
Graphics Processing Units (GPUs) support dynamic voltage and frequency
scaling (DVFS) in order to balance computational performance and energy
consumption. However, there still lacks simple and accurate performance
estimation of a given GPU kernel under different frequency settings on real
hardware, which is important to decide best frequency configuration for energy
saving. This paper reveals a finegrained model to estimate the execution time
of GPU kernels with both core and memory frequency scaling. Over a 2.5x range
of both core and memory frequencies among 12 GPU kernels, our model achieves
accurate results (within 3.5\%) on real hardware. Compared with the cyclelevel
simulators, our model only needs some simple microbenchmark to extract a set
of hardware parameters and performance counters of the kernels to produce this
high accuracy.

With huge amounts of training data, deep learning has made great
breakthroughs in many artificial intelligence (AI) applications. However, such
largescale data sets present computational challenges, requiring training to
be distributed on a cluster equipped with accelerators like GPUs. With the fast
increase of GPU computing power, the data communications among GPUs have become
a potential bottleneck on the overall training performance. In this paper, we
first propose a general directed acyclic graph (DAG) model to describe the
distributed synchronous stochastic gradient descent (SSGD) algorithm, which
has been widely used in distributed deep learning frameworks. To understand the
practical impact of data communications on training performance, we conduct
extensive empirical studies on four stateoftheart distributed deep learning
frameworks (i.e., CaffeMPI, CNTK, MXNet and TensorFlow) over multiGPU and
multinode environments with different data communication techniques, including
PCIe, NVLink, 10GbE, and InfiniBand. Through both analytical and experimental
studies, we identify the potential bottlenecks and overheads that could be
further optimized. At last, we make the data set of our experimental traces
publicly available, which could be used to support simulationbased studies.

Recommendation system has been widely used in different areas. Collaborative
filtering focuses on rating, ignoring the features of items itself. In order to
effectively evaluate customers preferences on books, taking into consideration
of the characteristics of offline book retail, we use LDA model to calculate
customers preference on book topics and use word2vec to calculate customers
preference on book types. When forecasting rating on books, we take two factors
into consideration: similarity of customers and correlation between customers
and books. The experiment shows that our hybrid recommendation method based on
features performances better than single recommendation method in offline book
retail data.

We propose a Doppler tracking system for gravitational wave detection via
Double Optical Clocks in Space (DOCS). In this configuration two spacecrafts
(each containing an optical clock) are launched to space for Doppler shift
observations. Compared to the similar attempt of gravitational wave detection
in the Cassini mission, the radio signal of DOCS that contains the relative
frequency changes avoids completely noise effects due for instance to
troposphere, ionosphere, groundbased antenna and transponder. Given the high
stabilities of the two optical clocks (Allan deviation $\sim 4.1\times
10^{17}$ @ 1000 s), an overall estimated sensitivity of $5 \times 10^{19}$
could be achieved with an observation time of 2 years, and would allow to
detect gravitational waves in the frequency range from $\sim 10^{4}$ Hz to
$\sim 10^{2}$ Hz.

We study magnitudes and temperature dependences of the electronelectron and
electronphonon interaction times which play the dominant role in the formation
and relaxation of photon induced hotspot in two dimensional amorphous WSi
films. The time constants are obtained through magnetoconductance measurements
in perpendicular magnetic field in the superconducting fluctuation regime and
through timeresolved photoresponse to optical pulses. The excess
magnetoconductivity is interpreted in terms of the weaklocalization effect and
superconducting fluctuations. AslamazovLarkin, and MakiThompson
superconducting fluctuation alone fail to reproduce the magnetic field
dependence in the relatively high magnetic field range when the temperature is
rather close to Tc because the suppression of the electronic density of states
due to the formation of short lifetime Cooper pairs needs to be considered. The
time scale {\tau}_i of inelastic scattering is ascribed to a combination of
electronelectron ({\tau}_(ee)) and electronphonon ({\tau}_(eph))
interaction times, and a characteristic electronfluctuation time
({\tau}_(efl)), which makes it possible to extract their magnitudes and
temperature dependences from the measured {\tau}_i. The ratio of
phononelectron ({\tau}_(phe)) and electronphonon interaction times is
obtained via measurements of the optical photoresponse of WSi microbridges.
Relatively large {\tau}_(eph)/{\tau}_(phe) and {\tau}_(eph)/{\tau}_(ee)
ratios ensure that in WSi the photon energy is more efficiently confined in the
electron subsystem than in other materials commonly used in the technology of
superconducting nanowire singlephoton detectors (SNSPDs). We discuss the
impact of interaction times on the hotspot dynamics and compare relevant
metrics of SNSPDs from different materials.

We find the number of compositions over finite abelian groups under two types
of restrictions: (i) each part belongs to a given subset and (ii) small runs of
consecutive parts must have given properties. Waring's problem over finite
fields can be converted to type~(i) compositions, whereas Carlitz and locally
Mullen compositions can be formulated as type~(ii) compositions. We use the
multisection formula to translate the problem from integers to group elements,
the transfer matrix method to do exact counting, and finally the
PerronFrobenius theorem to derive asymptotics. We also exhibit bijections
involving certain restricted classes of compositions.

We study theoretically spin transport through a singlemolecule magnet (SMM)
in the sequential and cotunneling regimes, where the SMM is weakly coupled to
one ferromagnetic and one normalmetallic leads. By a masterequation approach,
it is found that the spin polarization injected from the ferromagnetic lead is
amplified and highly polarized spincurrent can be generated, due to the
exchange coupling between the transport electron and the anisotropic spin of
the SMM. Moreover, the spincurrent polarization can be tuned by the gate or
bias voltage, and thus an efficient spin injection device based on the SMM is
proposed in molecular spintronics.

A remarkable quantitative agreement is found between the nonMarkovian
quantum kinetic approach and the timedependent Dirac equation approach for a
large region of Keldysh parameter, in the investigation of electronpositron
pair production in the electric fields which is spatially homogeneous and
envelope pulse shaped. If a subcritical bound potential is immersed in this
background field, the TDDE results show that the creation probability will be
enhanced by the bound states resonance by two orders of magnitude. We also
establish a computing resources greatly saved TDDE formalism for spatially
homogeneous field.

In this paper, we construct some new classes of complete permutation
monomials with exponent $d=\frac{q^n1}{q1}$ using AGW criterion (a special
case). This proves two recent conjectures in [Wuetal2] and extends some of
these recent results to more general $n$'s.

Permutation polynomials over finite fields have been studied extensively
recently due to their wide applications in cryptography, coding theory,
communication theory, among others. Recently, several authors have studied
permutation trinomials of the form $x^rh\left(x^{q1}\right)$ over
$\mathbb{F}_{q^2}$, where $q=2^k$, $h(x)=1+x^s+x^t$ and $r, s, t, k>0$ are
integers. Their methods are essentially usage of a multiplicative version of
AGW Criterion because they all transformed the problem of proving permutation
polynomials over $\mathbb{F}_{q^2}$ into that of showing the corresponding
fractional polynomials permute a smaller set $\mu_{q+1}$, where
$\mu_{q+1}:=\{x\in\mathbb{F}_{q^2} : x^{q+1}=1\}$. Motivated by these results,
we characterize the permutation polynomials of the form
$x^rh\left(x^{q1}\right)$ over $\mathbb{F}_{q^2}$ such that
$h(x)\in\mathbb{F}_q[x]$ is arbitrary and $q$ is also an arbitrary prime power.
Using AGW Criterion twice, one is multiplicative and the other is additive, we
reduce the problem of proving permutation polynomials over $\mathbb{F}_{q^2}$
into that of showing permutations over a small subset $S$ of a proper subfield
$\mathbb{F}_{q}$, which is significantly different from previously known
methods. In particular, we demonstrate our method by constructing many new
explicit classes of permutation polynomials of the form
$x^rh\left(x^{q1}\right)$ over $\mathbb{F}_{q^2}$. Moreover, we can explain
most of the known permutation trinomials, which are in [6, 13, 14, 16, 20, 29],
over finite field with even characteristic.

Quantum key distribution (QKD) uses individual light quanta in quantum
superposition states to guarantee unconditional communication security between
distant parties. In practice, the achievable distance for QKD has been limited
to a few hundred kilometers, due to the channel loss of fibers or terrestrial
free space that exponentially reduced the photon rate. Satellitebased QKD
promises to establish a globalscale quantum network by exploiting the
negligible photon loss and decoherence in the empty out space. Here, we develop
and launch a lowEarthorbit satellite to implement decoystate QKD with over
kHz key rate from the satellite to ground over a distance up to 1200 km, which
is up to 20 orders of magnitudes more efficient than that expected using an
optical fiber (with 0.2 dB/km loss) of the same length. The establishment of a
reliable and efficient spacetoground link for faithful quantum state
transmission constitutes a key milestone for globalscale quantum networks.

We propose an optimized design for nanowire superconducting single photon
detectors, using the recently discovered position dependent detection
efficiency in these devices. This knowledge allows an optimized the design of
meandering wire NbN detectors by altering the field distribution across the
wire. In order to calculate the response of the detectors with different
geometries, we use a monotonic local detection efficiency from a nanowire and
optical absorption distribution via finitedifferenttimedomain simulations.
The calculations predict a tradeoff between average absorption and the edge
effect leading to a predicted optimal wire width close to 100 nm for 1550 nm
wavelength, which drops to 50 nm wire width for 600 nm wavelength. The
absorption at the edges can be enhanced by depositing a silicon nanowire on top
of the superconducting nanowire, which improves both the total absorption
efficiency as well as the internal detection efficiency of meandering wire
structures.

Selfsupported electrocatalysts being generated and employed directly as
electrode for energy conversion has been intensively pursued in the fields of
materials chemistry and energy. Herein, we report a synthetic strategy to
prepare freestanding hierarchically structured, nitrogendoped nanoporous
graphitic carbon membranes functionalized with Janustype Co/CoP nanocrystals
(termed as HNDCMCo/CoP), which were successfully applied as a
highlyefficient, binderfree electrode in hydrogen evolution reaction (HER).
Benefited from multiple structural merits, such as high degree of
graphitization, threedimensionally interconnected micro/meso/macropores,
uniform nitrogendoping, welldispersed Co/CoP nanocrystals as well as the
confinement effect of the thin carbon layer on the nanocrystals, HNDCMCo/CoP
exhibited superior electrocatalytic activity and longterm operation stability
for HER under both acid and alkaline conditions. As a proofofconcept of
practical usage, a macroscopic piece of HNDCMCo/CoP of 5.6 cm x 4 cm x 60 um
in size was prepared in our laboratory. Driven by a solar cell,
electroreduction of water in alkaline condition (pH 14) was performed, and H2
has been produced at a rate of 16 ml/min, demonstrating its potential as
reallife energy conversion systems.

Let $p$ be an odd prime, $n$ a positive integer and $g$ a primitive root of
$p^n$. Suppose
$D_i^{(p^n)}=\{g^{2s+i}s=0,1,2,\cdots,\frac{(p1)p^{n1}}{2}\}$, $i=0,1$, is
the generalized cyclotomic classes with $Z_{p^n}^{\ast}=D_0\cup D_1$. In this
paper, we prove that Gauss periods based on $D_0$ and $D_1$ are both equal to 0
for $n\geq2$. As an application, we determine a lower bound on the 2adic
complexity of a class of DingHelleseth generalized cyclotomic sequences of
period $p^n$. The result shows that the 2adic complexity is at least
$p^np^{n1}1$, which is larger than $\frac{N+1}{2}$, where $N=p^n$ is the
period of the sequence.

Let $p,q$ be distinct primes satisfying $\mathrm{gcd}(p1,q1)=d$ and let
$D_i$, $i=0,1,\cdots,d1$, be Whiteman's generalized cyclotomic classes with
$Z_{pq}^{\ast}=\cup_{i=0}^{d1}D_i$. In this paper, we give the values of Gauss
periods based on the generalized cyclotomic sets
$D_0^{\ast}=\sum_{i=0}^{\frac{d}{2}1}D_{2i}$ and
$D_1^{\ast}=\sum_{i=0}^{\frac{d}{2}1}D_{2i+1}$. As an application, we
determine a lower bound on the 2adic complexity of modified Jacobi sequence.
Our result shows that the 2adic complexity of modified Jacobi sequence is at
least $pqpq1$ with period $N=pq$. This indicates that the 2adic complexity
of modified Jacobi sequence is large enough to resist the attack of the
rational approximation algorithm (RAA) for feedback with carry shift registers
(FCSRs).

Discriminant Correlation Filters (DCF) based methods now become a kind of
dominant approach to online object tracking. The features used in these
methods, however, are either based on handcrafted features like HoGs, or
convolutional features trained independently from other tasks like image
classification. In this work, we present an endtoend lightweight network
architecture, namely DCFNet, to learn the convolutional features and perform
the correlation tracking process simultaneously. Specifically, we treat DCF as
a special correlation filter layer added in a Siamese network, and carefully
derive the backpropagation through it by defining the network output as the
probability heatmap of object location. Since the derivation is still carried
out in Fourier frequency domain, the efficiency property of DCF is preserved.
This enables our tracker to run at more than 60 FPS during test time, while
achieving a significant accuracy gain compared with KCF using HoGs. Extensive
evaluations on OTB2013, OTB2015, and VOT2015 benchmarks demonstrate that the
proposed DCFNet tracker is competitive with several stateoftheart trackers,
while being more compact and much faster.

In modern stream cipher, there are many algorithms, such as ZUC, LTE
encryption algorithm and LTE integrity algorithm, using bitcomponent sequences
of $p$ary $m$sequences as the input of the algorithm. Therefore, analyzing
their statistical property (For example, autocorrelation, linear complexity and
2adic complexity) of bitcomponent sequences of $p$ary $m$sequences is
becoming an important research topic. In this paper, we first derive some
autocorrelation properties of LSB (Least Significant Bit) sequences of $p$ary
$m$sequences, i.e., we convert the problem of computing autocorrelations of
LSB sequences of period $p^n1$ for any positive $n\geq2$ to the problem of
determining autocorrelations of LSB sequence of period $p1$. Then, based on
this property and computer calculation, we list some autocorrelation
distributions of LSB sequences of $p$ary $m$sequences with order $n$ for some
small primes $p$'s, such as $p=3,5,7,11,17,31$. Additionally, using their
autocorrelation distributions and the method inspired by Hu, we give the lower
bounds on the 2adic complexities of these LSB sequences. Our results show that
the main parts of all the lower bounds on the 2adic complexity of these LSB
sequencesare larger than $\frac{N}{2}$, where $N$ is the period of these
sequences. Therefor, these bounds are large enough to resist the analysis of
RAA (Rational Approximation Algorithm) for FCSR (Feedback with Carry Shift
Register). Especially, for a Mersenne prime $p=2^k1$, since all its
bitcomponent sequences of a $p$ary $m$sequence are shift equivalent, our
results hold for all its bitcomponent sequences.

Pseudorandom sequences with good statistical property, such as low
autocorrelation, high linear complexity and large 2adic complexity, have been
applied in stream cipher. In general, it is difficult to give both the linear
complexity and 2adic complexity of a periodic binary sequence. Cai and Ding
\cite{Cai Ying} gave a class of sequences with almost optimal autocorrelation
by constructing almost difference sets. Wang \cite{Wang Qi} proved that one
type of those sequences by Cai and Ding has large linear complexity. Sun et al.
\cite{Sun Yuhua} showed that another type of sequences by Cai and Ding has also
large linear complexity. Additionally, Sun et al. also generalized the
construction by Cai and Ding using $d$form function with differencebalanced
property. In this paper, we first give the detailed autocorrelation
distribution of the sequences was generalized from Cai and Ding \cite{Cai Ying}
by Sun et al. \cite{Sun Yuhua}. Then, inspired by the method of Hu \cite{Hu
Honggang}, we analyse their 2adic complexity and give a lower bound on the
2adic complexity of these sequences. Our result show that the 2adic
complexity of these sequences is at least $N\mathrm{log}_2\sqrt{N+1}$ and that
it reach $N1$ in many cases, which are large enough to resist the rational
approximation algorithm (RAA) for feedback with carry shift registers (FCSRs).

Deep learning has been shown as a successful machine learning method for a
variety of tasks, and its popularity results in numerous opensource deep
learning software tools. Training a deep network is usually a very
timeconsuming process. To address the computational challenge in deep
learning, many tools exploit hardware features such as multicore CPUs and
manycore GPUs to shorten the training time. However, different tools exhibit
different features and running performance when training different types of
deep networks on different hardware platforms, which makes it difficult for end
users to select an appropriate pair of software and hardware. In this paper, we
aim to make a comparative study of the stateoftheart GPUaccelerated deep
learning software tools, including Caffe, CNTK, MXNet, TensorFlow, and Torch.
We first benchmark the running performance of these tools with three popular
types of neural networks on two CPU platforms and three GPU platforms. We then
benchmark some distributed versions on multiple GPUs. Our contribution is
twofold. First, for end users of deep learning tools, our benchmarking results
can serve as a guide to selecting appropriate hardware platforms and software
tools. Second, for software developers of deep learning tools, our indepth
analysis points out possible future directions to further optimize the running
performance.

Nanoporous graphitic carbon membranes with defined chemical composition and
pore architecture are novel nanomaterials that are actively pursued. Compared
to easytomake porous carbon powders that dominate the porous carbon research
and applications in energy generation/conversion and environmental remediation,
porous carbon membranes are synthetically more challenging though rather
appealing from an application perspective due to their structural integrity,
interconnectivity and purity. Here we report a simple bottomup approach to
fabricate largesize, freestanding, porous carbon membranes that feature an
unusual singlecrystallike graphitic order and hierarchical pore architecture
plus favorable nitrogen doping. When loaded with cobalt nanoparticles, such
carbon membranes serve as highperformance carbonbased nonnoble metal
electrocatalyst for overall water splitting.

In this note, we give a shorter proof of the result of Zheng, Yu, and Pei on
the explicit formula of inverses of generalized cyclotomic permutation
polynomials over finite fields. Moreover, we characterize all these cyclotomic
permutation polynomials that are involutions. Our results provide a fast
algorithm (only modular operations are involved) to generate many classes of
generalized cyclotomic permutation polynomials, their inverses, and
involutions.

A ThO$_{2}$ sample and a nickel activation foil were irradiated in the
leakage neutron field of CFBRII reactor. The activities of the activation
products were measured after irradiation to obtain the reaction rates. The
normalized reaction rates were also calculated based on the ENDF/BVII.1,
CENDL3.1, JENDL4.0, BROND2.2 databases. The experimental reaction rate ratio
is 4.37 with an uncertainty of 3.9\% which is coincident with each of the
ratios calculated based on the ENDFBVII. 1, JENDL4.0, BROND2.2 databases,
but is 11.2\% larger than that based on CENDL3.1 database.

Energy efficiency has become one of the top design criteria for current
computing systems. The dynamic voltage and frequency scaling (DVFS) has been
widely adopted by laptop computers, servers, and mobile devices to conserve
energy, while the GPU DVFS is still at a certain early age. This paper aims at
exploring the impact of GPU DVFS on the application performance and power
consumption, and furthermore, on energy conservation. We survey the
stateoftheart GPU DVFS characterizations, and then summarize recent research
works on GPU power and performance models. We also conduct real GPU DVFS
experiments on NVIDIA Fermi and Maxwell GPUs. According to our experimental
results, GPU DVFS has significant potential for energy saving. The effect of
scaling core voltage/frequency and memory voltage/frequency depends on not only
the GPU architectures, but also the characteristic of GPU applications.