
The frequent directions (FD) technique is a deterministic approach for online
sketching that has many applications in machine learning. The conventional FD
is a heuristic procedure that often outputs rank deficient matrices. To
overcome the rank deficiency problem, we propose a new sketching strategy
called robust frequent directions (RFD) by introducing a regularization term.
RFD can be derived from an optimization problem. It updates the sketch matrix
and the regularization term adaptively and jointly. RFD reduces the
approximation error of FD without increasing the computational cost. We also
apply RFD to online learning and propose an effective hyperparameterfree
online Newton algorithm. We derive a regret bound for our online Newton
algorithm based on RFD, which guarantees the robustness of the algorithm. The
experimental studies demonstrate that the proposed method outperforms
stateoftheart second order online learning algorithms.

A major challenge for building statistical models in the big data era is that
the available data volume far exceeds the computational capability. A common
approach for solving this problem is to employ a subsampled dataset that can be
handled by available computational resources. In this paper, we propose a
general subsampling scheme for largescale multiclass logistic regression and
examine the variance of the resulting estimator. We show that asymptotically,
the proposed method always achieves a smaller variance than that of the uniform
random sampling. Moreover, when the classes are conditionally imbalanced,
significant improvement over uniform sampling can be achieved. Empirical
performance of the proposed method is compared to other methods on both
simulated and realworld datasets, and these results match and confirm our
theoretical analysis.

Sparse generalized eigenvalue problem (GEP) plays a pivotal role in a large
family of highdimensional statistical models, including sparse Fisher's
discriminant analysis, canonical correlation analysis, and sufficient dimension
reduction. Sparse GEP involves solving a nonconvex optimization problem. Most
existing methods and theory in the context of specific statistical models that
are special cases of the sparse GEP require restrictive structural assumptions
on the input matrices. In this paper, we propose a twostage computational
framework to solve the sparse GEP. At the first stage, we solve a convex
relaxation of the sparse GEP. Taking the solution as an initial value, we then
exploit a nonconvex optimization perspective and propose the truncated Rayleigh
flow method (Rifle) to estimate the leading generalized eigenvector. We show
that Rifle converges linearly to a solution with the optimal statistical rate
of convergence for many statistical models. Theoretically, our method
significantly improves upon the existing literature by eliminating structural
assumptions on the input matrices for both stages. To achieve this, our
analysis involves two key ingredients: (i) a new analysis of the gradient based
method on nonconvex objective functions, and (ii) a finegrained
characterization of the evolution of sparsity patterns along the solution path.
Thorough numerical studies are provided to validate the theoretical results.

We study active object tracking, where a tracker takes as input the visual
observation (i.e., frame sequence) and produces the camera control signal
(e.g., move forward, turn left, etc.). Conventional methods tackle the tracking
and the camera control separately, which is challenging to tune jointly. It
also incurs many human efforts for labeling and many expensive trialanderrors
in realworld. To address these issues, we propose, in this paper, an endtoend
solution via deep reinforcement learning, where a ConvNetLSTM function
approximator is adopted for the direct frametoaction prediction. We further
propose an environment augmentation technique and a customized reward function,
which are crucial for a successful training. The tracker trained in simulators
(ViZDoom, Unreal Engine) shows good generalization in the case of unseen object
moving path, unseen object appearance, unseen background, and distracting
object. It can restore tracking when occasionally losing the target. With the
experiments over the VOT dataset, we also find that the tracking ability,
obtained solely from simulators, can potentially transfer to realworld
scenarios.

CuxBi2Se3 hosts both topological surface states and bulk superconductivity.
It has been identified recently as a topological superconductor (TSC) with an
extraordinary nematic, i. e. C2symmetric, superconducting state and oddparity
pairing. Here, using scanning tunneling microscopy (STM), we directly examine
the response of the superconductivity of CuxBi2Se3 to magnetic field. Under
out ofplane (caxis) fields, we discover elongated magnetic vortices hosting
zerobias conductance peaks consistent with the Majorana bound states expected
in a TSC. Under inplane (abplane) fields, the average superconducting gap
exhibits twofold symmetry with field orientation, the long C2 symmetry axes
are pinned to the dihedral mirror planes under B//=0.5 T but slightly rotate
under B//=1.0 T. Moreover, the intrinsic C2symmetric gap distribution in
momentum space is semiquantitatively determined for the first time. Our data
paint a microscopic picture of the nematic superconductivity in CuxBi2Se3 and
pose strong constraints on theory.

Optimizing distributed learning systems is an art of balancing between
computation and communication. There have been two lines of research that try
to deal with slower networks: {\em quantization} for low bandwidth networks,
and {\em decentralization} for high latency networks. In this paper, we explore
a natural question: {\em can the combination of both decentralization and
quantization lead to a system that is robust to both bandwidth and latency?}
Although the system implication of such combination is trivial, the
underlying theoretical principle and algorithm design is challenging: simply
quantizing data sent in a decentralized training algorithm would accumulate the
error. In this paper, we develop a framework of quantized, decentralized
training and propose two different strategies, which we call {\em extrapolation
compression} and {\em difference compression}. We analyze both algorithms and
prove both converge at the rate of $O(1/\sqrt{nT})$ where $n$ is the number of
workers and $T$ is the number of iterations, matching the {\rc convergence}
rate for full precision, centralized training. We evaluate our algorithms on
training deep learning models, and find that our proposed algorithm outperforms
the best of merely decentralized and merely quantized algorithm significantly
for networks with {\em both} high latency and low bandwidth.

In the task of quantifiable sequence editing (QuaSE), a model needs to edit
an input sentence to generate an output that satisfies a given outcome, which
is a numerical value measuring a certain property of the output. For example,
for review sentences, the outcome could be review ratings; for advertisement,
the outcome could be clickthrough rate. We propose a framework which performs
QuaSE by incorporating pseudoparallel data. Our framework can capture the
content similarity and the outcome differences by exploiting pseudoparallel
sentence pairs, which enables a better disentanglement of the latent factors
that are relevant to the outcome and thus provides a solid basis to generate
output satisfying the desired outcome. The dual reconstruction structure
further enhances the capability of generating expected output by exploiting the
coupling of latent factors of pseudoparallel sentences. We prepare a dataset
of Yelp review sentences with the ratings as outcome. Experimental results show
that our framework can outperform stateoftheart methods under both sentiment
polarity accuracy and target value errors.

Nowadays, billions of videos are online ready to be viewed and shared. Among
an enormous volume of videos, some popular ones are widely viewed by online
users while the majority attract little attention. Furthermore, within each
video, different segments may attract significantly different numbers of views.
This phenomenon leads to a challenging yet important problem, namely
finegrained video attractiveness prediction. However, one major obstacle for
such a challenging problem is that no suitable benchmark dataset currently
exists. To this end, we construct the first finegrained video attractiveness
dataset, which is collected from one of the most popular video websites in the
world. In total, the constructed FVAD consists of 1,019 drama episodes with
780.6 hours covering different categories and a wide variety of video contents.
Apart from the large amount of videos, hundreds of millions of user behaviors
during watching videos are also included, such as "view counts",
"fastforward", "fastrewind", and so on, where "view counts" reflects the
video attractiveness while other engagements capture the interactions between
the viewers and videos. First, we demonstrate that video attractiveness and
different engagements present different relationships. Second, FVAD provides us
an opportunity to study the finegrained video attractiveness prediction
problem. We design different sequential models to perform video attractiveness
prediction by relying solely on video contents. The sequential models exploit
the multimodal relationships between visual and audio components of the video
contents at different levels. Experimental results demonstrate the
effectiveness of our proposed sequential models with different visual and audio
representations, the necessity of incorporating the two modalities, and the
complementary behaviors of the sequential prediction models at different
levels.

The success of current deep saliency detection methods heavily depends on the
availability of largescale supervision in the form of perpixel labeling. Such
supervision, while laborintensive and not always possible, tends to hinder the
generalization ability of the learned models. By contrast, traditional
handcrafted features based unsupervised saliency detection methods, even though
have been surpassed by the deep supervised methods, are generally
datasetindependent and could be applied in the wild. This raises a natural
question that "Is it possible to learn saliency maps without using labeled data
while improving the generalization ability?". To this end, we present a novel
perspective to unsupervised saliency detection through learning from multiple
noisy labeling generated by "weak" and "noisy" unsupervised handcrafted
saliency methods. Our endtoend deep learning framework for unsupervised
saliency detection consists of a latent saliency prediction module and a noise
modeling module that work collaboratively and are optimized jointly. Explicit
noise modeling enables us to deal with noisy saliency maps in a probabilistic
way. Extensive experimental results on various benchmarking datasets show that
our model not only outperforms all the unsupervised saliency methods with a
large margin but also achieves comparable performance with the recent
stateoftheart supervised deep saliency methods.

In this paper, we propose a novel tensor graph convolutional neural network
(TGCNN) to conduct convolution on factorizable graphs, for which here two types
of problems are focused, one is sequential dynamic graphs and the other is
crossattribute graphs. Especially, we propose a graph preserving layer to
memorize salient nodes of those factorized subgraphs, i.e. cross graph
convolution and graph pooling. For cross graph convolution, a parameterized
Kronecker sum operation is proposed to generate a conjunctive adjacency matrix
characterizing the relationship between every pair of nodes across two
subgraphs. Taking this operation, then general graph convolution may be
efficiently performed followed by the composition of small matrices, which thus
reduces high memory and computational burden. Encapsuling sequence graphs into
a recursive learning, the dynamics of graphs can be efficiently encoded as well
as the spatial layout of graphs. To validate the proposed TGCNN, experiments
are conducted on skeleton action datasets as well as matrix completion dataset.
The experiment results demonstrate that our method can achieve more competitive
performance with the stateoftheart methods.

In this paper, we consider a hybrid millimeter wave (mmWave) and micro wave
($\mu$Wave) network from the perspective of \emph{wireless caching} and study
the optimal probabilistic content/file caching placement at desirable base
stations (BSs) using a stochastic geometric framework. Considering the average
success probability (ASP) of file delivery as the performance metric, we derive
expressions for the association probability of the typical user to the mmWave
and $\mu$Wave networks. Accordingly, we provide an upper bound for the ASP of
file delivery and formulate the content caching placement scheme as an
optimization problem with respect to caching probabilities, that jointly
optimizes the ASP of file delivery considering both content placement and
delivery phases. In particular, we consider the caching placement strategy
under both noiselimited and interferencelimited environments. We numerically
evaluate the performance of the proposed caching schemes under essential
factors, such as blockages in the mmWave network, cluster radius, BS density,
and path loss and compare it with uniform caching placement, caching $M$ most
popular contents, and random caching placement. Numerical results demonstrate
the superiority of the proposed caching scheme over others, albeit certain
tradeoffs.

In this paper, we first construct varieties of any dimension $n>2$ fibered
over curves with low slopes. These examples violate the conjectural slope
inequality of Barja and Stoppino [BS14b].
Led by their conjecture, we focus on finding the lowest possible slope when
$n=3$. Based on a characteristic $p > 0$ method, we prove that the sharp lower
bound of the slope of fibered $3$folds over curves is $4/3$, and it occurs
only when the general fiber is a $(1, 2)$surface. Otherwise, the sharp lower
bound is $2$. We also obtain a CornalbaHarrisXiao type slope inequality for
families of surfaces of general type over curves, and it is sharper than all
previously known results.
As an application of the slope bound, we deduce a sharp NoetherSeveri type
inequality that $K_X^3 \ge 2\chi(X, \omega_X)$ for an irregular minimal
$3$fold $X$ of general type not having a $(1,2)$surface Albanese fibration.
It answers a question in [Zha15] and thus completes the full Severi type
inequality for irregular $3$folds of general type.

Unconventional superconductivity often intertwines with various forms of
order, such as the "nematic" order which breaks the rotational symmetry of the
lattice. Investigation of these ordered phases sheds crucial light on the
superconductivity itself. Here we report a lowtemperature scanning tunneling
microscopy (STM) study on RbFe2As2, a heavily holedoped Febased
superconductor (FeSC). We observe significant symmetry breaking in its
electronic structure and magnetic vortex which differentiates the ({\pi},
{\pi}) and ({\pi}, {\pi}) directions. It is thus a novel nematic state,
distinct from the nematicity of undoped/lightlydoped FeSCs which breaks the
({\pi}, 0) / (0, {\pi}) equivalence. Moreover, we observe a clear "V" shaped
superconducting gap which can be well fitted with a nodal gap function. The gap
is found to be suppressed on surface Rb vacancies and at step edges, and
particularly strong at the [110]oriented edges, which is possibly due to a
dx2y2 like pairing component with nodes along diagonal directions. Our results
highlight the intimate connection between nematicity and superconducting
pairing, and promote a universal understanding of them.

Stateoftheart Convolutional Neural Network (CNN) benefits much from
multitask learning (MTL), which learns multiple related tasks simultaneously
to obtain shared or mutually related representations for different tasks. The
most widely used MTL CNN structure is based on an empirical or heuristic split
on a specific layer (e.g., the last convolutional layer) to minimize multiple
taskspecific losses. However, this heuristic sharing/splitting strategy may be
harmful to the final performance of one or multiple tasks. In this paper, we
propose a novel CNN structure for MTL, which enables automatic feature fusing
at every layer. Specifically, we first concatenate features from different
tasks according to their channel dimension, and then formulate the feature
fusing problem as discriminative dimensionality reduction. We show that this
discriminative dimensionality reduction can be fulfilled by 1x1 Convolution,
Batch Normalization, and Weight Decay in one CNN, which we refer to as Neural
Discriminative Dimensionality Reduction (NDDR). We perform detailed ablation
analysis for different configurations in training the proposed NDDRCNN
network. The experiments carried out on different network structures and
different task sets demonstrate the promising performance and desirable
generalizability of our proposed method.

We prove a sharp relative Clifford inequality for relatively special divisors
on varieties fibered by curves. It generalizes the classical Clifford
inequality about a single curve to a family of curves. It yields a geographical
inequality for varieties of general type and Albanesefibered by curves,
extending the work of Horikawa, Persson, and Xiao in dimension two to arbitrary
dimensions. We also apply it to deduce a slope inequality for some arbitrary
dimensional families of curves. It sheds light on the existence of a most
general CornalbaHarrisXiao type inequality for families of curves.
The whole proof is built on a new treelike filtration for nef divisors on
varieties fibered by curves. One key ingredient of the proof is to estimate the
sum of all admissible products of nef thresholds with respect to this
filtration.

We consider the problem of \emph{fully decentralized} multiagent
reinforcement learning (MARL), where the agents are located at the nodes of a
timevarying communication network. Specifically, we assume that the reward
functions of the agents might correspond to different tasks, and are only known
to the corresponding agent. Moreover, each agent makes individual decisions
based on both the information observed locally and the messages received from
its neighbors over the network. Within this setting, the collective goal of the
agents is to maximize the globally averaged return over the network through
exchanging information with their neighbors. To this end, we propose two
decentralized actorcritic algorithms with function approximation, which are
applicable to largescale MARL problems where both the number of states and the
number of agents are massively large. Under the decentralized structure, the
actor step is performed individually by each agent with no need to infer the
policies of others. For the critic step, we propose a consensus update via
communication over the network. Our algorithms are fully incremental and can be
implemented in an online fashion. Convergence analyses of the algorithms are
provided when the value functions are approximated within the class of linear
functions. Extensive simulation results with both linear and nonlinear function
approximations are presented to validate the proposed algorithms. Our work
appears to be the first study of fully decentralized MARL algorithms for
networked agents with function approximation, with provable convergence
guarantees.

The model averaging problem is to average multiple models to achieve a
prediction accuracy not much worse than that of the best single model in terms
of mean squared error. It is known that if the models are misspecified, model
averaging is superior to model selection. Specifically, let $n$ be the sample
size, then the worst case regret of the former decays at a rate of $O(1/n)$
while the worst case regret of the latter decays at a rate of $O(1/\sqrt{n})$.
The recently proposed $Q$aggregation algorithm \citep{DaiRigZhang12} solves
the model averaging problem with the optimal regret of $O(1/n)$ both in
expectation and in deviation; however it suffers from two limitations: (1) for
continuous dictionary, the proposed greedy algorithm for solving
$Q$aggregation is not applicable; (2) the formulation of $Q$aggregation
appears ad hoc without clear intuition. This paper examines a different
approach to model averaging by considering a Bayes estimator for deviation
optimal model averaging by using exponentiated least squares loss. We establish
a primaldual relationship of this estimator and that of $Q$aggregation and
propose new greedy procedures that satisfactorily resolve the above mentioned
limitations of $Q$aggregation.

We propose a DC proximal Newton algorithm for solving nonconvex regularized
sparse learning problems in high dimensions. Our proposed algorithm integrates
the proximal Newton algorithm with multistage convex relaxation based on the
difference of convex (DC) programming, and enjoys both strong computational and
statistical guarantees. Specifically, by leveraging a sophisticated
characterization of sparse modeling structures/assumptions (i.e., local
restricted strong convexity and Hessian smoothness), we prove that within each
stage of convex relaxation, our proposed algorithm achieves (local) quadratic
convergence, and eventually obtains a sparse approximate local optimum with
optimal statistical properties after only a few convex relaxations. Numerical
experiments are provided to support our theory.

Generative adversarial networks (GAN) have become popular for generating data
that mimic observations by learning a suitable variable transformation from a
random variable. However, empirically, GAN is known to suffer from instability.
Also, the theory provided based on the minimax optimization formulation of GAN
cannot explain the widelyused practical procedure that uses the socalled logd
trick. This paper provides a different theoretical foundation for generative
adversarial methods which does not rely on the minimax formulation. We show
that with a strong discriminator, it is possible to learn a good variable
transformation via functional gradient learning, which updates the functional
definition of a generator model, instead of updating only the model parameters
as in GAN. The theory guarantees that the learned generator improves the
KLdivergence between the probability distributions of real data and generated
data after each functional gradient step, until the KLdivergence converges to
zero. This new point of view leads to enhanced stable procedures for training
generative models that can utilize arbitrary learning algorithms. It also gives
a new theoretical insight into the original GAN procedure both with and without
the logd trick. Empirical results are shown on image generation to illustrate
the effectiveness of our new method.

Pronouns are frequently omitted in prodrop languages, such as Chinese,
generally leading to significant challenges with respect to the production of
complete translations. To date, very little attention has been paid to the
dropped pronoun (DP) problem within neural machine translation (NMT). In this
work, we propose a novel reconstructionbased approach to alleviating DP
translation problems for NMT models. Firstly, DPs within all source sentences
are automatically annotated with parallel information extracted from the
bilingual training corpus. Next, the annotated source sentence is reconstructed
from hidden representations in the NMT model. With auxiliary training
objectives, in terms of reconstruction scores, the parameters associated with
the NMT model are guided to produce enhanced hidden representations that are
encouraged as much as possible to embed annotated DP information. Experimental
results on both ChineseEnglish and JapaneseEnglish dialogue translation tasks
show that the proposed approach significantly and consistently improves
translation performance over a strong NMT baseline, which is directly built on
the training data annotated with DPs.

Finding the electromagnetic (EM) counterpart of binary compact star merger,
especially the binary neutron star (BNS) merger, is critically important for
gravitational wave (GW) astronomy, cosmology and fundamental physics. On Aug.
17, 2017, Advanced LIGO and \textit{Fermi}/GBM independently triggered the
first BNS merger, GW170817, and its high energy EM counterpart, GRB 170817A,
respectively, resulting in a global observation campaign covering gammaray,
Xray, UV, optical, IR, radio as well as neutrinos. The High Energy Xray
telescope (HE) onboard \textit{Insight}HXMT (Hard Xray Modulation Telescope)
is the unique highenergy gammaray telescope that monitored the entire GW
localization area and especially the optical counterpart (SSS17a/AT2017gfo)
with very large collection area ($\sim$1000 cm$^2$) and microsecond time
resolution in 0.25 MeV. In addition, \textit{Insight}HXMT quickly implemented
a Target of Opportunity (ToO) observation to scan the GW localization area for
potential Xray emission from the GW source. Although it did not detect any
significant high energy (0.25 MeV) radiation from GW170817, its observation
helped to confirm the unexpected weak and soft nature of GRB 170817A.
Meanwhile, \textit{Insight}HXMT/HE provides one of the most stringent
constraints (~10$^{7}$ to 10$^{6}$ erg/cm$^2$/s) for both GRB170817A and any
other possible precursor or extended emissions in 0.25 MeV, which help us to
better understand the properties of EM radiation from this BNS merger.
Therefore the observation of \textit{Insight}HXMT constitutes an important
chapter in the full context of multiwavelength and multimessenger observation
of this historical GW event.

In this paper, we generalize (accelerated) Newton's method with cubic
regularization under inexact secondorder information for (strongly) convex
optimization problems. Under mild assumptions, we provide global rate of
convergence of these methods and show the explicit dependence of the rate of
convergence on the problem parameters. While the complexity bounds of our
presented algorithms are theoretically worse than those of their exact
counterparts, they are at least as good as those of the optimal firstorder
methods. Our numerical experiments also show that using inexact Hessians can
significantly speed up the algorithms in practice.

We present novel minibatch stochastic optimization methods for empirical risk
minimization problems, the methods efficiently leverage variance reduced
firstorder and subsampled higherorder information to accelerate the
convergence speed. For quadratic objectives, we prove improved iteration
complexity over stateoftheart under reasonable assumptions. We also provide
empirical evidence of the advantages of our method compared to existing
approaches in the literature.

Principal component analysis (PCA) has been a prominent tool for
highdimensional data analysis. Online algorithms that estimate the principal
component by processing streaming data are of tremendous practical and
theoretical interests. Despite its rich applications, theoretical convergence
analysis remains largely open. In this paper, we cast online PCA into a
stochastic nonconvex optimization problem, and we analyze the online PCA
algorithm as a stochastic approximation iteration. The stochastic approximation
iteration processes data points incrementally and maintains a running estimate
of the principal component. We prove for the first time a nearly optimal
finitesample error bound for the online PCA algorithm. Under the subgaussian
assumption, we show that the finitesample error bound closely matches the
minimax information lower bound.

We present a novel deep neural network architecture for unsupervised subspace
clustering. This architecture is built upon deep autoencoders, which
nonlinearly map the input data into a latent space. Our key idea is to
introduce a novel selfexpressive layer between the encoder and the decoder to
mimic the "selfexpressiveness" property that has proven effective in
traditional subspace clustering. Being differentiable, our new selfexpressive
layer provides a simple but effective way to learn pairwise affinities between
all data points through a standard backpropagation procedure. Being nonlinear,
our neuralnetwork based method is able to cluster data points having complex
(often nonlinear) structures. We further propose pretraining and finetuning
strategies that let us effectively learn the parameters of our subspace
clustering networks. Our experiments show that the proposed method
significantly outperforms the stateoftheart unsupervised subspace clustering
methods.