• ### Electronic effect of doped oxygen atoms in Bi2201 superconductors determined by scanning tunneling microscopy(1803.03400)

April 3, 2018 cond-mat.supr-con
The oxygen dopants are essential in tuning electronic properties of Bi$_2$Sr$_2$Ca$_{n-1}$Cu$_n$O$_{2n+4+\delta}$ superconductors. Here we apply the technique of scanning tunneling microscopy and spectroscopy to study the influence of oxygen dopants in an optimally doped Bi$_2$Sr$_{2-x}$La$_x$CuO$_{6+\delta}$ and an overdoped Bi$_{2-y}$Pb$_y$Sr$_2$CuO$_{6+\delta}$. In both samples, we find that interstitial oxygen atoms on the SrO layers dominate over the other two forms of oxygen dopants, oxygen vacancies on the SrO layers and interstitial oxygen atoms on the BiO layers. The hole doping is estimated from the oxygen concentration, as compared to the result extracted from the measured Fermi surface. The precise spatial location is employed to obtain a negative correlation between the oxygen dopants and the inhomogeneous pseudogap.
• ### The study of electronic nematicity in an overdoped (Bi, Pb)$_2$Sr$_2$CuO$_{6+\delta}$ superconductor using scanning tunneling spectroscopy(1803.03403)

March 9, 2018 cond-mat.supr-con
The pseudogap (PG) state and its related intra-unit-cell symmetry breaking remain the focus in the research of cuprate superconductors. Although the nematicity has been studied in Bi$_2$Sr$_2$CaCu$_2$O$_{8+\delta}$, especially underdoped samples, its behavior in other cuprates and different doping regions is still unclear. Here we apply a scanning tunneling microscope to explore an overdoped (Bi, Pb)$_2$Sr$_2$CuO$_{6+\delta}$ with a large Fermi surface (FS). The establishment of a nematic order and its real-space distribution is visualized as the energy scale approaches the PG.
• ### Aspect-Aware Latent Factor Model: Rating Prediction with Ratings and Reviews(1802.07938)

Feb. 22, 2018 cs.IR
Although latent factor models (e.g., matrix factorization) achieve good accuracy in rating prediction, they suffer from several problems including cold-start, non-transparency, and suboptimal recommendation for local users or items. In this paper, we employ textual review information with ratings to tackle these limitations. Firstly, we apply a proposed aspect-aware topic model (ATM) on the review text to model user preferences and item features from different aspects, and estimate the aspect importance of a user towards an item. The aspect importance is then integrated into a novel aspect-aware latent factor model (ALFM), which learns user's and item's latent factors based on ratings. In particular, ALFM introduces a weighted matrix to associate those latent factors with the same set of aspects discovered by ATM, such that the latent factors could be used to estimate aspect ratings. Finally, the overall rating is computed via a linear combination of the aspect ratings, which are weighted by the corresponding aspect importance. To this end, our model could alleviate the data sparsity problem and gain good interpretability for recommendation. Besides, an aspect rating is weighted by an aspect importance, which is dependent on the targeted user's preferences and targeted item's features. Therefore, it is expected that the proposed method can model a user's preferences on an item more accurately for each user-item pair locally. Comprehensive experimental studies have been conducted on 19 datasets from Amazon and Yelp 2017 Challenge dataset. Results show that our method achieves significant improvement compared with strong baseline methods, especially for users with only few ratings. Moreover, our model could interpret the recommendation results in depth.
• ### Evidence of Electron-Hole Imbalance in WTe2 from High-Resolution Angle-Resolved Photoemission Spectroscopy(1708.08265)

Aug. 28, 2017 cond-mat.mtrl-sci
WTe2 has attracted a great deal of attention because it exhibits extremely large and nonsaturating magnetoresistance. The underlying origin of such a giant magnetoresistance is still under debate. Utilizing laser-based angle-resolved photoemission spectroscopy with high energy and momentum resolutions, we reveal the complete electronic structure of WTe2. This makes it possible to determine accurately the electron and hole concentrations and their temperature dependence. We find that, with increasing the temperature, the overall electron concentration increases while the total hole concentration decreases. It indicates that the electron-hole compensation, if it exists, can only occur in a narrow temperature range, and in most of the temperature range there is an electron-hole imbalance. Our results are not consistent with the perfect electron-hole compensation picture that is commonly considered to be the cause of the unusual magnetoresistance in WTe2. We identified a flat band near the Brillouin zone center that is close to the Fermi level and exhibits a pronounced temperature dependence. Such a flat band can play an important role in dictating the transport properties of WTe2. Our results provide new insight on understanding the origin of the unusual magnetoresistance in WTe2.
• ### Electronic structure of heavy fermion system CePt2In7 from angle-resolved photoemission spectroscopy(1706.05794)

June 19, 2017 cond-mat.str-el
We have carried out high-resolution angle-resolved photoemission measurements on the Cebased heavy fermion compound CePt2In7 that exhibits stronger two-dimensional character than the prototypical heavy fermion system CeCoIn5. Multiple Fermi surface sheets and a complex band structure are clearly resolved. We have also performed detailed band structure calculations on CePt2In7. The good agreement found between our measurements and the calculations suggests that the band renormalization effect is rather weak in CePt2In7. A comparison of the common features of the electronic structure of CePt2In7 and CeCoIn5 indicates that CeCoIn5 shows a much stronger band renormalization effect than CePt2In7. These results provide new information for understanding the heavy fermion behaviors and unconventional superconductivity in Ce-based heavy fermion systems.
• ### DIMM-SC: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data(1704.02007)

April 6, 2017 q-bio.QM, stat.ML
Motivation: Single cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the technology advances, statistical methods and computational tools are still lacking for analyzing droplet-based scRNA-Seq data. Particularly, model-based approaches for clustering large-scale single cell transcriptomic data are still under-explored. Methods: We developed DIMM-SC, a Dirichlet Mixture Model for clustering droplet-based Single Cell transcriptomic data. This approach explicitly models UMI count data from scRNA-Seq experiments and characterizes variations across different cell clusters via a Dirichlet mixture prior. An expectation-maximization algorithm is used for parameter inference. Results: We performed comprehensive simulations to evaluate DIMM-SC and compared it with existing clustering methods such as K-means, CellTree and Seurat. In addition, we analyzed public scRNA-Seq datasets with known cluster labels and in-house scRNA-Seq datasets from a study of systemic sclerosis with prior biological knowledge to benchmark and validate DIMM-SC. Both simulation studies and real data applications demonstrated that overall, DIMM-SC achieves substantially improved clustering accuracy and much lower clustering variability compared to other existing clustering methods. More importantly, as a model-based approach, DIMM-SC is able to quantify the clustering uncertainty for each single cell, facilitating rigorous statistical inference and biological interpretations, which are typically unavailable from existing clustering methods.
• ### Electronic structure of the ingredient planes of cuprate superconductor Bi2Sr2CuO6+{\delta}: a comparison study with Bi2Sr2CaCu2O8+{\delta}(1512.09230)

By means of low-temperature scanning tunneling microscopy, we report on the electronic structures of BiO and SrO planes of Bi2Sr2CuO6+{\delta} (Bi-2201) superconductor prepared by argon-ion bombardment and annealing. Depending on post annealing conditions, the BiO planes exhibit either pseudogap (PG) with sharp coherence peaks and an anomalously large gap of 49 meV or van Hove singularity (VHS) near the Fermi level, while the SrO is always characteristic of a PG-like feature. This contrasts with Bi2Sr2CaCu2O8+{\delta} (Bi-2212) superconductor where VHS occurs solely on the SrO plane. We disclose the interstitial oxygen dopants ({\delta} in the formulas) as a primary cause for the occurrence of VHS, which are located dominantly around the BiO and SrO planes, respectively, in Bi-2201 and Bi-2212. This is supported by the contrasting structural buckling amplitude of BiO and SrO planes in the two superconductors. Our findings provide solid evidence for the irrelevance of PG to the superconductivity in the two superconductors, as well as insights into why Bi-2212 can achieve a higher superconducting transition temperature than Bi-2201, and by implication, the mechanism of cuprate superconductivity.
• ### Spectroscopic Evidence of Type II Weyl Semimetal State in WTe2(1604.04218)

Quantum topological materials, exemplified by topological insulators, three-dimensional Dirac semimetals and Weyl semimetals, have attracted much attention recently because of their unique electronic structure and physical properties. Very lately it is proposed that the three-dimensional Weyl semimetals can be further classified into two types. In the type I Weyl semimetals, a topologically protected linear crossing of two bands, i.e., a Weyl point, occurs at the Fermi level resulting in a point-like Fermi surface. In the type II Weyl semimetals, the Weyl point emerges from a contact of an electron and a hole pocket at the boundary resulting in a highly tilted Weyl cone. In type II Weyl semimetals, the Lorentz invariance is violated and a fundamentally new kind of Weyl Fermions is produced that leads to new physical properties. WTe2 is interesting because it exhibits anomalously large magnetoresistance. It has ignited a new excitement because it is proposed to be the first candidate of realizing type II Weyl Fermions. Here we report our angle-resolved photoemission (ARPES) evidence on identifying the type II Weyl Fermion state in WTe2. By utilizing our latest generation laser-based ARPES system with superior energy and momentum resolutions, we have revealed a full picture on the electronic structure of WTe2. Clear surface state has been identified and its connection with the bulk electronic states in the momentum and energy space shows a good agreement with the calculated band structures with the type II Weyl states. Our results provide spectroscopic evidence on the observation of type II Weyl states in WTe2. It has laid a foundation for further exploration of novel phenomena and physical properties in the type II Weyl semimetals.
• ### Electronic Evidence for Type II Weyl Semimetal State in MoTe2(1604.01706)

Topological quantum materials, including topological insulators and superconductors, Dirac semimetals and Weyl semimetals, have attracted much attention recently for their unique electronic structure, spin texture and physical properties. Very lately, a new type of Weyl semimetals has been proposed where the Weyl Fermions emerge at the boundary between electron and hole pockets in a new phase of matter, which is distinct from the standard type I Weyl semimetals with a point-like Fermi surface. The Weyl cone in this type II semimetals is strongly tilted and the related Fermi surface undergos a Lifshitz transition, giving rise to a new kind of chiral anomaly and other new physics. MoTe2 is proposed to be a candidate of a type II Weyl semimetal; the sensitivity of its topological state to lattice constants and correlation also makes it an ideal platform to explore possible topological phase transitions. By performing laser-based angle-resolved photoemission (ARPES) measurements with unprecedentedly high resolution, we have uncovered electronic evidence of type II semimetal state in MoTe2. We have established a full picture of the bulk electronic states and surface state for MoTe2 that are consistent with the band structure calculations. A single branch of surface state is identified that connects bulk hole pockets and bulk electron pockets. Detailed temperature-dependent ARPES measurements show high intensity spot-like features that is ~40 meV above the Fermi level and is close to the momentum space consistent with the theoretical expectation of the type II Weyl points. Our results constitute electronic evidence on the nature of the Weyl semimetal state that favors the presence of two sets of type II Weyl points in MoTe2.
• ### Electronic Evidence of Temperature-Induced Lifshitz Transition and Topological Nature in ZrTe5(1602.03576)

The topological materials have attracted much attention recently. While three-dimensional topological insulators are becoming abundant, two-dimensional topological insulators remain rare, particularly in natural materials. ZrTe5 has host a long-standing puzzle on its anomalous transport properties; its underlying origin remains elusive. Lately, ZrTe5 has ignited renewed interest because it is predicted that single-layer ZrTe5 is a two-dimensional topological insulator and there is possibly a topological phase transition in bulk ZrTe5. However, the topological nature of ZrTe5 is under debate as some experiments point to its being a three-dimensional or quasi-two-dimensional Dirac semimetal. Here we report high-resolution laser-based angle-resolved photoemission measurements on ZrTe5. The electronic property of ZrTe5 is dominated by two branches of nearly-linear-dispersion bands at the Brillouin zone center. These two bands are separated by an energy gap that decreases with decreasing temperature but persists down to the lowest temperature we measured (~2 K). The overall electronic structure exhibits a dramatic temperature dependence; it evolves from a p-type semimetal with a hole-like Fermi pocket at high temperature, to a semiconductor around ~135 K where its resistivity exhibits a peak, to an n-type semimetal with an electron-like Fermi pocket at low temperature. These results indicate a clear electronic evidence of the temperature-induced Lifshitz transition in ZrTe5. They provide a natural understanding on the underlying origin of the resistivity anomaly at ~135 K and its associated reversal of the charge carrier type. Our observations also provide key information on deciphering the topological nature of ZrTe5 and possible temperature-induced topological phase transition.
• ### Identification of Topological Surface State in PdTe2 Superconductor by Angle-Resolved Photoemission Spectroscopy(1505.06642)

High resolution angle-resolved photoemission measurements have been carried out on transition metal dichalcogenide PdTe2 that is a superconductor with a Tc at 1.7 K. Combined with theoretical calculations, we have discovered for the first time the existence of topologically nontrivial surface state with Dirac cone in PbTe2 superconductor. It is located at the Brillouin zone center and possesses helical spin texture. Distinct from the usual three-dimensional topological insulators where the Dirac cone of the surface state lies at the Fermi level, the Dirac point of the surface state in PdTe2 lies deep below the Fermi level at ~1.75 eV binding energy and is well separated from the bulk states. The identification of topological surface state in PdTe2 superconductor deep below the Fermi level provides a unique system to explore for new phenomena and properties and opens a door for finding new topological materials in transition metal chalcogenides.
• ### Imputation of truncated p-values for meta-analysis methods and its genomic application(1501.04415)

Jan. 19, 2015 q-bio.QM, stat.AP
Microarray analysis to monitor expression activities in thousands of genes simultaneously has become routine in biomedical research during the past decade. A tremendous amount of expression profiles are generated and stored in the public domain and information integration by meta-analysis to detect differentially expressed (DE) genes has become popular to obtain increased statistical power and validated findings. Methods that aggregate transformed $p$-value evidence have been widely used in genomic settings, among which Fisher's and Stouffer's methods are the most popular ones. In practice, raw data and $p$-values of DE evidence are often not available in genomic studies that are to be combined. Instead, only the detected DE gene lists under a certain $p$-value threshold (e.g., DE genes with $p$-value${}<0.001$) are reported in journal publications. The truncated $p$-value information makes the aforementioned meta-analysis methods inapplicable and researchers are forced to apply a less efficient vote counting method or na\"{i}vely drop the studies with incomplete information. The purpose of this paper is to develop effective meta-analysis methods for such situations with partially censored $p$-values. We developed and compared three imputation methods - mean imputation, single random imputation and multiple imputation - for a general class of evidence aggregation methods of which Fisher's and Stouffer's methods are special examples. The null distribution of each method was analytically derived and subsequent inference and genomic analysis frameworks were established. Simulations were performed to investigate the type I error, power and the control of false discovery rate (FDR) for (correlated) gene expression data. The proposed methods were applied to several genomic applications in colorectal cancer, pain and liquid association analysis of major depressive disorder (MDD). The results showed that imputation methods outperformed existing na\"{i}ve approaches. Mean imputation and multiple imputation methods performed the best and are recommended for future applications.
• ### Subgroup Mixable Inference in Personalized Medicine, with an Application to Time-to-Event Outcomes(1409.0713)

Sept. 2, 2014 stat.ME
Measuring treatment efficacy in mixture of subgroups from a randomized clinical trial is a fundamental problem in personalized medicine development, in deciding whether to treat the entire patient population or to target a subgroup. We show that some commonly used efficacy measures are not suitable for a mixture population. We also show that, while it is important to adjust for imbalance in the data using least squares means (LSmeans) (not marginal means) estimation, the current practice of applying LSmeans to directly estimate the efficacy in a mixture population for any type of outcome is inappropriate. Proposing a new principle called {\em subgroup mixable estimation}, we establish the logical relationship among parameters that represent efficacy and develop a general inference procedure to confidently infer efficacy in subgroups and their mixtures. Using oncology studies with time-to-event outcomes as an example, we show that Hazard Ratio is not suitable for measuring efficacy in a mixture population, and provide alternative efficacy measures with a valid inference procedure.
• ### The role of handbooks in knowledge creation and diffusion: A case of science and technology studies(1406.2886)

June 11, 2014 cs.DL
Genre is considered to be an important element in scholarly communication and in the practice of scientific disciplines. However, scientometric studies have typically focused on a single genre, the journal article. The goal of this study is to understand the role that handbooks play in knowledge creation and diffusion and their relationship with the genre of journal articles, particularly in highly interdisciplinary and emergent social science and humanities disciplines. To shed light on these questions we focused on handbooks and journal articles published over the last four decades belonging to the research area of Science and Technology Studies (STS), broadly defined. To get a detailed picture we used the full-text of five handbooks (500,000 words) and a well-defined set of 11,700 STS articles. We confirmed the methodological split of STS into qualitative and quantitative (scientometric) approaches. Even when the two traditions explore similar topics (e.g., science and gender) they approach them from different starting points. The change in cognitive foci in both handbooks and articles partially reflects the changing trends in STS research, often driven by technology. Using text similarity measures we found that, in the case of STS, handbooks play no special role in either focusing the research efforts or marking their decline. In general, they do not represent the summaries of research directions that have emerged since the previous edition of the handbook.
• ### Entitymetrics: Measuring the Impact of Entities(1309.2486)

Sept. 10, 2013 cs.DL
This paper proposes entitymetrics to measure the impact of knowledge units. Entitymetrics highlight the importance of entities embedded in scientific literature for further knowledge discovery. In this paper, we use Metformin, a drug for diabetes, as an example to form an entity-entity citation network based on literature related to Metformin. We then calculate the network features and compare the centrality ranks of biological entities with results from Comparative Toxicogenomics Database (CTD). The comparison demonstrates the usefulness of entitymetrics to detect most of the outstanding interactions manually curated in CTD.
• ### Estimating mean survival time: when is it possible?(1307.8369)

July 31, 2013 math.ST, stat.TH
For right censored survival data, it is well known that the mean survival time can be consistently estimated when the support of the censoring time contains the support of the survival time. In practice, however, this condition can be easily violated because the follow-up of a study is usually within a finite window. In this article we show that the mean survival time is still estimable from a linear model when the support of some covariate(s) with nonzero coefficient(s) is unbounded regardless of the length of follow-up. This implies that the mean survival time can be well estimated when the covariate range is wide in practice. The theoretical finding is further verified for finite samples by simulation studies. Simulations also show that, when both models are correctly specified, the linear model yields reasonable mean square prediction errors and outperforms the Cox model, particularly with heavy censoring and short follow-up time.
• ### Meta Path-Based Collective Classification in Heterogeneous Information Networks(1305.4433)

May 20, 2013 cs.LG, stat.ML
Collective classification has been intensively studied due to its impact in many important applications, such as web mining, bioinformatics and citation analysis. Collective classification approaches exploit the dependencies of a group of linked objects whose class labels are correlated and need to be predicted simultaneously. In this paper, we focus on studying the collective classification problem in heterogeneous networks, which involves multiple types of data objects interconnected by multiple types of links. Intuitively, two objects are correlated if they are linked by many paths in the network. However, most existing approaches measure the dependencies among objects through directly links or indirect links without considering the different semantic meanings behind different paths. In this paper, we study the collective classification problem taht is defined among the same type of objects in heterogenous networks. Moreover, by considering different linkage paths in the network, one can capture the subtlety of different types of dependencies among objects. We introduce the concept of meta-path based dependencies among objects, where a meta path is a path consisting a certain sequence of linke types. We show that the quality of collective classification results strongly depends upon the meta paths used. To accommodate the large network size, a novel solution, called HCC (meta-path based Heterogenous Collective Classification), is developed to effectively assign labels to a group of instances that are interconnected through different meta-paths. The proposed HCC model can capture different types of dependencies among objects with respect to different meta paths. Empirical studies on real-world networks demonstrate that effectiveness of the proposed meta path-based collective classification approach.
• ### Collective allocation of science funding: from funding agencies to scientific agency(1304.1067)

April 3, 2013 physics.soc-ph, cs.DL
Public agencies like the U.S. National Science Foundation (NSF) and the National Institutes of Health (NIH) award tens of billions of dollars in annual science funding. How can this money be distributed as efficiently as possible to best promote scientific innovation and productivity? The present system relies primarily on peer review of project proposals. In 2010 alone, NSF convened more than 15,000 scientists to review 55,542 proposals. Although considered the scientific gold standard, peer review requires significant overhead costs, and may be subject to biases, inconsistencies, and oversights. We investigate a class of funding models in which all participants receive an equal portion of yearly funding, but are then required to anonymously donate a fraction of their funding to peers. The funding thus flows from one participant to the next, each acting as if he or she were a funding agency themselves. Here we show through a simulation conducted over large-scale citation data (37M articles, 770M citations) that such a distributed system for science may yield funding patterns similar to existing NIH and NSF distributions, but may do so at much lower overhead while exhibiting a range of other desirable features. Self-correcting mechanisms in scientific peer evaluation can yield an efficient and fair distribution of funding. The proposed model can be applied in many situations in which top-down or bottom-up allocation of public resources is either impractical or undesirable, e.g. public investments, distribution chains, and shared resource management.
• ### What is the Nature of Chinese MicroBlogging: Unveiling the Unique Features of Tencent Weibo(1211.2197)

Dec. 12, 2012 physics.soc-ph, cs.SI
China has the largest number of online users in the world and about 20% internet users are from China. This is a huge, as well as a mysterious, market for IT industry due to various reasons such as culture difference. Twitter is the largest microblogging service in the world and Tencent Weibo is one of the largest microblogging services in China. Employ the two data sets as a source in our study, we try to unveil the unique behaviors of Chinese users. We have collected the entire Tencent Weibo from 10th, Oct, 2011 to 5th, Jan, 2012 and obtained 320 million user profiles, 5.15 billion user actions. We study Tencent Weibo from both macro and micro levels. From the macro level, Tencent users are more active on forwarding messages, but with less reciprocal relationships than Twitter users, their topic preferences are very different from Twitter users from both content and time consuming; besides, information can be diffused more efficient in Tencent Weibo. From the micro level, we mainly evaluate users' social influence from two indexes: "Forward" and \Follower", we study how users' actions will contribute to their social influences, and further identify unique features of Tencent users. According to our studies, Tencent users' actions are more personalized and diversity, and the influential users play a more important part in the whole networks. Based on the above analysis, we design a graphical model for predicting users' forwarding behaviors. Our experimental results on the large Tencent Weibo data validate the correctness of the discoveries and the effectiveness of the proposed model. To the best of our knowledge, this work is the first quantitative study on the entire Tencentsphere and information diffusion on it.
• ### Citation content analysis (cca): A framework for syntactic and semantic analysis of citation content(1211.6321)

Nov. 27, 2012 physics.soc-ph, cs.IT, math.IT, cs.IR, cs.DL
This paper proposes a new framework for Citation Content Analysis (CCA), for syntactic and semantic analysis of citation content that can be used to better analyze the rich sociocultural context of research behavior. The framework could be considered the next generation of citation analysis. This paper briefly reviews the history and features of content analysis in traditional social sciences, and its previous application in Library and Information Science. Based on critical discussion of the theoretical necessity of a new method as well as the limits of citation analysis, the nature and purposes of CCA are discussed, and potential procedures to conduct CCA, including principles to identify the reference scope, a two-dimensional (citing and cited) and two-modular (syntactic and semantic modules) codebook, are provided and described. Future works and implications are also suggested.
• ### A bird's-eye view of scientific trading: Dependency relations among fields of science(1211.5820)

Nov. 25, 2012 cs.DL
We use a trading metaphor to study knowledge transfer in the sciences as well as the social sciences. The metaphor comprises four dimensions: (a) Discipline Self-dependence, (b) Knowledge Exports/Imports, (c) Scientific Trading Dynamics, and (d) Scientific Trading Impact. This framework is applied to a dataset of 221 Web of Science subject categories. We find that: (i) the Scientific Trading Impact and Dynamics of Materials Science And Transportation Science have increased; (ii) Biomedical Disciplines, Physics, And Mathematics are significant knowledge exporters, as is Statistics & Probability; (iii) in the social sciences, Economics, Business, Psychology, Management, And Sociology are important knowledge exporters; (iv) Discipline Self-dependence is associated with specialized domains which have ties to professional practice (e.g., Law, Ophthalmology, Dentistry, Oral Surgery & Medicine, Psychology, Psychoanalysis, Veterinary Sciences, And Nursing).
• ### Topic-Level Opinion Influence Model(TOIM): An Investigation Using Tencent Micro-Blogging(1210.6497)

Oct. 24, 2012 cs.SI, cs.LG, cs.CY
Mining user opinion from Micro-Blogging has been extensively studied on the most popular social networking sites such as Twitter and Facebook in the U.S., but few studies have been done on Micro-Blogging websites in other countries (e.g. China). In this paper, we analyze the social opinion influence on Tencent, one of the largest Micro-Blogging websites in China, endeavoring to unveil the behavior patterns of Chinese Micro-Blogging users. This paper proposes a Topic-Level Opinion Influence Model (TOIM) that simultaneously incorporates topic factor and social direct influence in a unified probabilistic framework. Based on TOIM, two topic level opinion influence propagation and aggregation algorithms are developed to consider the indirect influence: CP (Conservative Propagation) and NCP (None Conservative Propagation). Users' historical social interaction records are leveraged by TOIM to construct their progressive opinions and neighbors' opinion influence through a statistical learning process, which can be further utilized to predict users' future opinions on some specific topics. To evaluate and test this proposed model, an experiment was designed and a sub-dataset from Tencent Micro-Blogging was used. The experimental results show that TOIM outperforms baseline methods on predicting users' opinion. The applications of CP and NCP have no significant differences and could significantly improve recall and F1-measure of TOIM.
• ### A sieve M-theorem for bundled parameters in semiparametric models, with application to the efficient estimation in a linear model for censored data(1203.2470)

March 12, 2012 math.ST, stat.TH
In many semiparametric models that are parameterized by two types of parameters---a Euclidean parameter of interest and an infinite-dimensional nuisance parameter---the two parameters are bundled together, that is, the nuisance parameter is an unknown function that contains the parameter of interest as part of its argument. For example, in a linear regression model for censored survival data, the unspecified error distribution function involves the regression coefficients. Motivated by developing an efficient estimating method for the regression parameters, we propose a general sieve M-theorem for bundled parameters and apply the theorem to deriving the asymptotic theory for the sieve maximum likelihood estimation in the linear regression model for censored survival data. The numerical implementation of the proposed estimating method can be achieved through the conventional gradient-based search algorithms such as the Newton--Raphson algorithm. We show that the proposed estimator is consistent and asymptotically normal and achieves the semiparametric efficiency bound. Simulation studies demonstrate that the proposed method performs well in practical settings and yields more efficient estimates than existing estimating equation based methods. Illustration with a real data example is also provided.
Most studies on social influence have focused on direct influence, while another interesting question can be raised as whether indirect influence exists between two users who're not directly connected in the network and what affects such influence. In addition, the theory of \emph{complex contagion} tells us that more spreaders will enhance the indirect influence between two users. Our observation of intensity of indirect influence, propagated by $n$ parallel spreaders and quantified by retweeting probability on Twitter, shows that complex contagion is validated globally but is violated locally. In other words, the retweeting probability increases non-monotonically with some local drops.