• MOTIVATION: Proteins fold into complex structures that are crucial for their biological functions. Experimental determination of protein structures is costly and therefore limited to a small fraction of all known proteins. Hence, different computational structure prediction methods are necessary for the modelling of the vast majority of all proteins. In most structure prediction pipelines, the last step is to select the best available model and to estimate its accuracy. This model quality estimation problem has been growing in importance during the last decade, and progress is believed to be important for large scale modelling of proteins. The current generation of model quality estimation programs performs well at separating incorrect and good models, but fails to consistently identify the best possible model. State-of-the-art model quality assessment methods use a combination of features that describe a model and the agreement of the model with features predicted from the protein sequence. RESULTS: We first introduce a deep neural network architecture to predict model quality using significantly fewer input features than state-of-the-art methods. Thereafter, we propose a methodology to train the deep network that leverages the comparative structure of the problem. We also show the possibility of applying transfer learning on databases of known protein structures. We demonstrate its viability by reaching state-of-the-art performance using only a reduced set of input features and a coarse description of the models. AVAILABILITY: The code will be freely available for download at github.com/ElofssonLab/ProQ4.
  • Summary: Protein quality assessment is a long-standing problem in bioinformatics. For more than a decade we have developed state-of-art predictors by carefully selecting and optimising inputs to a machine learning method. The correlation has increased from 0.60 in ProQ to 0.81 in ProQ2 and 0.85 in ProQ3 mainly by adding a large set of carefully tuned descriptions of a protein. Here, we show that a substantial improvement can be obtained using exactly the same inputs as in ProQ2 or ProQ3 but replacing the support vector machine by a deep neural network. This improves the Pearson correlation to 0.90 (0.85 using ProQ2 input features). Availability: ProQ3D is freely available both as a webserver and a stand-alone program at http://proq3.bioinfo.se/
  • Motivation: To assess the quality of a protein model, i.e. to estimate how close it is to its native structure, using no other information than the structure of the model has been shown to be useful for structure prediction. The state of the art method, ProQ2, is based on a machine learning approach that uses a number of features calculated from a protein model. Here, we examine if these features can be exchanged with energy terms calculated from Rosetta and if a combination of these terms can improve the quality assessment. Results: When using the full atom energy function from Rosetta in ProQRosFA the QA is on par with our previous state-of-the-art method, ProQ2. The method based on the low-resolution centroid scoring function, ProQRosCen, performs almost as well and the combination of all the three methods, ProQ2, ProQRosFA and ProQCenFA into ProQ3 show superior performance over ProQ2. Availability: ProQ3 is freely available on BitBucket at https://bitbucket.org/ElofssonLab/proq3
  • Rapidly growing public gene expression databases contain a wealth of data for building an unprecedentedly detailed picture of human biology and disease. This data comes from many diverse measurement platforms that make integrating it all difficult. Although RNA-sequencing (RNA-seq) is attracting the most attention, at present the rate of new microarray studies submitted to public databases far exceeds the rate of new RNA-seq studies. There is clearly a need for methods that make it easier to combine data from different technologies. In this paper, we propose a new method for processing RNA-seq data that yields gene expression estimates that are much more similar to corresponding estimates from microarray data, hence greatly improving cross-platform comparability. The method we call PREBS is based on estimating the expression from RNA-seq reads overlapping the microarray probe regions, and processing these estimates with microarray summarisation algorithm RMA. Using paired microarray and RNA-seq samples from TCGA LAML data set we show that PREBS expression estimates derived from RNA-seq are more similar to microarray-based expression estimates than those from other RNA-seq processing methods. In an experiment to retrieve paired microarray samples from a database using an RNA-seq query sample, gene signatures defined based on PREBS expression estimates were found to be much more accurate than those from other methods. PREBS also allows new ways of using RNA-seq data, such as expression estimation for microarray probe sets. An implementation of the proposed method is available in the Bioconductor package `prebs'.