1.Abstract
N6-methyladenosine (m6A), an abundant eukaryotic mRNA modification, is a crucial marker dynamically regulated by demethylase (Erasers), methyltransferase (Writers), and binding proteins (Readers). Hence, decoding the stochastic profile of m6A over transcriptome is invaluable to our understanding of the biological functions of RNA. The m6A frequencies over 1624625 DRACH motifs on human exons were summarized from 40 single-based m6A experiments. Four machine learning algorithms, generalized linear model (GLM), multi-layer perceptron (MLP), extreme gradient boosting (XGBoost), and random forest (RF), were implemented to build the Poisson regression models. Compared with the classification models used in previous studies, our Poisson regression approaches provide a new framework for integrating multiple single-based RNA modification datasets. We demonstrate that the Poisson regressors can better predict the biological and technical variation between experiments than the corresponding classifiers trained using the same feature set. In addition, we for the first time introduced the binding sites of 17 m6A regulators as the predictive features. Compared to only using the sequence-derived and genome-derived features (MSE 1.020 / CE 0.579; AUC 0.854 / MCC 0.410), predictive performances can be significantly improved after adding the regulator features (MSE 0.855 / CE 0.503; AUC 0.883 / MCC 0.469). These results suggest the importance of the information of protein regulators when building high accuracy epi-transcriptomic predictors. Finally, we provide a predicted stochastic m6A map on the entire human exonic region, and an in-depth analysis is performed on the feature importance of both the linear and non-linear models.