G. Wang, H. Yin, B. Li, C. Yu, F Wang,X. Xu, J. Cao, Y. Bao, L. Wang, A.A. Abbasi, V.B. Bajic, L. Ma, Z. Zhang
Bioinformatics, (2019)
Motivation
The
significance of long non-coding RNAs (lncRNAs) in many biological
processes and diseases has gained intense interests over the past
several years. However, computational identification of lncRNAs in a
wide range of species remains challenging; it requires prior knowledge
of well-established sequences and annotations or species-specific
training data, but the reality is that only a limited number of species
have high-quality sequences and annotations.
Results
Here
we first characterize lncRNAs in contrast to protein-coding RNAs based
on feature relationship and find that the feature relationship between
open reading frame length and guanine-cytosine (GC) content presents
universally substantial divergence in lncRNAs and protein-coding RNAs,
as observed in a broad variety of species. Based on the feature
relationship, accordingly, we further present LGC, a novel algorithm for
identifying lncRNAs that is able to accurately distinguish lncRNAs from
protein-coding RNAs in a cross-species manner without any prior
knowledge. As validated on large-scale empirical datasets, comparative
results show that LGC outperforms existing algorithms by achieving
higher accuracy, well-balanced sensitivity and specificity, and is
robustly effective (>90% accuracy) in discriminating lncRNAs from
protein-coding RNAs across diverse species that range from plants to
mammals. To our knowledge, this study, for the first time,
differentially characterizes lncRNAs and protein-coding RNAs based on
feature relationship, which is further applied in computational
identification of lncRNAs. Taken together, our study represents a
significant advance in characterization and identification of lncRNAs
and LGC thus bears broad potential utility for computational analysis of
lncRNAs in a wide range of species.