PhD Defense | Contributions to Computational Methods for Association Extraction from Biomedical Data: Applications to Text Mining and In Silico Toxicology

Nov 01 2018 11:00 AM - Nov 01 2018 01:00 PM

Contributions to Computational Methods for Association Extraction from

Biomedical Data: Applications to Text Mining and In Silico Toxicology


The task of association extraction involves identifying links between different entities. Here, we
make contributions to two applications related to the biomedical field. The first application is in
the domain of text mining aiming at extracting associations between methylated genes and
diseases from biomedical literature. Gathering such associations can benefit disease diagnosis
and treatment decisions. We developed the DDMGD database to provide a comprehensive
repository of information related to genes methylated in diseases, gene expression, and disease
progression. Using DEMGD, a text mining system that we developed, and with an additional
post-processing, we extracted ~100,000 of such associations from free-text. The accuracy of
extracted associations is 82% as estimated on 2,500 hand-curated entries. The second application
is in the domain of computational toxicology that aims at identifying relationships between
chemical compounds and toxicity effects. Identifying toxicity effects of chemicals is a necessary
step in many processes including drug design. To extract these associations, we propose using
multi-label classification (MLC) methods. These methods have not undergone comprehensive
benchmarking in the domain of predictive toxicology that could help in identifying guidelines for
overcoming the existing deficiencies of these methods. Therefore, we performed extensive
benchmarking and analysis of ~19,000 MLC models. We demonstrated variability in the
performance of these models under several conditions and determined the best performing model
that achieves accuracy of 91% on an independent testing set. Finally, we propose a novel
framework, LDR (learning from dense regions), for developing MLC and multi-target regression
(MTR) models from datasets with missing labels. The framework is generic, so it can be applied
to predict associations between samples and discrete or continuous labels. Our assessment shows
that LDR performed better than the baseline approach (i.e., the binary relevance algorithm) when
evaluated using four MLC and five MTR datasets. LDR achieved accuracy scores of up to 97%
using testing MLC datasets, and R2 scores up to 88% for testing MTR datasets. Additionally, we
developed a novel method for minority oversampling to tackle the problem of imbalanced MLC
datasets. Our method improved the precision score of LDR by 10%.


Arwa Raies is currently a PhD Candidate in Computer Science at King Abdullah University of
Science and Technology (KAUST) in Thuwal, Saudi Arabia. She holds a Master’s Degree in
Computer Science from KAUST, and a Bachelor’s Degree in Computer Science from Prince
Sultan University in Riyadh, Saudi Arabia.
Arwa led research projects in the domains of machine learning, natural language processing, and
computational toxicology. She has published several papers in peer-reviewed journals and has
been a plenary speaker and presented several posters. Her research interests broadly concern
bioinformatics, cheminformatics, data mining, and health care.