PhD Defense Proposal | " Contributions to Computational Methods for Association Extraction from Biomedical Data: Applications to Text Mining and In Silico Toxicology"

May 13 2018 02:00 PM - May 13 2018 03:30 PM

"Contributions to Computational Methods for Association Extraction from Biomedical Data: Applications to Text Mining and In Silico Toxicology"

Committee Chairperson: Vladimir B. Bajic
Committee Members: Mikhail Moshkov, Takashi Gojobori


The task of association extraction involves identifying links between different entities. These links may not always be apparent or easily identifiable through human insights. Here, we make contributions to two applications related to the biomedical sector. The first application is in the field of text mining aiming at extracting from biomedical literature associations between methylated genes and diseases. Gathering such associations is important for disease diagnosis and treatment decisions. Therefore, we developed DDMGD database to provide a comprehensive repository of information related to genes methylated in diseases, in addition to gene expression and disease progression information. By extending DEMGD, a text mining system that we developed earlier, and with additional post-processing, we extracted ~100,000 of such associations from free-text. The accuracy of extracted associations is 82% as estimated on 2,500 hand-curated entries. The second application is in the field of in silico toxicology that aimed at identifying relationships between chemical compounds and toxicity effects. Identifying toxicity effects of chemicals is a necessary step in many processes including drug design. To extract these associations, we propose using multi‐label classification methods. These methods have not undergone comprehensive benchmarking in the domain of predictive toxicology. Therefore, we performed extensive benchmarking and analysis of ~19,000 multi‐label classification models we generated using combinations of various computational methods. We demonstrated variability in the performance of the methods under several conditions and determined the best performing model that achieves accuracy score of 91% on an independent testing set. Finally, we propose a novel framework for developing multi-label classification models from datasets with missing labels. The motivation for our framework comes from the observation that although some labels are missing, there are some regions in the label space in which a subset of labels is known for a group of samples. Therefore, we developed LDR, an ensemble-based approach that utilizes bi-clustering for partitioning the label space. Our preliminary assessment shows that LDR performed better than the baseline approach (i.e., binary relevance algorithm). The contributions made in this research for association extraction may be used to predict unforeseen associations between existing or new entities and subsequently propose novel hypotheses.