• Digital-Health-Conference-2020

AgendaTalk Details

The Use of Custom Embeddings Generated from Pubmed Corpora for Cancer Research

14:35 - 14:55 Building 19, Hall 1

In natural language processing, one of the big questions that remain open is “what is the optimal approach to embed our natural language in a vector space?”, which essentially transforms words into series of numbers. Ideally, the numbers should represent semantic meaning. In a multidimensional space, the different dimensions should correspond to different types of meaning (e.g. size of an entity, sex of an animal) that a computer algorithm can then subsequently use to make inferences.

Big text-data endowed institutions or corporations, claim only large-sized corpora produce performant embeddings. In this presentation, we will investigate what is the minimal size of a corpus useful for extracting cancer-related statements. To this end, we developed a literature knowledge mining tool “sina” (https://github.com/dicaso/sina), that allows extracting relevant statements to specific conditions and the research question at hand, by selecting a specific corpus of documents with which to establish a custom word embedding.