Using Machine Reading to Aid Cancer Understanding and Treatment
PubMed, a repository and search engine for biomedical literature, now indexes more than 1 million articles each year. At the same time, a typical large-scale patient profiling effort produces petabytes of data -- and is expected to reach exabytes within the near future. Combining these large profiling data sets with the mechanistic biological information covered by the literature is an exciting opportunity that can yield causal, predictive understanding of cellular processes. Such understanding can unlock important downstream applications in medicine and biology. Unfortunately, most of the mechanistic knowledge in the literature is not in a computable form and remains mostly hidden.
In the first part of the talk I will describe a natural language processing (NLP) approach that captures a system-scale, mechanistic understanding of cellular processes through automated, large-scale reading of scientific literature. At the core of this approach are compact semantic grammars that capture mentions of biological entities (e.g., genes, proteins, protein families, simple chemicals), events that operate over these biochemical entities (e.g., biochemical reactions), and nested events that operate over other events (e.g., catalyses). This grammar-based approach is a departure from recent trends in NLP such as deep learning, but I will argue that this is a better direction for cross-disciplinary projects such as this. Grammar-based approaches are modular (i.e., errors can be attributed to a specific rule) and are easier to understand by non-NLP users. This means that biologists can actively participate in the debugging and maintenance of the overall system. Additionally, the proposed approach captures other complex language phenomena such as hedging and coreference resolution. I will highlight how these phenomena are different in biomedical texts versus open-domain language.
I will show that the proposed approach performs machine reading at accuracy comparable with human domain experts, but at much higher throughput, and, more importantly, that this automatically-derived knowledge substantially improves the inference capacity of existing biological data analysis algorithms. Using this knowledge we were able to identify a large number of previously unidentified, but highly statistically significant mutually exclusively altered signaling modules in several cancers, which led to novel biological hypotheses within the corresponding cancer context.