How fast can you read? Probably not fast enough to keep up with scientific papers published at a rate of almost two a minute, or catch up with the around 50 million papers sitting waiting to be perused in public databases worldwide. All is not lost, though – a team of scientists and analysts have created a data mining tool that could help researchers mine medical literature and use the information to formulate hypotheses.
Being able to keep up with the latest breakthroughs is critical for scientific research, but the difficulties with access could slow scientific and medical research. While is is relatively straightforward to search the scientific literature, this can just generates a mass of information that can be difficult to analyse and use to draw conclusions. As Olivier Lichtarge, director of the Center of Computational and Integrative Biomedical Research at Baylor, explains, scientists formulate hypotheses based on what they read and know, but because there is so little that they can actually read in the time they have available, hypotheses can be biased. However, computers and data mining could provide an answer.
“A computer certainly may not reason as well as a scientist but the little it can, logically and objectively, may contribute greatly when applied to our entire body of knowledge,” says Lichtarge.
The team of computational biologists at Baylor College of Medicine, along with people from the University of Texas MD Anderson Cancer Center and IBM Research, created the Knowledge Integration Toolkit (KnIT), a prototype data mining system. Using a combination of text mining, entity detection, neighbour-text feature analysis and graph-based diffusion of information, KnIT mines the data and presents it in a network that can be queried, and then generates new and testable hypotheses that can be used to help direct laboratory studies.
In a case study to assess the principles of the KnIT technology, the team used it to mine published data on the tumour suppressor protein p53, to identify new protein kinases that turn the protein on by phosphorylation.
“On average, a scientist might read between one and five research papers on a good day,” said Lichtarge. “But, to put this in perspective with p53, there are over 70,000 papers published on this protein. Even if a scientist reads five papers a day, it could take nearly 38 years to completely understand all of the research already available today on this protein.”
Before 2003, only ten phosphorylating protein kinases that target p53 had been discovered. Mining the literature before this date, the KnIT correctly identified the ten known kinases and predicted seven of the nine that would be found in the next decade. The results were presented at KDD ’14 (the 20th ACM SIGKDD international conference on knowledge discovery and data mining).
“This study showed that.. we can, in fact, suggest new relationships and new functions associated with p53, which can later be directly validated in the laboratory,” said Lichtarge.
The long-term hope for the technology is to be able to extract data systematically from the totality of the public medical literature. However, this will need further technological advances to be able to read text, extract facts from every sentence and integrate the information into a network that describes the relationship between all of the objects and entities discussed in the literature.
“This first study is promising, because it suggests a proof of principle for a small step towards this type of knowledge discovery. With more research, we hope to get closer to clinical and therapeutic applications,” says Lichtarge.
While GenoKey wasn’t involved in this project, its combinatorial data mining algorithms, running on standard computer hardware, can cut through complexity in minutes, finding patterns and correlations with very high speed and accuracy.