
Cancer is a public health crisis that afflicts nearly one in two people during their lifetime.
With hundreds of different types of cancer affecting more than 70 organs, recorded in cancer registries in the United States, cancer is also a very complex disease. However, cancer registries, databases of information about individual cancer cases, may help find a ‘cure’ by providing vital statistics to doctors, researchers, and policymakers.
“Population-level cancer surveillance is critical for monitoring the effectiveness of public health initiatives aimed at preventing, detecting, and treating cancer,” noted Gina Tourassi, director of the Health Data Sciences Institute and the National Center for Computational Sciences at the Department of Energy’s Oak Ridge National Laboratory.
“Collaborating with the National Cancer Institute, we’re developing advanced artificial (AI) intelligence solutions to modernize the national cancer surveillance program by automating the time-consuming data capture effort and providing near real-time cancer reporting,” Tourassi added.

A visualization in how a multitask convolutional neural network classifies primary cancer sites. Photo Courtesy: Hong-Jun Yoon/ORNL
Trends in diagnoses and treatment
Using the information in digital cancer registries, scientists can identify trends in cancer diagnoses and treatment responses. This can, in turn, can help guide research dollars and public resources. However, as a result of variations in notation and language, understanding and interpreting, cancer pathology reports is complex. And human cancer registrars trained to analyze these reports may not always succeed.
To better leverage cancer data for research, scientists at Oak Ridge National Laboratory have developed an artificial intelligence-based natural language processing tool to improve information extraction from textual pathology reports. The project is part of a DOE-National Cancer Institute collaboration known as the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) that is accelerating research by merging cancer data with advanced data analysis and high-performance computing.
Being the Department of Energy’s largest Office of Science laboratory, the Oak Ridge National Laboratory houses unique computing resources to tackle this challenge. One of the organization’s resources includes the world’s most powerful supercomputer for AI as well as a secure data environment for processing protected information such as health data.
SEER program
Through its Surveillance, Epidemiology, and End Results (SEER) Program, the National Cancer Institute (NCI) receives data from cancer registries, such as the Louisiana Tumor Registry, which includes diagnosis and pathology information for individual cases of cancerous tumors.
“Manually extracting information is costly, time-consuming, and error-prone, so we are developing an AI-based tool,” said Mohammed Alawad, research scientist in the Oak Ridge National Laboratory Computing and Computational Sciences Directorate and lead author of a paper published in the Journal of the American Medical Informatics Association on the results of the team’s AI tool.
In a first for cancer pathology reports, the team developed a multitask convolutional neural network, or CNN. By processing language as a two-dimensional numerical dataset, this ‘tool’ is a deep learning model that ‘learns’ to perform tasks, such as identifying keywords in a body of the text.
“We use a common technique called word embedding, which represents each word as a sequence of numerical values,” noted Alawad.
Semantic relationship
Words that have a semantic relationship. In simple terms, this means that there are associations between the meanings of words (semantic relationships at word level), between the meanings of phrases, and between the meanings of sentences (semantic relationships at a phrase or sentence level). Words forming a semantic relationship that together convey a specific meaning are close to each other in dimensional space as vectors (values that have magnitude and direction).
By entering this textual data into the neural network, and filtering it through network layers according to parameters that find connections within the data, researchers are able to develop a language processing system that understands complex texts included in the nation’s cancer registries. The established parameters are then increasingly honed as more and more data is processed.
Although some single-task convolutional neural network models are already being used to comb through pathology reports, each model can extract only one of the characteristics from the range of information included in the reports.
For example, a single-task convolutional neural network may be trained to ‘extract’ just the primary cancer site, outputting the organ where the cancer was detected (i.e. lung, prostate, bladder, or others). But extracting information on the histological grade, or growth of cancer cells, would require training a separate deep learning model.
The Oak Ridge National Laboratory research team scaled the efficiency of their convolutional neural network by developing a network that can complete multiple tasks in roughly the same amount of time as a single-task convolutional neural network.
The team’s new neural network is able to simultaneously extract information for five characteristics: primary site (the body organ), laterality (right or left organ, if applicable), behavior, histological type (cell type), and histological grade (how quickly the cancer cells are growing or spreading).
The new multi-task convolutional neural network also outperformed the original single-task system for all five ‘tasks’ within the same amount of time. However, Alawad said, “It’s not so much that it’s five times as fast. It’s that it’s n-times as fast. If we had n different tasks, then it would take one- nth of the time per task.”
New architecture
The team’s key to success was the development of a convolutional neural network architecture that enables layers to share information across tasks without draining efficiency or undercutting performance.
“It’s efficiency in computing and efficiency in performance,” Alawad said.
“If we use single-task models, then we need to develop a separate model per task. However, with multitask learning, we only need to develop one model. However, developing this one model and figuring out the architecture, was computationally time-consuming. We needed a supercomputer for model development,” he explained.
Supercomputer
To build an efficient multi-task convolutional neural network, the researchers called on the world’s most powerful and ‘smartest’ supercomputer, the 200-petaflop Summit supercomputer.
This computer is fast and powerful. Summit’s theoretical peak speed is 200 petaflops (200,000 teraflops). In human terms, approximately 6.3 billion people are needed to make a calculation at the same time, every second, for an entire year, to match what Summit can do in just one second.
Using this powerful tool, the researchers started by developing two types of multitask convolutional neural network architectures. One method they developed was a common machine learning method known as hard parameter sharing. The second method included a method known as image classification known as cross-stitch. Hard parameter sharing uses the same few parameters across all tasks, whereas cross-stitch uses more parameters fragmented between tasks, resulting in outputs that must be ‘stitched’ together.
To train and test the multitask convolutional neural network’s with real health data, the researchers used Oak Ridge National Laboratory‘s secure data environment and over 95,000 pathology reports from the Louisiana Tumor Registry. They compared their convolutional neural network’s to three other established AI models, including a single-task network.
“In addition to offering HPC and scientific computing resources, Oak Ridge National Laboratory has a place to train and store secure data, which is an extremely important aspect of this project,” Alawad noted.
During testing, the researchers found that the hard parameter sharing multitask model outperformed the four other models (including the cross-stitch multitask model) and increased efficiency by reducing computing time and energy consumption. Compared with the single-task CNN and conventional AI models, the hard sharing parameter multitask convolutional neural network completed the challenge in a fraction of the time and most accurately classified each of the five cancer characteristics.
“The next step is to launch a large-scale user study where the technology will be deployed across cancer registries to identify the most effective ways of integration in the registries’ workflows. The goal is not to replace the human but rather augment the human,” Tourassi concluded.
Reference
[1] Alawad M, Gao S, Qiu JX, Yoon HJ, Christian JB, Penberthy L, Mumphrey B, Wu XC, Coyle L, Tourassi G. Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks, Journal of the American Medical Informatics Association, Volume 27, Issue 1, January 2020, Pages 89–98, https://doi.org/10.1093/jamia/ocz153