Natural language processing as a research aid

Large population-based health registries, such as the National Cancer Institute’s SEER program, are extensive databases that can be utilized by researchers to understand the epidemiology and mortality rates due to specific diseases. Since its development in the early 1970s, the SEER registry has proven to be a tremendous research tool providing insights on cancer statistics and patient demographics of the US population. Much of the information contained within the SEER database can be readily extracted for use in research studies. However, some types of information have not yet had tools developed to aid in their extraction. An example of this includes genetic test results for mutations in the epidermal growth factor receptor (EGFR) and anaplastic lymphoma kinase (ALK) genes. Mutations in these genes influence treatment guidelines of patients diagnosed with advanced stage non-small-cell lung cancer (NSCLC). Thus, the ability to efficiently extract test results for these mutations from the SEER database could enhance research studies on NSCLC by providing valuable information about molecular subgroups of these patients.

Natural language processing (NLP) is a type of artificial intelligence (AI) that could be developed to aid in the extraction of EGFR and ALK results from electronic pathology reports that are available with the SEER registries. The usefulness of NLP as an extraction tool has been demonstrated in other studies in the field of cancer research, but none have used it to ascertain mutation test results in lung cancer patients. In a new study recently published in the journal JCO Clinical Cancer Informatics, researchers from the Clinical Research and Public Health Sciences Divisions developed and tested the validity of a new NLP tool to do exactly this. “This is the first study that assessed NLP as a method to track important mutations in pathology reports of lung cancer patients included in SEER registries,” said Dr. Bernardo Goulart, lead author of the study.

Emily Silgard, Fred Hutch NLP engineer and co-author of the study indicated that “part of what made this work possible was that NCI is very invested in developing more automated capability at SEER sites around the country.” The authors developed a hybrid rule-based machine learning method to ascertain EGFR and ALK results in electronic pathology reports (see figure). The first question addressed was whether EGFR or ALK results were available. For those that had results reported, the algorithm addressed whether the result was positive or negative and which experimental technique was used to assess the mutation.

Schematic of study design demonstrating the sequence of steps to develop and validate the natural language processing program to ascertain EGFR and ALK mutations in non-small-cell lung cancer patients. Image from Dr. Bernardo Goulart

To set a study gold standard to compare the NLP algorithms against, two of the study’s authors, both oncologists, manually classified test results from over 800 pathology reports to use as a training set for the NLP algorithm. The authors then assessed the program against the training set and continued to modify it until it reached a score of 1.0 for both sensitivity and specificity. From there, the authors conducted an internal validation of the NLP algorithms and applied them to a larger set of 3,400 electronic pathology reports in the Seattle-Puget Sound SEER registry. Overall, the NLP algorithms performed very well in this internal validation, with scores for specificity, sensitivity, positive predictive value (PPV), and negative predictive value (NPV) between 0.95 and 1.00 for both EGFR and ALK in the categories of ‘results reported’ and the ‘test result’. Sensitivity and PPV also scored high in response to which test was used to assess the mutation, although specificity and NPV scores for this question for EGFR were lower (0.41 and 0.64, respectively).

To determine the external validity of the NLP algorithms—how well they performed in a separate SEER registry not utilized in the NLP program development—the authors applied the algorithms to 1,041 electronic reports from the Kentucky SEER registry. These external reports were also scored by the same oncologists for comparison to the NLP results. In contrast to the high performance with the internal validation, the external validation revealed gaps in the NLP algorithms. The results were lowest for the EGFR tests, frequently misclassifying answers to the questions about test report availability, test results (identifying whether it was positive or negative), and technique used. Overall, the program performed better for ALK with the exception of specificity and PPV in the ascertainment of mutation test results, both of which received low scores.

Goulart highlighted the major contributions from this study, “…we demonstrated the feasibility of using AI methods as tools to identify important genomic alterations in lung tumors from patients included in cancer registries by demonstrating high internal validity of our NLP algorithms” and the ability to “use the NLP tool in the Fred Hutch’ Cancer Surveillance System (CSS) (an NCI-funded SEER registry) to identify lung cancer patients whose cancers are positive for EGFR and ALK mutations. The ability to identify these patients in our registry allows us to assess the treatments they receive in community practices, and whether these patients have access to costly but effective targeted therapies.”

The authors noted that adapting NLP programs to use with clinical data can present unique challenges related to the essential security measures that accompany the conduct of clinical research. This can limit data sharing across institutions and the subsequent improvements in algorithm performance. “This project was pretty special in that regard; we had the opportunity to validate externally with another SEER site. That validation raised some important issues about the algorithm we used and the overfitting on our internal dataset. However, it also brought to light important differences between SEER sites and especially about the prevalence of EGFR and ALK testing,” said Silgard.

The authors already have new studies underway to follow-up on this work and the limitations they identified in the external validation. “A more robust solution would involve the use of deep learning and distributional semantics (word embeddings) which essentially allow us to see the similarity in words based merely on the context they appear in, even if the algorithm has never seen that specific word before,” said Silgard. Indeed, Goulart indicated that in next steps they plan to use “more robust AI methods to enhance their external validity, or, in other words, the ability of the AI tool to correctly identify the EGFR and ALK results in SEER registries from other regions of the US.” The authors are also enthusiastic about the application of the NLP in clinical studies, “The other very important direction is to use the current NLP tool to facilitate studies of treatment patterns in EGFR and ALK positive patients treated in Washington state. Such studies are already ongoing, and we will present the initial results at the ASCO Quality Care Symposium in San Diego, September,” said Goulart.

This work was supported by the SEER-NCI grant contract HHSN261201300012I.

Fred Hutch/UW Cancer Consortium member Drs. Bernardo Goulart, Christina Baik, Aasthaa Bansal, Scott Ramsey, and Stephen Schwartz contributed to this research.

Goulart BHL, Silgard ET, Baik CS, Bansal A, Sun Q, Durbin EB, Hands I, Shah D, Arnold SM, Ramsey SD, Kavuluru R, Schwartz SM. 2019. Validity of natural language processing for ascertainment of EGFR and ALK test results in SEER cases of stage IV non-small-cell lung cancer. JCO Clinical Cancer Infomatics. doi: 10.1200/CCI.18.00098.