Big Data in Cancer Research and Precision Medicine

December 2017, Vol 8, No 5

This special feature is supported by funding from Janssen Pharmaceutical Companies of Johnson & Johnson.

How is big data influencing cancer research and control, and what does it look like in real life? David G. Anderson, PhD, Director, Planning & Development, BDC Advisors, Miami, FL, discussed the implications of big data at the 2017 Association for Value-Based Cancer Care Summit.

The original definition of “big data” is data with “high volume, high velocity, and high variability,” said Dr Anderson. Examples of big data include genetic information, physiological sensor data (eg, cardiac monitoring), and unstructured clinical data (eg, physician’s notes and clinical images). Increasingly, big data is now perceived as “a large data set containing some structured and some unstructured information,” he said.

Like Google, big data refers to storing data and using algorithms to search for patterns. “We have very fast search algorithms that now can search for all kinds of things and make connections. The ability of big data to create associations between phenomena, outputs, inputs, etc, is unsurpassed,” Dr Anderson said.

Big databases are used for countless analytical tasks, such as structuring unstructured data (mammograms, colonoscopies, free text), standardizing data from different sources, populating disease registries, developing predictive models to risk-stratify patients, measuring performance, developing benchmarks, improving risk-adjustment methodologies, and tracking longitudinal outcomes.

Big Data in Clinical Research

Structured and unstructured big data have benefits and limitations. Structured data are used in statistical analysis and research and are responsible for creating “gold standards” for testing the efficacy and effectiveness of interventions. But randomized controlled clinical trials that produce these data sets are expensive, their results are limited to a few variables of interest, and they are highly dependent on “clean data,” because missing data reduce statistical power.

“There are some real limitations that we have in the typical traditional structured data sets that we’re using for research,” Dr Anderson acknowledged.

“Unstructured data get away from some of those problems,” he noted. They are “cheap” to collect, can be stored inexpensively and searched rapidly, produce associations that generate useful hypotheses, are useful for assessing the results of multilevel interventions, and do not have to be “clean.”

But unstructured big data also has limitations—it is often incomplete, may be ambiguous (different metrics are used for the same phenomena), may not capture important variables (eg, outcomes), and must be converted to structured data to be used in statistical analysis and predictive modeling.

Precision Medicine

In research, unstructured big data is most useful in the early, exploratory phase, in which the intervention (cause-and-effect) model is being developed. As the research evolves, structured big data becomes necessary to develop prototype studies and randomized clinical trials to test the intervention under controlled and realistic conditions.

Big data is very informative in the area of precision medicine. Precision medicine is supercharging the role of big data in cancer, because genomic and proteomic research is strongly influenced by patient-level genomic characteristics, and such genomic data for populations constitute big data, suggested Dr Anderson.

Big data analytic techniques can be useful in developing population-based cause-and-effect models from patient-level genomic data. Several companies are leading the way in using big data in cancer, including Health Catalyst, Genomic Health, and the American Society of Clinical Oncology’s CancerLinQ.

The Louisiana Tumor Registry

The Louisiana Tumor Registry (LTR) is one of the oldest tumor registries in the country. It originated in 1947 at Charity Hospital in New Orleans, went statewide in 1988, and is now funded by the National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) program. LTR has been the recipient of many awards, including the National Cancer Institute’s first-place award for data quality for 8 consecutive years.

“It’s very eclectic,” said Dr Anderson. LTR obtains data from different sources and compiles and publishes the data in the form of monographs and academic articles. The Cancer in Louisiana monograph, issued annually since 1987, contains detailed statistics on the state’s cancer incidence and mortality, and “it’s one of the more impressive [monographs] that I’ve seen,” he said.

In the past 3 years alone, LTR has participated in more than 50 special studies, including 9 SEER Rapid Response Surveillance Studies, and has contributed data to more than 500 published articles, including those appearing in the New England Journal of Medicine, JAMA, and Lancet.

“It’s interesting that Louisiana, which is not a wealthy state, made this commitment so early on and now keeps it up very effectively,” said Dr Anderson.

LTR evolved from a registry that was dependent on paper abstracts, separate databases, and few data items to one in which all Louisiana Commission on Cancer–accredited hospitals participate. LTR now effectively collects data across the whole cancer control continuum, including prevention, detection, diagnosis, treatment, and survivorship (Figure).


LTR uses electronic medical records, e-abstraction tools, various data elements and linkages (including linkages with providers), and other technological advances. In the future, LTR will aim for real-time e-pathology and expanded e-radiology reporting, in addition to improvements in the timeliness, completeness, and quality of data, and the increased use of the data in cancer control and cancer research.

But even after 30-plus years, LTR still has significant data gaps, acknowledged Dr Anderson. There’s an awful lot of data that need to be brought in. Data from physicians’ offices, for example, by and large are not part of the database, he added.

For various reasons, LTR has been unable to access the Louisiana Health Information Exchange; therefore, it is developing its own health information exchange from physician offices. It also lacks access to commercial insurance data (but not Medicare), hospital discharge data are incomplete, and there have been obstacles to obtaining radiology images. These gaps need to be filled, and LTR still needs to work through the coding schemes that will make data from specimens and images usable “in a statistical way,” said Dr Anderson.

Related Articles

Subscribe Today!

To sign up for our newsletter or print publications, please enter your contact information below.

I'd like to receive: