Data deluge

Image courtesy of Pexels

Ask a pharma researcher how well they’re able to leverage their organization’s medical imaging data, and you might hear a discouraging response. While most pharma companies have massive amounts of clinical and medical imaging data, often, most of the imaging data isn’t ready for modern research processes and infrastructure. This imaging data is an untapped asset — it’s disorganized, difficult or impossible to query, not normalized and in no way ready for machine learning and AI. The result is innovation is slowed.

Imaging data is a rich source of information that can hold the key to many discoveries, but it is complex to work with. Pharma companies need a sophisticated data management infrastructure to help manage this complexity and seek to scale up their research.

Here are five common data management problems to consider as your organization evaluates its path forward.

1. Data is siloed and disorganized

The data your organization needs for its next project could be located in any number of places: within your own internal archives, at a clinical institution or research organization or with another external partner. Each of these locations likely has its unique storage practices, naming and labeling conventions and quality checking processes. Moreover, these factors often vary even within a single organization, as personnel and practices change over time.

To leverage archived data or data from disparate sources, pharma researchers face significant challenges in organizing and curating it. This upfront data cleansing and normalization are crucial to make the data useful for querying, analysis and re-use — not just for today’s research but tomorrow’s as well. Organizations should implement practices that support good data hygiene and organization, realizing that investing in these things now can pay dividends for future research efforts. First and foremost, data-driven innovation depends on consistent and high-quality data.

2. Varying modalities present efficiency challenges

The biomedical data that pharma companies need for their research is often in the form of MR or CT scans, x-rays or other imaging modalities. While the data held in these imaging assets is tremendously valuable, it’s a massive undertaking to extract and catalog it. Some researchers perform imaging data curation and analysis manually, which is typically error-prone, inconsistent and enormously costly.

Other research teams have learned to create automated workflows via algorithms, which perform some of the basic tasks necessary to prepare their data — for example, converting a file from one format to another. But, of course, each modality requires its own specific workflow, and creating these algorithms to curate diverse data to a common standard is time-consuming in itself.

Pharma leaders should have a clear map of the variety of modalities across their enterprises and create comprehensive architectures and strategies for bringing automation to the task of platforming their increasingly diverse imaging assets.

3. Datasets are huge, with corresponding computational demands

Pharma companies have petabytes of medical imaging data sourced from past and current clinical trials, real-world data partners, and other sources, all accumulated over time.

Now, imagine the researcher’s task of effectively working with a database of this scale. Recall the algorithmic workflows described previously — and add in the fact that workflows are frequently pipelined together to use the output of one formula as the input to the next. It’s easy to see how computational demands at this scale become massive, requiring many cloud-scale resources and often hybrid environments, including high-performance compute clusters.
Further, the demand for processing power does not remain constant. Therefore, pharma leaders need modern tools and infrastructure to elastically scale to fit the research need quickly and cost-effectively.

4. Artificial intelligence and machine learning have additional data demands

Researchers often need to tap into every available data source — including data that predates AI and ML — to achieve the appropriate scale and diversity for accurate and viable ML/AI models. If an organization hasn’t been cataloging and archiving its recent data with an eye toward use for ML/AI, preparing this data is a challenge in itself. Preparing even older data may seem positively Herculean.

Once again, consistent curation — and automation wherever possible — are essential to the project. Furthermore, consistency in curation is vital for ML/AI, as training data must be normalized and free of factors that could lead to bias and missed insights.

ML/AI project leaders should also consider the additional challenges of ensuring comprehensive provenance, which is needed for reproducibility and regulatory approval. This fact means that access logs, versions and processing actions must be recorded. Using a system that automates this tracking and logging is the only way to ensure accuracy and compliance.

5. Collaboration must be compliant

Even before COVID, research organizations faced a high bar to ensure that their internal and external teams were meeting regulatory requirements when using sensitive biomedical data. Given today’s even more far-flung collaborations and an increase in remote work, pharma companies must remain vigilant to keep their information secure while still allowing researchers the access they need.

And once again, the size of datasets can pose a challenge in this area. The pace of research can grind to a halt if it depends on terabytes of data being downloaded and uploaded from collaborators’ individual networks. Centralizing work on a shared, secure and compliant platform is the most effective way to keep projects moving in this new environment.

Accelerating discovery with modern data management

Jim Olson

Jim Olson

These data challenges (and others) are extremely common in pharma companies. The good news is that forward-thinking researchers are finding ways to automate data capture and curation at scale, efficiently query and run computational analysis on their data, and collaborate with others — all while maintaining compliance and provenance. While these processes require upfront investment and planning and new ways of thinking, they are foundational for digital transformation and a much-accelerated pace of data-driven discovery and innovation within pharma.

Jim Olson is CEO of Flywheel, a biomedical research informatics platform leveraging the power of cloud-scale computing infrastructure to address the increasing complexity of modern computational science and machine learning. Jim is a “builder” at his core. His passion is developing teams and growing companies. Jim has over 35 years of leadership experience in technology, digital product development, business strategy, high growth companies, and healthcare at both large and startup companies, including West Publishing, now Thomson Reuters, Iconoculture, Livio Health Group and Stella/Blue Cross Blue Shield of Minnesota.