Automating Pharmacovigilance: ML and NLP for Detecting Adverse Drug Reactions in Scientific Literature.

No one likes adverse drug reactions (ADRs) – not pharmaceutical companies, not regulators, and certainly not patients.

That’s why ongoing monitoring for ADRs – a process known as pharmacovigilance (PV), or monitoring the safety and risk-benefit profile of pharmaceutical products – is a core responsibility of pharmaceutical companies around the world.  

However, despite this vigilance, ADRs are still a growing problem, accounting for nearly 30 percent of EU emergency room visits. Combined with other challenges such as an aging global population, more widespread chronic disease, and an increasingly complex array of pharmaceutical products on the market, it’s clear that identifying and tracking ADRs via PV is more important than ever. 

Intelligent automation of PV through machine learning and natural language processing (NLP) can improve the effectiveness of PV and ADR monitoring. But it’s first necessary to look back at more traditional PV approaches – and examine why they’ve become somewhat stale.

The problem with traditional pharmacovigilance

PV occurs throughout a pharmaceutical product’s entire life cycle, from clinical trials to postmarket surveillance. The industry’s traditional approach to PV is rooted in spontaneous reporting systems (SRSs), which contain ADR reports for regulators or drug companies submitted by healthcare professionals or consumers. 

It’s hard to understate the importance of SRSs for pharmacovigilance. Indeed, the World Health Organization says a national SRS system is one of five minimum requirements for an effective national PV strategy (other requirements include a national ADR database and a national ADR advisory committee). 

This SRS-centric approach, however, has taken plenty of criticism in recent years. Spontaneous reports can be incomplete or incorrect, biased (leading to underreported ADRs), and slow to materialize. Additionally, an SRS is by nature a passive signal detection approach, relying on consumers reporting suspected ADRs to health authorities (resulting in only a small percentage of adverse events ever being reported).

SRSs also don’t consider large and growing pools of other valuable information around ADRs, such as medical literature and data from social media and blogs.

Other emerging factors have complicated the industry’s traditional approach to PV, including:

  1. Implementing SRS in emerging markets: Implementing an effective SRS strategy can be difficult, especially in low- and medium-income regions with many remote areas, poor telecommunications infrastructure, and patchwork healthcare systems.
  2. Legislation: Increased scrutiny has led to relatively new legislation surrounding PV practices in jurisdictions such as Europe, Canada, and the U.S. This legislation has increasingly made screening scientific literature and other sources (along with SRSs) a requirement for pharmaceutical companies.
  3. Large (and increasing) amounts of unstructured data: It’s easy to commit to analyzing medical literature, but another thing to do – especially using a manual approach. Medical literature is growing exponentially, doubling in volume every few months or so. That’s a massive amount of potential ADR signals but also a considerable challenge for pharmacovigilance officers tasked with wading through all that information. 

In the EU, pharma companies must revisit this ever-growing mountain of data at least once per week “to maintain awareness of possible publications through a systematic literature review of widely used reference databases (e.g., Medline, Pubmed, or Embase).” 

Along with mainstream scientific literature in major databases, manufacturers are also expected to monitor local publications in regions that sell their products.

AI For Clinical Evaluation

Automation through ML and NLP can improve pharmacovigilance and medical literature review

This mix of rapidly increasing volumes of medical data combined with expanding regulatory requirements has put greater pressure on pharma companies to find more efficient methods of conducting PV reviews. “It is impossible for researchers, scientists, and physicians to read and process the large body of scientific articles and remain abreast of the foremost information,” explain Tafti Et al. “Therefore, there is a pressing need to develop intelligent computational methods, particularly big data analytics solutions, to efficiently process this wealth of data.” 

One effective solution for analyzing vast tracts of medical data – including scientific literature and SRSs, along with clinical trials, epidemiology databases, social media signals, and other sources – is the deployment of machine learning (ML) and natural language processing (NLP) models.

Real-world scientific studies have confirmed the effectiveness of this approach. Tafti Et Al. (2017) developed a scalable text mining solution (underpinned by Apache Spark, ML and NLP models, and an Elasticsearch NoSQL distributed database) to analyze biomedical articles and health-related social media posts.

The project featured three tiers of automation:

  1. Data collection: A web crawler was deployed to collect data as XML files, which were then converted to plain text files with associated metadata (publication, author, etc.) and stored in the NoSQL database.
  2. NLP: Natural language processing in two different applications:
    • a. Selecting relevant documents (as the system accumulated data, it established criteria to ensure high quality results).
    • b. Text processing: Following text normalization, documents were converted to a set of individual sentences to ensure model accuracy. Humans then annotated random groups of sentences to refine the model further.
  3. ML: A predictive model then classified documents based on one of two states: Including an ADR, or not including an ADR.

After analyzing hundreds of articles and posts from sources such as PubMed Central, MedHelp, and WebMD, the solution achieved an accuracy rate of 92.7 percent, with 93.6 percent precision and 93 percent recall. 

“This work not only detected and classified ADE sentences from big data biomedical literature but also scientifically visualized ADE interactions,” according to the authors.

Improve your medical literature review for pharmacovigilance with CapeStart’s ML and NLP teams

CapeStart’s machine learning and natural language processing experts have worked with dozens of life sciences and pharma companies to improve and streamline their PV processes. From data collection and annotation to the development and deployment of highly sophisticated ML and NLP models designed to cut through mountains of medical literature, CapeStart can improve the speed, costliness, and accuracy of your literature reviews for pharmacovigilance. 

Contact us today to schedule a brief consultation with one of our experts.

Contact Us.