Practical Applications of Machine Learning for Clinical Trials

Practical Applications of Machine Learning for Clinical Trials

In this blog, you’ll learn more about the business cases for machine learning that offer a clearer path to inspection readiness, as well as greater efficiency, timeliness, and quality in the TMF. You’ll understand the challenges that can be solved by applying machine learning to your documentation processes. Lastly, you’ll see use cases that demonstrate how you can optimize clinical documentation and TMF automation with machine learning technologies.

Machine learning technologies have been often touted as the magic that will optimize the way we live and work for years to come.

In clinical research, buzz about machine learning surrounds questions about how we can use it to improve quality, efficiency, and perhaps the success of the study itself.

You may be looking at applying machine learning to your clinical trial strategies and operations as early as next year—but is all the talk about machine learning for clinical trials for real, or is it just hype?

Business Cases for Machine Learning in the TMF

There are several business cases where machine learning fits well into TMF documentation processes.

Document Indexing

Document indexing is the primary use case that comes to mind when people think of machine learning for the TMF. Instead of using manual classification methods, you can employ an algorithm to automatically detect a document’s TMF number, level, zone, section, artifact, sub-artifact, site, contacts, and more.

Metadata Extraction

Machine learning can capture specific items such as the site, country, contact, and document date. You can run many other checks against this data for accuracy.

PHI Redaction

You also can employ machine learning for PHI redaction, i.e., the removal of personally identifiable information from site documents. Machine learning models can be trained to recognize information that should be flagged and redacted.

Inspection Readiness

With machine learning technologies, you may be able to conduct a comprehensive assessment of eTMF completeness and also pinpoint issues, identify anomalies, and get a truly accurate measure of your eTMF health.


Correspondence is another area where machine learning technologies like natural language processing can detect which information is relevant and which is not.

TMF Design

Lastly, machine learning could help you get a better look at your TMF configuration and protocol to determine what’s needed in your TMF based on the study type.

Technical Approaches to Machine Learning

Managing clinical trial documents comes with many challenges. Handwritten logs can be difficult for current technology to decipher accurately. Protocol approvals can be filed under multiple TMF areas, which might be tough to teach a machine. There are also over 200 ways to classify documents. These are just some of the challenges we could answer with machine learning.

Several technical approaches to machine learning can address these issues. Statistical classification is one useful approach that has been used for more than 20 years. While classification doesn’t need to apply recent deep neural network (DNN) technology, it still works well.

DNNs are definitely where most of the innovation is happening. These technologies are behind high-profile advancements like facial recognition and self-driving cars. For clinical trials, DNNs are helpful for document-centric tools, optical character recognition (OCR), and handwriting recognition.

Predictive models determine the change in confidence score based on action and data input, which can be useful if we’re trying to give the computer a way to make a yes or no decision. Health care is increasingly using predictive models with big data linked to vital signs, determining incidence of readmission, heart attack, stroke, or sepsis. This approach can get it right more than 90 percent of the time.

Finally, algorithms, or programmed results, are where decisions are baked in and not learned. If we can combine machine learning models with algorithms, we can harness a lot of power for the TMF. We could configure quality checks based on document classification and metadata extraction, look for anomalies, and check for missing essential documents.

Machine learning can give us a whole workbench of tools to assist our efforts to maintain a healthy, up-to-date TMF. Now, let’s look at some real-life use cases for these tools.

TMF Use Cases


Use Case 1: Document Indexing

Near-duplicate detection (NDD) is an algorithmic and a statistical approach to determining document resemblance. Although NDD technically is not considered machine learning, you can leverage machine learning to make NDD much more effective. This approach makes it possible to create a sketch of a document based on its structure and content. Using that sketch, you can compare many documents, stratify information, check for similarities, and find common formats.

For example, you can use NDD to confirm that one 1572 you found is pretty much the same as a 1572 you found somewhere else, because the documents look the same.

In this approach, TMF documents go through several DNNs to review documents for image correction, language detection, and OCR. NDD is then used to build a “fingerprint” of the document and calculate a confidence level. A human then reviews the document to determine whether the machine classified it correctly.

Here, the strata of documents that look similar are classified using human intelligence to train the model.

For example, as each document is classified and verified by a clinical document specialist, the model is further trained to recognize the way a certain document looks. All other documents in the same strata automatically go in the same classification.

In short: The machine classifies the document and determines the level of confidence, users set the threshold for confidence scores, and feedback goes back into the training model to use for the next classification.

A word of caution: Don’t make your machine learning model overly generic. Models must be specific. For example, a curriculum vitae classification is too broad, because a CV from the U.S. looks completely different than a CV from Israel. Making CV classifications specific to the country ensures your model will identify differences and classify them correctly in your eTMF.

Use Case 2: Metadata Extraction

For metadata extraction, natural language processing and extraction tools can identify structured and fielded data in a document.

For example, OCR is used to process forms that have the same structure, and natural language processing (NLP) algorithms may then be used to analyze the data sets extracted from documents and further classify them to site, country, investigator, and contact.

In this process, TMF documents are processed for image correction and language detection, and metadata is extracted with natural language processing based on a set of trained document models. As with the process for document classification, a human then determines whether the document has been extracted correctly, and feedback goes into the DNN training model.

Once this process is complete, you can pull data right from the document—such as the name, address, city, or state data on a 1572—and use NLP to index the document automatically. If the document hits the confidence threshold set by users, it then goes through a quality check.

Use Case 3: Inspection Readiness

Applying machine learning for inspection readiness is a simpler process than those for document classification and metadata extraction. Once your TMF is coded and all metadata has been extracted from documents, you can use algorithms to make decisions about eTMF health.

Here, machine learning models can help you identify all documents with the same classification, perform rule-based conformity checks, do quality verification of outliers and anomalies, and get a quality report showing percent of error with visual indicators of document issues.

As with document classification and metadata extraction, the process allows for manual correction of the data by humans, which updates models for greater accuracy.

Use Case 4: PHI Redaction

Regulation requires that all documents in the eTMF are free of PHI and personally identifiable information (PII). The challenge, of course, is that PII can range from an email address or a bank account number to a date of birth. This makes it very challenging to determine exactly what needs to be redacted, unless there is meaning around the PII data. For example, how does the AI know a date is someone’s birthday?  

Some machine learning models are built for document redaction, or the ability to identify and flag or remove PII in a document before it is archived in the TMF.

However, these models tend to cast too wide a net for information, thus removing more than necessary. We recommend using machine learning for basic document redaction use cases, like removal of email addresses.


At this point in time, machine learning won’t be able to handle all the work of document management and TMF automation. Businesses will still require human quality checks and management to ensure analyses are accurate in all of the above-mentioned use cases.

However, machine learning can definitely optimize these processes, and the capabilities of these technologies will advance with time.

Although we don’t know when machine learning will be able to power a self-filing eTMF, we can already implement the power of machine learning technologies to influence how we approach data in future trials.


For information on how machine learning can help you automate your TMF and clinical document processes, contact us at