Healthcare and Life Sciences
NYU Langone Health
deciphEHR is a genomic medicine program from NYU Langone Health that seeks to impact research and clinical utility. With the goal of pioneering genomic medicine, deciphEHR sequences patients and links their electronic health records (EHRs) to provide a full health picture of a patient. Using NVIDIA® Parabricks®, deciphEHR was able to decrease alignment from 27 minutes to five minutes and variant calling from seven hours to 40 minutes.
NYU Langone Health
Accelerated Computing Tools & Techniques
NVIDIA Parabricks
NVIDIA Data Center / Cloud
Key Takeaways
Launched in 2024, deciphEHR aims to enable genomic medicine for both research and clinical utility. To date, the program has sequenced 12,000 NYU Langone patients and linked their real EHRs. With the goal of sequencing 100,000 genomes by the end of the pilot phase, deciphEHR uses low-coverage sequencing to meet these milestones. Although higher coverage or long-read sequencing may be implemented for specific cohorts, low-coverage (or low-pass) sequencing accurately sequences large quantities while remaining cost-effective and scalable.
“Our goal is to enable genomic research in the medical system in a way that links the data to the full electronic health records of a patient,” says Or Yaacov, research assistant professor and program lead for deciphEHR. “To link the entire depth, history, and every feature that exists in electronic records that often can’t really be obtained from biobanks.”
To address scalability goals, deciphEHR has implemented a robotic system to process samples. For any patient that has signed a consent form, the program uses leftover blood from clinical blood tests. This allows leftover blood to be used on dried blood spot cards—often done in neonatal testing—instead of being thrown away. DNA remains stable on dried filter paper, and the automated system is able to complete library prep without isolating DNA.
“Everything is really designed in order to scale fast. That’s why it’s important to be able to use GPUs. Because if we want to scale, we need to process genomes faster,” says Yaacov. “That’s actually one of the slowest parts of the entire project: the computational processing of the sequencing files.”
The team at deciphEHR worked on computational workflows using the Genomic Analysis Toolkit (GATK) standard packages. With the program’s goal of sequencing 100,000 genomes, it was important to increase the volume of the team's work. However, using their existing high-performance computing (HPC) system created a scalability challenge.
“What we really wanted to do was increase volume,” states Jonathan McCafferty, senior research scientist for deciphEHR. “Executing those processes on our internal HPC infrastructure poses a challenge due to long runtimes and high contention for shared resources.”
In particular, alignment and variant calling, two critical and historically time-intensive steps in genomic sequencing analysis, were bottlenecks in their process. Both steps individually took the team several hours to complete—causing delays and impacting their ability to process samples.
“NVIDIA was a way to speed up a lot of those processes. The alignment and variant calling processes, which take up the most resources and compute time, helped move us away from waiting in queues and getting processing samples through,” recalls McCafferty.
Since alignment and variant calling were the two time-and resource-intensive parts of deciphEHR’s workloads, it was critical to find ways to reduce runtimes. To address this, the team implemented NVIDIA Parabricks, a scalable genomics software suite for secondary analysis that provides GPU-accelerated versions of trusted, open-source tools.
To address alignment and variant calling concerns, the team tested runtimes on CPUs as the baseline and compared performance on a series of NVIDIA GPU types they had available—including NVIDIA A100, NVIDIA L40, and NVIDIA V100 Tensor Core GPUs.
Over 5x Reduction in Alignment
For alignment, it took 1,600 seconds (or roughly 27 minutes) to complete on CPU. However, running alignment on one to four GPUs across the different node types dropped the runtime down to five minutes. This resulted in a more than 5x reduction in runtime for the alignment step alone.
Data provided by NYU Langone Health.
“Without GPUs, alignment took approximately 27 minutes to complete. But once we tested it on one to four GPUs, we brought that time down to just five minutes.”
Jonathan McCafferty
Senior Research Scientist, deciphEHR
Since Parabricks provides GPU-accelerated versions of trusted, open-source tools, the team at deciphEHR was able to compare CPU and GPU versions of the tools they were already using, including FQ2BAM (the Parabricks wrapper for BWA-MEM) for alignment and HaplotypeCaller for variant calling.
Over 10x Reduction in Variant Calling
The deciphEHR team already experienced over a 5x reduction in runtime for alignment, but variant calling was the main bottleneck that needed a significant reduction in runtime.
“Variant calling is where we saw the most dramatic speedup. GATK was taking seven to eight hours to run—even after splitting the genome into 10 equal regions,” states McCafferty. “Once we integrated Parabricks and started using GPUs, we brought that down to just 40 minutes per genome. That’s a tremendous improvement for our pipeline.”
Since the team had already been using the standard, open-source version of HaplotypeCaller on CPU, Parabricks brought another benefit—being able to use the same parameters in the GPU-accelerated, open-source version and maintain the same level of accuracy.
Data provided by NYU Langone Health.
Implementing Force Calling Mode in HaplotypeCaller
In addition to significant speedups in alignment and variant calling, the team was also impressed with NVIDIA’s responsiveness to implement feedback. The deciphEHR team knew the open-source version of HaplotypeCaller met their needs, but they needed capabilities they were currently using, like force-calling mode, to be added to the Parabricks version of HaplotypeCaller.
“The team at NVIDIA did an excellent job integrating the specific parameters and options we needed into HaplotypeCaller, aligning it with the configuration we were already using,” recalls McCafferty. “This was a major advantage, as it let us maintain identical parameters and reference files, enabling reproducibility and consistency with our CPU-based results.”
Not only was force-calling mode added to the Parabricks version of HaplotypeCaller, but it also continues to be accelerated, along with other tools, with every new Parabricks release.
In addition to significant speedups for key steps in their workloads and building capabilities into future releases, the deciphEHR team has been able to access beta releases early and test out new features. Plus, the team has simplified workloads and removed complexities in their processes.
“Parabricks has greatly simplified our workloads by reducing the number of containers we need to manage. It’s made our workflows easier to understand and maintain—unlike more complex packages from other organizations that involve maintaining a large number of containers. This has been a major improvement for our team.”
Removing complexity and simplifying workloads have also given deciphEHR’s team more flexible deployment options on their HPC infrastructure and the ability to make better use of their existing resources.
Yaacov adds, “Sharing the load between the GPU nodes and the CPU nodes gives us the flexibility to use more resources at a single time. We don’t have to use only GPUs or only CPUs.”
From reducing complexities to reducing runtimes, Parabricks has accelerated critical steps in the team’s processes to meet their sequencing goals. As a result, the deciphEHR team can process samples faster, freeing up time to focus on their goal of pioneering genomic medicine.
“Variant calling used to take over seven hours per genome. After integrating NVIDIA Parabricks and leveraging GPUs, we reduced that to just 40 minutes—a dramatic improvement for our workflow.”
Jonathan McCafferty
Senior Research Scientist, deciphEHR
Learn more about NVIDIA solutions for genomics.