Craig Yoshioka is a research associate professor at OHSU and director of one of the three national cryo-electron microscopy (cryo-EM) centers funded by the National Institutes for Health (NIH) in 2018.
The center collects data from four 300 kV Krios electron microscopes, along with a number of smaller ones used for screening samples before passing them onto larger microscopes. The devices are semiautomated and spend 130-150 hours per week collecting data.
These instruments can be used to unravel the mechanisms of life at a molecular level, but require the collection of very large image datasets to accomplish this. This equates to an average of 3 terabytes of data per device per day, or 8 to 16 terabytes daily for the facility.
Dr. Craig Yoshioka and cryo-EM equipment (courtesy: Blocks & Files)
The scope and complexity of cryo-EM data continues to increase. Much of the raw data is noise, and while processing techniques are still evolving, they have not yet reached the point where they can automatically determine what to keep and what to throw away. Dastasets that prove intractable using today’s tools might still contain valuable information that can be extracted with the tools of tomorrow. As such, Craig’s team needs to keep the data for an extended period – typically for a year – so that researchers of the facility have time to get meaningful results from it.
Craig’s team supports the efforts of around 900 researchers. There are at least 200 active projects at any given time. Cryo-EM data must be sent to labs at different institutions, as well as moved locally while maintaining consistently high-performance levels. Craig needed a storage solution that could support both massive capacity and offer high performance.
A typical data set for us runs into the terabyte range. When you’re collecting data from multiple experiments per day per microscope, that’s a lot of data that needs to be stored and processed to extract meaningful results. As such, we needed high storage capacity and high performance, and those two things are generally hard to get at the same time.
Craig Yoshioka Ph.D. – Research Associate Professor at OHSU
The center previously relied entirely on the Pacific Northwest National Laboratory (PNNL) for its data storage and high-performance computing needs. However, in order to facilitate local data sharing and eliminate bandwidth and reliability issues, Craig sought a solution that would bring more processing power and storage closer to the microscopes.
The original storage system was a large-disk array powered by ZFS on Linux. However, while it performed well enough to write all the data, processing it in real time was difficult and frequently resulted in the entire storage system slowing to a crawl. Moreover, it was a single storage server and a single point of failure, which spelt bad news for resilience.
Craig evaluated several possible options for overcoming the center’s storage woes. One option he considered was scaling up their existing ZFS pool or implementing a DIY BeeGFS cluster. With the help of hardware manufacturer Advanced HPC, he also explored the possibilities of using Weka.io, VAST, and Panasas. However, all of these potential solutions shared a common limitation – they were going to take time to maintain that Craig wanted to dedicate to research. He shortlisted some non-negotiable features; a distributed file system that he and his team could access via a centralized web interface with a single namespace and intuitive management utilities built in.
Craig found Quobyte in November 2022, and it ticked all the boxes. In the months before making his decision, the department bought in all the storage hardware, which uses a mix of traditional high-capacity hard disks and fast solid-state drives. Migrating to Quobyte was a very straightforward process, and even though one of the servers initially contained a faulty memory module that caused the entire server to crash, the Quobyte handled this underlying hardware problem extremely well. By January, Craig’s team had already moved 1.5 petabytes of data to the new environment.
Performance-wise, I can say Quobyte does everything I was hoping for it to do. For example, in some processing scenarios, where you’d trigger something that evicts all data from memory, our old server would fall flat on its face. Quobyte performs much more consistently, so our storage can now keep up with our data processing.
Thanks to Quobyte, Craig no longer needs to worry about real-time data processing jobs running into barriers due to storage limitations. Even when overloaded, Quobyte maintains consistent performance by gracefully degrading performance across the entire network – instead of everything coming to a grinding halt. As a result, performance issues no longer interrupt important research projects, and data-related workflows are substantially better streamlined than they were before. Moreover, being a distributed file system, Quobyte has successfully eliminated the single point of failure that came with using ZFS.
Craig eventually hopes to extend adoption of Quobyte to transferring large data sets over the internet. Quobyte can also grant access to the file system via an Amazon S3- compatible interface and provide granular access controls and storage policies via its management dashboard, it is well-suited to cloud environments too. Craig hopes to see Quobyte adopted more generally across the OHSU community as well, making it more central to the institution’s infrastructure. As he summarizes, he sees Quobyte as a central linchpin for data storage across their facility.
About Oregon Health and Science University
OHSU is Oregon’s only public academic health center. We are a system of hospitals and clinics across Oregon and southwest Washington. We are an institution of higher learning, with schools of medicine, nursing, pharmacy, dentistry and public health – and with a network of campuses and partners throughout Oregon. We are a national research hub, with thousands of scientists developing lifesaving therapies and deeper understanding.
About Advanced HPC
Advanced HPC, Inc. is a leading provider of high-performance computing and storage solutions. We deliver best in class parallel storage solutions, using Quobyte software. Each solution is custom designed for a variety of different applications and backed up with world-class support.
Our team has over 100 years of experience in the technology industry. Our staff includes industry veterans in sales, customer service, and production to ensure that you are getting the best solution to fit your needs.