What is high-performance computing (HPC) storage?

Let's start with a quick overview of what high-performance computing actually is. High-performance computing is the field of IT that deals with solving large - often scientific or research problems - using large supercomputers or compute clusters.

When problems like weather predictions or protein structure alignment became too big for a single server or machine, researchers had to aggregate the computing power of many machines to solve them.

Dividing a big problem into smaller pieces and then solving the smaller pieces in parallel on many smaller machines proved to be a winning proposition: Scaling out allowed scientists to solve even bigger problems. They didn't even have to wait for CPUs and machines to become faster. With individual CPU cores not getting significantly faster anymore, even traditional software must now parallelize across the many cores in modern CPUs.

Scaling out an application across many machines suddenly changed the requirements for storage, too: The applications have to read and write a lot of data in parallel. This required new storage systems to provide massive parallel IO and to be able to scale out together with the compute.

Scratch, Home, Project storage

The focus for those high-performance computing file systems was mainly on performance and throughput; reliability was secondary since a lot of the data was temporary, e.g., for intermediate results or checkpoints. This type of storage is often referred to as scratch storage.

Most HPC clusters offer several types of storage spaces besides the fast scratch. Home directories are reliable storage, but are often slower and have lower capacity than scratch. This is where researchers store binaries, scripts, input files, and results.

Some clusters have additional storage space for projects or input files shared across different projects like the protein database (PDB), images, and so on. These are similar to home directories in that this storage needs to be reliable.

IO patterns in HPC and what is MPI-IO?

Many HPC applications were created at a time when flash was too expensive. And even today, the amounts of storage in HPC often range in the 10s to 100s of PB. At this scale, the price difference of roughly a factor of 5 between flash and spinning disks makes all-flash unaffordable.

Most HPC applications write data in large sequential blocks, something that hard drives love. What is often forgotten is that large sequential IO can be prefetched and cached very efficiently, dramatically reducing the IO wait times in the CPU.

A special pattern of IO comes from MPI applications. MPI is a library that helps scientists and developers to create parallelized applications that distribute the computation across many machines. The IO subsystem is called MPI-IO and is often used interchangeably with a very specific IO pattern: Each of the parallelized processes running on a separate machine writes its output to the same big file. In this scenario, each process gets a small chunk of the file allocated to write its results. A file system that allows many processes to write to the same file concurrently, i.e., without sequentializing the writers with locks or other methods, is called a parallel file system.

Quobyte is a parallel distributed file system for HPC and enterprise scale-out workloads.

Leave Us Your Feedback!
Leave Us Your Feedback About This Article: