What is high performance computing (HPC) storage?

Let's start with a quick overview of what high performance computing actually is. High performance computing is the field of IT that deals with solving large - often scientific problems or research problems - using large supercomputers or compute clusters.

When problems like weather predictions, protein structure alignment, became too big for a single server or machine, researchers had to aggregate the compute power of many machines to solve them.

Dividing a big problem into smaller pieces and then solving the smaller pieces in parallel on many smaller machines proved to be a winning proposition: Scaling out allowed scientists to solve even bigger problems and they didn't have to wait for CPUs and machines to become faster. With CPUs not getting significantly faster anymore even traditional software must now parallelize across the many cores in modern CPUs.

Scaling out an application across many machines suddenly changed the requirements for storage too: The applications have to read and write a lot of data in parallel. This required new storage systems that provide massive parallel IO and are able to scale out together with the compute.

Scratch, Home, Project storage

The focus for those high performance or HPC file systems was mainly on performance and throughput, reliability was secondary since a lot of the data was temporary, e.g. for intermediate results or checkpoints. This type of storage is often referred to as scratch storage.

Most HPC clusters offer several types of storage spaces besides the fast scratch. Home directories are often slower but reliable storage with lower capacity than scratch. This is where researchers store binaries, scripts, input files and results.

Some clusters have additional storage space for projects or input files that are shared across projects like the protein database (PDB), images and so on. These are similar to home directories in that this storage needs to be reliable.

IO patterns in HPC and what is MPI-IO?

Many HPC applications were created at a time where flash was too expensive. And even today the amounts of storage in HPC often range in the 10s to 100s of PB. At this scale the price difference of roughly a factor 5 between flash and spinning disk makes all flash unaffordable.

Most HPC applications write data in large sequential blocks, something that hard drives love. What is often forgotten is that large sequential IO can be prefetched and cached very efficiently, which reduces the IO wait times in the CPU dramatically.

A special pattern of IO comes from MPI applications. MPI is a library that helps scientists and developers to create parallelized applications that distribute the computation across many machines. The IO subsystem is called MPI-IO and is often used interchangeably with a very specific IO pattern: Each of the parallelized processes running on a separate machine writes its output to the same big file. In this scenario each process gets a small chunk of the file allocated to which it writes its results. A file system that allows many processes to write to the same file concurrently, i.e. without sequentializing the writers with locks or other methods, is called a parallel file system.

Quobyte is a parallel distributed file system for HPC and enterprise scale-out workloads.

More Articles About Enterprise Storage