I/O tuning¶

HPC I/O software can be thought of as a stack of data-handling tools. For example, when you use the popular H5py python library, you are working with the topmost element of a stack that passes data from H5py to HDF5, to MPI-IO, to POSIX. Most applications tend to use high-level I/O libraries, such as HDF5, but there are also many legacy applications written to work directly with MPI-IO, or even POSIX I/O. Tuning I/O involves optimizing the complex data transfer process between computational resources and storage by adjusting parameters across the I/O stack.
Different I/O tuning strategies are appropriate for different layers. We recommend that users tune the layer that the application is directly built upon. For example, if your application is built with H5py, please refer to H5py I/O Tuning.

MPI I/O Tuning¶

MPI I/O tuning involves optimizing performance for large-scale parallel applications by using collective I/O operations, which group I/O from multiple processes into coordinated calls, and by tuning MPI-IO library parameters through environment variables, such as buffer sizes and aggregator counts.

MPI-I/O Collective Mode¶

Collective I/O operations are a set of optimizations available in many implementations of MPI-I/O that improve the performance of large-scale I/O to shared files. To enable these optimizations, you must use the collective calls in the MPI-I/O library, which end in _all, e.g. MPI_File_write_at_all(). Also, all MPI tasks in the given MPI communicator must participate in the collective call, even if they are not performing any I/O themselves. The MPI-I/O library has a heuristic to determine whether to enable collective buffering, the primary optimization used in collective mode.

Collective Buffering¶

Collective buffering, also called two-phase I/O, breaks an I/O operation into two stages. For a collective read, the first stage uses a subset of MPI tasks (called aggregators) to communicate with the I/O servers (OSTs in Lustre) and read a large chunk of data into a temporary buffer. In the second stage, the aggregators ship the data from the buffer to its destination among the remaining MPI tasks using point-to-point MPI calls. A collective write does the reverse, aggregating data with MPI into buffers on the aggregator nodes, then writing from the aggregator nodes to the I/O servers. The advantage of collective buffering is that fewer nodes are communicating with the I/O servers, which reduces contention while still attaining high performance through concurrent I/O transfers. In fact, Lustre performs best with a one-to-one mapping of aggregator nodes to OSTs.

Since the release of mpt/3.3, Cray has included a Lustre-aware implementation of the MPI-I/O collective buffering algorithm. This implementation is able to buffer data on the aggregator nodes into stripe-sized chunks so that all read and writes to the Lustre file system are automatically stripe-aligned without requiring any padding or manual alignment from an application. Because of the way Lustre is designed, alignment is a key factor in achieving optimal performance.

MPI-I/O Hints¶

Several environment variables can be used to control the behavior of collective buffering on Perlmutter. The MPICH_MPIIO_HINTS variable specifies hints to the MPI-I/O library that can, for instance, override the built-in heuristic and force collective buffering on:

export MPICH_MPIIO_HINTS="*:romio_cb_write=enable:romio_ds_write=disable"

Placing this command in your batch file before calling aprun will cause your program to use these hints. The * indicates that the hint applies to any file opened by MPI-I/O, while romio_cb_write controls collective buffering for writes and romio_ds_write controls data sieving for writes, an older collective mode optimization that is no longer used and can interfere with collective buffering. The options for these hints are enabled, disabled, or automatic (the default value, which uses the built-in heuristic).

It is also possible to control the number of aggregator nodes using the cb_nodes hint, although the MPI-I/O library will default to automatically setting this to the stripe count of your file.

When set to a value of 1, the MPICH_MPIIO_HINTS_DISPLAY variable causes your program to dump a summary of the current MPI-I/O hints to stderr each time a file is opened. This is useful for debugging and as a sanity check against spelling errors in your hints.

More detail on MPICH runtime environment variables, including a full list and description of MPI-I/O hints, is available from the intro_mpi man page on Perlmutter.

HDF5 I/O Tuning¶

Introduction to HDF5

H5py I/O Tuning¶

Choose the most modern format in file creation

By default, every file that is created by HDF5 uses the most backwardly compatible version of the file format, which is less-efficient than more recent versions of the file format. By turning on the 'latest' format during file creation, you may save I/O cost. Refer to version bounding for more information.

f = h5py.File('name.h5', libver='earliest') # most compatible
f = h5py.File('name.h5', libver='latest')   # most modern

Tip

Using 1 process to create 1 file with 8000 objects (including subgroups, different datasets, etc), 'latest' version achieved 2.25X speedup.

Below are some sample codes demonstrating collective I/O to accelerate application I/O:

with dset.collective:
 dset[start:end,:]=temp

Tip

"Using 1K processes to write a 1TB file, collective IO achieved a 2X speedup on Cori."

Avoid Datatype Conversions¶

By default, numpy uses 64-bit floating-point values. However, whenever HDF5 accesses a dataset with another datatype (through H5py), those elements must must be converted, which is a costly operation. For example, if we create a dataset with dtype 'f' (which indicates a 32-bit float), and then assign a numpy array to this dataset, HDF5 will perform datatype conversion as it performs the I/O. To avoid type conversion when writing numpy arrays, the 'f8' type must be used when creating the H5py dataset. For example:

#slower performance
dset = f.require_dataset('field', shape, 'f')
dset = temp # temp is a numpy array, which by default is 64-bit floats

#faster performance
dset = f.require_dataset('field', shape, 'f8')
dset = temp

Tip

This greatly reduces the I/O time, from 527 seconds to 1.3 seconds when writing a 100x100x100 array with 5x5x5 procs

Use Low-level API¶

H5py provides a nice object-oriented interface by hiding many details that are available in the HDF5 C interface. Fortunately, users can still leverage the flexible HDF5 C features for tuning I/O performance. H5py allows you to do so by using the h5py low-level API. For examples:

space = h5py.h5s.create_simple((100,)) 
plist = h5py.h5p.create(h5py.h5p.DATASET_CREATE) 
plist.set_alloc_time(h5py.h5d.ALLOC_TIME_EARLY)

Application developers should also consider disabling dataset element pre-filling, using the low-level API. The following code creates a 900GB dataset, and by using 'FILL_TIME_NEVER' reduces the I/O cost from 40 minutes to less than 1 second:

fx = h5py.File('test_nofill.h5', 'w') 
spaceid = h5py.h5s.create_simple((30000, 8000000))  
plist = h5py.h5p.create(h5py.h5p.DATASET_CREATE)  
plist.set_fill_time(h5py.h5d.FILL_TIME_NEVER)  
plist.set_chunk((60, 15625))  
datasetid = h5py.h5d.create(fx.id, "data", h5py.h5t.NATIVE_FLOAT, spaceid, plist)
dset = h5py.Dataset(datasetid)
fx.close()

Case Study with H5Boss¶

H5py has been used in many scientific applications. One use case is from astronomy, in which the developers built a customized file structure based on HDF5 and developed query/subsetting/updating functions for managing BOSS spectral data, SDSS-II. For more details: H5Boss.

Performance vs. Productivity¶

NERSC Data Seminar May 26 2017