TensorFlow¶
TensorFlow is a deep learning framework developed and supported by Google. It supports a large variety of state-of-the-art neural network layers, activation functions, optimizers and tools for analyzing, profiling and debugging deep neural networks. For users who want to get started we recommend browsing the TensorFlow tutorials. The TensorFlow page also provides a complete API documentation.
Using TensorFlow at NERSC¶
There are multiple ways to use and run TensorFlow on NERSC systems like Perlmutter.
Using NERSC TensorFlow modules¶
The TensorFlow modules are the easiest and fastest way to get started with a complete python + TensorFlow environment including all the features supported by the system.
You can load the TensorFlow module with
where <version> should be replaced with the version string you are
trying to load. To see which ones are available use module avail
tensorflow.
Customizing the module environments: If you want to integrate your own packages into the NERSC TensorFlow module environment, you can simply install packages on top with pip, e.g.:
This leverages the $PYTHONUSERBASE variable which is set by the modulefiles
to specify a location for your additional packages specific to that module version.
These packages will then be available every time you load the module.
Building your own environments¶
If you want to build your own complete environment with full control over the packages and versions installed, it is recommended to use conda as described in our Python documentation. Follow the appropriate instructions in the TensorFlow documentation to install TensorFlow into your environment.
For TensorFlow's GPU functionality you need to have the requisite
CUDA libraries available. One option, as mentioned in the
TensorFlow docs,
is to install cudatoolkit and cudnn via conda, into your
custom conda environment. Alternatively, you can simply load the
cudatoolkit and cudnn modules on Perlmutter.
Note you should take care to match the
CUDA version
of the module against your TensorFlow version.
Please contact us at consult@nersc.gov if you want to build Horovod for your private build.
Using containers¶
It is also possible to use your own containers with TensorFlow using shifter. Refer to the NERSC shifter documentation for help deploying your own containers.
On Perlmutter, we provide nersc/tensorflow shifter images based on
NVIDIA GPU Cloud
(NGC) containers, with a few extra packages added for convenience. They are named like
nersc/tensorflow:24.08.01 with the 24.08 tag referring to the coresponding NGC tag.
To run a container in batch jobs we strongly recommend using Slurm image shifter options for best performance:
#SBATCH --image=nersc/tensorflow:24.08.01
#SBATCH --module=gpu,nccl-plugin
srun shifter python my_python_script.py
On Perlmutter, best performance for multi-node distributed training is achieved via
usage of the NCCL shifter modules,
along with the default gpu shifter module. Please refer to the
NCCL shifter modules page
to identify the correct argument for your container.
Customizing your containers in shifter: Shifter containers are
read-only, which means you cannot modify the image contents at runtime.
However, you can specify a path on the host system for additional
packages by setting $PYTHONUSERBASE. You can use the Shifter --env
option to set this variable, e.g.:
shifter --image=nersc/tensorflow:24.08.01 --module gpu,nccl-plugin --env PYTHONUSERBASE=$HOME/.local/perlmutter/nersc_tf_24.08.01
pip install netCDF --user
You also need to set the $PYTHONUSERBASE in your Slurm batch scripts to use your custom libraries at runtime:
#SBATCH --image=nersc/tensorflow:24.08.01
#SBATCH --module=gpu,nccl-plugin
srun shifter --env PYTHONUSERBASE=$HOME/.local/perlmutter/24.08.01 python my_python_script.py
You can also customize the images further by building your own Docker/Shifter image based on NERSC or NGC images following the standard Shifter image building instructions. The recipes for NERSC NGC images, which are built on top of NVIDIA's NGC images, are a good starting point for building optimized GPU-enabled containers.
NGC tensorflow containers on Perlmutter
Please note that for running multi-node distributed
training with horovod in NGC tensorflow containers,
you will need to include --mpi=pmi2 and
--module=gpu,nccl-plugin as options to srun
and shifter (respectively). The full job
step command would look something like
srun --mpi=pmi2 ... shifter --module=gpu,nccl-plugin ....
Requires unset NCCL_CROSS_NIC for multi-node training for container version 24.11 and higher
There is an issue with the TensorFlow containers 24.11+ (which have NCCL versions 2.23 - 2.25).
To circumvent this, include NCCL_CROSS_NIC when launching your jobs with these containers.
Distributed TensorFlow¶
We recommend using Uber Horovod for distributed data parallel training. The version of Horovod we provide is compiled against the optimized Cray MPI and thus integrates well with Slurm. Check out our example Slurm scripts for running Horovod on Perlmutter, using modules and containers. Also, Horovod provides TensorFlow 1 and 2 examples.
Splitting Data¶
It is important to note that splitting the data among the nodes is up
to the user and needs to be done besides the modifications stated
above. Here, utility functions can be used to determine the number of
independent ranks via hvd.size() and the local rank id via
hvd.rank(). If multiple ranks are employed per node,
hvd.local_rank() and hvd.local_size() return the node-local
rank-id's and number of ranks. If
the
dataset API
is being used we recommend using the dataset.shard option to split
the dataset. In other cases, the data sharding needs to be done
manually and is application dependent.
Frequently Asked Questions¶
I/O Performance and Data Feeding Pipeline¶
For performance reasons, we recommend storing the data on the scratch
filesystem, accessible via the SCRATCH environment variable.
For efficient data feeding
we recommend using the TFRecord data format and using the dataset
API to feed
data to the model. Especially, please note that the TFRecordDataset
constructor takes buffer_size and num_parallel_reads options which
allow for prefetching and multi-threaded reads. Those should be tuned
for good performance, but please note that a thread is dispatched for
every independent read. Therefore, the number of inter-threads needs
to be adjusted accordingly (see "Potential Issues" below). The buffer_size parameter is
meant to be in bytes and should be an integer multiple of the
node-local batch size for optimal performance.
On Perlmutter, there is 126GB of node-local DRAM temporary storage,
mounted at /tmp. This can be made use of to speed up data pipelines,
either by staging data there once, at the start of a job, or by caching
dataset elements there via the
cache()
option for tf.data.Datasets.