HPSS Best Practices¶
The best guide for how files should be stored in HPSS is how you might
want to retrieve them. If you are backing up against accidental
directory deletion or failure, then you would want to store your files
in a structure where you use htar to separately bundle up each
directory. On the other hand, if you are archiving data files, you
might want to bundle things up according to the month the data was taken
or detector run characteristics, etc. The optimal size for htar
bundles is between 100 GB and 2 TB, so you may need to do several htar
bundles for each set depending on the size of the data. Other best practices
described in this section include
- Grouping smaller files
- Ordering large retrievals
- Avoiding very large files
- Clearing disk cache after backing up
- Using Globus to access data remotely
- Using the xfer QOS for long-running transfers
Group Small Files Together¶
If you need to store many files smaller than 100 GB,
please use htar to bundle them
together before archiving. HPSS is a tape system and responds
differently than a typical file system. If you upload large numbers of
small files, they will be spread across dozens or hundreds of tapes,
requiring multiple loads into tape drives and repositionings of the
tape. Storing many small files in HPSS without bundling them together
will result in extremely long retrieval times for these files and will
slow down the HPSS system for all users.
Order Large Retrievals¶
If you are retrieving many (> 100) files from HPSS, you need to order
your retrievals so that all files on a single tape will be retrieved in
a single pass in the order they are on the tape. This is most easily
accomplished by creating a list of the files you want ordered by their
appearance on tape and using that with the hsi or htar command.
NERSC has several scripts to help you generate an ordered list for
retrievals with these commands.
When retrieving large amounts of compressed data all at once where the total size of the tape archived data is >100 TB and each data chunk is >100 GB, it is advised to raise a ticket and reach out to the HPSS team in order to receive detailed guidance in performing the retrieval incrementally. It helps ensure that all our users have access to HPSS simultaneously.
Caution
If you're retrieving a large data set from HPSS with Globus, please see our Globus page for instructions on how to best retrieve files in correct tape order using the command line interface for Globus.
Generating A Tape-Sorted List¶
The script generate_sorted_list_for_hpss.py (already available in your shell path) will generate a list
of tape-sorted files. This list can be used for htar or hsi to
extract the files. For hsi, please see the description
below
for a more advanced script that will also re-create the directory
structure you had in HPSS.
To use the script, you first need a list of fully qualified file path names. If you do not already have such a list, you can query HPSS using the following command:
(the stdout+stderr pipe to grep removes directories and index files from the output, keeping only files). Once you have the list of files, feed it to the sorting script:
generate_sorted_list_for_hpss.py -i temp.txt |grep -v "Generating sorted list from input file" > sorted_list.txt
The file sorted_list.txt will have a sorted list of files to
retrieve. If these are htar files, you can extract them with htar
into your current directory:
nersc$ awk '{print "htar -xvf",$1}' sorted_list.txt > extract.script
nersc$ chmod u+x extract.script
nersc$ ./extract.script
Tip
You can use the xfer QOS to parallelize your extractions using the sorted list. Just split the list into N smaller lists and and submit N separate xfer jobs.
Ordering hsi Retrievals and Recreating Directory Structure¶
The script hpss_get_sorted_files.py (already available in your shell path) will retrieve the files in the
proper tape order and also recreate the directory structure the files
had in HPSS.
To use the script, you first need a list of fully qualified file path names and/or directory path names. If you do not already have such a list, you can query HPSS using the following command:
(the stdout+stderr pipe to grep removes directories from the output, keeping only files). Once you have the list of files, feed it to the sorting script:
hpss_get_sorted_files.py -i temp.txt -o <your_target_directory,
default is current directory> -s <strip string, default is NONE>
For files in HPSS under /home/e/elvis/unique_data, you might want to
strip off /home/e/elvis from the target directory. You can do that
by adding the -s /home/e/elvis flag.
Avoid Very Large Files¶
Files sizes greater than 2 TB can be difficult for HPSS to work with and lead to longer transfer times, increasing the possibility of transfer interruptions. Generally it's best to aim for file sizes in the 100 GB to 2 TB range. NERSC reserves the right to terminate HPSS transfers for files 20 TB or larger because these tie up resources on the system and are rarely successfully retrieved.
You can use tar and split to break up
large aggregates or large files into 500 GB sized chunks:
This will generate a number of files with names like
my_output_tarname.tar.00, my_output_tarname.tar.01, which you can
use hsi put to archive into HPSS. When you retrieve these files, you
can recombine them with cat:
Tip
If you're generating these chunks on the Lustre file system, be sure to follow the Lustre striping guidelines.
Clear Disk Cache After Backing Up¶
When working with a large amount of data, it is a good practice to delete
the data from the disk cache as soon as it has been backed up on HPSS.
This helps ensure optimal usage of HPSS resources. If individual files larger
than 1GB are stored, and they are not going to be used in the near future,
users are encouraged to reach out and ask for these files to be purged
from disk cache. Additionally, users may themselves clear the cache by
running the hsi command migrate -f -P <filepath>. The -P will enforce
that the disk copy is removed, once the tape copy has been created.
Note that the migrate command will hang until migration to tape is
completed. To avoid this delay, users may want to wait till the
file has been automatically sent to tape by HPSS before
running the migrate command. To check if data has been saved at tape
level, use ls -V <filepath>. Alternatively, as files should be
migrated automatically by HPSS in the hours after they are stored in
HPSS, issuing the migrate -P command 12 or 24 hours after the file
was created could be an easier way to achieve a similar result.
Use Globus to Access HPSS Data Remotely¶
Users can access HPSS data remotely by using
Globus. We recommend a two-stage process to
move data to or from HPSS and a remote site. Use Globus to transfer
the data between NERSC and the remote site (your scratch or CFS
directory would make a useful temporary staging point at NERSC), and
use hsi or htar to move the data into or out of HPSS.
Use the Xfer QOS¶
Use the dedicated xfer QOS for long-running transfers to or from HPSS. You can also submit jobs to the xfer QOS after your computations are done. The xfer QOS is configured to limit the number of running jobs per user to the same number as the limit of HPSS sessions.