Pegasus adds checksum computation and integrity checking steps for non shared filesystem deployments (nonsharedfs and condorio). The main motivation to do this is to ensure that any data transferred for a workflow does not get inadvertently corrupted during data transfers performed during workflow execution, or at rest at a staging site. Users now have options to specify sha256 checksums for the input files in the replica catalog. If checksums are not provided, then Pegasus will compute the checksums for the files during data transfers, and enforce these checksums whenever a PegasusLite job starts on a remote node. The checksums for outputs created by user executable are generated and published by pegasus-kickstart in it's provenance record. The kickstart output is brought back to the submit host as part of the job standard output using in-built HTCondor file transfer mechanisms. The generated checksums are then populated in the Stampede workflow database.
PegasusLite wrapped jobs invoke pegasus-integrity-check before launching any computational task. pegasus-integrity-check computes checksums on files and compares them against existing checksum values passed to it in its input. We also have extended our transfer tool pegasus-transfer to invoke pegasus-integrity check after completing the transfer of files.
Integrity checks in the workflows are implemented at 3 levels
after the input data has been staged to staging server - pegasus-transfer verifies integrity of the staged files.
before a compute task starts on a remote compute node - This ensures that checksums of the data staged in match the checksums specified in the input replica catalog or the ones computed when that piece of data was generated as part of previous task in the workflow.
after the workflow output data has been transferred to storage servers - This ensures that output data staged to the final location was not corrupted in transit.
The figure below illustrates the points at which integrity checks are implemented. In our approach, the reference checksums for the input files for a job are sent to the remote node where a job executes using in-built HTCondor file transfer mechanism.
Currently, there are few scenarios where integrity checks will not happen in case of non shared filesystem deployments
checksums are not enforced for user executables specified in the transformation catalog. In future, we plan to support checksumming for staged executable.
If you have set pegasus.transfer.bypass.input.staging to true to enable the bypass of staging of input files via the staging server, and have not specified the checksums in the replica catalog.
pegasus-statistics now includes a section containing integrity statistics:
# Integrity Metrics # Number of files for which checksums were compared/computed along with total time spent doing it. 171 files checksums generated with total duration of 8.705 secs # Integrity Errors # Total: # Total number of integrity errors encountered across all job executions(including retries) of a workflow. # Failures: # Number of failed jobs where the last job instance had integrity errors. Failures: 0 job failures had integrity errors
Currently we support following dials for integrity checking.
none - no integrity checking
full - full integrity checking for non shared filesystem deployments at the 3 levels described in this section.
By default integrity checking dial is set to full . To change this you can set the following property
For raw input files for your workflow you can specify the checksums along with file locations in the Replica Catalog. Pegasus will check against these checksums when a PegasusLite job starts up on a remote node. If checksums are not specified, then Pegasus will compute them during the data transfer to the staging site, and use them.
To specify checksums in replica catalog, you need to specify two additonal attributes with your LFN -> PFN mapping.
checksum.type The checksum type. Currently only type of sha256 is supported
checksum.value The checksum for the file
For example here is how you would specify the checksum for a file in a file based replica catalog
# file-based replica catalog: 2018-10-25T02:10:02.293-07:00 f.a file:///lfs1/input-data/f.a checksum.type="sha256" checksum.value="ca8ed5988cb4ca0b67c45fd80fd17423aba2a066ca8a63a4e1c6adab067a3e92" site="condorpool"