Created: January 21, 2010
Updated: July 22, 2010
When running large-scale workflows on grids and clusters we are often interested in runtime information about the jobs in the workflow. Such information includes statistics about job runtimes, I/O, memory usage, and CPU utilization. This data can be used by workflow management systems to develop feasible schedules that minimize workflow makespan. For example, knowing the peak memory usage of a particular job type can help you decide how many concurrent instances of that job type are able to efficiently run on a single node with a given amount of memory. Knowing I/O and CPU utilization could help to co-schedule I/O-bound and CPU-bound tasks to maximize resource utilization. The data can also be used to debug problems such as memory leaks, or to develop summary statistics for papers and proposals.
The collection of this kind of data is what wfprof is designed to do. For a single process running on a local machine the data is relatively easy to collect using tools such as top for memory usage and CPU utilization, time for runtimes, and mpstat for I/O. These tools cannot be used, however, when running a workflow that may contain thousands of indivdual jobs. Tools such as top and mpstat produce outputs that are difficult to parse and correlate with individual jobs. Also note that it isn't sufficient to simply profile one job or one category of jobs and use that to extrapolate for the entire workflow. Each job in the workflow is given different parameters that may result in widely varying behavior.
Wfprof is composed of two tools--ioprof and pprof--that wrap the job and use the Linux ptrace() API to intercept and analyze system calls and events produced by the job. These events are recorded, summarized and logged by the wrappers in a format that is easy to parse (currently CSV, but later XML). Because wfprof relies on the ptrace() API it is only available for Linux. In the future we may develop versions for other platforms if the need arises.
The ptrace() API has very low overhead. In our experiments ioprof adds about 10-15% to the runtime of a profiled workflow. Some of that overhead is due to the fact that ioprof is written in Python. We have plans to rewrite ioprof in C, which we expect will reduce the overhead substantially. Unlike ioprof, which must intercept almost all system calls, pprof only needs to intercept process lifecycle events (exec, exit, fork, etc.). Fortunately, the ptrace() API has special facilities for these events. As a result, in our experiments the overhead of pprof does not add a noticable amount to the runtime of a workflow.
ioprof, as the name suggests, collects information about I/O by tracing the system calls related to file descriptors (e.g. open(), close(), read(), write(), etc.). It records all the files opened by the job, and the amount of I/O performed on each by looking at the arguments passed to the system call, and the value returned from the system call. The 10-15% overhead of ioprof does not affect the I/O statistics it collects, but it does affect the runtime of the job. As a result, to get a complete profile we need to run the workflow twice: once to collect I/O stats with ioprof, and once to collect runtime, cpu, and memory stats with pprof.
The output of ioprof looks like this:
xform pid exe file bread nread bwrite nwrite nseek mode flags mAdd:3.0 3573 /montage/bin/mAdd <stdin> 0 0 0 0 0 None None mAdd:3.0 3573 /montage/bin/mAdd <stdout> 0 0 27 1 0 None None mAdd:3.0 3573 /montage/bin/mAdd <stderr> 0 0 0 0 0 None None mAdd:3.0 3573 /montage/bin/mAdd <socket> 0 0 0 0 0 None None mAdd:3.0 3573 /montage/bin/mAdd <pipe> 7 2 0 0 0 None None mAdd:3.0 3573 /montage/bin/mAdd <pipe> 5 2 0 0 0 None None mAdd:3.0 3573 /montage/bin/mAdd shrunken_090442_7144.hdr 304 2 0 0 0 None O_RDONLY mAdd:3.0 3573 /montage/bin/mAdd shrunken_090442_7144.hdr 304 2 0 0 0 None O_RDONLY mAdd:3.0 3573 /montage/bin/mAdd simages_090442_7144.tbl 4096 1 0 0 1 None O_RDONLY mAdd:3.0 3573 /montage/bin/mAdd simages_090442_7144.tbl 9102 4 0 0 1 None O_RDONLY mAdd:3.0 3573 /montage/bin/mAdd shrunken_090442_7144_area.fits 0 0 7876800 1924 0 0666 O_RDWR|O_CREAT|O_TRUNC ... mAdd:3.0 3574 /bin/uname <stdin> 0 0 0 0 0 None None mAdd:3.0 3574 /bin/uname <stdout> 0 0 0 0 0 None None mAdd:3.0 3574 /bin/uname <stderr> 0 0 0 0 0 None None mAdd:3.0 3574 /bin/uname <pipe> 0 0 7 1 0 None None mAdd:3.0 3575 /bin/basename <stdin> 0 0 0 0 0 None None mAdd:3.0 3575 /bin/basename <stdout> 0 0 0 0 0 None None mAdd:3.0 3575 /bin/basename <stderr> 0 0 0 0 0 None None mAdd:3.0 3575 /bin/basename <pipe> 0 0 5 1 0 None None
- xform - the job type (short for transformation)
- pid - process ID
- exe - the executable
- file - the file read/written
- bread - the number of bytes read
- nread - the number of read operations
- bwrite - the number of bytes written
- nwrite - the number of write operations
- nseek - the number of seek operations
- mode - the mode from the open() call
- flags - the flags from the open() call
There are two things to notice about the above output. First, each job in the workflow can be composed of many processes, each running a different executable. Second, the output contains data for file descriptors that are not actually files, such as pipes, standard streams, and sockets.
pprof collects statistics about the memory usage and runtimes of the processes that make up the job. Like ioprof it uses the ptrace() API, but it does NOT intercept every system call. This makes the overhead of pprof negligible. Instead of looking at every system call, pprof looks for specific lifecycle events, such as the start of a new process, or the end of a running process. These events correspond with the system calls fork (clone), exec, and exit. Pprof watches for these specific system call events using special facilities provided by the ptrace() API. When an event is detected it is quickly recorded and the process is allowed to continue.
The output of pprof looks like this:
xform pid ppid exe lstart lstop tstart tstop vmpeak rsspeak utime stime wtime mAdd:3.0 26340 26339 /bin/uname 2 3 1253814165.849284 1253814165.850822 3788 480 0.000 0.000 0.001538 mAdd:3.0 26341 26339 /bin/basename 4 5 1253814165.851547 1253814165.853214 3788 476 0.000 0.000 0.001667 mAdd:3.0 26339 26338 /montage/bin/mAdd 1 6 1253814165.846537 1253814166.146199 17000 8020 0.140 0.060 0.299662
- xform - job type (short for transformation)
- pid - ID of process
- ppid - ID of process's parent
- exe - executable
- lstart - logical start time relative to other processes in the job
- lstop - logical stop time relative to other processes in the job
- tstart - actual start time of process
- tstop - actual stop time of process
- vmpeak - peak virtual memory usage
- rsspeak - peak physical memory usage
- utime - time spent scheduled in CPU in user mode
- stime - time spent scheduled in CPU in kernel mode
- wtime - wall time of job (tstop-tstart)
Notice that we can compute the CPU utilization of each process using (utime+stime)/wtime. Also, the vmpeak and rsspeak are maintained by the Linux kernel for each process, we simply query the /proc file system when the process exits to find out what the values are.
Once the logs are generated by ioprof and pprof there are several tools that can be used to analyze the data. These tools group the data by job type and output summary statistics (min, max, mean, variance, etc.) for the workflow.
The analysis tools include:
- iostats - computes statistics for I/O read and I/O write
- memstats - computes peak physical memory statistics
- runstats - computes runtime statistics
- utilstats - computes CPU utilization statistics
Each analysis script computes the statistics grouped by job type (xform) and by job type + executable. Note that memstats has to reconstruct the timeline of overlapping processes for each job to determine which processes were running concurrently and sum up their peak memory accordingly. Because the peak memory usage of a process is not guaranteed to be maintained for the entire lifetime of the process there is some error when adding these values together. As a result, the peak memory calculated by memstats at the job level is more of an upper-bound than an absolute value.
The wrapper scripts are in the Pegasus subversion repository. https://pegasus.isi.edu/svn/pegasus/trunk/src/tools/wfprof/old/
We are currently working on a C version of ioprof that should be a bit faster than the current Python version.