GPTL - General Purpose Timing Library
(with optional PAPI interface)
Download the latest release by clicking here
Description
GPTL is a library to instrument C, C++, and Fortran codes for
performance analysis and profiling. The instrumentation can be inserted
manually by the user wherever they wish, or it can be done automatically by
the compiler at function entry and exit points if the application being
profiled is built with GNU, Pathscale, Intel, or PGI (8.0.2 or later)
compilers. To auto-instrument an application,
add -finstrument-functions (Pathscale, GNU, Intel)
or -Minstrument:functions (PGI) to the compile and link flags
of the source files to be profiled.
Automatic instrumentation of a number of MPI routines is also possible. In this
case no special compiler flags are necessary, and users can obtain profiles
with zero changes to their source files. See
Example 6 for further details.
Here is a portion of GPTL printout after running the HPCC benchmark
with compiler-based automatic instrumentation enabled:
Stats for thread 0:
Called Recurse Wallclock max min FP_OPS e6_/_sec CI
total 1 - 64.021 64.021 64.021 3.50e+08 5.47 7.20e-02
HPCC_Init 11 10 0.157 0.157 0.000 95799 0.61 8.90e-02
* HPL_pdinfo 120 118 0.019 0.018 0.000 96996 4.99 8.56e-02
* HPL_all_reduce 7 - 0.043 0.036 0.000 448 0.01 1.03e-02
* HPL_broadcast 21 - 0.041 0.036 0.000 126 0.00 6.72e-03
HPL_pdlamch 2 - 0.004 0.004 0.000 94248 21.21 1.13e-01
* HPL_fprintf 240 120 0.001 0.000 0.000 1200 0.93 6.67e-03
HPCC_InputFileInit 41 40 0.001 0.001 0.000 194 0.27 8.45e-03
ReadInts 2 - 0.000 0.000 0.000 12 3.00 1.61e-02
PTRANS 21 20 22.667 22.667 0.000 4.19e+07 1.85 3.19e-02
MaxMem 5 4 0.000 0.000 0.000 796 2.70 1.79e-02
* iceil_ 132 - 0.000 0.000 0.000 792 2.88 1.75e-02
* ilcm_ 14 - 0.000 0.000 0.000 84 2.71 1.71e-02
param_dump 18 12 0.000 0.000 0.000 84 0.82 7.05e-03
Cblacs_get 5 - 0.000 0.000 0.000 30 1.43 1.67e-02
Cblacs_gridmap 35 30 0.005 0.001 0.000 225 0.05 1.79e-03
* Cblacs_pinfo 7 1 0.000 0.000 0.000 40 3.08 1.54e-02
* Cblacs_gridinfo 60 50 0.000 0.000 0.000 260 2.28 2.10e-02
Cigsum2d 5 - 0.088 0.047 0.000 165 0.00 6.37e-03
pdmatgen 20 - 21.497 1.213 0.942 4.00e+07 1.86 3.08e-02
* numroc_ 96 - 0.000 0.000 0.000 576 2.87 1.69e-02
* setran_ 25 - 0.000 0.000 0.000 150 2.94 1.72e-02
* pdrand 3.7e+06 2e+06 15.509 0.041 0.000 1.72e+07 1.11 2.24e-02
xjumpm_ 57506 57326 0.219 0.030 0.000 230384 1.05 2.66e-02
jumpit_ 60180 40120 0.214 0.021 0.000 280840 1.32 2.18e-02
slboot_ 5 - 0.000 0.000 0.000 30 1.30 1.01e-02
Cblacs_barrier 10 5 0.481 0.167 0.000 50 0.00 3.26e-03
sltimer_ 10 - 0.000 0.000 0.000 614 3.05 1.90e-02
* dwalltime00 15 - 0.000 0.000 0.000 150 2.54 2.57e-02
* dcputime00 15 - 0.000 0.000 0.000 373 3.06 1.91e-02
* HPL_ptimer_cputime 17 - 0.000 0.000 0.000 170 2.66 2.29e-02
pdtrans 14 9 0.124 0.045 0.000 573505 4.61 1.36e-01
Cblacs_dSendrecv 12 8 0.115 0.042 0.000 56 0.00 2.24e-03
pdmatcmp 5 - 0.448 0.295 0.003 1.29e+06 2.87 2.94e-01
* HPL_daxpy 2596 - 0.008 0.000 0.000 1.34e+06 177.06 4.40e-01
* HPL_idamax 2966 - 0.007 0.000 0.000 767291 104.75 4.15e-01
...
Function names on the left of the output are indented to indicate their
parent, and depth in the call tree. An asterisk next to an entry means it
has more than one parent (see Example 2 for
further details). Other entries in this output show the number of
invocations, number of recursive invocations, wallclock timing
statistics, and PAPI-based information. In this example, HPL_daxpy
produced 1.34e6 floating point operations, 177.06 MFlops/sec, and had a
computational intensity (floating point ops per memory reference) of
0.415.
If the PAPI library is
installed on the target platform, GPTL can be used to
access all available PAPI events.
To count floating point operations for example, one need only add
a call that looks like:
ret = GPTLsetoption (PAPI_FP_OPS, 1);
The second argument "1" in the above call means "enable". Any non-zero
integer means "enable", and a zero means "disable".
Multiple GPTL or PAPI options can be specified with additional
calls to GPTLsetoption(). The man pages provided with the
distribution describe the full API specification. The interface is
identical for both Fortran and C/C++
codes, modulo the case-insensitivity of Fortran.
Calls to GPTLstart() and GPTLstop() can be nested to an
arbitrary depth. As shown above, GPTL handles nested regions by
presenting output in an indented fashion. The example also shows how
auto-instrumentation
can be used to easily produce a dynamic call tree of
the application being profiled, where region names correspond to function
entry and exit points.
What's new in the latest release
(gptl3_6_3.tar.gz)
- GPTLprint_memusage() converts memory usage units to MB by default (if possible).
- Added support for bluegene (see macros.make.bluegene).
- Bugfix for gptl_papilibraryinit (Fortran): needed to return an int.
- Bugfix for GPTLpr_summary: slave tried to receive too much data.
- Changed LINUX ifdef to HAVE_SLASHPROC. Not all Linux systems have /proc/pid/statm.
- Makefile uses "findstring xlf" to decide how Fortran defines are set.
What was new in previous releases
- Bugfix for auto-profiling MPI_Recv wrapper (when ENABLE_PMPI is set in macros.make):
previous version could cause hangs in some cases.
- Added auto-profiling entries for more MPI routines: MPI_Iprobe, MPI_Probe, MPI_Ssend,
MPI_Alltoallv, MPI_Scatterv, MPI_Test.
- Better estimates of bytes transferred for auto-profiled MPI routines.
- Makefile simplification. Can now run "make" from ctests/ and ftests/.
- Initial set of PMPI wrappers. Automatically generates MPI times and
statistics for the most common MPI calls.
- Option to synchronize and time certain collectives (see ENABLE_PMPI in
macros.make.linux). Note that the set of MPI routines profiled is not yet
complete. This option has not yet been fully tested.
- Bugfix for when omp_get_max_threads() returns zero.
- GPTLallocate() returns error when asked for zero bytes.
- OpenMP applications now work when GPTL is built with PTHREADS
- Fortran bugfix enables longer event names. This allows one to enable the
PAPI native event names which can be long.
- Remove some of the relatively unuseful entries from ctests/ and ftests/.
- Tested on AIX.
- Easier linking with C++ applications.
- Options for call-tree generation based on number of invocations per
parent: most_frequent (default), first_parent, last_parent, full_tree.
Previous versions always used first_parent. New option full_tree can
produce tons of output depending the nature of the call tree. But it
can also be very useful because it shows all parent-child relationships.
- Derived events based on PAPI:
- L2 miss rate (GPTL_L2MRT)
- Load-stores per L2 miss (GPTL_LSTPL2M)
- L3 miss rate (GPTL_L3MRT)
- Function GPTLpr_summary() now takes an MPI communicator as its
argument. Passing an "int" doesn't work with some MPI implementations (e.g. OpenMPI).
- New subroutine gptlprocess_namelist() enables Fortran codes to
use a namelist to set GPTL options. This allows changing settings
without having to recompile or relink application
codes. See Example 5 for example usage.
- New function GPTLget_eventvalue() allows an application to query
the current value of any PAPI-based event, including derived events.
- New function GPTLget_wallclock() allows an application to query
the current wallclock accumulation for any region.
- New function GPTLbarrier() calls MPI_Barrier() and times it.
- parsegptlout.pl now takes header name as an argument rather
than an integer index.
- hex2name.pl converts auto-instrumented entries for thread summary regions.
- Derived events based on PAPI:
- Computational intensity (GPTL_CI)
- Instructions per cycle (GPTL_IPC)
- FP ops per cycle (GPTL_FPC)
- FP ops per instruction (GPTL_FPI)
- Load-store instruction fraction (GPTL_LSTPI)
- L1 miss rate (GPTL_DCMRT)
- Load-stores per L1 miss (GPTL_LSTPDCM)
- New entry points GPTLevent_code_to name() and GPTLevent_name_to_code()
- Ability to disable portions of printed output (e.g. GPTLdopr_preamble)
- Better description of enabled events
Features
- Low overhead.
- No external dependencies (PAPI interface is optional).
- Automatically multiplexes requested PAPI counters when required.
- Thread-safe, and reports per-thread statistics for multi-threaded
codes.
- Includes utility functions to print memory usage
(GPTLprint_memusage()) and get timestamps (GPTLstamp()).
- Includes utility scripts to post-process multi-threaded and
multi-tasked output for easy assessment of load balance
characteristics.
- Support for derived (PAPI-based) events such as computational
intensity and instructions per cycle. Run ctests/avail to list
available events.
Download and Installation
Examples
These pages contain simple codes which illustrate the use of some features of
GPTL. All examples were run on a Linux x86 using GNU compilers.
Example 1 is a manually-instrumented
Fortran code which uses PAPI to count floating point
operations.
Example 2 is a C code compiled
with gcc's auto-instrumentation hooks to print a dynamic call tree. Perl
script hex2name.pl is used to convert addresses to
human-readable names.
Example 3 is a simple MPI code, the
output of which is post-processed using Perl script
parsegpltout.pl to examine load imbalance.
Example 4 is an auto-instrumented C++ code.
Issues related to in-line constructors are illustrated.
Example 5 is a Fortran code which uses
gptlprocess_namelist() and an associated namelist file to
set GPTL options.
Example 6 is a Fortran code which utilizes the
new ENABLE_PMPI option to automatically time various MPI calls and print the
average number of bytes transferred.
Bugs
- Calling GPTLinitialize() after GPTLfinalize() is untested.
- Increasing the thread count after GPTLinitialize() has been
called does not work when GPTL threading is via OMP. This should
work OK with PTHREADS, but is untested.
- The PAPI library warns about using omp_get_thread_num() as the
underlying routine to get the thread number. Therefore it's better to
build GPTL with PTHREADS than OPENMP (see macros.make.linux).
Bug Reports
Please email me bug reports and/or feature requests (jmrosinski2 AT gmail DOT com).
Author
GPTL was written
by Jim Rosinski,
currently at NOAA/ESRL, formerly
at ORNL,
SiCortex,
and NCAR.
Copyright
This software is Open Source. My only request is that you don't
embed GPTL library source itself in software that you intend to sell.