Tuesday, September 27, 2011

Google Performance Tools


Version 0.5 (last updated Fri Mar 11 05:58:27 PST 2005) of the google-perftools has the following 4 tools:
* thread-caching malloc
* heap-checking using tcmalloc
* heap-profiling using tcmalloc
* CPU profiler
The tool claims "The fastest malloc we've seen; works particularly well with threads and STL. Also: thread-friendly heap-checker, heap-profiler, and cpu-profiler."  They work on Linux 9 and C++ programs.  In the README it claims that Google is porting it to Windows but no clear date when that will happen.  The tcmalloc library performance is impressive, and the CPU profiler works great, especially the display of the output.
To download, go to http://sourceforge.net/projects/goog-perftools.  To install, read the INSTALL file in the unzipped directory.  Here is the main section on installation.  If you are running on Linux 2.6.x kernel, there are RPMs ready to install (without having to compile):




1. `cd' to the directory containing the package's source code and type
`./configure' to configure the package for your system. If you're
using `csh' on an old version of System V, you might need to type
`sh ./configure' instead to prevent `csh' from trying to execute
`configure' itself.Running `configure' takes awhile. While running, it prints some
messages telling which features it is checking for.
2. Type `make' to compile the package.
3. Optionally, type `make check' to run any self-tests that come with
the package.
4. Type `make install' to install the programs and any data files and
documentation.

CPU Profiler



Your code can only scale if it efficiently uses CPU cycles - spent CPU cycles doing the real work, not in overhead.  Enter CPU profiling.  Google-perftools CPU profile allows you to profile in two ways: 1) compile the profiling library into your code, and run it; 2) set env variables if you can't compile a program because you don't have source.  The example graphical output looks like this - NOTICE: the larges box in the display is the BIGGEST CPU consumer, and so on so forth, very cool:

This graph is generated by the 'pprof' command, part of the google-perftools, with the '--gv' option, or ghost view.  You must have 'dot' installed.  Dot is distributed by AT&T Bell labs and is available for restricted use. The other tools are called dotty, neato and tcldot. You can obtain a non-commercial license for dot/dotty/neato/tcldot and download the software from the Web page:http://www.research.att.com/sw/tools/graphviz.
You run a CPU-profiler profiled binary, for example:
# ./profiler4_unittest 200 10 /tmp/cpuprofile
Then you generate analysis report using the 'pprof' tool, like this:
# profiler4_unittest 200 10 /tmp/cpu.prof# pprof "profiler4_unittest" "/tmp/cpu.prof
     460  25.2%  25.2%      460  25.2% __pthread_mutex_lock_internal
     399  21.9%  47.1%      399  21.9% __pthread_mutex_unlock_usercnt
     196  10.8%  57.9%      196  10.8% vfprintf
     156   8.6%  66.5%      156   8.6% __lll_mutex_lock_wait
     141   7.7%  74.2%      141   7.7% __lll_mutex_unlock_wake
     110   6.0%  80.2%      240  13.2% __vsnprintf
      60   3.3%  83.5%       60   3.3% _IO_default_xsputn_internal
      50   2.7%  86.3%       50   2.7% _IO_str_init_static_internal
      45   2.5%  88.7%       45   2.5% _IO_old_init
      35   1.9%  90.7%      462  25.4% __snprintf
      34   1.9%  92.5%       34   1.9% __find_specmb
Here are some other 'pprof' commands, many of which generate graphical output:
% pprof --gv "program" "profile"
  Generates annotated call-graph and displays via "gv"

% pprof --gv --focus=Mutex "program" "profile"
  Restrict to code paths that involve an entry that matches "Mutex"

% pprof --gv --focus=Mutex --ignore=string "program" "profile"
  Restrict to code paths that involve an entry that matches "Mutex"
  and does not match "string"

% pprof --list=IBF_CheckDocid "program" "profile"
  Generates disassembly listing of all routines with at least one
  sample that match the --list= pattern.  The listing is
  annotated with the flat and cumulative sample counts at each line.

Heap Profile & TCMalloc

The interesting point google-perftools brings up about memory allocation is TCMalloc or Thread-Caching malloc:
" Sanjay Ghemawat, Paul Menage <opensource@google.com> Motivation -
TCMalloc is faster than the glibc 2.3 malloc (available as a separate library called ptmalloc2) and other mallocs that I have tested. ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair. Speed is important for a malloc implementation because if malloc is not fast enough, application writers are inclined to write their own custom free lists on top of malloc. This can lead to extra complexity, and more memory usage unless the application writer is very careful to appropriately size the free lists and scavenge idle objects out of the free list.
TCMalloc also reduces lock contention for multi-threaded programs. For small objects, there is virtually zero contention. For large objects, TCMalloc tries to use fine grained and efficient spinlocks. ptmalloc2 also reduces lock contention by using per-thread arenas but there is a big problem with ptmalloc2's use of per-thread arenas. In ptmalloc2 memory can never move from one arena to another. This can lead to huge amounts of wasted space. For example, in one Google application, the first phase would allocate approximately 300MB of memory for its data structures. When the first phase finished, a second phase would be started in the same address space. If this second phase was assigned a different arena than the one used by the first phase, this phase would not reuse any of the memory left after the first phase and would add another 300MB to the address space. Similar memory blowup problems were also noticed in other applications.
Another benefit of TCMalloc is space-efficient representation of small objects. For example, N 8-byte objects can be allocated while using space approximately 8N * 1.01 bytes. I.e., a one-percent space overhead. ptmalloc2 uses a four-byte header for each object and (I think) rounds up the size to a multiple of 8 bytes and ends up using 16N bytes."
See also:
Automatic Leaks Checking Support
Profiling heap usage

http://www.performancewiki.com/google-perftools.html

No comments:

Post a Comment