Version 0.5 (last updated Fri Mar 11 05:58:27 PST 2005) of the google-perftools has the following 4 tools:
* thread-caching malloc
* heap-checking using tcmalloc
* heap-profiling using tcmalloc
* CPU profiler
The tool claims "The fastest malloc we've seen; works particularly well with threads and STL. Also: thread-friendly heap-checker, heap-profiler, and cpu-profiler." They work on Linux 9 and C++ programs. In the README it claims that Google is porting it to Windows but no clear date when that will happen. The tcmalloc library performance is impressive, and the CPU profiler works great, especially the display of the output.
To download, go to http://sourceforge.net/projects/goog-perftools. To install, read the INSTALL file in the unzipped directory. Here is the main section on installation. If you are running on Linux 2.6.x kernel, there are RPMs ready to install (without having to compile):
1. `cd' to the directory containing the package's source code and type `./configure' to configure the package for your system. If you're using `csh' on an old version of System V, you might need to type `sh ./configure' instead to prevent `csh' from trying to execute `configure' itself.Running `configure' takes awhile. While running, it prints some messages telling which features it is checking for. 2. Type `make' to compile the package. 3. Optionally, type `make check' to run any self-tests that come with the package. 4. Type `make install' to install the programs and any data files and documentation. |
CPU Profiler
This graph is generated by the 'pprof' command, part of the google-perftools, with the '--gv' option, or ghost view. You must have 'dot' installed. Dot is distributed by AT&T Bell labs and is available for restricted use. The other tools are called dotty, neato and tcldot. You can obtain a non-commercial license for dot/dotty/neato/tcldot and download the software from the Web page:http://www.research.att.com/sw/tools/graphviz.
You run a CPU-profiler profiled binary, for example:
# ./profiler4_unittest 200 10 /tmp/cpuprofile |
Then you generate analysis report using the 'pprof' tool, like this:
# profiler4_unittest 200 10 /tmp/cpu.prof# pprof "profiler4_unittest" "/tmp/cpu.prof460 25.2% 25.2% 460 25.2% __pthread_mutex_lock_internal 399 21.9% 47.1% 399 21.9% __pthread_mutex_unlock_usercnt 196 10.8% 57.9% 196 10.8% vfprintf 156 8.6% 66.5% 156 8.6% __lll_mutex_lock_wait 141 7.7% 74.2% 141 7.7% __lll_mutex_unlock_wake 110 6.0% 80.2% 240 13.2% __vsnprintf 60 3.3% 83.5% 60 3.3% _IO_default_xsputn_internal 50 2.7% 86.3% 50 2.7% _IO_str_init_static_internal 45 2.5% 88.7% 45 2.5% _IO_old_init 35 1.9% 90.7% 462 25.4% __snprintf 34 1.9% 92.5% 34 1.9% __find_specmb |
Here are some other 'pprof' commands, many of which generate graphical output:
% pprof --gv "program" "profile" Generates annotated call-graph and displays via "gv" % pprof --gv --focus=Mutex "program" "profile" Restrict to code paths that involve an entry that matches "Mutex" % pprof --gv --focus=Mutex --ignore=string "program" "profile" Restrict to code paths that involve an entry that matches "Mutex" and does not match "string" % pprof --list=IBF_CheckDocid "program" "profile" Generates disassembly listing of all routines with at least one sample that match the --list= pattern. The listing is annotated with the flat and cumulative sample counts at each line. |
Check out details for Google CPU Profiler.
Heap Profile & TCMalloc
The interesting point google-perftools brings up about memory allocation is TCMalloc or Thread-Caching malloc:" Sanjay Ghemawat, Paul Menage <opensource@google.com> Motivation -
TCMalloc is faster than the glibc 2.3 malloc (available as a separate library called ptmalloc2) and other mallocs that I have tested. ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair. Speed is important for a malloc implementation because if malloc is not fast enough, application writers are inclined to write their own custom free lists on top of malloc. This can lead to extra complexity, and more memory usage unless the application writer is very careful to appropriately size the free lists and scavenge idle objects out of the free list.
TCMalloc also reduces lock contention for multi-threaded programs. For small objects, there is virtually zero contention. For large objects, TCMalloc tries to use fine grained and efficient spinlocks. ptmalloc2 also reduces lock contention by using per-thread arenas but there is a big problem with ptmalloc2's use of per-thread arenas. In ptmalloc2 memory can never move from one arena to another. This can lead to huge amounts of wasted space. For example, in one Google application, the first phase would allocate approximately 300MB of memory for its data structures. When the first phase finished, a second phase would be started in the same address space. If this second phase was assigned a different arena than the one used by the first phase, this phase would not reuse any of the memory left after the first phase and would add another 300MB to the address space. Similar memory blowup problems were also noticed in other applications.
Another benefit of TCMalloc is space-efficient representation of small objects. For example, N 8-byte objects can be allocated while using space approximately 8N * 1.01 bytes. I.e., a one-percent space overhead. ptmalloc2 uses a four-byte header for each object and (I think) rounds up the size to a multiple of 8 bytes and ends up using 16N bytes."
See also:
Automatic Leaks Checking Support
Profiling heap usage
http://www.performancewiki.com/google-perftools.html
No comments:
Post a Comment