Writing

Valgrind

A practical walkthrough of Valgrind's most useful tools — Memcheck, Callgrind, and when to reach for each one.

Valgrind is a framework that runs your program on a synthetic CPU and lets different analysis tools observe every instruction. It’s not one tool — the --tool= flag picks which analysis runs.

valgrind ./program           # runs memcheck by default
valgrind --tool=callgrind ./program
valgrind --tool=massif ./program

The main ones worth knowing:

ToolWhat it does
memcheckFinds memory errors and leaks
callgrindCounts instructions and records the call graph
cachegrindSimulates cache behaviour
massifTracks heap allocations over time
helgrind / drdFinds data races in threaded programs

The tradeoff for all of them is speed — your program runs 5–100× slower because Valgrind is simulating a CPU rather than running on one. The upside is that results are deterministic and need no hardware counters.


Build flags

gcc -g -O0 -o demo src/demo.c   # callgrind
gcc -g -O1 -o demo src/demo.c   # memcheck

-g is the important one. Without it you get addresses instead of filenames and line numbers. -O0 matters mainly for callgrind — at -O2 the compiler inlines small functions and merges loops, so the call graph stops reflecting what you actually wrote. For memcheck, -O1 or -O2 is fine since Valgrind can still track memory regardless of optimisation level.


Memcheck

Catches the bugs that are hard to reproduce:

  • reads past the end of a heap allocation
  • reads from memory that was freed
  • reads of uninitialised values
  • memory leaks
valgrind \
  --tool=memcheck \
  --leak-check=full \
  --show-leak-kinds=all \
  --track-origins=yes \
  ./program

--leak-check=full gives you a stack trace for every leak instead of just a count. --track-origins=yes tells you where an uninitialised value came from — it roughly doubles the slowdown but saves a lot of guessing.

A typical error looks like this:

==4821== Invalid read of size 4
==4821==    at 0x10919E: process (demo.c:42)
==4821==    by 0x109312: main (demo.c:71)
==4821==  Address 0x52040a0 is 0 bytes after a block of size 40 alloc'd
==4821==    at 0x484DA83: malloc (vg_replace_malloc.c:431)
==4821==    by 0x1090F2: main (demo.c:65)

The ==PID== prefix appears on every line — useful when tracing child processes. The first stack trace is where the bad access happened. The second is where the memory was originally allocated. With those two together you usually know what went wrong without touching a debugger.

Leak output at the end of a run:

==4821== LEAK SUMMARY:
==4821==    definitely lost: 40 bytes in 1 blocks
==4821==    indirectly lost: 0 bytes in 0 blocks
==4821==      possibly lost: 0 bytes in 0 blocks
==4821==    still reachable: 0 bytes in 0 blocks

“Definitely lost” is a real leak — pointer gone, memory never freed. “Still reachable” means a live pointer exists at exit; often fine for global allocations that the OS reclaims. Focus on “definitely lost” first.

For CI, --error-exitcode=1 makes Valgrind exit non-zero when it finds anything:

valgrind --tool=memcheck --leak-check=full --error-exitcode=1 ./program

Callgrind

Callgrind counts every instruction the CPU executes, per function, and records who called whom. No sampling — you get exact counts.

valgrind \
  --tool=callgrind \
  --callgrind-out-file=callgrind.out \
  --dump-instr=yes \
  --collect-jumps=yes \
  ./program

--dump-instr=yes records at instruction granularity rather than just source lines. --collect-jumps=yes adds branch taken/not-taken counts. The output goes to callgrind.out.

During the run you see:

==8559== Callgrind, a call-graph generating cache profiler
==8559== Command: ./demo data
==8559== 
==8559== I   refs:      194,508,428

“I refs” is the total instruction count. That number is what callgrind distributes across your functions.

Reading the results

callgrind_annotate callgrind.out

The per-function table, sorted by instruction count:

Ir                   file:function
──────────────────────────────────────────────────────────────────
57,434,686 (29.53%)  strtod_l.c:____strtod_l_internal   [libc]
20,140,768 (10.35%)  strtod_l.c:str_to_mpn              [libc]
11,300,000 ( 5.81%)  strtok_r.c:strtok_r                [libc]
10,442,480 ( 5.37%)  strcmp-avx2.S:__strcmp_avx2         [libc]
10,248,782 ( 5.27%)  src/demo.c:find_or_create
 8,608,707 ( 4.43%)  strcspn-sse4.c:__strcspn_sse42      [libc]
 7,650,898 ( 3.93%)  src/demo.c:accumulate
 4,450,000 ( 2.29%)  src/demo.c:parse_line
 3,750,000 ( 1.93%)  src/demo.c:parse_score
 2,800,535 ( 1.44%)  iofgets.c:fgets                    [libc]
   750,230 ( 0.39%)  src/demo.c:read_file

This run was a CSV parser — 50,000 rows, three strtod calls per row. The results make the problem obvious: strtod and its internals take ~50% of all instructions. The next biggest thing in our own code is find_or_create at 5.3%, which does a linear strcmp scan through 20 categories on every row.

The Ir column is exclusive cost — instructions executed inside that function, not counting callees. That’s why parse_score shows only 1.93% even though it calls strtod which accounts for 29.53%; the callee cost is attributed to strtod itself.

Source annotation

callgrind_annotate --auto=yes callgrind.out

This annotates every line of your source with its instruction count:

 1,200,000 ( 0.62%)  static double parse_score(const char *field) {
 1,050,000 ( 0.54%)      double v = strtod(field, &end);
108,658,465 (55.86%)  => strtod (150,000x)
   750,000 ( 0.39%)      return (end == field) ? 0.0 : v;

The => line shows the callee’s cost inline. That one strtod call accounts for 56% of the entire program. Hard to miss.

   250,000 ( 0.13%)  static int parse_line(char *buf) {
   400,000 ( 0.21%)      char *id  = strtok(buf,  ",");
 5,500,000 ( 2.83%)  => strtok_r (50,000x)
   350,000 ( 0.18%)      char *nm  = strtok(NULL, ",");
 6,483,552 ( 3.33%)  => strtok_r (50,000x)
                     ...
   200,000 ( 0.10%)      Stats *st = find_or_create(cat);
21,736,060 (11.17%)  => find_or_create (50,000x)

Six strtok calls per row, each attributing several million instructions to libc’s strtok_r. And find_or_create — a 20-element linear strcmp loop called 50,000 times — costs more instructions than anything else in our code.

GUI

callgrind.out opens in qcachegrind (macOS: brew install qcachegrind, Ubuntu: apt install kcachegrind). The GUI adds an interactive call graph, click-through to source, and a sortable flat profile. Worth installing if you’re spending any real time on a profile.


Cachegrind

valgrind --tool=cachegrind ./program
cg_annotate cachegrind.out.<pid>

Simulates the L1 instruction cache, L1 data cache, and last-level cache. Useful when you think cache misses are the problem rather than raw instruction count. The output metrics:

  • Ir — instruction reads
  • D1mr / DLmr — L1 and LL data read misses
  • D1mw / DLmw — L1 and LL data write misses

Most of the time callgrind is the better first tool. Cachegrind becomes relevant once you know which loop is hot and want to understand whether it’s bound by memory access patterns.


Massif

valgrind --tool=massif --massif-out-file=massif.out ./program
ms_print massif.out

Tracks heap allocations over the lifetime of the program and produces a graph showing peak usage and which call sites are responsible. Useful for “why is this process using 4 GB” situations.


Suppressions

Libraries you don’t control sometimes trigger memcheck errors. Rather than stare at them forever, suppress them:

valgrind --gen-suppressions=all ./program 2>raw.supp

Edit raw.supp to keep only the entries you want, then pass it with --suppressions=my.supp. Many distros also ship standard suppression files at /usr/lib/valgrind/*.supp.


Practical patterns

CI memory check:

valgrind --tool=memcheck --leak-check=full --error-exitcode=1 ./tests

Find the hot function:

valgrind --tool=callgrind --callgrind-out-file=callgrind.out ./program
callgrind_annotate --threshold=80 callgrind.out   # top 80% only

Compare before and after a change:

callgrind_annotate before.out | grep "PROGRAM TOTALS"
callgrind_annotate after.out  | grep "PROGRAM TOTALS"

Write one log file per child process:

valgrind --log-file=valgrind-%p.log --trace-children=yes ./program

Step through a Valgrind run in GDB:

valgrind --tool=memcheck --vgdb=yes --vgdb-error=0 ./program
# in another terminal:
gdb ./program
(gdb) target remote | vgdb

References