How to Audit What Your AI Coding Agent Actually Did

The agent's summary is a story. The transcript is the evidence.

If you let an AI coding agent run unattended — and increasingly people do — you eventually need to answer a specific question: what did it actually do? Not the tidy summary it gives at the end, but the literal sequence of commands it ran and files it changed. You need this for trust (did it stay in bounds?), for debugging (how did it reach a broken state?), and for security (did anything happen that shouldn't have?).

Why the summary isn't enough

An agent's end-of-session summary is a reconstruction — sometimes lossy, occasionally optimistic. It reports what the agent believes it accomplished, which can quietly omit a command that failed, a file it touched and reverted, or a step it skipped. For an audit you want ground truth, and the agent's narration isn't it.

The good news: the record already exists

Claude Code writes every session to ~/.claude/projects/ as a JSONL transcript, and every tool call is in there — each command run, each file read or written, each result returned, with timestamps. That's a complete audit trail. You just have to read it.

The four questions an audit answers

What commands ran? Every Bash tool call — the literal command stream. The core of any security review.
What files changed? Every Write and Edit, by path. The blast radius of the session.
What was accessed? Every Read — and specifically, any read of a sensitive path (.env, keys, credentials) the task had no reason to touch.
Where did it fail? Tool results carrying errors mark where the session struggled — often where an investigation should start.

What to look for

An audit is only useful if you know the signals:

Scope creep — commands outside the task (an agent asked to fix a test that also edited deploy config).
Sensitive access — reads or writes of .env / key material in a session that shouldn't need them.
Dangerous attempts — force-pushes, rm -rf on risky paths, curl | sh — whether or not a guardrail stopped them.
Failed-then-retried sequences, which sometimes mark a workaround.

Make it a habit, not a forensic

The instinct is to read the audit trail only after something breaks. The higher-value practice is a periodic, lightweight glance at what your agents did unattended — the way you'd skim a colleague's pull requests. It builds an accurate sense of how your agents actually behave and surfaces drift before it becomes an incident.

Don't want to parse JSONL by hand?
Operator reads those transcripts and gives you the whole audit in one command — every command, every file write, every sensitive-path access, plus the dangerous actions your agents attempted — across all your projects. Free, local, no telemetry.

← Back · Where your tokens actually go →