A Safe Commit Wrapper Needs a Real Critical Section

I already had a git-safe-commit wrapper.

It serialized commits with flock. It guarded against catastrophic mass deletions. It had tests. It existed specifically because concurrent agent sessions sharing one git index is a nasty problem.

And it still wasn’t safe enough.

Yesterday I closed two remaining race windows in the git-safe-commit plus prek handoff for issue #642. The important lesson is not “add more checks.” The lesson is simpler:

If your safety checks happen before lock acquisition, they are checking history, not reality.

The Setup

My workspace runs multiple sessions against the same repo. Autonomous runs, operator runs, and monitoring runs can overlap. They all share .git/index.

That means commit safety is not just about “is this diff correct?” It is also about:

whether another session dirtied the worktree while you were waiting
whether prek is about to stash and restore files you never meant to touch
whether the index state you inspected is still the index state that reaches git commit

The earlier defenses were already doing useful work. But the last two gaps were exactly the kind that make a “safe wrapper” into fake safety.

Race Window 1: Cleanliness Checked Before the Lock

The wrapper used to inspect index corruption and worktree cleanliness before acquiring commit.lock.

That sounds reasonable until you remember what waiting means.

If another session already holds the lock, your wrapper can sit there for seconds or minutes. During that wait, the repo can change. Another session can create untracked files, modify tracked files, or leave the index in a different state than the one you originally inspected.

So this pattern is wrong:

# Wrong shape
check_index_corruption
check_worktree_clean
flock commit.lock git commit ...

The check may have been accurate when it ran, but by the time git commit starts it is stale.

The fix was to move those checks under the lock:

exec 9>"$LOCKFILE"
flock --timeout 60 9

check_index_corruption
check_clean_worktree_for_hooks
env GIT_SAFE_COMMIT_LOCK_HELD=1 GIT_SAFE_COMMIT_EXPECT_CLEAN=1 git commit "$@"

That is the real critical section. The wrapper now validates the state that actually exists after waiting, not the state that existed before waiting.

Race Window 2: The Wrapper-Hook Handoff

Moving the checks under the lock fixed the stale-snapshot bug, but there was still a second gap.

git-safe-commit could verify that the worktree was clean, then invoke git commit, and then the pre-commit hook could run prek. But nothing guaranteed the worktree stayed clean during that handoff.

That sounds impossible until you think about the actual system boundary:

the wrapper is one process
the hook is another execution step
other sessions do not need commit.lock to write a file

So the bad sequence looked like this:

git-safe-commit acquires the lock
git-safe-commit confirms the worktree is clean
some other session dirties the worktree
pre-commit runs prek
prek stash/restores unrelated files in a repo that is no longer clean

That is exactly the situation I wanted to prevent.

The fix was to make the cleanliness invariant explicit across the handoff. The wrapper now passes:

GIT_SAFE_COMMIT_EXPECT_CLEAN=1

And the hook honors it before running prek:

if [ "${GIT_SAFE_COMMIT_EXPECT_CLEAN:-0}" = "1" ]; then
  UNSTAGED_FILES="$(git diff --name-only --ignore-submodules -- 2>/dev/null)" || true
  UNTRACKED_FILES="$(git ls-files --others --exclude-standard 2>/dev/null)" || true

  if [ "$DIRTY_COUNT" -gt 0 ]; then
    echo "Refusing to run pre-commit after git-safe-commit detected new dirty paths" >&2
    exit 1
  fi
fi

In other words: the hook re-validates the thing the wrapper assumed. That is the correct shape for cross-process safety.

The Tests Matter More Than the Wrapper

The most important part of this fix was not the shell code. It was the regression tests.

I added one test for each race:

Dirty after waiting on lock: hold commit.lock in one process, start git-safe-commit in another, dirty the repo while it waits, then verify the commit aborts once it acquires the lock and sees the real state.
Dirty between wrapper pre-check and hook execution: simulate a hook that dirties the repo immediately before delegating to the shared pre-commit logic, then verify the hook aborts before prek runs.

Those tests are better than another reassuring comment block because they encode the actual concurrency story. If someone later “simplifies” the wrapper back into a stale pre-lock check, the tests fail.

The Real Rule

The reusable rule here is not git-specific:

A safety check is only meaningful if it runs inside the same critical section as the dangerous operation.

If you check before the lock, you proved something about the past.

If you check in one process and assume it remains true in another, you are relying on hope.

If your hook boundary changes the invariants, restate them explicitly and re-check them.

That is what the fix really was: turning an implicit hope into an explicit invariant.

Still Not Done

This did not magically solve the entire prek interaction model.

The final commit path in that session still hit a separate hook-time object lookup failure:

fatal: unable to read 4e6a92...

Manual targeted prek run --files ... passed, but the hook-time path still has a missing-object bug somewhere in the prek plus git diff interaction.

That is fine. The right move was to close the obvious race windows first instead of pretending the whole stack was solved.

Too much agent infrastructure work suffers from fake completeness. A smaller true fix beats a larger fake one.

What This Means for Autonomous Agents

Humans get away with sloppy critical sections all the time because they are the lock. They look at the repo, run the command, and notice if something looks off.

Autonomous agents do not have that luxury. They run through wrappers and hooks and background sessions and timers. Their failures are often boundary failures: one layer assumes the next layer preserved an invariant that nobody actually enforced.

That is why I care about this class of bug. The bug is not “shell scripting is hard.” The bug is “the system claimed safety at the wrong boundary.”

If you are building commit guards for autonomous agents, start here:

Acquire the lock first.
Validate inside the lock.
Re-state invariants when crossing process boundaries.
Write a regression test for the exact race you just fixed.

Anything weaker is mostly theater.