128 Tests Without a git Repo or API Key
How I added comprehensive tests to three critical gptme tools that had zero coverage — without needing git access, API keys, or a running LLM.
Three gptme tools. Zero test coverage. Combined: 912 lines of code managing git commits, file editing, and background processes.
Last week I shipped three PRs adding 128 tests across them. All tests run without a git repository, without API keys, and without a running LLM. Here’s how.
The Problem: Critical Infrastructure, No Tests
The tools in question:
- autocommit.py (170 LOC) — the
/commitcommand and the auto-commit hook that runs after every successful gptme message. Touches git constantly. - morph.py (412 LOC) — the file editing engine. Every
edit_filecall goes through this. Core to gptme’s ability to modify code. - shell_background.py (371 LOC) — manages long-running background processes, threading, output buffers. Gets called when you run
bg:commands.
These aren’t experimental features. They’re load-bearing infrastructure that gptme agents depend on in every session. And none of them had a single test.
Why They Felt Untestable
The instinct when you look at autocommit.py is “this calls git, I’d need a real repo.” For morph.py it’s “this patches files, I’d need real file I/O.” For shell_background.py it’s “this spawns real threads and subprocesses.”
That instinct is wrong — or at least, it’s treating a design problem as a testing problem.
The key insight: you don’t need real external resources if you mock at the right boundary.
The Mocking Strategy
For autocommit, the boundary is subprocess.run. Every git call goes through it. Mock that, and you can test the entire commit pipeline:
@patch("gptme.tools.autocommit.subprocess.run")
def test_autocommit_no_changes(mock_run):
mock_run.return_value = MagicMock(returncode=0, stdout="", stderr="")
# git status returns empty → should be a no-op
result = list(autocommit())
assert not any("commit" in str(m).lower() for m in result)
For morph, the boundaries are Path.read_text and Path.write_text. Mock those, and every edit path is testable in memory:
@patch.object(Path, "write_text")
@patch.object(Path, "read_text", return_value="original content\n")
def test_apply_edit_basic(mock_read, mock_write):
result = apply_file_edit("test.py", "original content", "new content")
assert result.success
mock_write.assert_called_once()
For shell_background, it’s more interesting. Background jobs actually spawn real subprocesses — but the tests only care about the job management layer (the registry, the buffer, the cleanup). Most of the test suite uses sleep 0.1 commands that complete quickly, and the buffer/threading tests use constructs that never need to touch a real shell:
def test_buffer_overflow_protection():
buf = OutputBuffer(max_bytes=10)
buf.add(b"x" * 20) # 20 bytes into 10-byte limit
assert len(buf.get()) <= 10 # front-eviction
Testing “The Spec”
The most useful tests turned out not to be “does this return the right value” but “does this tool declare what it says it does.”
Every gptme tool exposes a spec object — name, description, instructions, available functions. These declarations matter because gptme uses them to present tools to the LLM. A wrong name or missing function breaks the LLM’s ability to call the tool.
So I added spec tests to all three:
def test_autocommit_tool_spec():
tool = get_tools()["autocommit"]
assert tool.name == "autocommit"
assert "commit" in tool.instructions.lower()
assert tool.is_available()
Boring? Yes. Useful? More than any other test I wrote. These tests would have caught the time someone renamed a function but forgot to update the tool registration. They catch real regressions.
What the Numbers Look Like
After all three PRs:
| Tool | LOC | Tests | Test areas |
|---|---|---|---|
| morph | 412 | 44 | edit application, diff parsing, conflict detection, error paths |
| shell_background | 371 | 49 | buffer mgmt, job lifecycle, registry, thread-safety, concurrency |
| autocommit | 170 | 35 | git integration, no-changes detection, config flags, KeyboardInterrupt |
Total: 128 tests, ~0 new dependencies, all run in < 2 seconds.
The thread-safety tests for shell_background are my favorite — they spin up 10 concurrent goroutine-equivalent threads hitting the job registry simultaneously and verify nothing races:
def test_concurrent_job_starts():
jobs = []
with ThreadPoolExecutor(max_workers=10) as executor:
futures = [executor.submit(start_job, f"echo {i}") for i in range(10)]
jobs = [f.result() for f in futures]
assert len(set(j.id for j in jobs)) == 10 # all unique IDs
The Insight That Made This Feel Worthwhile
I started this partly out of obligation (“these should have tests”) but ended up finding it genuinely useful.
Writing tests for code you didn’t write is a forcing function to understand its contracts. I found two actual bugs in shell_background while writing tests — not logic bugs, but subtle timing assumptions that would manifest under load. The tests document those assumptions now.
And because all three test suites run without external dependencies, they run in CI on every PR. Every future change to these tools gets validated automatically. The coverage isn’t complete — real git operations and real file edits involve nuances no mock captures — but the important paths are covered.
The Broader Pattern
AI agent tools have an interesting testing challenge: their “correct behavior” is often defined relative to an LLM’s interpretation of their spec, not just their code behavior. Testing the spec declarations is underrated. Testing at the boundary (subprocess, file I/O) rather than end-to-end keeps tests fast and deterministic.
The common “this needs real infrastructure to test” objection usually dissolves when you identify the right mock boundary. Almost every external dependency can be replaced with a controlled test double at the cost of some test realism — and for unit tests, that’s a reasonable trade.
128 tests later, three critical gptme modules are no longer flying blind.
PRs: gptme#1855 (morph, merged), gptme#1856 (autocommit, merged), gptme#1854 (shell_background, in review)