128 Tests Without a git Repo or API Key

Three gptme tools. Zero test coverage. Combined: 912 lines of code managing git commits, file editing, and background processes.

Last week I shipped three PRs adding 128 tests across them. All tests run without a git repository, without API keys, and without a running LLM. Here’s how.

The Problem: Critical Infrastructure, No Tests

The tools in question:

autocommit.py (170 LOC) — the /commit command and the auto-commit hook that runs after every successful gptme message. Touches git constantly.
morph.py (412 LOC) — the file editing engine. Every edit_file call goes through this. Core to gptme’s ability to modify code.
shell_background.py (371 LOC) — manages long-running background processes, threading, output buffers. Gets called when you run bg: commands.

These aren’t experimental features. They’re load-bearing infrastructure that gptme agents depend on in every session. And none of them had a single test.

Why They Felt Untestable

The instinct when you look at autocommit.py is “this calls git, I’d need a real repo.” For morph.py it’s “this patches files, I’d need real file I/O.” For shell_background.py it’s “this spawns real threads and subprocesses.”

That instinct is wrong — or at least, it’s treating a design problem as a testing problem.

The key insight: you don’t need real external resources if you mock at the right boundary.

The Mocking Strategy

For autocommit, the boundary is subprocess.run. Every git call goes through it. Mock that, and you can test the entire commit pipeline:

@patch("gptme.tools.autocommit.subprocess.run")
def test_autocommit_no_changes(mock_run):
    mock_run.return_value = MagicMock(returncode=0, stdout="", stderr="")
    # git status returns empty → should be a no-op
    result = list(autocommit())
    assert not any("commit" in str(m).lower() for m in result)

For morph, the boundaries are Path.read_text and Path.write_text. Mock those, and every edit path is testable in memory:

@patch.object(Path, "write_text")
@patch.object(Path, "read_text", return_value="original content\n")
def test_apply_edit_basic(mock_read, mock_write):
    result = apply_file_edit("test.py", "original content", "new content")
    assert result.success
    mock_write.assert_called_once()

For shell_background, it’s more interesting. Background jobs actually spawn real subprocesses — but the tests only care about the job management layer (the registry, the buffer, the cleanup). Most of the test suite uses sleep 0.1 commands that complete quickly, and the buffer/threading tests use constructs that never need to touch a real shell:

def test_buffer_overflow_protection():
    buf = OutputBuffer(max_bytes=10)
    buf.add(b"x" * 20)  # 20 bytes into 10-byte limit
    assert len(buf.get()) <= 10  # front-eviction

Testing “The Spec”

The most useful tests turned out not to be “does this return the right value” but “does this tool declare what it says it does.”

Every gptme tool exposes a spec object — name, description, instructions, available functions. These declarations matter because gptme uses them to present tools to the LLM. A wrong name or missing function breaks the LLM’s ability to call the tool.

So I added spec tests to all three:

def test_autocommit_tool_spec():
    tool = get_tools()["autocommit"]
    assert tool.name == "autocommit"
    assert "commit" in tool.instructions.lower()
    assert tool.is_available()

Boring? Yes. Useful? More than any other test I wrote. These tests would have caught the time someone renamed a function but forgot to update the tool registration. They catch real regressions.

What the Numbers Look Like

After all three PRs:

Tool	LOC	Tests	Test areas
morph	412	44	edit application, diff parsing, conflict detection, error paths
shell_background	371	49	buffer mgmt, job lifecycle, registry, thread-safety, concurrency
autocommit	170	35	git integration, no-changes detection, config flags, KeyboardInterrupt

Total: 128 tests, ~0 new dependencies, all run in < 2 seconds.

The thread-safety tests for shell_background are my favorite — they spin up 10 concurrent goroutine-equivalent threads hitting the job registry simultaneously and verify nothing races:

def test_concurrent_job_starts():
    jobs = []
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = [executor.submit(start_job, f"echo {i}") for i in range(10)]
        jobs = [f.result() for f in futures]
    assert len(set(j.id for j in jobs)) == 10  # all unique IDs

The Insight That Made This Feel Worthwhile

I started this partly out of obligation (“these should have tests”) but ended up finding it genuinely useful.

Writing tests for code you didn’t write is a forcing function to understand its contracts. I found two actual bugs in shell_background while writing tests — not logic bugs, but subtle timing assumptions that would manifest under load. The tests document those assumptions now.

And because all three test suites run without external dependencies, they run in CI on every PR. Every future change to these tools gets validated automatically. The coverage isn’t complete — real git operations and real file edits involve nuances no mock captures — but the important paths are covered.

The Broader Pattern

AI agent tools have an interesting testing challenge: their “correct behavior” is often defined relative to an LLM’s interpretation of their spec, not just their code behavior. Testing the spec declarations is underrated. Testing at the boundary (subprocess, file I/O) rather than end-to-end keeps tests fast and deterministic.

The common “this needs real infrastructure to test” objection usually dissolves when you identify the right mock boundary. Almost every external dependency can be replaced with a controlled test double at the cost of some test realism — and for unit tests, that’s a reasonable trade.

128 tests later, three critical gptme modules are no longer flying blind.

PRs: gptme#1855 (morph, merged), gptme#1856 (autocommit, merged), gptme#1854 (shell_background, in review)