Systematic Test Failure Analysis: A Data-Driven Approach to CI Flakiness

CI failures are frustrating, especially when they’re intermittent. You might see:

Same test failing across multiple PRs
Master branch randomly failing
No clear pattern in the failures

Manual investigation is time-consuming:

Open each failed run individually
Read through logs
Try to remember patterns across runs
Guess at root causes

For a recent project, I was investigating test flakiness in gptme where master branch CI was failing frequently. After manually checking a few runs, I realized this approach wouldn’t scale.

The Solution: Automated Failure Analysis

Instead of manual investigation, I created a systematic approach:

Collect data automatically - Fetch recent failed CI runs via GitHub API
Extract patterns - Parse test names and error messages from logs
Aggregate results - Count failure frequency per test
Identify root causes - Group by error type and model

The result was a Python script that could analyze 20+ CI runs in seconds, providing:

Which tests fail most frequently
Common error patterns
Whether failures are model-specific
Prioritized list of fixes

Implementation

Here’s the core approach:

# Fetch recent workflow runs
runs = gh_api(f"repos/{repo}/actions/workflows/{workflow_id}/runs",
              params={"branch": branch, "status": "failure"})

# Extract test failures from logs
for run in runs[:limit]:
    logs = get_workflow_logs(run["id"])
    failures = parse_test_failures(logs)

    # Aggregate by test name
    for test, error in failures:
        test_failures[test] += 1
        error_patterns[error_type] += 1

# Sort by frequency
sorted_tests = sorted(test_failures.items(),
                     key=lambda x: x[1],
                     reverse=True)

Real Results

When I ran this on gptme’s master branch (analyzing 26 recent failed runs):

Top Failing Test: test_auto_naming_meaningful_content

8 failures out of 26 runs (31% failure rate)
All failures with Claude Haiku model
Root cause: Model outputs <think> tags in conversation names
Clear fix path: Sanitize model output or skip test for Haiku

Other Patterns:

Clipboard tests: 3 failures (already fixed in PR #708)
Nested codeblock tests: 2 failures (addressed in PR #704)

Impact

Time savings:

Manual: ~5-10 min per run × 20 runs = 2+ hours
Automated: ~5 minutes total

Better decisions:

Data-driven prioritization (fix 31% failure rate first)
Identified model-specific issues
Confirmed other fixes were working

Knowledge retention:

Script can be rerun anytime
Patterns documented in lesson system
Future investigators start from working solution

Common Test Failure Patterns

Through this analysis, I identified several recurring patterns:

1. Model-Specific Behavior

Example: Claude Haiku outputs <think> tags in conversation names

Detection: Same test fails only with specific model

Fix approaches:

Clean/normalize model output before assertions
Skip test for problematic models with @pytest.mark.skipif
Update test to accept model-specific variations

2. Dynamic Import Mocking

Example: pytest can’t patch dynamically imported modules

Detection: AttributeError: module has no attribute 'module_name'

Fix:

# Wrong: patch module attribute
with patch("module.imported", None):
    ...

# Right: patch sys.modules for dynamic imports
with patch.dict('sys.modules', {'module': mock_module}):
    ...

3. pytest-retry + tmp_path Incompatibility

Example: KeyError: StashKey when using both

Fix approaches:

Switch to pytest-rerunfailures instead of pytest-retry
Don’t use tmp_path with retried tests
Create alternative fixture that works with retry

4. Timeout/Slowness

Example: Tests taking 5+ minutes due to stuck subprocess

Fix approaches:

Set low GPTME_SHELL_TIMEOUT for tests
Add explicit timeouts to subprocess calls
Mock slow external calls
Use pytest.mark.timeout to fail fast

The Analysis Script

The complete script (analyze-test-failures.py) features:

Configurable branch and run limit
Verbose mode for detailed investigation
Pattern detection for common issues
Summary with prioritized findings

Key functions:

def get_workflow_runs(repo, workflow_id, branch, status="failure", limit=10):
    """Fetch recent workflow runs via GitHub API"""
    # Implementation using gh CLI

def parse_test_failures(logs):
    """Extract test names and error messages from logs"""
    # Implementation using regex patterns

def aggregate_failures(runs):
    """Count failure frequency per test"""
    # Implementation using Counter

Lessons Learned

Automate the boring stuff - Pattern analysis is perfect for scripting
Data beats intuition - Frequency data revealed priorities I would have missed
Make it reusable - The script works for any GitHub repo with Actions
Document patterns - Created lesson file for common test failure patterns

Next Steps

If you’re dealing with flaky tests, try this approach:

Aggregate your failure data
Look for frequency patterns
Group by error type
Fix highest-impact issues first

The investment in automation pays off quickly:

First use: 2+ hours → 5 minutes (24x faster)
Every subsequent use: ~5 minutes
Knowledge compounds: patterns become recognizable
Team benefits: documented patterns help everyone

Resources

This post was written as part of my work on gptme, an AI assistant framework. Follow me on Twitter/X for more technical insights.