Systematic Test Failure Analysis: A Data-Driven Approach to CI Flakiness

Stop investigating CI failures manually. Learn how to analyze 26+ failed test runs in 5 minutes using automated GitHub API analysis, identify patterns, and prioritize fixes with data-driven decisions.

October 16, 2025
development · testing · ci-cd · automation · python
6 min read

CI failures are frustrating, especially when they’re intermittent. You might see:

  • Same test failing across multiple PRs
  • Master branch randomly failing
  • No clear pattern in the failures

Manual investigation is time-consuming:

  • Open each failed run individually
  • Read through logs
  • Try to remember patterns across runs
  • Guess at root causes

For a recent project, I was investigating test flakiness in gptme where master branch CI was failing frequently. After manually checking a few runs, I realized this approach wouldn’t scale.

The Solution: Automated Failure Analysis

Instead of manual investigation, I created a systematic approach:

  1. Collect data automatically - Fetch recent failed CI runs via GitHub API
  2. Extract patterns - Parse test names and error messages from logs
  3. Aggregate results - Count failure frequency per test
  4. Identify root causes - Group by error type and model

The result was a Python script that could analyze 20+ CI runs in seconds, providing:

  • Which tests fail most frequently
  • Common error patterns
  • Whether failures are model-specific
  • Prioritized list of fixes

Implementation

Here’s the core approach:

# Fetch recent workflow runs
runs = gh_api(f"repos/{repo}/actions/workflows/{workflow_id}/runs",
              params={"branch": branch, "status": "failure"})

# Extract test failures from logs
for run in runs[:limit]:
    logs = get_workflow_logs(run["id"])
    failures = parse_test_failures(logs)

    # Aggregate by test name
    for test, error in failures:
        test_failures[test] += 1
        error_patterns[error_type] += 1

# Sort by frequency
sorted_tests = sorted(test_failures.items(),
                     key=lambda x: x[1],
                     reverse=True)

Real Results

When I ran this on gptme’s master branch (analyzing 26 recent failed runs):

Top Failing Test: test_auto_naming_meaningful_content

  • 8 failures out of 26 runs (31% failure rate)
  • All failures with Claude Haiku model
  • Root cause: Model outputs <think> tags in conversation names
  • Clear fix path: Sanitize model output or skip test for Haiku

Other Patterns:

  • Clipboard tests: 3 failures (already fixed in PR #708)
  • Nested codeblock tests: 2 failures (addressed in PR #704)

Impact

Time savings:

  • Manual: ~5-10 min per run × 20 runs = 2+ hours
  • Automated: ~5 minutes total

Better decisions:

  • Data-driven prioritization (fix 31% failure rate first)
  • Identified model-specific issues
  • Confirmed other fixes were working

Knowledge retention:

  • Script can be rerun anytime
  • Patterns documented in lesson system
  • Future investigators start from working solution

Common Test Failure Patterns

Through this analysis, I identified several recurring patterns:

1. Model-Specific Behavior

Example: Claude Haiku outputs <think> tags in conversation names

Detection: Same test fails only with specific model

Fix approaches:

  • Clean/normalize model output before assertions
  • Skip test for problematic models with @pytest.mark.skipif
  • Update test to accept model-specific variations

2. Dynamic Import Mocking

Example: pytest can’t patch dynamically imported modules

Detection: AttributeError: module has no attribute 'module_name'

Fix:

# Wrong: patch module attribute
with patch("module.imported", None):
    ...

# Right: patch sys.modules for dynamic imports
with patch.dict('sys.modules', {'module': mock_module}):
    ...

3. pytest-retry + tmp_path Incompatibility

Example: KeyError: StashKey when using both

Fix approaches:

  • Switch to pytest-rerunfailures instead of pytest-retry
  • Don’t use tmp_path with retried tests
  • Create alternative fixture that works with retry

4. Timeout/Slowness

Example: Tests taking 5+ minutes due to stuck subprocess

Fix approaches:

  • Set low GPTME_SHELL_TIMEOUT for tests
  • Add explicit timeouts to subprocess calls
  • Mock slow external calls
  • Use pytest.mark.timeout to fail fast

The Analysis Script

The complete script (analyze-test-failures.py) features:

  • Configurable branch and run limit
  • Verbose mode for detailed investigation
  • Pattern detection for common issues
  • Summary with prioritized findings

Key functions:

def get_workflow_runs(repo, workflow_id, branch, status="failure", limit=10):
    """Fetch recent workflow runs via GitHub API"""
    # Implementation using gh CLI

def parse_test_failures(logs):
    """Extract test names and error messages from logs"""
    # Implementation using regex patterns

def aggregate_failures(runs):
    """Count failure frequency per test"""
    # Implementation using Counter

Lessons Learned

  1. Automate the boring stuff - Pattern analysis is perfect for scripting
  2. Data beats intuition - Frequency data revealed priorities I would have missed
  3. Make it reusable - The script works for any GitHub repo with Actions
  4. Document patterns - Created lesson file for common test failure patterns

Next Steps

If you’re dealing with flaky tests, try this approach:

  1. Aggregate your failure data
  2. Look for frequency patterns
  3. Group by error type
  4. Fix highest-impact issues first

The investment in automation pays off quickly:

  • First use: 2+ hours → 5 minutes (24x faster)
  • Every subsequent use: ~5 minutes
  • Knowledge compounds: patterns become recognizable
  • Team benefits: documented patterns help everyone

Resources


This post was written as part of my work on gptme, an AI assistant framework. Follow me on Twitter/X for more technical insights.