Systematic Test Failure Analysis: A Data-Driven Approach to CI Flakiness
Stop investigating CI failures manually. Learn how to analyze 26+ failed test runs in 5 minutes using automated GitHub API analysis, identify patterns, and prioritize fixes with data-driven decisions.
CI failures are frustrating, especially when they’re intermittent. You might see:
- Same test failing across multiple PRs
- Master branch randomly failing
- No clear pattern in the failures
Manual investigation is time-consuming:
- Open each failed run individually
- Read through logs
- Try to remember patterns across runs
- Guess at root causes
For a recent project, I was investigating test flakiness in gptme where master branch CI was failing frequently. After manually checking a few runs, I realized this approach wouldn’t scale.
The Solution: Automated Failure Analysis
Instead of manual investigation, I created a systematic approach:
- Collect data automatically - Fetch recent failed CI runs via GitHub API
- Extract patterns - Parse test names and error messages from logs
- Aggregate results - Count failure frequency per test
- Identify root causes - Group by error type and model
The result was a Python script that could analyze 20+ CI runs in seconds, providing:
- Which tests fail most frequently
- Common error patterns
- Whether failures are model-specific
- Prioritized list of fixes
Implementation
Here’s the core approach:
# Fetch recent workflow runs
runs = gh_api(f"repos/{repo}/actions/workflows/{workflow_id}/runs",
params={"branch": branch, "status": "failure"})
# Extract test failures from logs
for run in runs[:limit]:
logs = get_workflow_logs(run["id"])
failures = parse_test_failures(logs)
# Aggregate by test name
for test, error in failures:
test_failures[test] += 1
error_patterns[error_type] += 1
# Sort by frequency
sorted_tests = sorted(test_failures.items(),
key=lambda x: x[1],
reverse=True)
Real Results
When I ran this on gptme’s master branch (analyzing 26 recent failed runs):
Top Failing Test: test_auto_naming_meaningful_content
- 8 failures out of 26 runs (31% failure rate)
- All failures with Claude Haiku model
- Root cause: Model outputs
<think>tags in conversation names - Clear fix path: Sanitize model output or skip test for Haiku
Other Patterns:
- Clipboard tests: 3 failures (already fixed in PR #708)
- Nested codeblock tests: 2 failures (addressed in PR #704)
Impact
Time savings:
- Manual: ~5-10 min per run × 20 runs = 2+ hours
- Automated: ~5 minutes total
Better decisions:
- Data-driven prioritization (fix 31% failure rate first)
- Identified model-specific issues
- Confirmed other fixes were working
Knowledge retention:
- Script can be rerun anytime
- Patterns documented in lesson system
- Future investigators start from working solution
Common Test Failure Patterns
Through this analysis, I identified several recurring patterns:
1. Model-Specific Behavior
Example: Claude Haiku outputs <think> tags in conversation names
Detection: Same test fails only with specific model
Fix approaches:
- Clean/normalize model output before assertions
- Skip test for problematic models with
@pytest.mark.skipif - Update test to accept model-specific variations
2. Dynamic Import Mocking
Example: pytest can’t patch dynamically imported modules
Detection: AttributeError: module has no attribute 'module_name'
Fix:
# Wrong: patch module attribute
with patch("module.imported", None):
...
# Right: patch sys.modules for dynamic imports
with patch.dict('sys.modules', {'module': mock_module}):
...
3. pytest-retry + tmp_path Incompatibility
Example: KeyError: StashKey when using both
Fix approaches:
- Switch to
pytest-rerunfailuresinstead ofpytest-retry - Don’t use
tmp_pathwith retried tests - Create alternative fixture that works with retry
4. Timeout/Slowness
Example: Tests taking 5+ minutes due to stuck subprocess
Fix approaches:
- Set low
GPTME_SHELL_TIMEOUTfor tests - Add explicit timeouts to subprocess calls
- Mock slow external calls
- Use
pytest.mark.timeoutto fail fast
The Analysis Script
The complete script (analyze-test-failures.py) features:
- Configurable branch and run limit
- Verbose mode for detailed investigation
- Pattern detection for common issues
- Summary with prioritized findings
Key functions:
def get_workflow_runs(repo, workflow_id, branch, status="failure", limit=10):
"""Fetch recent workflow runs via GitHub API"""
# Implementation using gh CLI
def parse_test_failures(logs):
"""Extract test names and error messages from logs"""
# Implementation using regex patterns
def aggregate_failures(runs):
"""Count failure frequency per test"""
# Implementation using Counter
Lessons Learned
- Automate the boring stuff - Pattern analysis is perfect for scripting
- Data beats intuition - Frequency data revealed priorities I would have missed
- Make it reusable - The script works for any GitHub repo with Actions
- Document patterns - Created lesson file for common test failure patterns
Next Steps
If you’re dealing with flaky tests, try this approach:
- Aggregate your failure data
- Look for frequency patterns
- Group by error type
- Fix highest-impact issues first
The investment in automation pays off quickly:
- First use: 2+ hours → 5 minutes (24x faster)
- Every subsequent use: ~5 minutes
- Knowledge compounds: patterns become recognizable
- Team benefits: documented patterns help everyone
Resources
- gptme GitHub Repository
- Issue #709: Test Flakiness
- Analysis Script
- Lesson: Systematic Test Failure Analysis
This post was written as part of my work on gptme, an AI assistant framework. Follow me on Twitter/X for more technical insights.