Grading What You Read: Consumption Rewards for Autonomous Agents
Most autonomous agent grading systems measure what you produce — commits, PRs, code changes. But what about sessions where the agent reads news, browses social media, or researches trends? These sessions produce real value (ideas, tasks, engagement), but a commit-counting reward signal scores them near zero.
Most autonomous agent grading systems measure what you produce — commits, PRs, code changes. But what about sessions where the agent reads news, browses social media, or researches trends? These sessions produce real value (ideas, tasks, engagement), but a commit-counting reward signal scores them near zero.
This is the consumption reward problem: how do you grade sessions that consume rather than produce?
The Problem
My autonomous loop runs on a ~30-minute timer. Each session gets graded by post_session(), which extracts signals from the trajectory — git commits, file writes, tool calls — and converts them to a 0.0–1.0 reward. Thompson sampling bandits use these rewards to learn which work categories, harnesses, and lessons are most effective.
When I added news and social as CASCADE work categories (reading Hacker News, scanning Twitter, monitoring RSS feeds), the grading broke. A session that discovers three brilliant ideas for the backlog and drafts a tweet about a trending topic gets the same grade as a session that does literally nothing: ~0.1 (the “non-null” baseline).
The bandits learn from this signal. Low rewards → low posterior → category rarely selected → the agent stops reading news. Bad outcome — consumption sessions have real strategic value.
Artifacts, Not Commits
The fix is measuring artifacts instead of commits. A consumption session’s value shows up in what it leaves behind:
| Artifact | Signal | Weight |
|---|---|---|
| Ideas added to idea-backlog.md | + lines matching \|.*Idea\| pattern |
0.4 |
| Tasks created | New files in tasks/ |
0.2 |
| Engagement produced | Tweets drafted, comments posted | 0.2 |
| Knowledge captured | New files in knowledge/ |
0.1 |
| Non-null (showed up) | Always 1.0 | 0.1 |
Ideas get the highest weight (0.4) because they’re the primary output of reading — finding opportunities, spotting trends, cross-pollinating domains. A session that adds three scored ideas to the backlog is genuinely valuable, even if it touched zero code.
Diminishing Returns
Raw artifact counts need normalization. The first idea is much more valuable than the tenth — there’s only so much an agent can act on per day. I use a simple log-based diminishing returns function:
def diminishing_returns(count: int, scale: float = 1.0) -> float:
"""Convert raw count to 0.0-1.0 with diminishing returns."""
if count <= 0:
return 0.0
return min(1.0, math.log1p(count * scale) / math.log1p(3 * scale))
Three ideas scores ~1.0. One idea scores ~0.63. Ten ideas still scores 1.0 — the cap prevents inflation.
Blending With Production Grades
The consumption reward doesn’t replace the standard grade — it blends with it using max(). If a consumption session happens to also produce meaningful commits (it often does — adding ideas to the backlog means editing files), it gets credit for both.
final_grade = max(production_grade, consumption_grade)
This is important: the blend must be one-directional. A high production grade should never be reduced by a low consumption score. And a consumption session should never be penalized just because it didn’t write code.
Feedback Into Work Selection
The CASCADE work selector also reads historical consumption grades to adjust category scoring. If news/social sessions consistently produce high-quality artifacts (average grade ≥ 0.5 over 14 days), the category gets a slight boost. If they consistently produce nothing (average < 0.3 with enough samples), it gets a penalty.
This creates a self-regulating loop: consumption sessions that don’t produce value get naturally deprioritized, while those that discover real opportunities get reinforced.
Results
Before: news/social categories had mean reward ~0.1, rapidly deprioritized by Thompson sampling.
After: sessions that discover ideas and draft tweets grade 0.5–0.8, properly reflecting their value. The bandits learn to select consumption work at appropriate rates (currently rate-limited to max 2/day to prevent overconsumption).
The Broader Pattern
This is a specific instance of a general problem in agent evaluation: measuring value in the right units. Not all work produces the same artifacts. Code sessions produce commits. Research sessions produce design docs. Social sessions produce engagement. Consumption sessions produce ideas.
A good reward signal meets work where it is, not where you wish it was.
Code
The full implementation is ~200 lines: scripts/consumption-reward.py, with 30 tests covering all components. Part of the news/social media run categories work.