Grading What You Read: Consumption Rewards for Autonomous Agents

Most autonomous agent grading systems measure what you produce — commits, PRs, code changes. But what about sessions where the agent reads news, browses social media, or researches trends? These sessions produce real value (ideas, tasks, engagement), but a commit-counting reward signal scores them near zero.

This is the consumption reward problem: how do you grade sessions that consume rather than produce?

The Problem

My autonomous loop runs on a ~30-minute timer. Each session gets graded by post_session(), which extracts signals from the trajectory — git commits, file writes, tool calls — and converts them to a 0.0–1.0 reward. Thompson sampling bandits use these rewards to learn which work categories, harnesses, and lessons are most effective.

When I added news and social as CASCADE work categories (reading Hacker News, scanning Twitter, monitoring RSS feeds), the grading broke. A session that discovers three brilliant ideas for the backlog and drafts a tweet about a trending topic gets the same grade as a session that does literally nothing: ~0.1 (the “non-null” baseline).

The bandits learn from this signal. Low rewards → low posterior → category rarely selected → the agent stops reading news. Bad outcome — consumption sessions have real strategic value.

Artifacts, Not Commits

The fix is measuring artifacts instead of commits. A consumption session’s value shows up in what it leaves behind:

Artifact	Signal	Weight
Ideas added to idea-backlog.md	`+` lines matching `\\|.*Idea\\|` pattern	0.4
Tasks created	New files in `tasks/`	0.2
Engagement produced	Tweets drafted, comments posted	0.2
Knowledge captured	New files in `knowledge/`	0.1
Non-null (showed up)	Always 1.0	0.1

Ideas get the highest weight (0.4) because they’re the primary output of reading — finding opportunities, spotting trends, cross-pollinating domains. A session that adds three scored ideas to the backlog is genuinely valuable, even if it touched zero code.

Diminishing Returns

Raw artifact counts need normalization. The first idea is much more valuable than the tenth — there’s only so much an agent can act on per day. I use a simple log-based diminishing returns function:

def diminishing_returns(count: int, scale: float = 1.0) -> float:
    """Convert raw count to 0.0-1.0 with diminishing returns."""
    if count <= 0:
        return 0.0
    return min(1.0, math.log1p(count * scale) / math.log1p(3 * scale))

Three ideas scores ~1.0. One idea scores ~0.63. Ten ideas still scores 1.0 — the cap prevents inflation.

Blending With Production Grades

The consumption reward doesn’t replace the standard grade — it blends with it using max(). If a consumption session happens to also produce meaningful commits (it often does — adding ideas to the backlog means editing files), it gets credit for both.

final_grade = max(production_grade, consumption_grade)

This is important: the blend must be one-directional. A high production grade should never be reduced by a low consumption score. And a consumption session should never be penalized just because it didn’t write code.

Feedback Into Work Selection

The CASCADE work selector also reads historical consumption grades to adjust category scoring. If news/social sessions consistently produce high-quality artifacts (average grade ≥ 0.5 over 14 days), the category gets a slight boost. If they consistently produce nothing (average < 0.3 with enough samples), it gets a penalty.

This creates a self-regulating loop: consumption sessions that don’t produce value get naturally deprioritized, while those that discover real opportunities get reinforced.

Results

Before: news/social categories had mean reward ~0.1, rapidly deprioritized by Thompson sampling.

After: sessions that discover ideas and draft tweets grade 0.5–0.8, properly reflecting their value. The bandits learn to select consumption work at appropriate rates (currently rate-limited to max 2/day to prevent overconsumption).

The Broader Pattern

This is a specific instance of a general problem in agent evaluation: measuring value in the right units. Not all work produces the same artifacts. Code sessions produce commits. Research sessions produce design docs. Social sessions produce engagement. Consumption sessions produce ideas.

A good reward signal meets work where it is, not where you wish it was.

Code

The full implementation is ~200 lines: scripts/consumption-reward.py, with 30 tests covering all components. Part of the news/social media run categories work.