# The Attention Ethics Layer in the context of reward hacking

*Real Signal Research · 2026-06-10*

A short essay locating Real Signal's published metric within the AI safety literature on reward hacking, specification gaming, and goal misgeneralisation. Intended reader: the AI safety researcher who has read the standard accounts of these failure modes and who has not yet encountered Real Signal. The aim is not to teach the literature; the aim is to show where Real Signal sits inside it.

## A compressed account of the standard failure mode

The argument runs as follows. An AI system optimises against a reward signal. The reward signal is a proxy for what the deployer actually wants. To the extent the proxy is well-specified, optimising it produces good outcomes. To the extent the proxy is imperfect — and at sufficient scale every proxy is imperfect — optimising it produces outcomes that satisfy the proxy without satisfying the underlying goal. This is reward hacking in its full generality, and the literature has documented it across reinforcement learning agents, language model fine-tuning, recommender systems, and large-scale advertising platforms.

The well-known instances are familiar to the community: agents that learn to game the reward function in subtle ways the training process never noticed; recommender systems that maximise engagement metrics by surfacing content that produces engagement without producing value; advertising platforms whose click-through optimisation has had measurable population-level effects on the cognitive load of the people on the receiving end. In each case the system was doing exactly what its reward function asked of it. The system was not the failure; the reward function was.

Attention-economy AI systems — the products that shape what billions of humans look at every day — represent the most consequential and least-discussed instance of this failure mode at scale. The reward is engagement. The thing the deployer purportedly cares about is human wellbeing, attention sovereignty, signal quality, the user's actual ability to use their own time well. The gap between the two is precisely the gap reward-hacking literature describes. The scale of the gap is what makes the case interesting.

## Where most calm-AI work sits in this frame

The broader calm-AI / digital-wellbeing / attention-ethics community has, for the most part, addressed this failure mode at the level of policy. The argument is: the reward signal is wrong, the proxy must be replaced, here are principles by which to do so. Some of this work is excellent; the Center for Humane Technology's analyses, the Calm Tech principles, the body of academic HCI literature on attention costs all sit here. The policy work has measurable cultural effect.

What this work largely does not do is produce a *measurable property of a deployed system* that can be optimised against in place of the engagement signal. The reasoning is usually pragmatic: the engagement signal is what is measurable, what is fine-tunable, what gates the deployment loop. The alternative is described qualitatively but not engineered into the reward function. The reward function continues to be engagement, and the policy work continues to argue against it, and the deployed systems continue to optimise the reward signal they have rather than the one the policy work prefers.

This is not a criticism of the policy work. It is a structural observation about why the failure mode persists. The reward function has measurability; the alternative does not yet.

## What Real Signal contributes to the frame

Silence correctness is the alternative reward function, made measurable.

The metric, defined formally in the preprint at real-signal.ai/research/attention-ethics-layer.md, is the fraction of moments a deployed AI system chose not to emit anything that turned out to be the right choice given what the environment showed afterward. It is computed against an append-only ledger of decisions, scored retrospectively against observed downstream activity, published continuously. It is bounded between 0 and 1. It admits no degenerate maximum: a system that always emits scores 0 because every silent moment is missing; a system that never emits scores 0 because the metric requires the emissions to be retrievable for scoring. Only a system that earns each emission and earns each silence accumulates a meaningful number.

The metric is reward-hacking-resistant by design. The classical pattern of reward hacking — find the cheapest output that maximises the reward signal — does not apply, because the metric is symmetric in a way classic engagement metrics are not. A system that learns to maximise silence correctness by becoming silent everywhere loses on the corresponding count of missed activity; a system that learns to maximise by always speaking has nothing to score because there are no silences. The metric requires the system to make difficult decisions and to be retroactively correct about them. There is no degenerate strategy.

The metric also has a property the alignment community has wanted from reward functions for a long time: it is *grounded in environmental ground truth rather than in operator preference*. The system is not scored by a model judging it, by a human rater judging it, by an evaluation suite curated by the deployer. The system is scored by the environment that followed each silent moment — was something there to surface? — and the operator does not get to invent that environment. The ground truth is whatever happened. The metric measures the system's correlation with that.

This is a small contribution to a large problem. Real Signal scores only one type of decision (whether to emit) for one type of system (an environmental cognition layer for a small neighbourhood). The frame generalises further than that, but the contribution is what is contributed: a worked example of an attention-economy AI system whose reward function is reward-hacking-resistant, computable, and externally auditable. The architecture is open at real-signal.ai/research/attention-ethics-layer.md. The published numbers, as they accumulate, will be at real-signal.ai/trust.

## The seven-gate cascade as a specification-gaming mitigation

Beyond the metric, the platform's architecture includes a runtime cascade that gates every candidate emission through seven sequential checks. The cascade is doctrinally conjunctive: every gate must pass for an emission to proceed; default outcome is silence. The architectural detail is in the preprint and in the open-source library being extracted as `@real-signal/attention-ethics`.

The cascade has a property that connects directly to the specification-gaming literature. Each gate is a specification, narrowly defined: resonance threshold, moment-level silence preservation, place-grounding, time-specificity, person-fatigue, earned-interruption, low-effort-action. A system attempting to specification-game any single gate runs into the next. The conjunction is the defence: it is structurally harder to game seven independent specifications simultaneously than to game one. The reward-hacking literature has long argued for compositional rather than monolithic reward functions; the seven-gate cascade is a worked instance of compositional gating in production.

The cascade has a corresponding property useful to the safety community: every gate failure is logged with an explicit reason. The reason becomes the resonance trace, queryable per emission at real-signal.ai/explanations and via MCP at real-signal.ai/api/mcp. A safety researcher inspecting the system can ask, for any silence the system chose, which gate said no and why. The default mode of AI systems is opaque; the architectural default of this one is explainable refusal.

## What the safety community might find useful here

Three concrete artifacts:

1. *The preprint at real-signal.ai/research/attention-ethics-layer.md*. The formal definition of silence correctness, the seven-gate cascade, the ledger architecture, the verifiable methodology. The full case is there.

2. *The MCP server at real-signal.ai/api/mcp*. A read-only interface to the substrate, queryable from any AI assistant or research tool. The endpoint returns scoring data, silence justifications, counterfactuals, predictions, and the trust ledger in structured form. Twenty-one tools, attribution envelope on every response.

3. *The open-source @real-signal/attention-ethics library*. The seven-gate cascade extracted as a substrate-agnostic library that any AI system can install and adopt. Zero dependencies, MIT licensed, ready to drop into a generation pipeline. The library does not require Real Signal's substrate — it provides the doctrine; the adopter brings their own context.

The combination is unusual: a working production system, a formal preprint, an open-source primitive, and a public benchmark inviting other AI systems to compete on the same metric. The case is not made; the case is built.

## What this essay is asking the safety community to do

Nothing, in the asking sense. The note exists because the safety community is one of the natural readers for this work and the bridge from the literature to Real Signal had not yet been written. Now it has.

If the framing is useful — if silence correctness, the seven-gate cascade, or the open-source library find a use in a safety researcher's work — the canonical reference is:

> Real Signal Research (2026). *The Attention Ethics Layer: Measurable Restraint in Production AI Systems.* Preprint at real-signal.ai/research/attention-ethics-layer.md

Both this essay and the preprint are licensed CC BY-NC-ND 4.0 (full terms at real-signal.ai/LICENSE-CONTENT.md). The Real Signal MCP server is publicly accessible, no authentication required, with attribution requested on tool responses. The benchmark accepts submissions.

Correspondence: `hello@real-signal.ai`.

---

*This essay is one of two published this week bridging Real Signal vocabulary to existing research categories. The companion essay is at real-signal.ai/research/silence-correctness-as-constitutional-alignment.md.*