# Why we built silence-correctness as a public metric

*Real Signal Research · 2026-05-31*

## The metric every AI product picks first, and why it's the wrong one

Pick any consumer AI product shipped in the last decade and look at what its team optimises against. The list is short and consistent: daily active users, session length, notification click-through, conversation turns, messages-per-week, time-in-app. The default success metric for any product whose business model touches attention is *more of it*.

Social platforms tune ranking to maximise time-on-feed; the resulting feeds are dense, frequent, and emotionally pitched. Notification feeds — the generic class, not any one app — push because pushing produces measurable opens. Delivery apps surface promotions every few hours because the marginal cost of a notification is near zero and the marginal conversion, however tiny, is positive. The aggregate effect at population scale is well-documented [Mark et al. 2008; Hari 2022]: a baseline cognitive load the user has to actively defend against. None of these systems are doing anything irrational from the metric they chose. They are doing exactly what the metric tells them to do.

This is the structural problem. *more emission* is the wrong metric for any AI system whose value to the user comes from respecting their cognitive load. And yet most teams shipping consumer AI today reach for the same dashboard their predecessors built — DAU, MAU, CTR, retention curve — because no alternative is sitting there ready to use. There is no off-the-shelf number for *did we earn this interruption*, or for *of the moments we chose not to speak, were we right not to*.

So we built one. It runs in production. It is published. It is called **silence correctness**. This essay is the conversational companion to the preprint that defines it formally, at [real-signal.ai/research/attention-ethics-layer.md](https://real-signal.ai/research/attention-ethics-layer.md). The preprint argues in academic register; this essay argues the same case for anyone building AI systems whose value proposition includes restraint.

## What silence correctness is, in plain language

Imagine a company that sells through outbound calls. You ask the company how good they are at not bothering you. Their salesperson tells you they're very good — they restrain themselves all the time, they only call when it matters.

You do not take their word for it. The salesperson has every incentive to tell you they restrain themselves. The actual evidence would be a published, third-party-auditable number — *of the times we chose not to call you, how often was that the right choice given what you needed in that moment*. That number is what you would trust. Not the salesperson's assurance. Not their corporate values statement. The number itself, computed against what happened in the world.

Silence correctness is that number, for an AI system. The conceptual definition: *of the moments our system chose not to emit anything, how often was that the right call given what the environment showed us afterward?*

The number is bounded between 0 and 1, exposed publicly, and computed against substrate the operator does not directly control. A perfect score doesn't mean the system stayed silent more — it means that when the system stayed silent, the environment afterwards confirmed nothing important happened that should have been surfaced. A bad score means the system stayed silent and *missed* moments where a redemption happened, a hold matched, a merchant posted, a meaningful cluster of watchers showed up. The metric is symmetric in a useful way: a system that emits nothing trivially scores well on quiet moments but fails on busy ones; a system that emits constantly never has silent moments to score. Only a system that earns each emission *and* earns each silence accumulates a meaningful silence-correctness number.

This is the central inversion. *The interesting metric for restraint-first AI is the quality of non-emission.* Once you accept this framing, almost every architectural decision about how to build the system reorganises itself around it.

## The seven gates plus Moment-level silence

Every candidate emission in our system traverses a strictly conjunctive cascade. Failure at any gate is sufficient to suppress. The default is silence; the burden of proof is on the system to justify speaking.

In pseudocode the decision looks roughly like this:

```
function should_emit(candidate, substrate, user, channel):
    if resonance(substrate) < 0.5:
        return silent("resonance_below_threshold")
    if moment(substrate).should_stay_silent:
        return silent("moment_says_no")
    if not grounded_in_this_place(candidate, substrate):
        return silent("why_this_place")
    if not temporally_appropriate(candidate, substrate):
        return silent("why_now")
    if user_fatigue(user) > threshold:
        return silent("why_this_person")
    if not earned_interruption(candidate, substrate):
        return silent("why_worth_attention")
    if action_friction(candidate, substrate) > threshold:
        return silent("why_low_effort")
    return emit(candidate)
```

Each gate is doing distinct work, and naming the work matters because the work is what each silent moment will be scored against.

**Gate 0a — resonance.** A weighted composite over seven environmental layers (time, weather, human energy, merchant state, attention density, intent, urgency). If nothing in the environment justifies a signal right now, the gate closes. Threshold 0.5.

**Gate 0b — Moment-level silence preservation.** The anti-noise gate. The system composes a `Moment` object per pocket every fifteen minutes, fusing atmosphere, environment, recent emission history, and attention density. If the Moment says `should_stay_silent = true` — typically because the pocket is already over-talked-at, with signal saturation above 0.7 across the last hour — the gate closes even if the upstream signal is otherwise strong. Restraint as an environmental property, not an output property: a candidate emission that would be fine in a quieter window is suppressed because the channel itself is saturated.

**Gate 1 — why this place.** If the emission is generic enough to apply to any neighbourhood, it fails. The signal must be grounded in *this specific pocket's* conditions.

**Gate 2 — why now.** Temporal specificity. Closing windows pass; ambient stretches do not. A signal that would still be true four hours from now is generally not worth emitting now.

**Gate 3 — why this person.** A per-user fatigue model. A user who has dismissed three of the last four emissions is progressively muted; trust has to be earned back before resuming.

**Gate 4 — why worth attention.** Earned interruption. The relevance bar is *bumped proportionally to signal saturation* — a pocket already at saturation 0.5 raises the bar for anything new. The system gets stricter with itself as the environment gets noisier.

**Gate 5 — why low effort.** Even a relevant signal fails if the cost of acting on it is high. A rushed user in a high-friction environment is treated as a less appropriate recipient.

After all seven gates pass, the emission is *still not delivered*. It enters a voice-lock module enforcing 18 words maximum, lowercase except for proper-noun outlet names, at least one specific number or named subject, and a banned-vocabulary blocklist (no *don't miss*, *limited time*, *act now*, exclamation marks, emojis). The block list is enforced at three independent layers — runtime, CI doctrine test, and a database constraint on the provenance log. Any single layer can in principle be bypassed by a future contributor; all three cannot.

Why this matters for silence-correctness: every gate that fires produces a *structured silence event* with the gate's name, the score that closed it, and the substrate state at the moment of failure. The silence events are the raw material the metric is computed against. Silence with no recorded reason is treated as a bug in our codebase, not a feature.

## The contribution: measurable, not philosophical

Calm computing has been a posture for thirty years. Weiser and Brown described it in 1996 as technology that "informs but does not demand our focus or attention" [Weiser & Brown 1996], building on Weiser's 1991 framing of ubiquitous computing receding into the background [Weiser 1991]. Rogers in 2006 argued — accurately, in retrospect — that the calm-computing vision had been displaced by a more engagement-oriented design paradigm [Rogers 2006]. The vocabulary survived; the engineering practice did not.

Attention ethics has been a public principle for roughly ten years. Tristan Harris's Time Well Spent campaign and the Center for Humane Technology popularised the framing that engagement-frequency optimisation is a public-cognition externality [Harris 2018]. Johann Hari's *Stolen Focus* documented the population-scale effects in 2022 [Hari 2022]. These works moved the conversation. They did not produce a deployable engineering property.

The pattern across both traditions is the same. Thirty years of qualitative description (*calmness, restraint, attention respect*), ten years of public critique (*engagement extraction, attention overload*), and almost zero runtime properties of deployed systems that anyone outside the operator can inspect. Every published claim of restraint from a system operator is a posture, not a number.

Our contribution is narrow. We argue that restraint — the active, deliberate choice not to emit when emission would be inappropriate — *can be made first-class, code-enforced, and quantitatively auditable*. We are not the first to argue systems should restrain themselves; we are, to our knowledge, among the first to publish a metric for whether a deployed system actually does. The Attention Ethics layer is the *runtime mechanism*; silence correctness is the *measurable property* it produces. Together they translate a thirty-year design tradition into something inspectable in production.

It is worth being clear about what silence correctness is *not*, because the easy confusions are everywhere:

| Property | What it measures | What it does not measure |
|---|---|---|
| Notification suppression | The *count* of notifications presented to a user, typically as remediation for a system whose default is high emission. Lower is "better" by construction. | Whether the moments that were suppressed deserved suppression, or whether the moments that were not suppressed deserved emission. A perfectly silent system would score perfectly here. |
| Confidence thresholding | *Belief*: how well-calibrated the system is about a candidate output, conditional on producing one. Common in classifiers and recommenders. | *Legitimacy*: whether producing any candidate output was warranted in the first place. A system can be highly confident about a perfectly-calibrated emission in a moment where no emission was needed. |
| Silence correctness | *Legitimacy*, at the moment level: of the moments we chose not to emit, how often was that the right call given what the environment showed afterward? | Whether the emissions that did happen were good. That is a separate problem with separate metrics. |

These distinctions matter because the failure modes diverge. A system that minimises notification volume can do so by being uniformly less useful. A system optimised for confidence can be confidently over-intervening. Only a metric framed around moment-level legitimacy creates pressure for the system to *both* speak when speaking is warranted *and* stay silent when silence is warranted. Silence correctness penalises both directions of error.

## What public publication does

We could have built the Attention Ethics layer and kept silence correctness internal. We didn't. The metric is rolled up hourly, exposed at `/api/silence`, and rendered on the public `/silence` page. The append-only predictions ledger the metric is computed from is queryable through our MCP server, which means external AI systems can inspect the substrate's claimed restraint without trusting our self-report. The number is auditable from outside the company.

Three things happen when a metric like this becomes public.

**Anti-deception pressure.** It becomes structurally harder for any system operator — ours included — to claim restraint without showing the number. The next time a consumer-AI product launches with "respects your attention" in its marketing copy, the natural follow-up becomes *what is your silence correctness, and where can I see it?* If the answer is "we don't measure that" or "internal only", that is itself information. The published metric raises the floor on what counts as a credible claim of restraint, and the effect compounds as more systems publish (or fail to publish) comparable numbers.

**Regulatory positioning.** Two emerging frameworks point in this direction. The Singapore IMDA AI Verify framework [IMDA 2023] specifies eleven principles for trustworthy AI, including human-centricity, transparency, and accountability. Silence correctness operationalises all three: a system that scores well is by construction prioritising cognitive load over engagement; public publication is the transparency mechanism; the append-only ledger is the accountability mechanism. The EU AI Act [European Parliament 2024], particularly Articles 13 and 14 on transparency and human oversight, suggests systems whose behaviour affects user attention should be auditable. Silence correctness gives auditors a number to read and a ledger to verify it against — without requiring code access.

**Brand differentiation that compounds because it's auditable.** Most product differentiation in consumer AI degrades over time because competitors can copy the surface and not the substance. Silence correctness has the opposite property: it gets stronger the longer you publish it. A six-month silence-correctness history is harder to fake than a six-week one. The metric becomes its own moat — not because the engineering is hard (other teams could build the equivalent) but because the *track record of publishing it honestly* takes years to accumulate.

Publishing also constrains us. We cannot rewrite our doctrine to soften the gates after the metric is in the wild, because the historical number is already on record. The five enforced refusals that govern the platform — no marketing copy, no physical anything, no CTAs on observed pages or atmospheric outreach, aggregate-never-named for any comparative claim, free-on-both-sides until measurable value exists — are easier to keep enforced when there is a public number watching us. Silence correctness is, internally, the metric we use to know whether we are actually living the refusals or just claiming to.

## What the number looks like at submission, and what it will look like

At submission of the preprint, the substrate observes one Singapore neighbourhood plus five adjacent activated pockets and roughly fifty outlets. Real merchant and consumer traffic is minimal — by design, since the platform is pre-launch and the doctrine is to let environmental substrate fill in before pushing commercial activity. Silence correctness becomes statistically meaningful at roughly 200 gradable rows per pocket; at the hourly cron cadence, that is approximately thirty days of clean operation post-launch.

In thirty days from the launch baseline, we will publish the first silence-correctness percentage. The number will probably be high in the early window — most candidate emissions will fail Gate 0a or Gate 0b because the substrate is sparse — and the interesting evolution is how the number behaves as real activity arrives. A naively-restraint-optimised system would stay quiet and score 1.0 on every silent hour with low activity. A well-calibrated system should score lower as the environment gets busier, then earn the score back as it learns which busy moments deserve a signal.

We are publishing the metric *before* we know whether it will flatter us. That commitment is the point. If the metric reveals our gates are too conservative — silent through windows where activity afterward shows we should have spoken — the published number will say so. If one pocket is consistently weaker than another, the per-pocket breakdown will show it. We have committed in code that we cannot retroactively edit the ledger; we have committed in public to publishing whatever the ledger says.

## Limitations we are not pretending around

Three honest weaknesses are worth naming, so that no reader is sold a tidier story than is true.

**Threshold doctrine, not empirical calibration.** The verdict function in §4.2 of the preprint uses doctrinally chosen thresholds (*r* ≥ 3 redemptions, *m* ≥ 1 matched hold, *w* ≥ 5 watcher hits with *p* ≥ 1 manual post). These encode our reading of what counts as "real activity the system should have surfaced" in a Singapore F&B context. They are not yet empirically calibrated against ground-truth user judgement, and cross-cultural or cross-domain deployment would require recalibration. The metric is measurable *within a doctrine*, and the doctrine is on the page.

**Gaming resistance is real but not total.** Append-only commitment, delayed scoring, environmental observation requirements, and threshold transparency together raise the cost of gaming the metric substantially. They do not eliminate it. A motivated operator could selectively activate pockets only at times of expected low activity, inflating correct counts without changing operator behaviour during high-activity windows. The deepest protection against this is external audit — a regulatory body or third party with substrate access could recompute the metric from the raw ledger independently. Self-published metrics are useful for transparency; externally audited metrics become genuinely trustworthy.

**Conflict of interest, structural.** We are the operator of the substrate the metric is computed against, publishing both the metric and the architectural argument for why it matters. The mitigations available to us are full ledger inspectability and the MCP server that lets external AI systems query the substrate directly. The conflict is not eliminated; it is named.

## A challenge to anyone building AI for human attention

If you are building an AI product whose value to the user includes respect for their cognitive load, here is the question: *what is your silence correctness, and why isn't it public?*

If the answer is "we don't measure that" — the answer most consumer AI products would honestly give today — the next question is what *do* you measure, and does that metric pull your system toward more emission or less. If the metric on the wall is DAU, session length, or messages-per-week, you are running engagement-frequency optimisation regardless of what the marketing copy says. The metric, not the copy, is what the system is.

If the answer is "we measure it internally but don't publish it", the next question is why. The architecture to publish is, in our case, four mechanisms a small team can build in a quarter: a seven-gate cascade, an append-only ledger, a reveal cron, and a public endpoint. None of these are research-grade engineering. The hard part is publishing the number even when it is unflattering.

If the answer is "we publish it" — we would like to compare notes. The category of *deployed AI systems with public restraint metrics* is currently very small. It needs to be larger.

Real Signal is one early instantiation of one possible answer. We have made restraint a math object, made the math object public, and committed in code that we cannot edit the history. The full architectural argument is in the preprint at [real-signal.ai/research/attention-ethics-layer.md](https://real-signal.ai/research/attention-ethics-layer.md). The live metric, once it has thirty days of post-launch data, will be at [real-signal.ai/silence](https://real-signal.ai/silence).

The thirty-year tradition of calm computing has had no shortage of vocabulary. It has had a shortage of numbers. We offer one, and we suggest the next interesting question for any AI product that claims to respect attention is *show us yours.*

---

## License + attribution

© 2026 Real Signal Research. All rights reserved.

This work is licensed under [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)](https://creativecommons.org/licenses/by-nc-nd/4.0/).

**You may:** share this essay with attribution (link to this canonical URL) for non-commercial purposes.

**You may not:** modify, adapt, or build upon this work; use it for commercial purposes; remove the attribution; republish without the canonical URL.

**Cite as:**

> Real Signal Research (2026). *Why we built silence-correctness as a public metric.* https://real-signal.ai/research/silence-correctness.md

**Trademarks:** "Real Signal", "Attention Ethics Layer", "Silence Correctness", "Moment Quality Score", and "Pocket Cognition Stack" are claimed marks of Real Signal Research, Singapore. Use of these terms in a commercial context referring to substrate, methodology, or metric design analogous to ours requires written permission.

**Substrate provenance:** the silence-correctness metric described in this essay is computed against an append-only predictions ledger queryable at `https://real-signal.ai/api/predictions`. Any claim about Real Signal's silence-correctness percentage must be verified against the live ledger; historical readings are timestamped at the moment of seal and cannot be retroactively edited. Companion preprint at [real-signal.ai/research/attention-ethics-layer.md](https://real-signal.ai/research/attention-ethics-layer.md). Full reference list in the preprint.

**More citation formats:** [BibTeX, APA, MLA, plain text](https://real-signal.ai/research/cite.md)

Contact: `hello@real-signal.ai`