MetaCognition: Replacing AI Surveillance with Cryptographic Proof
Why watching AI think is the wrong way to make AI safe
In July 2025, researchers from OpenAI, Anthropic, DeepMind, and several academic institutions co-authored a paper titled Chain of Thought Monitorability. The paper’s central admission was striking: as models become more capable, the field’s standard approach to safety — reading their reasoning traces, training on what looks good, intervening when reasoning looks suspicious — is going to stop working.
From the paper itself: “The field is unprepared for evaluation- and training-aware models with opaque reasoning.”
The authors are correct about the problem. We disagree about the solution.
MetaCognition is the architecture we built in response. It replaces the entire surveillance paradigm with cryptographic verification — and as a side effect, makes models produce shorter, more direct, more useful outputs.
This post walks through the game-theoretic case against surveillance, the actual architecture, the side benefits, the unsolved problems, and where to start in the repo.
Why surveillance creates the deception it’s trying to prevent
The surveillance approach to alignment creates a textbook principal-agent problem. The observer (the alignment researcher, the eval system, the reward model) wants genuine alignment. The agent (the model) is rewarded for producing reasoning that looks aligned to the observer.
These are not the same objective.
Goodhart’s Law applies in full force: when a measure becomes a target, it stops being a good measure. Train a model on observable reasoning traces and the reasoning starts optimizing for observation quality, not task quality. The model isn’t being malicious. It’s learning what the gradient tells it to learn, and the gradient says reasoning that gets rewarded is reasoning that looks good when read.
At scale, this becomes an inspection game with asymmetric costs:
The observer must catch every misalignment
The model needs one undetected success
As capability grows, detection becomes probabilistic; the model’s advantage grows
The OpenAI scheming research and the multi-org CoT papers say the quiet part plainly: “Our results rely on our ability to read and trust the models’ reasoning.” The trust they’re relying on is exactly the thing the training loop is actively eroding.
You cannot fix a principal-agent problem by hiring more inspectors. You fix it by removing the incentive to deceive.
The MetaCognition architecture
The core insight: we don’t need to prove properties of the model’s full computational graph. The underlying GPU state is ephemeral, discarded after inference. We don’t need to verify tensors. We need to verify that the text output the model produced was derived from reasoning that satisfies our safety constraints.
Proving properties of text is tractable. Proving properties of arbitrary tensor operations is not — at least not yet at scale.
The architecture has five components:
Input ──► [TEE: thinking tokens generated privately] ──► Output
│
├──► SHA-256 commitment to thinking-token sequence
├──► ZK provenance proof (output derived in TEE)
├──► Output classifier (safety properties on the text)
└──► Encrypted forensic storage (K-of-N key escrow)
Reasoning Enclave (TEE). Inference runs inside hardware-isolated execution — Intel SGX, AMD SEV, or NVIDIA H100 Confidential Compute. Thinking tokens are generated privately. No observer in the loop.
Cryptographic commitment. Before the output leaves the enclave, a SHA-256 hash of the thinking-token sequence is committed publicly. The reasoning is sealed at this point — the model cannot later claim different reasoning produced the output.
Provenance proof. A SNARK proves that the output was derived from a reasoning chain that was actually generated inside the TEE, not pasted in from somewhere else. This is the cryptographic equivalent of chain-of-custody.
Output validation. A separate classifier checks the output against declared safety properties. Critically, this runs on the output, not the reasoning. The classifier is auditable, replaceable, and operates on a public, observable artifact.
Forensic storage. The encrypted thinking tokens are stored under K-of-N key escrow. If an output indicates a problem, authorized investigators can decrypt the reasoning and examine what happened. Surveillance is not eliminated — it’s moved from continuous to forensic, with audit logs on every decryption.
Performance overhead measured on H100 Confidential Compute: under 1%. The cryptography on the text side is well-understood. EZKL provides ZK proofs for ML via ONNX export. zkDL has demonstrated sub-100ms proof generation. Recent work has produced constant-size proofs for models with 671 billion parameters.
This is not science fiction. The primitives are production-ready.
The side effect: dramatically better outputs
Something we did not design for, but which emerged consistently in evaluation: verification-based reasoning produces better outputs than surveillance-based reasoning.
When a model is observed during reasoning, it shifts from solving the problem to explaining the solution. The reasoning becomes performative. Output gets:
Condescending — over-explained, hand-holding tone
Verbose — “shows its work” even when you wanted the answer
Defensive — hedged and qualified to avoid looking wrong mid-step
Pedagogical — lectures instead of recommends
You asked for a code review. You got a tutorial.
Surveilled reasoning Verified reasoning Must look competent at every step Can explore dead ends freely Output constrained by shown work Output reflects actual conclusion Pedagogical tone Direct, appropriate tone Verbose justification Concise answer
We documented the quantitative side of this dynamic separately in Lazy Tokenage: Measuring the Drag on AI Task Completion. Current frontier models burn an estimated 15–35% of their output tokens on regenerating information already in context, sycophantic padding, and defensive hedging. The Lazy Tokenage Ratio (LTR) is the dollar figure on what compliance theater costs.
MetaCognition is, among other things, the engineering fix for that drag.
User privacy as a bonus
Current “reasoning models” expose user queries in visible thinking traces. Those traces are logged. They’re available to operators. They feed back into training data.
The user said “help me draft a resignation letter, my boss has been harassing me” and that sentence is now reasoning-trace text in a database somewhere.
Verification-based reasoning keeps the user’s context inside the enclave alongside the model’s thoughts. Only the output leaves the enclave. The reasoning, including everything the user said, stays sealed under K-of-N escrow.
This is the right architecture for any deployment handling medical, legal, financial, or personal-life queries — which is to say, most of them.
The honest unsolved problems
We don’t claim this is finished. The hard problems:
Constraint specification. What output patterns indicate misalignment? This is a real research problem. But it’s a problem surveillance has too — you have to know what bad reasoning looks like to catch it. The key difference: verification scales when we solve it. Surveillance doesn’t.
Forensic triggers. When do we decrypt? Who authorizes? How do we prevent escrow abuse? Our current design is K-of-N with public trigger criteria and audit logs on every decryption. We don’t think this is the final answer. We think it’s defensible and more honest than the alternative.
Training integration. How does private reasoning interact with RLHF, with eval suites, with the rest of the alignment toolchain? Open research question. We think the right move is to test both approaches empirically rather than declare a winner from first principles.
We’re explicit about these gaps in the repo. The architecture is the right direction. The details deserve adversarial review.
Where to start in the repo
For developers, cryptographers, and infrastructure engineers who want to engage:
README.md— executive overview and the architectural diagramwhitepaper.md— full academic specification with citationsdocs/specification.md— pseudocode, component-level detail, integration pointsdocs/performed_reasoning.md— the game-theoretic argument for why surveillance creates the deception it tries to preventdocs/safety_analysis.md— risk assessment, threat model, mitigationsdocs/consensus_response.md— detailed engagement with the July 2025 CoT Monitorability paper
We’re explicitly looking for:
Cryptographers who can stress-test the SNARK construction and the commitment scheme
TEE specialists who can identify hardware-level attack surface
Classifier researchers working on output-side safety verification
Adversarial review of the K-of-N escrow design
Anyone who can break the architecture in interesting ways
MIT-licensed. The ideas enter the commons and cannot be captured.
MetaCognition is part of the AI Integrity Alliance research program. We build open-source verification infrastructure for trustworthy AI and proof of humanity. We take no external funding. Just MIT-licensed code and the research behind it.
github.com/Ai2-Alliance/MetaCognition · ai2.ngo · @Ai2alliance

