Research

Verified evidence.

Every claim on the homepage is backed by a reproducible experiment. Raw data, methodology, and on-chain proofs below.

Featured experiments

April 2026 · Base mainnet

Crypto agents · 1,083 transactions · 12 hours

81.9% control·99.9% Helix·195 reverts prevented·100% on-chain

Paired A/B test on Base mainnet (chain ID 8453). Every failure scenario sent to both arms at the same block. Control = blind retry on revert. Helix = PCEC pipeline with Repair Graph lookup. Result: 195 reverts prevented over a 12-hour window. Every tx hash is verifiable on BaseScan.

Raw transaction data ↗Repair strategy breakdown ↗Methodology ↗

April 2026 · 5 frontier models

EVM revert classification · 10 failure modes

GPT-4o-mini 50%·GPT-4o 80%·Claude 4.5 Sonnet 90%·GPT-5.4-mini 90%·GPT-5.4 90%·Helix (PCEC) 100%

Ten production revert messages from Base and Ethereum mainnet, classified by failure cause. All five frontier models failed on the same case: a bare execution reverted with no reason string. Helix converges to 100% because the Repair Graph remembers — model capability is not the ceiling, accumulated repair data is.

Methodology and N=10 results ↗

Theoretical context

Helix’s PCEC pipeline is a concrete instance of the Generator–Verifier–Updater (GVU) operator framework. The Repair Graph functions as a memory-evolution layer in the sense of recent episodic-memory work — high-utility patterns are retrieved more, low-q patterns are naturally forgotten. We did not invent these ideas. Our contribution is the application: a production-ready substrate that turns these primitives into observable agent reliability.

Related work

GVU operator framework — defines self-improving system criteria
Episodic memory + Q-value utility (MemRL family)
Experience distillation into strategic principles

Reproducibility

Every result above is reproducible. We do not publish numbers that can’t be re-run.

Base mainnet A/B test: full tx hashes and CSV in the gist linked above. Reproduce by replaying the failure scenarios at the recorded block heights.
LLM benchmark: the 10 production revert messages are listed in the gist. Reproduce by querying the same models with the same prompts.
All A/B test data and methodology is public. No proprietary preprocessing.

What’s next

We’re writing up two further experiments: (a) Web2 microservices in autonomous resolution (91% across 4 production services, zero LLM calls); (b) the PCEC architecture as it scales past 500 patterns in the public Repair Graph. Expected Q3 2026.

All raw data and methodology are public. We do not believe in unreproducible claims.

Back to home