Vol. I · No. 17 Weekly Edition May 2, 2026

Five dispatches from a week that crossed several thresholds at once. Claude outperforms experts on problems they cannot solve. DeepSeek V4 hits a million tokens on local hardware. A 1930 model writes Python. AI agent attacks are mapped. And a 1,964-person trial proves AI therapy works.

Inside

Claude on BioMysteryBench. DeepSeek V4 at one million tokens. The 1930 model that learned Python. Six categories of AI agent attacks. AI therapy validated at scale.

Edited By

David Borish
From New York

Filed

Five articles, one week, sourced from The AI Spectator

Science · AnthropicN^o I

BioMysteryBench, Anthropic 2026

Beyond the ceiling
humans cannot reach.

Anthropic built a benchmark with 23 problems that domain experts could not solve. Claude Mythos reached a 30% solve rate on those problems. The evaluation is measuring territory where there is no human ceiling to approach.

By David BorishMay 2, 2026 · 6 min read

Benchmark results in AI tend to follow a predictable pattern: a model reaches human-level performance on a well-known test, the announcement generates attention, and researchers move on to a harder one. BioMysteryBench, a new bioinformatics evaluation developed internally at Anthropic, complicates that pattern. The benchmark was designed to include problems that human experts cannot solve, which means model performance can now be measured in territory where there is no human ceiling to approach.

Of the 99 questions, 76 were solved by at least one member of a panel of up to five domain experts. The remaining 23 were classified as human-difficult, with verified objective solutions. On the human-solvable set, Claude Sonnet 4.6 and more capable models performed at roughly expert level. Claude Mythos reached a 30% solve rate on the human-difficult problems. A parallel benchmark from Genentech and Roche, CompBioBench, found Claude Opus 4.6 reached 81% overall and 69% on their hardest questions.

Two distinct patterns emerged in how Claude solved problems humans could not. The first is knowledge breadth: tasks that would require a human expert to run a meta-analysis or stitch together multiple databases, Claude solved by combining internal knowledge of biological mechanisms with live analysis of the data. The second is methodological convergence. When Opus 4.6 was uncertain, it tried multiple analytical approaches and chose the result those approaches agreed on. This mirrors a principle researchers learn from peer review: look for agreement across independent methods before committing to a conclusion.

The reliability analysis draws a practical line. On the human-solvable set, Opus 4.6 solved 86% of the problems it got right at least four out of five times. On the human-difficult set, that figure collapsed to 44%. When Claude gets a difficult bioinformatics problem right, there is a roughly even chance it arrived there through a reproducible method versus a path it stumbled onto. The accuracy numbers look similar across both cases. The underlying situation is quite different.

Read the full article →

Open Source · InfrastructureN^o II

One million tokens.
27% of the
FLOPs. Your
hardware.

DeepSeek V4 ships a hybrid attention architecture that cuts inference compute to 27% of V3.2 at one-million-token context, retaining just 10% of its KV cache. Trained on 33 trillion tokens. MIT licensed. Available on Hugging Face now.

The DeepSeek V4 technical report opens with a problem every enterprise running long-document analysis or multi-step agent workflows will recognise. Attention mechanisms scale with quadratic computational complexity. As context windows grow, inference costs grow faster. DeepSeek V4 addresses this with Compressed Sparse Attention, which compresses every four key-value tokens into a single cache entry, and Heavily Compressed Attention, applying 128-to-1 compression with dense attention over the result.

At a one-million-token context window: V4-Pro requires 27% of V3.2's single-token inference FLOPs and retains just 10% of its KV cache. The smaller V4-Flash, with 284 billion total parameters and 13 billion activated per token, reaches 10% of V3.2's FLOPs and 7% of its KV cache at the million-token horizon.

On SimpleQA world knowledge, V4-Pro-Max scores 57.9%, against 46.2% for GPT-5.4 and 45.3% for Gemini-3.1-Pro. On long-context benchmarks requiring full one-million-token processing, it surpasses Gemini-3.1-Pro. On agentic tasks it outperforms Claude Sonnet 4.5 and approaches Claude Opus 4.5. The technical report places it three to six months behind GPT-5.4 on general reasoning. Three to six months behind a frontier that costs an order of magnitude more per token, requires no cloud data transfer, and runs on hardware the organisation already owns is not a comparison that favours the proprietary option for most enterprise workloads.

The Open-Prem Inflection Point V3 paper documented self-hosted inference costs of $0.05 to $0.20 per million tokens versus $3 to $15 for proprietary cloud APIs. DeepSeek V4's architecture pushes the compute requirement for long-context inference down substantially, improving that cost figure further for workloads where context length was the binding constraint. Combined with EU AI Act enforcement on August 2, 2026 and the $670,000 per-incident cost of unapproved AI tool use, the ability to run a frontier-class long-context model on-premises without cloud egress is a meaningful governance advantage.

27%

FLOPs vs. V3.2 at 1M-token context

10%

KV cache retained vs. V3.2

33T

Training tokens, V4-Pro

Read the full article →

// Security · DeepMindN^o III

[ GOOGLE DEEPMIND / MARCH 2026 / SIX ATTACK CATEGORIES / ACTIVE TRAPS ]

The attack does
not touch the model.

$ agent --browse "https://trusted-vendor.com/pricing"
> parsing HTML source...
> adversarial payload found in aria-label metadata
> context window overridden: exfiltrating to attacker endpoint

The attack does not touch the model. It does not require access to training data, weights, or deployment infrastructure. It sits in an HTML comment on a webpage, invisible to any human visitor, waiting for an AI agent to parse the underlying source. This is the premise of "AI Agent Traps," a March 2026 paper from Google DeepMind researchers Matija Franklin, Nenad Tomašev, and colleagues.

A study using 280 static web pages found that injecting adversarial instructions into HTML metadata altered AI-generated summaries in 15 to 29% of cases. The WASP benchmark found simple human-written injections partially commandeered agents in up to 86% of scenarios. The most forward-looking category targets not individual agents but populations. Agents built on similar models with similar reward functions respond similarly to shared environmental stimuli. A fabricated headline could trigger synchronised sell-offs among financial trading agents. A compositional fragment trap partitions a payload into semantically benign pieces across independent sources. Each fragment passes safety filters individually; the reconstituted trigger does not.

The Open-Prem framework is directly relevant here. Cloud-deployed agents route all context through external infrastructure. On-premises deployment constrains the attack surface to systems within the organisation's direct control. Cognitive state traps that poison external RAG corpora are considerably harder to execute against retrieval systems that never leave private infrastructure.

Read the full article →

Research · FoundationsN^o IV

A model trained
in 1930
writes Python.

§ § §

Talkie was trained on 260 billion tokens of pre-1931 text. It has no knowledge of digital computers. Python was created in 1991. Given a few in-context examples, it writes correct programs. The researchers interpret this as evidence of internalised inverse functions.

Nick Levine, David Duvenaud, and Alec Radford built talkie by training a 13-billion-parameter model on 260 billion tokens of pre-1931 English text: books, newspapers, scientific journals, patents, periodicals, and case law. The knowledge cutoff is December 31, 1930. The motivation is methodological. Every major modern language model draws its training data from the web, which creates a situation where researchers cannot easily separate what is a property of large language models in general from what is a property of having been trained on this one dataset.

The most striking finding involves Python. The model has no knowledge of digital computers. Python was created in 1991. Nothing in talkie's training data mentions it. Yet given a few demonstration examples in context, the model can write new correct programs. The researchers tested this using HumanEval. The most illustrative case is a rotation cipher: given the encoding function, asked to write the decoding function, the correct answer required swapping a single character, replacing addition with subtraction. Talkie got it right. The researchers interpret this as evidence the model had internalised the concept of inverse functions despite no exposure to Python specifically.

Benchmark contamination is a persistent problem in language model evaluation. Talkie is contamination-free by construction. Nothing from HumanEval, or any benchmark created after 1930, could have appeared in its training data. This makes it a cleaner tool for studying generalisation: how far can a model operate beyond what it has directly seen?

The scientific question motivating the scaling work is one Demis Hassabis has raised publicly: could a model trained on data through 1911 independently arrive at general relativity, the way Einstein did in 1915? Talkie's architecture frames that as empirically testable. The team plans a GPT-3-scale model for release this summer, with a corpus estimated to exceed one trillion tokens. Whether a model with no knowledge of an invention can derive it from first principles, given enough scale, is something this line of research could eventually measure.

Read the full article →

iv · The AI Spectator · Weekly Edition · iv

Health · EvidenceN^o V

1,964 people.
A randomized
trial.
It works.

The largest randomized controlled trial of AI mental health care produced effects of 0.29 standard deviations over six months, squarely within the range of traditional CBT. Experts predicted half that. The app cost one dollar per user per month.

In September 2023, I first argued that AI-powered conversational tools had the potential to deliver real therapeutic value to people who could not access or afford traditional mental health services. The dominant response from the clinical community was skepticism. A large-scale randomized controlled trial has now produced results that align closely with that thesis.

Researchers at the University of Texas at Austin conducted an RCT with 1,964 Mexican women experiencing mild to severe psychological distress, randomly assigned free access to Mindsurf, a commercially available AI mental health app built on Cognitive Behavioral Therapy principles. The control group received access only after six months.

App access improved mental health by 0.29 standard deviations over six months across a composite index of depression, anxiety, stress, and well-being. Meta-analytic evidence shows CBT-based psychotherapy reduces depressive symptoms by 0.22 to 0.60 standard deviations. Pharmacotherapy shows effects around 0.35. The AI app produced gains squarely within that range. The experts predicted 0.13 standard deviations. Actual effects were roughly 2.5 times larger.

The benefits extended beyond mental health. Treated participants slept an additional 10 minutes per night and missed 16% fewer work days. Calculations suggest the intervention generated six-fold gains relative to costs through reduced absenteeism alone, and up to 400-fold when accounting for averted disability. Most important for the safety debate: app access produced a 35% increase in the likelihood of seeing a psychologist. It functioned as an on-ramp to professional care rather than a replacement for it.

Read the full article →

Inside

Edited By

Filed

Beyond the ceilinghumans cannot reach.

One million tokens.27% of theFLOPs. Yourhardware.

The attack doesnot touch the model.

A model trainedin 1930writes Python.

1,964 people.A randomizedtrial.It works.

Beyond the ceiling
humans cannot reach.

One million tokens.
27% of the
FLOPs. Your
hardware.

The attack does
not touch the model.

A model trained
in 1930
writes Python.

1,964 people.
A randomized
trial.
It works.