Five stories from the week in artificial intelligence — architecture bets, orbital data centers, the creativity tax of accuracy, real-time models, and a benchmark that broke its own ceiling.
Subquadratic's $29M claim against transformer dominance. Google and SpaceX negotiate the first orbital data centers. Harvard researchers find a way to unlock what models hide. Thinking Machines Lab ships an interaction model built from scratch. Claude Mythos breaks METR's benchmark suite.
A Miami startup says it has solved AI's most persistent compute constraint. Third-party evaluation will confirm or complicate those numbers over the next few months. The field is watching.
The Exponential Replacement Curve's doubling time has compressed from seven months to 4.3. METR's benchmark hit its ceiling. These are not unrelated developments. — davidborish.com
The AI Spectator, May 16, 2026
A Miami startup raised $29 million and launched out of stealth with a direct challenge to transformer architecture's quadratic compute cost. The benchmark numbers are strong. The validation process has just begun.
The transformer architecture has powered every significant AI model of the past decade. It also carries a structural cost that has shaped product design, infrastructure budgets, and competitive strategy across the industry: compute requirements scale quadratically with context length. Every token is compared against every other token. Double the input length and you roughly quadruple the processing cost.
Subquadratic launched out of stealth on May 5, 2026, with $29 million in seed funding and a model called SubQ 1M-Preview, built on what the company calls Subquadratic Sparse Attention. The architecture is designed to process context with compute that scales linearly rather than quadratically. Rather than comparing every token against every other, it selects a small subset of positions for each query token and computes exact attention only over those.
The benchmark numbers from launch were strong enough to generate serious attention. SubQ 1M-Preview scored 95.6 percent on RULER 128K, compared to 94.8 percent for Claude Opus 4.6. On MRCR v2, the company reports 65.9, against Claude Opus 4.7 at 32.2 and Gemini 3.1 Pro at 26.3. On SWE-Bench Verified, SubQ scores 81.8 against Opus 4.6's 80.8. The architecture runs 52 times faster than FlashAttention in internal comparisons and requires 63 percent less compute.
Cost figures are the number that will attract the most scrutiny. Running SubQ on the 128K RULER benchmark reportedly costs approximately $8, compared to roughly $2,600 on comparable frontier models. If those figures hold under production conditions, they reshape the economics for any team currently rationing context to manage inference costs.
The research community's skepticism is grounded in history. A January 2026 technical analysis examined the pattern of prior subquadratic attempts and concluded that no frontier model has successfully deployed a pure subquadratic architecture without performance tradeoffs. Mamba-attention hybrids recover performance by mixing subquadratic layers with standard quadratic attention, which reintroduces the scaling cost. Magic.dev raised $1.75 trillion in valuation on a 100 million token context model with claimed efficiency gains of roughly 1,000 times; as of early 2026, there is no public evidence of significant external adoption.
On May 14, Subquadratic announced a partnership with LayerLens to run SubQ through Stratix, a benchmark platform covering more than 200 models and nearly 100 benchmarks. Results will be published at stratix.layerlens.ai with prompt-level breakdowns and head-to-head comparisons. Subquadratic has committed to publishing complete results, including areas where the model underperforms. The evaluation will cover retrieval accuracy at depth, positional consistency, and synthesis from extended inputs.
Every AI team currently managing context windows is doing so because cost and performance constraints force tradeoffs. RAG pipelines exist because sending a full document corpus through a transformer is not economically viable. SubQ Code, the coding agent launched alongside the API, processes entire codebases in a single context window without multi-agent coordination. For teams managing large repositories, the value of genuine long-context recall at this cost differential is measurable and direct.
Running SubQ on the 128K benchmark costs approximately $8. Comparable frontier models cost roughly $2,600. Subquadratic, May 2026
Google and SpaceX are in talks to launch data centers into space. The Wall Street Journal broke the story May 12. The technology is unproven at scale. The companies involved and the hardware already in orbit make this more than a concept pitch.
Google's orbital initiative, called Project Suncatcher, was announced in late 2024. The plan is to launch a cluster of roughly 81 satellites spanning a one-kilometer radius in low-Earth orbit, carrying custom Tensor Processing Units. Planet Labs is building the satellites. Two prototypes are planned for this year, with broader deployment targeted for 2027.
SpaceX has filed with the FCC to launch up to a million satellites for its own orbital data center constellation, a project central to its pitch to investors ahead of a planned IPO valued at $1.75 trillion. The company acquired xAI in February in a deal valuing the combined entity at $1.25 trillion. Last week SpaceX struck a deal with Anthropic to supply 300 megawatts of computing capacity using more than 220,000 Nvidia GPUs.
The two companies are not simply partnering. They are also competitors. SpaceX dominates commercial access to orbit, which means anyone looking to put satellites in space must at least evaluate a SpaceX deal regardless of competitive overlap.
The case for orbital data centers starts with power. Terrestrial facilities face real constraints: land acquisition, grid access, water cooling, permitting, and growing community resistance. A satellite in sun-synchronous orbit can receive near-continuous sunlight at roughly 36 percent higher irradiance than on Earth's surface, with no night cycle. Cooling works through radiative heat dissipation into cold vacuum, eliminating the water-intensive systems that make terrestrial data centers resource-heavy.
Starcloud, a Y Combinator-backed startup with Nvidia as a partner, launched its first satellite carrying an H100-class GPU in late 2025 and trained a language model in orbit. Axiom Space launched dedicated orbital data center nodes in January 2026. Blue Origin has announced plans under a project called TeraWave. The field has moved from white papers to hardware in roughly 18 months.
The economics remain the obstacle. Estimates put the cost of a one-gigawatt orbital setup at roughly three times that of an equivalent ground installation. SpaceX's own February 2026 pricing listed standard rideshare rates at $7,000 per kilogram. SpaceX disclosed to investors in April that its orbital data center business might not pay off. The companies making these bets are betting on trajectory: launch costs have fallen dramatically over the past decade, and the Starship vehicle, if it achieves its target flight cadence, could reduce per-kilogram costs further.
Some researchers have drawn a direct line from today's orbital clusters to the concept of a Dyson swarm, a theoretical megastructure of solar-collecting satellites surrounding a star. Sam Altman has referenced a Dyson sphere of data centers around the Sun. The CEO of Starcloud has described early Earth-orbit clusters as the seed for a solar-system-scale swarm. What looks like a launch services negotiation is also, in a longer view, the first serious commercial step toward moving computation off the planet entirely.
Most AI research focuses on making models smarter. A new paper from Thinking Machines Lab, led by Mira Murati, focuses on something different: making models usable in the way humans naturally work together. The paper introduces what the team calls interaction models, a class of AI systems designed to handle real-time, multimodal collaboration natively rather than through add-on components.
The structural problem Thinking Machines identifies is real. Current large language models experience the world one turn at a time. A user types or speaks, the model waits, the model responds, the user waits. Until the user finishes, the model perceives nothing. Until the model finishes generating, it receives no new information. The researchers cite a telling admission buried in a recent frontier model card: when models are used in interactive, hands-on patterns, users often perceive them as too slow, and the card suggested autonomous, long-running harnesses better elicit model capabilities. Thinking Machines reads this as a design concession, not a feature.
The core technical idea is what the team calls time-aligned micro-turns. Rather than waiting for a complete user input and generating a complete response, the model processes 200-millisecond chunks of audio, video, and text continuously, interleaving input and output in a single token stream. At any given moment the model is both listening and generating, which allows it to interrupt when something is wrong, backchannel while the user is talking, respond to visual cues without being explicitly prompted, and speak simultaneously with the user when tasks require it.
Most existing real-time systems emulate this through external components, primarily voice-activity-detection systems that predict when a speaker has finished. Thinking Machines argues that these components, being far less intelligent than the models they manage, place a ceiling on what real-time AI can do. Proactive behaviors like "interrupt me when I make a factual error" or "count how many pushups I complete" require the system to decide when to respond based on context rather than audio boundary detection. A harness that listens for silence cannot do that.
For tasks requiring deeper reasoning than real-time generation can produce, the interaction model delegates to a background model running asynchronously. The interaction model stays active throughout, handling follow-ups and new input, then integrates the background results when the moment is right.
The model is 276 billion parameters with 12 billion active, a mixture-of-experts design. The team acknowledges it is too slow to serve larger pretrained models in this real-time setting and plans larger versions later this year. The research preview is limited; Thinking Machines is accepting feedback and plans broader access later this year.
Harvard researchers found that as models become more accurate, they become less diverse. The information is in there. Standard decoding just never finds it. Queenie Luo, Gary King, Michael Puett, and Michael D. Smith propose a fix that requires no retraining.
Every major AI lab is racing to improve benchmark scores. Models that answer factual questions more accurately and reason more reliably are winning on leaderboards and attracting users. What gets less attention is what is sacrificed in the process. A new working paper from Harvard identifies a specific casualty: the ability to generate genuinely diverse, creative responses over extended use.
As models are trained to converge on correct answers, their probability distributions become more peaked around the most common outputs. More information gets pushed into statistical tails where standard decoding methods never look. Newer, more accurate models generate narrower and more repetitive outputs than older ones when asked open-ended questions.
The researchers call the most affected category the "search quest": any prolonged, exploratory effort to find an answer that does not yet exist in a fixed form. Choosing a wedding dress, finding a research topic, naming a startup, designing a product. Unlike a factual query with a correct answer, a search quest requires learning the space of possibilities. You do not know what you want until you have seen enough options to understand what is out there. Current AI tools produce a useful first list and then start repeating themselves.
The mechanics of why this happens are grounded in how models generate text. At each step, a model computes a probability distribution over all possible next tokens. Standard decoding methods select from the highest-probability tokens, keeping output fluent but ensuring the vast majority of information encoded in the model's parameters is never accessed. The researchers illustrate this: when prompted to brainstorm 18th century world history book topics, a model's top tokens lead to European history. Move to the 300th through 2000th positions and you find tokens like "Asia" and "African," topics the model clearly knows but never volunteers.
The Harvard team's solution, called Recoding-Decoding, introduces two forms of randomness at generation time without retraining. A random priming phrase, constructed from the 2,000 most common English nouns, is embedded in the prompt as "Related to NOUN." A random diverting token is drawn from three-letter starting stems of the 5,000 most common English words and placed at the start of each new sentence. Both exploit positional bias, a documented property of language models: they attend more heavily to tokens at the beginning and end of input sequences.
The battlefield test is the most striking result. Standard GPT-5.1, queried about interesting world history battlefields, produced 19 unique locations across 1,000 runs, all in Europe or North America. RD produced 1,307 unique battlefields across a globally distributed range, including East Asia, South Asia, the Middle East, Africa, and Australia. Standard decoding identified zero battlefields that RD did not also find.
Across 50 brainstorming topics, RD consistently outperformed all ordinary decoding variants on diversity while maintaining relevance scores of 0.99, essentially matching standard decoding's 0.99 to 1.00. On five public datasets spanning 500 prompts, RD increased diversity by 161 percent on GPT-5.1 and 140 percent on Gemini-3, with relevance scores within one percentage point of standard decoding.
One finding cuts against a common assumption: that newer, more capable models are better for creative tasks. The data shows the opposite. As model accuracy improves, the gap between standard decoding and RD widens. Better benchmark performance comes at a measurable cost to output variety when the model is used for open-ended exploration. GPT-5.1 under standard decoding scores lower on diversity than older GPT-3.5.
The paper also addresses collective effects. Research published in Science Advances found that generative AI enhances individual creativity while reducing collective diversity: different people's outputs converge on the same ideas. The researchers replicated this pattern when they discovered that students in a university course submitted nearly identical essays without communicating. RD addresses this by ensuring different users asking similar questions receive genuinely different suggestions.
The method requires either a completion API or a simulated version using the chat completion API. As frontier labs increasingly restrict raw API access in favor of structured chat interfaces, the practical ceiling for this approach depends on whether providers make completion endpoints available. The research is a reminder that what AI labs optimize for determines what they get and sometimes what they sacrifice.
On May 8, METR added Claude Mythos Preview to its time horizon tracker and immediately flagged: "Measurements above 16 hours are unreliable with our current task suite." The suite was not designed for a model this capable. The Exponential Replacement Curve projected this. The timeline has compressed.
METR's time horizon metric measures the length of a software task, calibrated by how long a human expert takes to complete it, that an AI agent can complete with a given reliability. GPT-4o, released in mid-2024, had a 50 percent time horizon of roughly 7 minutes. Claude 3.7 Sonnet reached about 2 hours. By late 2025, Claude Opus 4.5 and GPT-5 were clustering around 5 to 6 hours. Mythos Preview pushed past what the measurement infrastructure can reliably capture.
The acceleration behind that number is the real story. METR's original March 2025 paper found frontier time horizons doubling approximately every 7 months across the 2019 to 2025 period. The updated methodology from January 2026 found the post-2023 doubling time had compressed to 130.8 days, or about 4.3 months. One independent analysis pegged it at roughly 105 days.
The Mozilla Firefox team used Mythos Preview to fix 423 security bugs in April 2026, compared to a prior monthly average of 17 to 31. The model identified vulnerabilities requiring multi-component reasoning across large codebases, including a 20-year-old XSLT bug and a race condition that could enable sandbox escape. The UK AI Security Institute found that Mythos can autonomously execute a complete corporate network takeover, succeeding in 30 percent of attempts on a complex attack range that AISI estimates would require roughly 20 hours for a human expert.
The Exponential Replacement Curve framework accounts for the gap between benchmark performance and actual deployment. The capability threshold is what METR measures. The displacement timeline depends on organizational adoption speed, integration costs, and regulatory friction. But the METR data shows the capability side of that equation is moving faster than the framework's original projections assumed.
The Stanford and ADP data already showed that employment for workers aged 22 to 25 in AI-exposed occupations fell 6 percent between late 2022 and July 2025, with young software developers down 20 percent from their late 2022 peak. The BLS still projects software developer employment growing 17.9 percent through 2033. That projection was made before a model could fix 423 security bugs in a month that human teams were averaging 17 to 31, and before the leading capability benchmark had to flag its own measurements as unreliable.
When the benchmark suite designed to track exponential progress can no longer keep up, the rate of change has outrun the institutions built to monitor it. That is itself a data point for the Exponential Replacement Curve's wave model, which projects displacement pressure building in advance of widespread organizational awareness.