
On 29 April 2026, Science Advances published a benchmark study with one of the more interesting AI findings of the year: when forecasting record-breaking weather extremes — the heat waves, cold snaps, and wind events that trigger early-warning systems and disaster response — the European Centre for Medium-Range Weather Forecasts' physics-based HRES model still consistently outperforms the leading AI weather models. The story isn't that AI loses; it's that the production stack converging in operational meteorology is hybrid AI-physics, and that pattern is exactly what every other high-stakes AI deployment will converge on too.
The dataset alone is worth pausing on.
For 2020, researchers Zhongwei Zhang, Erich Fischer, Jakob Zscheischler, and Sebastian Engelke (ETH Zurich, KIT, Geneva, UFZ) identified 162,751 heat records, 32,991 cold records, and 53,345 wind records that broke the previous benchmarks set on the same 0.25° grid cell during the AI models' 1979-2017 training period. The same exercise on 2018 — a year with completely different ENSO conditions, transitioning from La Niña to El Niño rather than 2020's El Niño-to-La Niña shift — produced the same ranking. That dual-year robustness check, plus a second one defining records on a 31-day running window rather than per-month, plus a third using HRES-fc0 as the ground truth instead of ERA5, all give the same answer. The five AI models tested — Google DeepMind's GraphCast (and its operational variant), Huawei Cloud's Pangu-Weather (and its operational variant), and Shanghai Academy of AI for Science's Fuxi — all systematically underestimate the frequency and intensity of record-breaking events. The physics model doesn't.
The most useful framing isn't "physics beats AI." That's the AI-hype-cycle reading and it misses the actual story. AI weather models match or exceed HRES on standard global temperature and wind metrics — and run orders of magnitude faster and cheaper. That hasn't changed. ECMWF, NOAA, and Microsoft are all running AI models in production alongside physics models. The new paper is a precise map of where AI quietly underperforms — the tail of the distribution — and an argument for why the operational world has already converged on hybrid stacks rather than pure AI replacement. For any business deploying AI in a high-stakes domain, that pattern is the headline.
What the benchmark actually measured
The benchmark uses ERA5 reanalysis data — the ECMWF dataset produced by re-running atmospheric data assimilation across all available historical observations — on a 0.25° latitude-longitude grid covering 244,450 land grid cells (Antarctica excluded due to anomalous AI behavior in those latitudes). Records are defined locally, per grid cell, per calendar month, across the 1979-2017 training window. A "record-breaking event" in 2020 is any 00 or 12 UTC observation that exceeds the corresponding monthly record at that grid cell. The records span every climatic zone the test year covers: tropics, subtropics, mid-latitudes, northern high latitudes — though South America, Southeast Asia, the Maritime Continent, and Australia have few or no records in 2020.
The dataset includes prominent real-world events: the Siberian heatwave of early 2020, the U.S. heatwave of August 2020, and tens of thousands of less-publicised local-scale records that the public rarely hears about but matter operationally for grid planners, agricultural insurers, water utilities, and emergency response. The forecast accuracy metric is RMSE — root mean square error — between the forecast at lead time τ and the ground truth at the same location and time, latitude-weighted to account for grid-cell area variation.
Two things make this evaluation unusual and convincing. First: the sample size. Most prior AI-vs-physics extreme-event evaluations use case studies — a single tropical cyclone, a single heatwave — and conclusions are inherently fragile. 249,087 records across two test years across multiple climate regimes is a population, not a sample. Second: the careful controls. The authors evaluated AI forecasts against ERA5 (their training distribution) and HRES against HRES-fc0 (the standard ECMWF analysis used to initialise HRES). They re-ran the analysis using HRES-fc0 as ground truth for the AI models' operational variants — yielding 170,136 heat, 109,155 cold, and 338,235 wind records on that grid — and got the same ranking. They re-ran with a 31-day running-window record definition (90,471 heat, 18,054 cold) — same ranking. They re-ran with a forecast-conditioned evaluation (avoiding the "forecaster's dilemma" of conditioning on observations) — same ranking. The result is robust to almost every methodological choice the authors made.
Three findings worth remembering
Errors grow with record severity. The AI models' errors "grow almost linearly with respect to the degree of record exceedance." The further past the historical record an event sits, the worse the AI models do at calling it. The authors describe the behaviour as if the AI predictions had "an implicit (soft) cap at a certain local value" — interpolating between observed training points and effectively flattening near the largest historical case. Physics-based HRES, by contrast, "is more robust to extreme record exceedances." For cold records HRES shows nearly constant error across exceedance magnitudes; for heat and wind records it shows mild but far smaller bias.
AI models miss records and miscount them. Beyond underpredicting intensity, the AI models systematically underpredict the number of record-breaking events relative to ground truth. That produces "a low number of true positives and a high number of false negatives, and consequently low recall." In risk-management language: AI weather models miss extremes more often than physics models do. Across all record types and lead times, HRES's precision-recall curves sit consistently above those of GraphCast, Pangu-Weather, and Fuxi — meaning higher precision for the same recall, or higher recall at the same precision.
A small but telling artefact: Pangu-Weather's record counts trace a zigzag pattern across lead times. The reason is that Pangu-Weather is itself a chained system — a 6-hour model and a 24-hour model — that handles different lead-time ranges with different sub-models. Each component has its own characteristic underprediction rate, and the discontinuity at the model handover shows up as a visible step in the error curves. It's a useful reminder that "AI weather model" is not one thing; under the hood, these systems are themselves engineered hybrids, and where their components swap, the error structure swaps with them.
AI models fail in correlated ways. Perhaps the most operationally important finding for anyone running an AI ensemble: "All AI models are positively correlated with each other, showing that they tend to make errors on the same events." The authors attribute this to shared biases learned from the common ERA5 training set. The implication is that running multiple AI weather models in parallel does not give you the diversification benefit you'd expect from an ensemble — they tend to be wrong together. A physics model run alongside provides the actual statistical independence.
Carbon Brief quoted Prof. Erich Fischer of ETH Zurich on the practical reading: a "warning shot" against replacing traditional models with AI "too quickly." Co-author Dr. Zhongwei Zhang framed the structural reason: AI weather models are trained around general atmospheric conditions and effectively treat extreme forecasting as a "secondary task." The training loss is overall RMSE; extremes contribute little to the aggregate score, so the model has weak gradient signal pointing toward better tail behavior.
The Dubai counter-example: when AI does extrapolate
The paper is careful to acknowledge a counter-result that complicates the simple narrative. In April 2024, Dubai experienced an unprecedented rainfall event — beyond anything in the regional historical record — and GraphCast predicted it well. Sun et al. (2025) attributed this to the event sharing dynamical similarity with extreme rainfall events from other regions in the training data. GraphCast hadn't seen Dubai-scale rain before, but it had seen the synoptic pattern that drives such events.
The lesson is more nuanced than "AI can't extrapolate." It's that AI models can extrapolate when the underlying dynamics of an unprecedented event resemble dynamics they've seen elsewhere — what the literature now calls "translocation" rather than true extrapolation. They struggle when the event is dynamically novel relative to the entire training set. For a CTO designing an AI system around this insight, the right question is not "have I seen this exact case before?" but "have I seen this family of cases before?" The first is data engineering. The second is a much harder feature-design problem.
This is also the cleanest argument against treating extrapolation as a binary. AI models extrapolate gracefully along some axes and fail catastrophically along others, and the axes are often invisible until something breaks. The systematic record-breaking benchmark forces those axes into the open in a way single-event case studies cannot.
Why this happens — the out-of-distribution problem
The paper's discussion is direct about the structural reason. AI weather models "do not use any knowledge of physical principles and do not explicitly enforce energy balances or other physical constraints. They are purely data-driven and essentially interpolate between observed historical weather patterns in the training period."
In machine-learning terms: out-of-distribution generalisation is a well-known fundamental limitation of neural networks. Balestriero et al. (2021) showed that in any high-dimensional predictor space, even held-out test points typically lie outside the convex hull of training data — meaning they technically require extrapolation. The record dataset goes further: every event in it has at least one variable beyond the univariate range of training data, satisfying a much stronger out-of-distribution definition. The model is being asked to predict points it has provably never seen along at least one dimension, and neural networks have no clean mechanism for doing this.
The paper cites parallel findings from other fields: image classification under domain shift, protein fitness landscape prediction, and large language model generalisation studies all show similar tail-distribution failures. The pattern is structural to data-driven models, not specific to weather.
There's also a more subtle issue. Deterministic AI weather models are trained to minimise RMSE — which makes them predict the conditional mean of the target distribution. The mean smooths out sharp peaks. A real wind storm with a 60-knot gust gets predicted as a softer 45-knot blob in space and time, because that's what minimises RMSE on a training set where most cells most of the time have moderate wind. Probabilistic AI models trained on proper scoring rules avoid some of this smoothing — but they're still trained on the same ERA5 data, so they likely face the same extrapolation ceiling. A different loss function does not manufacture training examples that don't exist.
What's actually running in production
It would be easy to read this as "AI weather is overhyped." The operational world has already drawn the more useful conclusion. Production weather forecasting in 2026 is hybrid, not pure AI replacement.
NOAA's hybrid "grand ensemble" combines AI-driven systems with traditional numerical models and consistently outperforms both AI-only and physics-only ensemble systems. ECMWF runs HRES, the AIFS (its in-house data-driven forecasting system, Lang et al. 2024), and produces experimental GraphCast and Pangu-Weather charts in parallel. Microsoft published Aurora (Bodnar et al. 2025) as a "foundation model for the Earth system" trained on a far broader dataset than ERA5 alone — explicitly designed to test whether wider data exposure improves tail behavior. A 2024 study in npj Climate and Atmospheric Science found hybrid physics-AI models outperform pure numerical weather prediction for extreme precipitation nowcasting. The framing across all of these isn't "which one wins." It's "which one for which conditions, and how do we route between them."
The paper itself spells out three concrete improvement directions for next-generation AI weather models:
-
Data augmentation with simulated extremes. Numerical climate models can produce physically-plausible extreme events that lie outside the historical record. AI models trained on these synthetic extremes should, in theory, raise their implicit cap. The paper cites FourCastNet's tropical-cyclone performance improving substantially when trained with augmented cyclone datasets — direct evidence that the implicit cap can be lifted by widening the training distribution.
-
Physics-informed neural networks and hybrid models. Replace specific parameterisations in physical climate models with AI components, or train neural networks with physical conservation laws (energy, mass) explicitly embedded in the loss function. The result is models that combine AI's learning capacity with the extrapolation guarantees that physical constraints provide.
-
Adapting principles from extreme-value theory. The statistical-learning literature has spent decades on tail estimation; the recent generation of AI forecasters has barely tapped it. Engression, progression, and generative-extreme-value methods (cited as Shen & Meinshausen 2024; Buriticá & Engelke 2024; Boulaguiem et al. 2022) offer concrete machinery for fitting models that extrapolate by design rather than interpolate by default.
The authors' closing line is the operational version: "it remains vital to fund and run physics-based NWP and AI weather models in parallel and to rigorously evaluate their performance for the most impactful type of weather events."
Why this pattern generalises
The hybrid pattern is the part of the story most worth taking seriously if you're a CTO, founder, or COO whose business has nothing to do with weather. The training-data limitation is structural to any data-driven model at the tail of any distribution.
AI models trained on historical data are excellent at the typical case and progressively weaker as you move away from it. The further out into the tail — the rarer, more extreme, more consequential the event — the larger the systematic error tends to be. For weather, that's the heat dome that hasn't happened before. For credit risk, it's the default scenario your training set didn't see. For fraud detection, it's the new attack pattern. For supply-chain forecasting, it's the disruption with no historical analogue. For autonomous-system safety, it's the edge case the model wasn't designed to anticipate. For LLM agents, it's the ambiguous user request that doesn't pattern-match to training conversations.
In each of these domains, the standard-case performance of AI is now usually better than the legacy baseline. That's why AI is being deployed. But the standard case is not where decisions matter most. Tail decisions — the ones with the largest cost-of-being-wrong — are exactly where the training-data limitation bites hardest. The weather paper makes this concrete in a way most internal AI evaluations don't: run any production AI system through a record-breaking-events test set, not just the standard validation split, and you'll likely see a similar pattern.
The hybrid stack is the answer that meteorology has converged on, and it is the answer most domains will converge on too. The cheapest, fastest, AI-native model handles the typical case and most of the operational volume. A more expensive but mechanistically grounded model — a physics simulator, a deterministic rules engine, a tighter human-in-the-loop process — covers the tail. The system as a whole is more robust than any single component, and the cost profile is reasonable because the tail handler only fires when needed. The integration challenge is the routing logic, not the components themselves.
Four operating principles for your own AI roadmap
For anyone building AI agents, production AI workflow automation, or any proprietary intelligence layer where the cost of a wrong tail call is large, the Science Advances result sharpens four operating principles.
Build an extreme-events evaluation track separately from your standard validation. Most AI evaluation harnesses score against held-out data sampled from the same distribution as training data. That's exactly the eval that gives you the highest scores and the smallest insight into tail behaviour. Build a parallel test set drawn from rare events — record breakers, regime shifts, regulatory edge cases, novel attack patterns — and score against it explicitly. Track the standard-case and tail-case scores side by side. The gap between them is your real risk surface, and you cannot manage what you do not measure.
Be honest about which decisions are tail decisions. The AI weather models look great in marketing materials because the marketing references the standard-case scores. The operational reality is sharper: accurate forecasts matter most before record-breaking extremes trigger early warning systems — exactly where AI underperforms. Translate that to your own business. Which of your AI-driven decisions have outsized consequences when wrong? A wrongly-flagged routine transaction is a customer-service issue. A missed novel fraud pattern is a board-level event. Those two should not share the same model, and they certainly should not share the same evaluation harness.
Don't trust ensembles of similar AI models to give you diversification. The paper's correlated-failures finding is the most underappreciated operational implication. Stacking three AI models trained on similar data does not give you three independent forecasts — they share the same blind spots. Genuine diversification requires a structurally different forecaster: a physics simulator, a rules-based system, a different model family, or a human reviewer with different priors. This is true beyond weather. Five LLMs voting on the same prompt is not a five-way ensemble; it's largely the same prediction five times. Three credit risk models all trained on the same loan tape are not three independent views of risk, no matter how different their architectures look on a slide.
Keep your non-AI systems alive on purpose, not by accident. This is the lesson most enterprise AI rollouts get wrong. The pressure to retire "legacy" rules engines, deterministic models, and human-review processes is enormous once an AI system outperforms them on standard metrics. The weather paper is a clean argument against it. Standard-metric outperformance doesn't tell you anything about the tail; the operational reason ECMWF still runs HRES is the same operational reason your business should keep its non-AI fallback alive: AI is faster and cheaper most of the time, and the deterministic process still catches the cases where being right matters most. Build the routing logic that decides when each system fires. That routing layer is the real intellectual property of a hybrid stack.
The honest version of the story
The most useful framing of this result isn't "physics beats AI." It's that AI weather forecasting in 2026 is good enough to handle the bulk of operational work cheaply, and the legacy physics infrastructure that AI threatens to displace remains essential for the events where the public consequences of being wrong are largest. The right organisational response is to upgrade to a hybrid stack, not to swing entirely in either direction.
That hybrid pattern is the more important takeaway. The AI hype cycle wants binary answers — AI replaces humans, AI replaces traditional software, AI replaces incumbent vendors. Production reality is hybrids that capture the cost advantages of AI on the standard case and the reliability of mechanistic models on the tail. The 249,087-record benchmark is the cleanest evidence yet for why this is the right architecture; the work ahead, in weather and in every other domain that takes AI seriously, is engineering the routing, the evaluation, and the human accountability that make hybrids actually work.
The companies — in weather, in finance, in operations, in any high-stakes AI deployment — that get this calibration right will compound advantage. The ones that swap out their tail-handling capability in pursuit of headline AI scores will eventually find out, expensively, what their training distribution didn't include.
How we work in this space
- AI AgentsExplore →
Autonomous AI systems that handle real workflows — sales, support, research, outbound — at scale.
- AI AutomationExplore →
End-to-end AI-powered automation across operations, marketing, sales, and reporting workflows.
- InsightAXExplore →
Our in-house marketing intelligence platform — unified analytics, attribution, and forecasting.


