How LLMs Reshaped Scientific Writing in Two Years

How LLMs Reshaped Scientific Writing — and What Operators Learn

Two large-scale analyses published over the past year give us the cleanest measurement we've had of how generative AI has changed the language of scientific writing. They're worth pausing on — both for what they say about science, and for what their methodology says about every other text corpus a business owns.

Kobak and colleagues analysed 15.1 million biomedical abstracts indexed by PubMed from 2010 to 2024 and showed that the November 2022 release of ChatGPT triggered an abrupt, sustained shift in the words researchers use to describe their work. Science Advances ran the result in July 2025. Their lower-bound estimate: at least 13.5% of 2024 PubMed abstracts were processed by an LLM. In specific subgroups — Chinese-affiliated computational biology papers, MDPI's Sensors journal, the rapidly-growing journal Cureus — the figure rises to 25%, 34%, even 41%.

13.5%

Lower bound — share of 2024 PubMed biomedical abstracts processed by an LLM

Kobak et al., Sci. Adv. 2025. Subgroup peaks reach 41% (China + computational biology) and 34% (South Korea + MDPI Sensors). This is a lower bound on the population-level estimate; actual prevalence is likely higher.

A parallel study by Liang and colleagues at Stanford NLP, published in Nature Human Behaviour in 2025, applied a different statistical framework to a different corpus — over a million papers across arXiv, bioRxiv, and the Nature portfolio between January 2020 and September 2024 — and converged on the same picture. Computer-science abstracts top the chart at up to 22% LLM-modified content; mathematics sits much lower (about 6%); the Nature portfolio sits in between (~6%). Two studies, with different methods on different corpora, both show the inflection point at the end of 2022 and a steepening curve through 2024.

How "excess vocabulary" works as a measurement

The Tübingen group adapted the COVID-era statistical concept of excess mortality — observed deaths minus the deaths a counterfactual model would have predicted — to language. They built a counterfactual frequency distribution for every word in PubMed abstracts based on 2021–2022 trends, then measured the gap between what the model predicted for 2024 and what 2024 actually contained. Words whose 2024 frequency dramatically exceeded their predicted frequency are flagged as excess words — the linguistic fingerprints of widespread LLM editing.

The headline excess words are now famous. Delves appears 28× more often in 2024 PubMed abstracts than the counterfactual predicts. Underscores is up 13.8×. Showcasing is up 10.7×. Among common words, potential, findings, crucial, across, additionally, comprehensive, enhancing, insights, notably, particularly, and within all show statistically meaningful gaps. Stanford's analysis on arXiv computer-science abstracts identifies realm, intricate, showcasing, and pivotal — words that maintained low, stable frequencies across more than a decade and then spiked starting in 2023.

The strength of this approach is what it doesn't require. There's no labelled training set of human-vs-AI text. There's no per-paper classifier (which would have unavoidable false positives and false negatives at the corpus scale). It just measures the gap between two distributions of word frequencies — the predicted one and the observed one — and reports the divergence. The result is a population-level estimate, not a per-paper classification, but population-level is exactly the right granularity for understanding what's happening to a corpus as a whole.

Where the signal is strongest

The geography of LLM-assisted writing isn't uniform. Kobak and colleagues found that papers from China, South Korea, and Taiwan show measurably higher excess vocabulary than papers from the United States or United Kingdom. Computational and bioinformatics subfields lead disciplinary adoption, with biology and clinical medicine somewhat behind. The pattern is sharpest in the journals that publish the most papers fastest — MDPI's Sensors, Cureus, and similar high-volume venues — and weakest in the most selective journals, where Nature, Science, and Cell sit at around 7%.

Peak LLM-modification rates across subgroups (Kobak et al., 2025)

Frequency-gap (Δ) of LLM-preferred excess vocabulary, expressed as approximate share of subgroup abstracts. Source: Kobak et al., Sci. Adv. 2025.

The Stanford analysis adds an author-behaviour layer. LLM modification is highest among researchers who post preprints frequently, who work in crowded research areas (where competitive pressure is highest), and who write shorter papers (where the ratio of prose to substance is highest). In other words, the pattern lands exactly where you would predict — the contexts where writing is the bottleneck, where the prose is part of the production rate, and where the marginal benefit of an LLM is the largest.

The methodology generalises beyond science

For the average business, the more interesting finding isn't the share of LLM-touched papers — it's the measurement methodology. The same approach that works on PubMed works on any sufficiently large text corpus you care about: customer reviews on your site, support tickets, internal documentation, vendor proposals, candidate cover letters, contributor pull-request descriptions, content submitted by freelance writers.

Build a baseline distribution of word frequencies from a pre-2023 reference period. Compare against the current period. Anything with a frequency ratio of 5× or more — and especially the now-canonical delve, underscore, showcase, realm, pivotal, intricate — is a quiet signal. The signal won't tell you which specific item was AI-written; it will tell you what the corpus-level shift looks like, which is genuinely useful information for any decision-maker who wants to understand whether the prose they're reading represents the writers they think they're hearing from.

For content marketing operations specifically, this is a brand-defence question. The AI-sludge stylistic profile is now empirically detectable, increasingly recognisable to readers, and increasingly dilutive of any brand voice that converges with it. The companies that will continue to read as distinctly human in 2027 are the ones whose corpora — public marketing, product copy, customer comms — measurably don't drift toward the excess-vocabulary baseline.

Quality is a separate problem from prevalence

The Tübingen authors are careful in their framing. The fact that 13.5% of biomedical abstracts are LLM-touched isn't intrinsically a quality problem. An LLM editing a non-native-English author's draft for fluency is a different action from an LLM writing the abstract from scratch. What the methodology detects is the stylistic signature, not the substantive correctness.

That said, the paper flags the documented quality concerns plainly: LLMs "are infamous for making up references, providing inaccurate summaries, and making false claims that sound authoritative and convincing." The risk isn't that an abstract uses delves — it's that the abstract was written by a system whose error mode is fluent confabulation, and that the writer didn't catch the confabulations because the prose looked correct.

The lesson translates directly to any production AI automation workflow that produces consequential written output. Two failure modes worth monitoring separately:

Stylistic homogenisation. Your output sounds like the AI training distribution, not like your team. Detectable, fixable by post-editing.
Substantive confabulation. Your output asserts things that aren't true with appropriate-sounding confidence. Less detectable, much costlier when missed.

The two have different fixes. The first is a brand-and-style-guide problem. The second is a verification-and-grounding problem — and is exactly why the operational pattern in proprietary intelligence layers is hybrid: an AI handles the bulk of the writing speed, and a deterministic verification step (against a database of known-true facts, or a structured retrieval over your domain) catches the second failure mode before the output ships.

What operators should take from this

Three practical reads.

Build the same measurement for your own corpus. Pick a pre-2023 reference period and compute the excess-vocabulary distribution for any text domain you own — internal docs, public marketing, customer-facing content. You don't need to publish the analysis; you need to know the answer. The baseline won't be 13.5% for everyone, but the methodology lets you set a numerical target for how AI-touched do we want our voice to be and then measure whether you're hitting it.

Treat prevalence and quality as separate questions. The 13.5% figure is interesting; the false-citation and false-claim risk that often comes alongside the same processes is what actually breaks things. AI deployment programmes that conflate the two end up with strange policies — banning LLM use entirely instead of building the verification layer that catches confabulations regardless of whether AI or a human introduced them.

Use stylistic divergence as a brand signal, not just a detection signal. Once delves and underscores are surveillance markers across millions of documents, the inverse becomes valuable: prose that demonstrably doesn't sit in the excess-vocabulary distribution reads as more clearly human-authored, and increasingly so as the median public corpus drifts AI-ward. For any business whose communications are part of its product, that's a strategic asset.

The Kobak and Liang studies are the cleanest instrumentation we've yet had for measuring LLM penetration into a meaningful corpus. The most useful response isn't shock at the size of the number — it's recognising that the same instrumentation works on every other corpus that matters to your business, and starting to read it.

Sources

: Kobak et al. · Delving into LLM-assisted writing in biomedical publications through excess vocabulary · Science Advances · 2025-07-02 · https://www.science.org/doi/10.1126/sciadv.adt3813 : Liang et al. · Quantifying large language model usage in scientific papers · Nature Human Behaviour · 2025 · https://www.nature.com/articles/s41562-025-02273-8

Related services

How we work in this space

Sources & references

← All posts Get a proposal →