The Curious Case of Language Models and the Vanishing Semicolon

05 May, 2025

ChatGPT Image May 3, 2025, 10_07_33 PM

ChatGPT has a serious writing issue. It overuses em dashes and undervalues semicolons to such an extent that you can guess if a piece of content is AI-written just by noticing this pattern. When I confronted GPT about this, it said other LLMs behave similarly, providing the following explanations:

Training data: Em dashes appear more frequently, and semicolons less so.
Avoiding misuse: Semicolons are often misused or misunderstood; em dashes are considered a safer alternative.
Versatility: Em dashes substitute for commas, parentheses, or colons, supposedly offering greater versatility.
Modern style: Em dashes are visually distinctive and better suited for conversational and contemporary writing.

Is the Em Dash Really the Better Understood One?

I can agree that em dashes might be more popular—I rarely see semicolons in recent writing—but are em dashes actually more understood? Not really. Perhaps slightly more than semicolons, but I've often seen them misused as well. In fact, there have been several instances where GPT uses em dashes as a replacement for commas and parentheses in ways I don't particularly agree with.

Consider this sentence generated by GPT:
"The project deadline—originally set for June—was pushed back—due to unforeseen circumstances."

Here's my draft:
"The project deadline, originally set for June, was pushed back due to unforeseen circumstances."

I'm not sure if the AI-generated version was incorrect, but I found the em dash usage unnecessary, creating awkward pauses and disrupting the flow. I found my version clearer, with smoother readability.

This example illustrates how GPT's default stylistic choices often ignore better alternatives. Commas, in this case, preserve the flow and tone without the dramatization that em dashes introduce.

This makes the "versatility" argument unconvincing. It feels more like GPT is inventing reasons to justify its choices. Sure, using em dashes might not be technically incorrect, but language offers sufficient diverse options precisely because different punctuation marks serve nuanced purposes. Settling for a one-size-fits-all fix robs the language of its character and style, creating a dry, functional version suited more for documentation than genuine expression.

Why It’s Likely to Get Worse

Language models are increasingly trained on synthetic data (artificially generated text mimicking real-world patterns). If these models already favour one punctuation mark, this preference will amplify, further flattening language diversity.

Some might argue that this shift is actually beneficial, that it will make language more accessible. But the point of language is to equip people with sufficient tools to express nuanced thoughts and feelings. That's the reason we have so many punctuation marks in the first place. They help shape how something is expressed, not just what is expressed. By curtailing their usage, the models are also curtailing the expression behind it.

Can people not express themselves without them? Certainly, yes. But doing so is like painting with fewer colours. Possible, but inevitably more limited.

What I’m Doing About It

I don't generate much content using AI; I usually use models for reviewing purposes. Even during reviews, however, GPT frequently suggests replacing semicolons with em dashes, reinforcing its stylistic bias. To push back, I've started explicitly instructing GPT in my prompts:

"Do not suppress linguistic diversity. Use and keep varied punctuation to maintain nuanced expression."

It’s a small step, but a necessary one to preserve the richness and expressive quality of language against AI-driven homogenization, which in itself is a much broader concern, going far beyond mere punctuation use. I hope OpenAI, Meta, Anthropic, and others take cognisance of this.