Gone (Cat)Fishing: How Language Detectives Tackle Online Anonymity

Chi Luu

Chi Luu is a peripatetic linguist who speaks Australian English and studies dead languages. Every two weeks, she’ll uncover curious stories about language from around the globe for Lingua Obscura.

In the recent case of online harassment of currency campaigner Caroline Criado-Perez, it was reported that those found guilty of sending abusive tweets had set up multiple anonymous Twitter accounts solely for the purpose of Twitter trolling. It’s increasingly easy to do. However, the harassers, identified as Isabella Sorley and John Nimmo, had not been so diligent in covering their tracks and were ultimately revealed by links to other social media accounts that contained information about their real identities. Had their online anonymity been maintained, could forensic linguistic methods have uncovered the mystery of their identities based purely on textual clues?

Is linguistic anonymity possible when the language detectives are on hot on your heels?

Following on from our previous discussion of online anonymity, we’ve seen that in terms of general internet interaction, mindful users can often obscure the more obvious identity markers that can’t always be hidden in real life – gender being a major one.  Subconscious assumptions about stereotypical linguistic behaviors of men and women play a large part in allowing a superficial anonymity among regular internet users. The Internet has long been viewed as a stage for users to role play and freely explore different personas. In the fast-paced blur of casual online social interaction, interlocutors often accept internet personas at face value, a fact that can be easily manipulated by those who want their real life identities to stay out of the spotlight.

The Internet phenomenon of catfishing, in which deceivers role play and perpetuate false identities for various scams, popularly for the purpose of online romance, is seemingly widespread with victims all too ready to believe that an anonymous person online is really who they say they are. Willing believers clearly make online anonymity a cinch for role players. But how easy is it for experts to trace a single individual by their language use across different online media, regardless of who they pretend to be, whether they’re a female gamer playing under the radar to avoid harassment, or a male chatroom user exploring a female persona?

When online anonymity involves a crime or serious case of abuse, often the only evidence available to identify the perpetrators is text-based (and of this, there may be very little to go on). It is then that language experts working in a little known branch of linguistics and computer science might be called in to crack the case–welcome to the world of forensic linguistics and stylometry. It might be a surprise to most but forensic linguistics, in which language science is applied to the field of law and policework, and stylometry, which investigates linguistic style elements in writing, is actually a thing. It may be a case of blink and you’d miss the forensic linguist. As Peter Tiersma and Lawrence M. Solan’s 2002 review of forensic linguistics in legal contexts states, “despite these indications of an increasing presence of expert linguists in American courtrooms, […] the vast majority of American lawyers and judges have little or no experience with linguistic expertise in a legal matter. Many have never even heard of it.”

This is partly because forensic linguistics is still a growing area, though by no means new. The fields of forensic linguistics and forensic stylometry, where language science, the law and detective work become oddball partners in crime, have certainly existed before the advent of the internet in cases involving bomb threats, hoaxes, speaker identification and the like. In the past, this little known branch of applied linguistics was still finding its feet, where voiceprints were claimed to be as unique and accurately identifying as fingerprints and often inexperienced and untrained academics were called upon to act as expert witnesses in court and police cases, sometimes with disastrous results. Over time, more robust methods have been developed, resulting in more successful convictions in the courtroom on the basis of linguistic detective work.

Experts claim a regular anonymous internet user may be tracked through linguistic clues they unwittingly leave behind in their writing. According to Dr Tim Grant in an article for The Conversation, “everything from the way someone uses capitalization or personal pronouns, to the words someone typically omits or includes, to a breakdown of average word or sentence length, can help identify the writer of even a short text like a Tweet or text message.” So it might surprise you how much of your individual writing style you leave behind for linguists to rifle through, even if you are a success at pretending to be someone different on the internet.

AOL discovered this to their detriment in August 2006, when they released “a file containing 20 million search queries for over 650,000 users. Although the identities of the users were not included in the data, it was found that many users in the file could be easily re-identified. This caused such fierce public protests that AOL removed the data from the website within days,” Xiao-Bai Li and Sumit Sarkar’s 2009 paper states. A more recent example of anonymity identification is the notorious case of Robert Galbraith, a supposed fledgling mystery writer who was uncovered, partially through authorship analysis, as J.K. Rowling*. This was possible using newer computational linguistic methods to a greater level of certainty than ever before. So simply put, they have the technology and they can build it – a linguistic case against anonymity.

With the rise of social media use, there has also been a rise in online harassment by anonymous perpetrators, and more cases may now be legally considered crimes, according to Chris Reed’s 2010 discussion on crime and the online persona. As a result, the role of the forensic linguist has become ever more important in the online realm in identifying anonymous suspects. Although the level of certainty in anonymous identification may not be as definitive as for a forensic technician working with DNA samples, computational forensic linguistic methods are rapidly improving and online anonymity under expert scrutiny is becoming less possible to pull off. Courts are growing more accepting of this linguistic evidence as expert testimony.

So how easy is it to deliberately hide your linguistic identity online? An interesting paper on adversarial stylometry by Michael Brennan, Sadia Afroz, and Rachel Greenstadt suggests that it is possible to obfuscate your writing, or imitate the writing style of another’s that would lower the chances of being identified by modern stylometric methods. “The obfuscation approach weakens all methods to the point that they are no better than randomly guessing the correct author of a document. The imitation approach was widely successful in causing authorship to be attributed to the intended imitation target. Additionally, these passages were generated by participants in very short periods of time by amateur writers who lacked expertise in stylometry,” according to the authors.

In time, can detection methods in forensic linguistics and stylometry for linguistic ‘fingerprints’ reach a stage where they can be accepted as definitely as DNA evidence? The jury is out on that point. As to the question of whether an individual can deliberately obfuscate their writing to achieve online anonymity? Well you might get away with it for a short time, if it wasn’t for those pesky forensic linguists.

*(Speaking of which, J.K. Rowling should surely have known to beware the inventor of Parseltongue, since Dr Francis Nolan also happens to be a well-known forensic linguist).

JSTOR Citations

The Linguist on the Witness Stand: Forensic Linguistics in American Courts

By: Peter Tiersma and Lawrence M. Solan

Language Vol. 78, No. 2 (Jun., 2002), pp. 221-239

Linguistic Society of America

Stylometric Identification in Electronic Markets: Scalability and Robustness

By: Ahmed Abbasi, Hsinchun Chen and Jay F. Nunamaker Jr.

Journal of Management Information Systems Vol. 25, No. 1 (Summer, 2008), pp. 49-78

Taylor & Francis, Ltd.

Authorship Attribution

By: David I. Holmes

Computers and the Humanities, Vol. 28, No. 2 (Apr., 1994), pp. 87-106


Gender and sexual identity authentication in language use: the case of chat rooms


Discourse Studies Vol. 10, No. 2 (April 2008), pp. 251-270

Sage Publications, Ltd.

Privacy and Data-Based Research

By: Ori Heffetz and Katrina Ligett

The Journal of Economic Perspectives Vol. 28, No. 2 (Spring 2014), pp. 75-98

American Economic Association

Against Classification Attacks: A Decision Tree Pruning Approach to Privacy Protection in Data Mining

By: Xiao-Bai Li and Sumit Sarkar

Operations Research, Vol. 57, No. 6 (Nov. - Dec., 2009), pp. 1496-1509


Why Must You Be Mean to Me? Crime and the Online Persona

By: Chris Reed

New Criminal Law Review: An International and Interdisciplinary Journal, Vol. 13, No. 3 (Summer 2010), pp. 485-514

University of California Press

Authorship attribution in the wild

By: Moshe Koppel, Jonathan Schler and Shlomo Argamon

Language Resources and Evaluation Vol. 45, No. 1, Plagiarism and Authorship Analysis (Winter 2011), pp. 83-94


Forensic Phonetics

By: Francis Nolan

Journal of Linguistics Vol. 27, No. 2 (Sep., 1991), pp. 483-493

Cambridge University Press

Chi Luu

Chi Luu is a computational linguist and NLP researcher who tinkers with tiny models and machines to uncover curious mysteries in human language. She has degrees in Theoretical Linguistics and Literature, with a morbid focus on dead and dying languages. She has worked on dictionaries, multi-language search engines, and question answering applications.

Comments are closed.