Arts & Culture, Opinion

13th January 2026

Archived Fluency: Deconstructing Corpus Linguistics

Introduction

When it comes to my enjoyment and academic interests, language is something which has remained a constant within my life. I am someone who has always been proficient in analysing literary and non-literary works, but also intrigued by technological developments (especially in relation to media).

A while back, I was having a look at my Year 2 and 3 modules; in doing so, one specific topic seemed to intrigue me quite a bit: Corpus and Computational Linguistics. I enjoyed studying Computer Science at GCSE and A-Level but the specifications never provided this interdisciplinary approach of digital linguistics. I also really enjoyed English at GCSE and A-Level but I wanted to go beyond the classic, rigid works of Shakespeare and Fitzgerald.

Moreover, media’s blend of technology with textual analysis compliments my interests. If you’d like to learn more about what it was like studying this combination, check out my article titled Narratives, Media and Algorithms: The Interdisciplinary Approach.

After reading about this module, I began to do some research on corpus linguistics, something which seems incredibly fascinating and strengthens my areas of intrigue. Through this article, I will discuss what it is, the historical context and how it works in the current day. This will by no means cover every little nuance, as it is still a topic I’m getting to grips with.

Additionally, I will address computational linguistics within a future article and then do a comparison between the two. Keep on the lookout for that.

Before we start, a large part of my research is contributed to McEnery and Wilson’s 2001 book Corpus Linguistics, so huge thank you to them for helping me comprehend corpus linguistics. I would also recommend this EBSCO guide to cover any gaps I may end up with.

What is corpus linguistics?

According to McEnery and Wilson, corpus linguistics can be easily defined as “the study of language based on examples of ‘real life’ language use”; this means it is not a branch (e.g – syntax, semantics, computational linguistics, sociolinguistics, etc) but instead a methodology offering us an approach to how we understand language. This empirical outlook can be used within most areas of linguistics, such as syntax, semantics and pragmatics.

What constitutes a corpus?

  • Corpus in simplified terms is a collection of written texts that has been systematically arranged to allow for language analysis; to do so, it seems that corpus requires some level of abstraction – breaking down a problem by removing unnecessary details.
  • Corpora is the plural term for corpus


    Within modern linguistics, corpus contains 4 main attributes:
    Sampling (gathering a small, specialised sample of utterances) and representativeness (diverse range of texts from different genres/authors/periods)
    Finite size (typically around 1,000,000 words; however, they sometimes continue to evolve in scope and magnitude)
    Machine-readable format (used to be in reference to print text)
    A standard reference (acts as a foundation allowing for successive studies to be carried out)

  • They can exist in 2 forms:
    Annotated (acts as a repository for linguistic information; e.g – enhanced with several linguistic traits)
    Unannotated (e.g – initial raw states of plaintext)

For example, the British National Corpus (BNC) is:

“a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th Century, both spoken and written”

How was the BNC corpus created?

Created by and maintained by the BNC Consortium (led by Oxford University Press), their collection is comprised of 90% written (e.g – newspaper extracts, journals, periodicals, academic books, essays, etc) and 10% spoken (e.g – transcripts of everyday conversations, radio shows, interviews, meetings) works.

Work began on the corpus in 1991 after receiving funding by commercial partners, the Science and Engineering Council (now known as EPSRC) and the UK government’s Department of Trade and Industry (which is now associated with Science, Innovation and Technology | DSIT) under the JFIT (Joint Framework for Information Technology) programme. Moreover, the British Library and Academy provided aid to the corpus’ development.

The corpus was completed in 1994; whilst the text has been refined before the release of their second edition BNC World (2001) and their third edition XML Edition (2007), they haven’t incorporated any additional texts. This contradicts John Sinclair’s COBUILD team from Birmingham University, who have developed a monitor corpus (a dynamic entity which is constantly incorporating new texts to offer more inclusivity and expand its longevity), ensuring more contemporary texts and broader linguistic samples can be obtained.

For more information on how the BNC carried out work on the project, click here.

How was corpus linguistics created?

In the early 1900s, the field itself wasn’t truly defined; however in retrospect, linguists such as German-American anthropologist Franz Boas adopted a framework synonymous with that of corpus-based approaches. In Randy Allen Harris’ 1993 book The Linguistic Wars, he states on page 27 that this approach started “with a large collection of recorded utterances from some language, a corpus. The corpus was subjected to a clear, stepwise, bottom-up strategy of analysis”. This bottom-up strategy infers to me that early iterations of corpus linguistics took a structural approach, starting from its most simple form and increasing its broadness to encompass other connections to different ideas.

Addressing ‘early corpus linguistics’:

When examining texts and studies during this period, you won’t find the term ‘early corpus linguistics’. This is used within McEnery and Wilson’s book when addressing the first iterations to “categorise all of this work” (pg.3). The true first attempts at corpus-based description transpired in the 19th and early 20th Century, with child language being studied via language acquisition research (roughly taking place between 1876-1926). These were based on “carefully composed parental diaries recording the child’s locutions (their vocabulary choices). This corpus research and collection expanded beyond diary studies throughout the decades.

Corpus-based language acquisition research:

Researchers such as William Preyer (his 1882 book Mind of the Child described how naturally developing language caused losses and gains, relating this to neuropsychology) and William Stern (his theory of language development in 1924 caused him to coin his perspective as “personalistic-genetic”) are often used nowadays in language acquisition research. For instance, David Ingram’s 1978 book First Language Acquisition, which detailed how children receive and produce language in real-time through the key fundamental areas (e.g – phonology, morphology, syntax, semantics).

Language acquisition studies – large vs longitudinal:

McEnery and Wilson propose 2 forms of sample studies orchestrated within language acquisition research: large and longitudinal. Large sample studies occurred roughly between 1927-1957, with corpora retrieved through a large proportion of children, attempting to establish conventions for the development in language acquisition.

Meanwhile, longitudinal studies have been carried out since 1957 up to current day, based on the collection of utterances (to do so, around 3 children are used as a data source over time); e.g – Roger Brown’s 1973 book A First Language: The Early Stages and Lois Bloom’s investigation into language acquisition, such as her 1970 piece Language development: form and function in emerging grammars.

Foreign language pedagogy:

We must also consider language in the context of teaching: specifically, foreign language pedagogy (method and practice of teaching). This area of development for corpus was popularised by linguists such as Fries and Traver’s English Word Lists from 1940 and Bongers’ research in 1947 into American-Indian languages via The History and Principles of Vocabulary Control. Vocabulary lists for foreign learners became crucial tools that were possible thanks to corpus research and development.

Corpus approaches within comparative linguistics:

According to McEnery and Wilson, comparative linguistics also displays a corpus-based approach, such as Helen Eaton’s 1940 dictionary An English-French-German-Spanish Word Frequency. We wouldn’t see corpora created again until the early 1990s via McEnery and Oakes’ 1996 discussion of sentence and word alignment within the CRATER project (in this, they mention parallel corpora – where the same text is represented within its original language and translated), referenced within Jenny Thomas and Mick Short’s book Using Corpora for Language Research.

Chomsky and discourse surrounding corpus linguistics:

McEnery and Wilson also suggest that the discontinued development of corpus linguistics in the late 1950s was because of it being predominantly scrutinised, due to Noam Chomsky’s criticisms of the corpus as a source of information. His criticisms were seen as highly volatile, due to his robust influence, and his remarks against corpus data evoked outcry from those who wanted to utilise said data.

Rationalists vs empiricists:

According to them, Chomsky was the catalyst for a long-running debate within linguistics: that being rationalists vs empiricists. This debate isn’t exclusive to linguistics, but instead is “the basic decision of whether to rely on artificially induced observations or to rely on naturally occurring observations.

Rationalist theory is based on artificial data and the conscience’s judgements (e.g – a native speaker of a language offering claims based on their reflections).

In comparison, empiricist theory relies on naturally occurring data, frequently through the medium of the corpus (e.g – written/spoken). The example McEnery and Wilson provide is this:

Imagine we decide to determine whether sentence x is a valid sentence of language y by looking in a corpus of the language being questioned. We may then gather evidence for how sentence x conforms to grammar norms.

Both approaches have pros and cons.

Chomsky’s criticisms against corpus:

For Chomsky, his dissatisfaction with corpus linguistics was due to believing linguists must seek to model language competence (internal linguistic knowledge) rather than performance (external evidence of internal linguistic knowledge by adapting vocabulary to different contexts). He disputed that competence explains and characterises a speaker’s language proficiency. Chomsky believed performance could be easily manipulated through factors such as drinking or short-term memory limitations, which could restrict our comprehension.

Corpus is intrinsically a collection of externalised utterances and thus is performance data. This means it lacks true value to linguists attempting to model competence data.

Considering data subject context within corpus research:

When developing a corpus, we must consider the full context of the subjects being used for data, since linguistic competence may become misconstrued through incorrect modelling. The example McEnery and Wilson use is:

To paint an extreme example, consider a large body of transcribed speech based on conversations with aphasics (someone who suffers from brain damage, causing them to poorly communicate). If we aren’t told they are aphasics, we could easily end up modelling features of aphasia as grammatical competence unless we are made aware of the nature of the corpus”

In other words, if you don’t know who your data stems from, false conclusions about language will be made. Language use depends on who is speaking and what conditions (e.g – a boss speaking to an employee, friends engaging in conversation). In the case of aphasia, subjects may miss out words, mix up meanings and struggle to form sentences.

Chomsky’s assumptions about corpus linguistics:

According to Chomsky, there are 2 fundamentally flawed assumptions about corpus linguistics:

  • The sentences of a natural language are finite
  • The sentences of a natural language can be collected and catalogued

Addressing said assumptions:

However, McEnery and Wilson – alongside you and me – recognise that this number isn’t feasible and cannot be counted. The number of sentences in a natural language is potentially infinite. This is due to both lexis and syntax combining to make a plethora of sentences. Performance data, such as that of a corpus, doesn’t objectively explain the nature of language.

Corpora are definitively incomplete entities, since language is incalculable and no finite corpus can fairly represent language as a concept. Some sentences are present within a corpus because they’re frequent constructions (e.g – a corpus focused on what individuals had for food at lunch may frequently contain “Today, for lunch I had…” + “and then I had…” + “which was…”). Some sentences exist by indeterminist luck and through random chance.

Chomsky’s paper about corpus:

Chomsky greeted this randomisation by presenting a paper at the Ninth International Congress of Linguistics called The Logical Basis of Linguistic Theory. In this, he remarked:

“Any natural corpus will be skewed. Some sentences won’t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [based upon it] would be no more than a mere list”

Recognising Chomsky’s claims:

McEnery and Wilson address Chomsky’s observation as perfectly accurate, with corpora being partial in 2 senses of the word. They are partial in regards to incompletion, containing some of the valid sentences of a natural language. Secondly, they are partial in relation to their representation, with the frequency of a language feature being a large determiner of inclusion (e.g – Chomsky remarked how the sentence “I live in New York” is fundamentally more likely than “I live in Dayton, Ohio” due to more people being likely to say the former than the latter).

Forcing the scientific method:

Criticisms of early corpus linguists was the belief that they were forcing a scientific approach to linguistics, with Paul Postal deeming their views as “very primitive and silly” through “involving rigorous collection of data and cataloguing” (cited in McEnery and Wilson, pg.10). In Chomsky’s Modular Approaches to the Study of the Mind, he believes that “if you sit and think for a few minutes, you’re flooded with relevant data” (pg.44). We can see this spontaneity within this exchange:

Chomsky: The verb perform cannot be used with mass word objects: one can perform a task but one cannot perform labour.
Hatcher: How do you know, if you don’t use a corpus and have not studied the verb perform?
Chomsky: How do I know? Because I am a native speaker of the English language.

Credit: The Third Texas Conference on Problems of Linguistic Analysis in English – University of Texas (1962)

As McEnery and Wilson comment, Chomsky’s adamance here is admirable and definitely seems impressive at first. But it exposes the necessity of corpus data. Chomsky was incorrect in his claim. McEnery and Wilson remark that “One can perform magic. Although, often our own critical thinking can reduce time potentially spent scouring through a corpus.

Summarising the proposed arguments against corpus linguistics:

To sum up, these are the arguments opposing corpora so far:

  • Corpus encourages false modelling, as we document performance rather than competence (Chomsky argued linguistics aimed to offer introspection and explanation of linguistic competence, rather than the counting and describing of linguistic performance)
  • Natural languages aren’t finite, making the potential linguistic goal of counting and describing as impossible to achieve; this means that counting sentences will never provide a satisfying description of language, as how can a partial corpus encompass all of an infinite language?
  • We must partly embrace introspection to detect ungrammatical and ambiguous structures
  • Scientific methods were being forced onto linguistics, diverging from its nature as a discipline

Other criticisms of corpus linguistics:

Outside of Chomsky, other people had addressed problems which arose from corpus linguistics. There was a problem in relation to processing data. David Abercrombie in his 1965 study Studies in Phonetics and Linguistics developed the term pseudo-procedure, applicable to the majority of corpus research carried out on language at the time. This term referred to practices that were commonly stated but not practically implemented due to their difficulty. In this case, the examination of corpus.

McEnery and Wilson propose a question to us as readers: Can you imagine searching through an 11 million word corpus, using nothing more than your eyes? The process sounds excruciatingly time-consuming, while also being costly and resulting in human error. Large teams were established and used by early corpus linguists (e.g – Kaeding’s 5000 Prussian analysts | pg.12) to dissect these corpora, which required payment and spawned potential risks of analytical errors.

A world of corpus linguistics pre-computerisation must have certainly been challenging and impossible putting into practice in the 1950s and beforehand. Early corpus linguistics were prohibited by human documentation and data processing, increasing cost, time and inaccuracies.

Corpus computerisation:

Nowadays, the term corpus has strong correlation to its digital implementation, used for data processing. This is due to computers ensuring that initially challenging tasks found within an environment of pseudo-procedure are now feasible. The use of searching (e.g – binary & linear search) and sorting (e.g – bubble & merge sort) algorithms and carrying out calculations does yield a vast array of untapped potential for a corpus linguist.

It does so by synthesising textual (most common) and/or digitised speech (becoming common), rendering the pseudo-procedure invalid in some areas. It does so via the following abilities, where we can ask the machine to:

  • Search for a specified word within the text
  • Retrieve all examples of a specified word (usually with context; these different scenarios present its concordance – the different ways a word is used)
  • Potentially calculate the amount of times that specified word appears; this provides information on a word’s frequency
  • Sorting the data in some way (e.g – alphabetically, ascending → descending and vice versa)

Although, this pseudo-procedure is still present in current day. For example, McEnery and Wilson ask:

“if we wished to look at a non-annotated corpus to determine… the average number of noun phrases in an English sentence…[a] computer would be of little help if it could not accurately identify noun phrases”

This example highlights the necessity of human involvement for corpus research, since computers must receive instructions that are meticulously constructed and converted from programming code to a binary format.

The revival of corpus linguistics:

Corpus-based work – despite its criticisms – prevailed into the 1960s and 1970s (with it sparking in academic popularity within the 1980s and onwards), however it remained a more niche field of research compared to other areas of linguistics. The reason this methodology continued is due to clear drawbacks of Chomsky’s rationalist approach to linguistics. Unlike his introspective analysis of language, naturally occurring data allows one to observe and examine said data.

McEnery and Wilson ask “how can we be sure of [a speaker making an introspective judgement]” or when “they express an opinion on a thought process”, a process which remains invisible to the naked eye and clouded in one’s perspective.

With language performance, data is at its most credible in comparison to the internal judgements made by language competence; this is due to public exposure that allows simultaneous recognition of linguistic evidence. Moreover, the corpus seems to offer a more systematic means of approaching language analysis, compared to the thought processes of linguistic competence.

Conclusion

So, that in a nutshell is corpus linguistics. I recognise this article doesn’t cover everything but should be enough of a start for newcomers to this methodology like myself.

Any questions? Where do you stand on corpus linguistics? Does it withhold or further the study of linguistics as a discipline? Feel free to contact me via johnjoyce4535@gmail.com!

Check out my last piece: Into the Abyss – Breaking Down the Stranger Things Finale

For more arts & culture, check out the link below:

https://www.liverpoolguildstudentmedia.co.uk/category/arts-culture

.