Corpus linguistics – getting started

Corpus linguistics – sometimes referred to as computational linguistics – is not in and of itself the study of language. Rather, it’s a set of sophisticated tools and methods that help linguists study and understand some very interesting features of large volumes of language or textual data. These datasets are often compared to reference materials to see what makes them stand out.

At the heart of this methodology is the design and building of a corpus – literally a body or database – of textual material that you might want to examine more closely for any number of reasons. Once you have your corpus built (and this in itself is no small task) you can interrogate the textual data with specialised software tools*.

Using a corpus software allows you to analyse thousands, even millions of words from your corpus in very short order. Typically you could be looking for the frequency of a word in every day use in a particular part of the world; you might want to look at particular words and other words that are associated with these words – key words in context or collocations as we call them in linguistics; if your corpus has been annotated (or tagged) for certain language or other features such as gender, you might be able to look at gendered patterns of language use.

So, that’s just a starting point. Getting into a corpus is probably easier than working out exactly what research questions you want to ask. We’ll explore this issue more in later posts.

*Anthony, L (2013) A critical look at software tools in corpus linguistics. Linguistic Research 30(2), 141-161