Text Analysis: Linguistic Meets Data Science: 2.2. Patterns in language and text: 2.2.1. Collocations

2.2. Patterns in language and text

2.2.1. Collocations

Collocation queries are one of the key methods used by corpus linguistics to look at words that have a tendency to appear together (co-locate) with the query words in our concordances. This can tell us a lot about how the word we are interested in is used in our specific sample. One of the classic sayings in linguistics, attributed to linguist J.R. Firth (1957), is that

"we shall know a word by the company it keeps".

Collocation queries provide a list of the words that commonly occur with the query word and their frequencies. By viewing these collocate frequencies, we can get a good idea of the distribution of different items in relation to the term we originally started querying. For example, the collocates of words like "immigrant" or "refugee" may reveal that they may appear with words like "illegal" or "crime" in some newspapers and with words like "help" and "welcome" in another.'Importantly, collocations are usually not identified simply by frequencies because doing so would almost invariably list common function words such as articles and prepositions as frequent collocates of any given word. Instead, collocations are calculated statistically using a variety of strength of association statistics. These methods take into account the frequencies and distributions of all the words in the corpus and use them to calculate the probabilities of any two words co-occurring by chance. Depending on the statistic used, the strength of association — which we could think of as attraction or repulsion — is quantified, and the words that appear unexpectedly often in the corpus are listed as collocates.

There are numerous different ways of calculating association strengths. Some correspond very closely with the word associations that a human reader might intuitively produce, while others produce lists that prioritise words that are strongly associated but more difficult for us to notice. Each method has its own uses. It is important to know which method we are using!

Collocations are commonly calculated from a span of words to the right (R) and left (L); typical spans range from 3-5 words on each side. There is no firm-and-fast rule about the correct number of words on either side. Still, the general principle is that a shorter span will identify words that are directly associated with the query word, while longer spans will tell us more about topical associations.

In addition to this, we might also want to include not only the immediate left- or right-side collocate. This is done by specifying the number of words to include, resulting in describing our query as R1, R2, or R3 for our three collocates to the right.

Example:

Query: lead

KWIC line: All roads lead to Rome.
Collocates: L2 L1 R1 R2

Collocates will often lead us into other interesting patterns in language, for instance, idiomatic expressions. While most top collocates are likely to make sense to a competent speaker of the language, idiomatic expressions can result in collocates that we might not expect to find. Idiomatic expressions, or idioms, are phrases that do not relate directly to their intended meaning, such as "piece of cake", implying something being easy, "raining cats and dogs", implying a heavy rain rather than pets actually falling from the sky, and so forth. This highlights the importance of being aware of linguistic phenomena when approaching quantitative or data-driven methods, as language, when viewed out of context, language might not make sense, and some only make sense to humans. We will discuss formulaic patterns in language later in this unit.