Text Analysis: Linguistic Meets Data Science: 3.4. Text Classification: 3.4.1. Text Classification

3.4. Text Classification

3.4.1. Text Classification

Let's take a perspective on text classification as a supervised sibling of topic modelling. As we remember from our introduction to ML in our lesson on sentiment analysis, supervised ML indicates that this is a method where a training dataset is used. Our goal with text classification is to define a set of categories and then train a model to detect those categories in documents to classify them accordingly. Sentiment analysis can be seen as text classification, as it relies on predefined categories (positive, neutral, negative sentiment) in order to classify unstructured data (text).

Text classification can be described as belonging to three separate categories: Rule-based, ML-based, and Hybrid. Rule-based and ML-based should be familiar to us as these are similar to the lexicon-contra ML-based approaches we discussed in sentiment analysis. Hybrid, then, would indicate a mix of the two. We will discuss how these three approaches can be used in text classification, and then we will look at an example workflow in KNIME.

Rule-based systems rely on manual sets of rules created manually by humans. At their most basic level, these are formed exactly like lexicons used for sentiment analysis, meaning that they make comparisons based on a collection of words defined as belonging to our desired categories. These can, of course, also include pre-defined N-grams or phrases that can be considered indicative of belonging to a category. The text is then categorized based on the frequency of words belonging to each category within it. As with sentiment analysis, there are weaknesses to this approach in terms of scalability and human error, but the strengths of rule-based text classification are partially the explicitness of the rule-set (meaning we know what we put in it) and the ability to attain very high accuracy in complex categorizations due to a human consciousness guiding the process.

Much like ML-based approaches to sentiment analysis, ML-based approaches to text classification work based on a training dataset that is used to train a model to distinguish between our different desired categories. Like other supervised ML approaches, we are going to start with a set of documents that have already been tagged with our intended categories. Once we have prepared our training dataset, we can allow the model to make the connections between the category we have specified and the content of the text. In practice, the model breaks the text down into features and then connects those features to the category. As previously mentioned, features can be words or patterns, or even POS-tag distribution or stylistic features if our categories are stylistic in nature, such as genres of writing.

Hybrid systems for text classification are intended to bring together the best of the rule- and ML-based approaches. A common hybrid application is to start with an ML-based approach and then augment it with a manually created rule-based lexicon to improve accuracy. This results in a more easily fine-tuned model that can manage the more obvious classification tasks automatically. This should, in theory, result in an approach that blends accuracy and human workload for optimal performance, as a human should only need to get involved where issues occur.

So, what should be our considerations when picking a route for text classification? Coming from topic modelling, we can apply some of the same ideas when we build our categories: How many do we need? How broad should they be? However, in addition to those familiar considerations, we must also consider how we decide to build our training dataset. We could certainly start this process with a topic model of a representative dataset and then pick the hybrid approach by manually altering and refining the resulting categorizations. We could also simply write a lexicon for our rule-based approach if we are familiar enough with the context, for instance, if we are categorizing research articles from our own field of study. If we are engaging with documents from a context that we do not feel secure in dealing with, letting the ML machinery do its work might be the best choice. For a discussion on how text classification can be used in a practical context, have a look at this discussion in OER 7. You can then return here and consider how the three types of text classification could fit with the needs of a knowledge organisation system.

We are now going to move on to an example from KNIME and talk a bit about how a text classification workflow could look.