3.2. Sentiment analysis

3.2.2. Machine learning and sentiment analysis

Moving on to Machine Learning (ML) as an approach for sentiment analysis, we should first consider the weaknesses and strengths of lexicon-based approaches so that we can get a clear picture of what ML sentiment approaches can bring to the table in comparison. In summary, lexicon-based sentiment analysis is a good fit for contexts that we are well aware of and allows for detailed refinement through our interactions with the dictionaries involved, contextually so through the corpus-based approach. Some applications have issues dealing with sentiment expressions in context, while others incorporate syntactic rules to understand emphasis markers and intensifiers. So, the benefits could then be the ease of refinement and the straightforward nature of the approach.

However, there are weaknesses in this type of manual refinement, too. The inclusion of new contexts or new language developments must happen through active involvement and initiative from the person in charge of maintaining the resources, and, as seen in the KNIME workflow example, some approaches to extending the dictionaries can generate errors. Furthermore, while well-developed solutions such as VADER can incorporate some level of pattern sensitivity based on syntactical rules, incorporating pattern sensitivity in our own contextual dictionaries brings a lot of complexity into play. So, how can these weaknesses be addressed through an ML approach? First, we must talk a bit about how ML works in text analysis.

ML is often divided into supervised and unsupervised, which refers to whether or not the model is using labelled data. Supervised ML is used for predicting the classification of new data based on the previous categorization. Hence, we first teach the model the training dataset and then use the model to try and predict a new set of items along the same features used in the training set. On the other hand, unsupervised ML is used to discern patterns in large amounts of data without a training dataset. Supervised ML makes the most sense for sentiment analysis, but we will return to unsupervised ML in a later lesson.

When we talk about applying ML for text analysis, we often mean that we are going to train and apply a predictive model. A well-known secret is that ML is essentially applied statistics and probability calculations, so while it might be more fun to pretend that we are teaching a newborn artificial intelligence how to read and risking the future of humanity through its imminent acquisition of consciousness, we are actually just honing in a mathematical algorithm on our purposes by assigning values to different language items, such as words or phrases.

The way that we will most commonly train our predictive model is by using a training dataset. In the case of sentiment analysis, this would mean that we first will need a correctly analyzed dataset of items, i.e., documents or texts, and their sentiment scores. The model will then be built by learning to match the content of the items with the score they have been tagged with. We accomplish this by breaking our items down into features that the model can understand, for instance, words or N-grams, and then making the connections between the provided score and the features of the corresponding item. However, unlike lexicon-based solutions, ML is focused on pattern recognition and will look at these features in the context of each other. This brings things like word order, sentence structure, and other syntactical clues into play, making the ML approach more holistic in its scoring.

When we introduce a new dataset to our model, it should now be able to predict the sentiment score with some accuracy. At the first level, our accuracy will depend on how similar the two datasets are (a model trained on Twitter data will have issues predicting sentiment in newspapers, for instance). This is also where we start reaping the benefits of the ML approach, as we can continually refine our model by indicating which predictions are correct and which are not. This serves to further increase the accuracy and widens the contextual scope in a very intuitive way. Maintained ML sentiment analysis models will remain relevant to current language trends because of this, as they can continually be taught new things.

There are different types of predictive algorithms that are used for ML sentiment analysis, and if you are interested in learning more about them, a summary is available here. In this lesson, we will now move on to discussing an ML sentiment analysis model available through KNIME called BERT.

BERT stands for Bidirectional Encoder Representations from Transformers and is a language model developed at Google AI. The key feature of BERT, compared to other ML language analysis models, is that it processes the sentences bidirectionally rather than reading from left to right. As we just learned, one of the strengths of ML is that it takes the context into account and can understand language structures' influence on sentiment. By not reading in a specific order, BERT is able to fully account for both neighbouring words when scoring a sentence. It outperforms other ML models in terms of accuracy, and the time it takes to build accuracy because of it. The drawback, however, is that BERT is comparatively slower than other models.

The BERT model available through KNIME can be seen used in a workflow here, and in this example, it is trained on a Kaggle dataset of Tweets on the topic of aeroplane travel. On a context level, we can then understand that it will be more accurate on Tweets than on other media and that it would be the most accurate on Tweets with airfare as a topic. However, many other datasets available on Kaggle could be used to train the BERT model for other contexts or topics. So, contextual awareness matters just as when we looked at lexicon-based approaches.