Analyzing Youtube comments with Natural Language Processing

Sometimes, week-end side-projects start out of the blue, triggered by the curiosity out of some events happening around us. Here’s the story of one such project, which made me dive back into the Natural Language Processing (NLP) techniques I learned in a Coursera class in 2012 (cf course archive on stanford.edu).

I’ve been watching the 2023 surf competition organized by the World Surf League (WSL), and the results of one of the events got very controversial, with the event judges attributing the victory to American competitor Griffin Colapinto, whereas for many, it appeared that the Brazilian Italo Ferreira’s surfing was better and he deserved to win. Gabriel Medina (3 times WSL world champion) expressed his concerns about this publicly, and the WSL published a letter to the community in an effort to defend the judges’ choices. The departure of Erik Logan, CEO of the WSL, from his post 2 months later might not be unrelated to the controversy.

Wondering whether there is a consensus about the fairness - or lack thereof - of the the championship’s result, I grabbed the comments published under the WSL finals events from this year, and used NLP techniques to get insights out of this content.

This project would result in the creation of an infographic titled “Waves of indignation”, reaching more than 90000 people after its publication on social media site reddit.com.

Features highlights:

The WSL Championship Tour

Each year, the WSL organizes the Championship Tour, where the best professional surfers compete during a series of events organized all around the world. Each event consists in a set of elimination rounds opposing two or three surfers, and where the defeated get eliminated. Two surfers remain in the event final’s to determine who will be the event’s champion.

During the competition heats, a panel of five judges give a note to each surfed wave. In surfing, as in other sports, it is not uncommon to have fans disagreeing with the judges opinion, and there is some occasional public outrage. Surfing has heavily grown in popularity in Brazil in the past decades, with many Brazilian athletes qualifying to the WSL Championship Tour and winning world titles. A recurring concern expressed even by some of the the competitors themselves, is a perception that the WSL is biased against Brazilian surfers, and that they get underscored - supposedly because of Brazil being a less lucrative market and WSL funding depending on sponsors.

This year, the finals of “Surf ranch” got very controversial. The Surf ranch competition happens in a wave pool in central California rather than in the ocean, and the surfers get to showcase their skills in an artificial wave. While the judges gave better scores to Griffin Colapinto, many viewers complained that his opponent Italo Ferreira did better. Someone published a video showing both surfers in action side by side. In the following event, “Surf city”, organized in El Salvador, Filipe Toledo - another Brazilian competitor - made it to the finals and won. His best wave was awarded a score of 9 - excellent - by the judges, which was interpreted by the public as an attempt to “compensate” their unfairness at Surf ranch.

I decided to get the comments for the finals heats of all the events of the Championship tour this year, and have a look: do people always disagree with the judges, regardless of who competes and who wins? Or is there a consensus to disagree with the judging of just these two events?

The WSL publishes a video of its events on Youtube, where viewers have the possibility to post comments. This is a perfect location for us to get data.

Data exploration

What story should we tell, which focus should we have? At the moment, all we have is assumptions, intuitions, about how the public reactions look like. We don’t know for sure, so what we should do is explore the data to find actual insights.

First step is to download the comments. They are retrieved using a small program called youtube-comment-downloader. You give it the Youtube video URL, and it creates a file with the comments data, including comment text, author’s username, whether the comment is a reply.

After getting the comment files, data exploration can start. While I have sometimes used the platform Observable to explore data (here, for example), this time I’m just doing it in a terminal script in node.js. I get text figures instead of visuals, but it’s faster and sufficient.

Comments count

Let’s have a look at how many comments each event got.

Number of comments in each event's finals Youtube video

There is a huge difference between women’s and men’s events, the women’s most discussed final in Supertubos getting even less than 50 comments. The number of comments in female surfers’ videos being so low, it does not make much sense to run analysis on them, so the rest of the processing will focus on the male surfers’ videos.

We also notice how heated the discussion was at the Surf ranch finals was, with more than 1200 comments posted, more than 5 times the number of comments of any the other finals this year.

Language detection

First thing we’d like to know is the language used in each comment. Detecting the language of some text has this particularity that the shorter the text, the more difficult it gets. Youtube comments tend to be short: often only a sentence, sometimes just a bit longer, rarely a few paragraphs. The tone is very casual, a lot of abbreviations are used.

Language detection library tinyld has an impressive accuracy compared to other libraries. This is because it attempts to recognize languages’ unique character patterns and word usage (a technique borrowed from machine learning domain), in addition to the classic Bayesian scoring algorithm. Read their algorithm description article for details.

Annotated language detection results for Surf ranch event finals

Some of the results look surprising: is surfing so popular in Poland? Is anyone really writing in Berber nowadays? By inspecting a subset of the comments, we confirm that these are mismatches: the acronym of the competition organizer, WSL, confuses the language detection algorithm into thinking that the comment is in Polish language. Some abbreviations like lmao in English or kkkk in Portuguese (coincidentally, both commonly used to express laughter in their respective language), get comments to match with Berber language, and so on.

In order to increase the quality of the results, we limit tinyld to detect only English, Portuguese and Spanish languages. All the other languages will be ignored. We’ll only analyze English and Portuguese comments - the most frequent ones -, but including Spanish in the detection turns out useful in order to avoid having Spanish language comments mistakenly detected as in Portuguese language.

Once the comments languages are known, we’d like to find the most common keywords, for each language. We’ll start to run a few algorithms which will help get quality results.

Filtering, tokenizing and stemming

A few text processing steps are commonly performed in Natural Language Processing, to facilitate text analysis:

Stopwords filtering: very common words like and, the don’t add any value to a text analysis. These therefore get filtered out from the text corpus before running text analysis, using predefined stopwords list for each language, plus a set of hand-picked words: competitors and places names, for example.
Tokenizing: tokenizing consists in detecting individual words in the text. Why is it needed, aren’t the words obvious? Tokenizing is a bit more subtle than just splitting a text based on space and punctuation characters. In English language for instance, contracted forms need to be handled properly: we might want to tokenize hasn't into has + n't. Then, n't needs to be interpreted as not, which is where the next step, stemming, comes into play.
Stemming: Nouns and verbs exist in many variations: singular, plural, conjugated. Surf, surfs, surfing, surfed describe the same concept, and all these words have the same root (stem): surf. Stemming consists in identifying the stem of each word of the text to analyze.

These three data processing algorithms let us see the relevant data, and get more accurate statistics. We use the node.js library Natural for this, as it is pretty fully-featured and supports the languages we want to process.

Finding the top keywords and phrases with n-grams

n-grams algorithms are useful to identify the top words or top phrases in a text. n-grams are phrases of some arbitrary length: bigrams are phrases of length two, trigrams length three, 4-grams length four, etc.

Rather than looking up the most frequently used words in the comments, we want to be smart about it and also detect phrases. If the words professional and surfing are frequently used together, then we want to see professional surfing as a top phrase, rather than professional and surfing as top keywords individually.

While most NLP libraries compute n-grams in a rigid way and ask the user whether they want to count top words, or top bigrams or n-grams of given length, gramophone cleverly identifies phrases depending on how often they are used.

For each event, we get the list of top phrases in English and Portuguese, using the stems previously determined.

Visualization of the top 5 n-gram stems for each event, with the number of occurrences of each. The stems have been colored in order to easily identify the common ones across events.

We can see that the topic of judging is quite present, in the top 5 stems of five of the seven events we’re analyzing. Adjectives like good and great are also used a lot, although we cannot really deduct something from them directly: they may refer to the weather conditions for surfing, or qualify the waves surfed, and might be used in negative form as well as positive. The stem condit corresponds to the noun conditions, and is used in many comments in some of the events, for which the weather was not optimal for a surf contest.

Side-note: while only top 5 stems are shown in this article for the sake of conciseness, the top 30 were actually analyzed in the project.

Do we have a story? Well, maybe, but it’s not really interesting. The same terms can be found in most videos, which makes sense, because they are all about the same topic: the finals round of a surf contest.

Fortunately, another text analysis algorithm is perfect to figure out what is unique in each of the videos.

Unveiling the topics specific to each event: tf-idf algorithm

tf-idf stands for “Term Frequency - Inverse Document Frequency”. This algorithm identifies the topics (keywords or n-grams) which are standing out in one text, in comparison to a set of texts.

When applying tf-idf on each event’s comments, the particularity of each one gets clear. This is especially interesting in the last two events.

Top stems based on TF-IDF algorithm shows more interesting figures

At the very controversial Surf ranch event finals, instead of win, the top 5 stems now shows rob. Many comments indeed claim that Italo Ferreira was robbed from the victory he deserves.
In the following event finals in Surf city, many people state that while Filipe Toledo surfed better than his opponent and deserved to win, he was overscored and didn’t deserve a 9 for the best wave he took.

Sentiment analysis

Sentiment analysis consists in determining feelings expressed in some text. In its more basic version, we assess the polarity of the comments: whether the feelings are negative, neutral or positive. Natural has such a sentiment analysis algorithm.

Results from such analysis are to be taken with a grain of salt: the algorithm is smart enough to understand negations, but is not able to detect sarcasm, so comments like “What a joke” end up being counted as positive, instead of negative - the term joke being interpreted as positive.

Another interesting area of sentiment analysis is done through looking at the emoji present in the comments. Emoji bring more nuance than simply positive / negative assessment: they can show joy, laughter, loathing, fear, anger, an many more emotions. Interpreting them is still a challenge, though: a laughing emoji can express fun or mockery, based on the context.

Still, comparing the emoji used in each event, we definitely get the picture as to how the public feels:

Most frequent emoji found in Youtube comments for the Margaret river finals

Most frequent emoji found in Youtube comments for the Surf ranch finals

Data visualization design

While this case study’s focus is the Natural Language Processing, let’s still have a brief look at the design of the infographic.

The purpose of the analysis is to satisfy some curiosity, confirm assumptions, and we don’t aim at taking decisions based on this outcome. For this reason, we can make some choices favoring aesthetics over accuracy.

A ridgeline chart can look great, and its shape of waves is a nod to our topic of surfing.

Because I already had spent a considerable amount of time working on this week-end side project, the sentiment analysis visuals was simplified:

show the sentiment statistics using a simple bar chart,
skip the design of a visual to show the emoji statistics was skipped: done is better than perfect.

Similarly, again to save time, inclusion of the comments in Portuguese language was omitted. The same trends can be observed in these comments. They could have been displayed side by side, or even merged, based on the translated terms.

Results

I published the infographic in three subreddits (thematic channels on Reddit social media), focused either on data visualizations or on surfing. 10 days after being published, the infographic got the following reach and engagement results (source: Reddit insights):

channel: `r/dataisbeautiful`

85.7k total views
Score: 59 (77% upvote rate)
25 shares
17 comments

channel: `r/surfing`

7.1k Total views
Score: 35 (97% upvote rate)
4 shares
4 comments

channel: `r/surf`

1.3k Total views
Score: 15 (100% upvote rate)
0 shares
0 comments

Closing thoughts: NLP in the age of AI

Natural Language Processing is a precious knowledge to have in our toolbox, as it is complementary to other data analysis methods.

While there is a lot of hype behind Artificial Intelligence and Large Language Models nowadays, NLP and other traditional methods are definitely still relevant as they provide quality results for a fraction of the cost: they don’t require to train and load heavy models, they don’t need strong processing power or memory to run, and the way how tools work is transparent. In a related note, a research paper just got published, showing that the ‘old school’ text compression technology gzip can outperform Deep Neural Networks for text classification.

Alef is specialized in building web-based data analysis and data visualization solutions.

Do you think you have a challenge for us? Let's talk.