This is part 3 of a 3 part blog post. This post uses the data that was scraped in part 1 and prepared in part 2.
Now that we have the data in a nice format, let’s make a frequency plot! First let’s load the data and the packages:
library("tidyverse") library("ggthemes") # To use different themes and colors renert_tokenized = readRDS("renert_tokenized.rds")
ggplot2 package, I can produce a plot of the most frequent words.
renert_tokenized %>% count(word, sort = TRUE) %>% filter(n > 50) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip() + theme_minimal()
So, the most frequent word is kinnek, meaning King! kinnek is mentioned more times than renert, the name of the hero. Next are här and wollef meaning mister and wolf. In fifth position we have fuuss, for fox. I’ll let you use Google Translate for the other words 😄.
Now, I’m also doing sentiment analysis by using the AFINN list of words. This list of words have a score that gives its sentiment. You can download the original list from here.
Because such a list is not available in Luxembourguish, I have translated it using Google’s translate api. Here is the code to do that:
library("tidyverse") library("translate") # google translate api library("tidytext") # to load the AFINN dictionary api_key = "api_key_goes_here" set.key(api_key) afinn = get_sentiments("afinn") # I wrap the `translate()` function around `purrr::possibly()` so that in case of an # error, I get the translations that worked back. possibly_translate = purrr::possibly(translate::translate, otherwise = "error") afinn_lux = afinn %>% mutate(lux = map(word, possibly_translate, source = "en", target = "lb")) %>% mutate(lux = unlist(lux)) write_csv(afinn_lux, "afinn_lux.csv")
For the above code to work, you need to have a Google cloud account, which you can create for free.
I did not check the quality of the translations, and I’m sure it’s far from perfect. It’s also available on the Github repository here. Again, contributions more than welcome!
Now, I need to merge the dictionary with the data from each song. First, let’s load the dictionary:
afinn_lux = read.csv("afinn_lux.csv") # I only keep the `lux` column (and rename it to word) and the `score column` afinn_lux = afinn_lux %>% select(word = lux, score)
How does this dictionary look like? Let’s see:
## word score ## 1 opzeginn -2 ## 2 verloossen -2 ## 3 opzeginn -2 ## 4 entfouert ginn -2 ## 5 entlooss -2 ## 6 entfouert -2
Let’s load the tokenized songs, and merge them with the dictionary:
renert_songs_tokenized = readRDS("renert_songs_tokenized.rds") renert_songs_sentiment = map(renert_songs_tokenized, ~full_join(., afinn_lux))
I can now merge the data in a single data frame and do some further cleaning:
renert_songs_sentiment = renert_songs_sentiment %>% bind_rows() %>% filter(!is.na(score)) %>% filter(!is.na(gesank))
What does the final data look like? Here it is:
## # A tibble: 6 x 3 ## word gesank score ## <chr> <chr> <int> ## 1 rifft éischte -2 ## 2 léiw éischte -3 ## 3 fest éischte 2 ## 4 fest éischte 2 ## 5 räich éischte 2 ## 6 räich éischte 3
We see that there are words that are the same, but with different scores. That’s because the translation of the dictionary was most probably not very good. Oh well, let’s do a boxplot of the sentiment for each song:
order = c("éischte", "zwete", "drëtte", "véierte", "fënnefte", "sechste", "siwente", "aachte", "néngte", "zéngte", "elefte", "zwielefte", "dräizengte", "véierzengte") renert_songs_sentiment %>% ggplot(aes(gesank, score)) + scale_x_discrete(limits = order) + geom_boxplot() + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
As we can see, there is no discernible pattern. This can mean two things; either the general sentiment inside each song is fairly neutral, or the the quality of the translation was too bad for the results to make any sense.
That’s it for this series of posts! I hope you enjoyed reading it as much as I enjoyed writing it and analyzing the data!