Text mining my Thesis

Text mining my Thesis

A simple text mining exercise with data from my Ph.D. thesis, using R’s beautiful tidytext package and examples from the related book. I have never mined text before, so I’ll stick to the examples from the book, but using my own data.

I’ll also use dplyr for handling tibbles (a sort of data.frame) and ggplot2 to plot the results. This report is generated with reprex, since I have not yet gotten around to use knitr.


Ready? Let’s start by loading the packages we will need:

library(readr)
library(tidytext)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(furrr)
#> Loading required package: future
library(ggplot2)

We the collect the data. We will use the .tex files from my Ph.D. thesis, which has 6 chapters that we ingest separately (but in parallel, because we are cool and we live in a bright future). Notice that format = "latex" is necessary to ingest LaTeX data:

input_files = paste0("https://raw.githubusercontent.com/adrfantini/PhD-thesis/master/chapters/", c(
    "1_intro.tex",
    "2_obs.tex",
    "3_itaobs.tex",
    "4_methods.tex",
    "5_validation.tex",
    "6_conclusions.tex"
))

plan(multiprocess) # Yay multiprocess!

a = future_map_dfr(input_files, function(x) {
    tibble(text = read_lines(x)) %>%
        unnest_tokens(word, text, format = "latex") %>%
        mutate(word = tolower(word)) %>%
        anti_join(stop_words) %>%
        mutate(chapter=basename(x))
})
#> Joining, by = "word"
#> Joining, by = "word"
#> Joining, by = "word"
#> Joining, by = "word"
#> Joining, by = "word"
#> Joining, by = "word"

We now have a nice tibble with all the words, and can now take a look at which words are more common, like so:

a %>% count(word, sort = TRUE)
#> # A tibble: 2,413 x 2
#>    word              n
#>    <chr>         <int>
#>  1 precipitation   251
#>  2 flood           196
#>  3 data            176
#>  4 model           134
#>  5 fig             121
#>  6 discharge       109
#>  7 italy           106
#>  8 station         105
#>  9 datasets         98
#> 10 chym             97
#> # โ€ฆ with 2,403 more rows

Precipitation easily comes on top. No surprises.

Let’s now get the most common 10 words by chapter, and plot them:

a_n = a %>%
    group_by(chapter) %>%
    count(word, sort = TRUE)

a_short = a_n %>%
    top_n(10, n) %>%
    ungroup

a_short %>%
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(x = word, y = n, fill = chapter)) +
        geom_col(show.legend = FALSE) +
        facet_wrap(~chapter, scales = 'free') +
        coord_flip()

Uh-oh! What’s going on? The ordering of some words on the Y axis is wrong, but we indeed reordered them by n with mutate(word = reorder(word, n))! This is because different input files might have different word rankings, and here we are ranking by global rank. We can fix this by means of a small trick: creating a new variable (column) lab, which identifies a word-file combination univocally. We merge the labels using the dummy separator +, since we know our filenames and our words do not include that character, and later we only plot the part of the label after the +:

a_short = a_short %>%
    mutate(lab = paste(chapter, word, sep = '+'))

a_short %>%
    mutate(lab = reorder(lab, n)) %>%
    ggplot(aes(x = lab, y = n, fill = chapter)) +
        geom_col(show.legend = FALSE) +
        facet_wrap(~chapter, scales = 'free') +
        scale_x_discrete(labels = function(b) {strsplit(b, split = '+', fixed = TRUE) %>% sapply(function(x) x[2])}) +
        coord_flip()

Much better!

Now, let’s look at how well Zipf’s law holds for this small corpus of files. We can check by following the example in chapter 3.2 of the text mining book:

a_n_rank = a_n %>%
    group_by(chapter) %>%
    mutate(rank = row_number(),
          `term frequency` = n/sum(n))

# Zipf's law!
a_n_rank %>%
    ggplot(aes(rank, `term frequency`, color = chapter)) +
        geom_line(size = 1.1, alpha = 0.8) +
        scale_x_log10() +
        scale_y_log10()

Not exactly linear, right? Well, that is to be expected with such a small amount of words and in a technical publication.

The book also shows how to extract the tf-idf metric. This is the ratio of the term frequency in a given chapter and its frequency in the whole collection. Basically, this shows which terms are more common in a particular chapter compared to other chapters.

This is quite easy following the examples; we’ll take the top 12 words for each chapter this time:

tf_idf = a %>%
    group_by(chapter) %>%
    count(word, sort = TRUE) %>%
    bind_tf_idf(word, chapter, n) %>%
    arrange(desc(tf_idf))

tf_idf %>%
    group_by(chapter) %>%
    top_n(12, tf_idf) %>%
    slice(1:12) %>%
    ungroup() %>%
    mutate(word = reorder(word, tf_idf)) %>%
    ggplot(aes(word, tf_idf, fill = chapter)) +
        geom_col(show.legend = FALSE) +
        facet_wrap(~chapter, scales = "free") +
        ylab("tf-idf") +
        coord_flip()

Oh no, the ordering issue strikes again! But… since the plot comes out almost correct and I’m too lazy to fix it right now, this will have to do , sorry ๐Ÿ˜. (If this were a textbook I would say the exercise is left for the reader)

What this plot shows is that in chapter 1 I really talk a lot about risk, which is a term that I never use in other chapters. This is because my thesis deals with hazard, which is a different concept from risk, and in chapter 1 I make this distinction very clear. Interpolation is very prominent in chapter 3, since this is where I describe the gridding procedures used to create the observational precipitation dataset that I use. All in all, most of what is shown is what I would expect, knowing what is in every chapter. How boring!

Speaking of boring, how positive is each chapter? We can assign a positivity score to each word (yes, there are tables, and yes, Trump’s tweets are found to be quite negative), and average them:

a_sent = a_n %>%
    inner_join(get_sentiments("afinn"), by = "word") %>%
    group_by(chapter) %>%
    summarize(score = sum(score * n) / sum(n))

a_sent %>%
    ggplot(aes(chapter, score, fill = score > 0)) +
        geom_col(show.legend = FALSE) +
        coord_flip() +
        ylab("Average sentiment score")

It turns out that the first three chapters are a bit negative, with the first (the introduction) being worst. But I recover and start thinking in positive terms (or rather, selling my work with fancy words ๐Ÿ˜) in the second half of the thesis, peaking with very positively-written conclusions.


Well, this wraps up this rather short and quite uneventful diversion into text mining, I hope you enjoyed it!