An NLP Analysis of the Mueller Testimony

10 min readAug 8, 2019

In Robert Mueller’s July 24th testimony before the House Intelligence Committee, he averaged a whopping 7.7 word response to each question asked of him. This wasn’t exactly shocking given he had warned Congress that he would not depart from the contents already included in his report. Beneath all those monosyllabic responses and “I’d have to pass on that” comments, there are some surprising insights that can be gleaned from fairly basic data analysis and NLP techniques. We can confirm Mueller’s unbiased nature towards Democrats and Republicans, look at where he does treat them differently and even detect party membership from an interviewer’s questions.

Preprocessing

A complete transcript of the Mueller report can be found on the Washington Post’s website here.

The initial format from the Washington Post

I manually cut and pasted the contents into a .txt file, avoiding the first few interactions that represented opening statements. From there, I could read the file into Python as a single string. Each return statement needed to be replaced with a space unless followed by an all caps word with a colon after it since this represents a new speaker, not just a new paragraph. The following regular expression does the trick to identify all such locations:

\b[A-Z \-\.]+\b:

The space, dash, and period were included in this expression to account for interviewers with names like “M. Johnson”, “Jackson Lee” and “Mucarsel-Powell.”

There were a few other weird things in the transcript I needed to consider:

Nadler questioned Mueller but also moderated the testimony. Comments after his initial back and forth with Mueller needed to be flagged as separate but not dropped as they are important for analyzing who ran out of time vs. yielded their remaining time.
There were a lot of typos. “Mueller” was spelled “Muelller” and “Garcia” was spelled “Garica” on multiple occasions. One name was followed by “.” instead of “:”. All of these typos can cause issues when splitting the data and using name as a unique ID.
In a couple of cases the same person spoke twice in a row and yet they had their name listed at the start of the paragraph. I had to identify consecutive rows in my dataset that had the same name in order to merge the contents of the two comments.
The transcript didn’t identify political party, a key label I wanted to base my analysis around. I initially tried making an alternating list of “Dem” and “Rep” labels. However, after noticing that one Democrat yielded his time to another and that the alternating questioning stopped entirely towards the end, I explored scraping the Wikipedia table of current representatives. This was going to add all sorts of issues since the name formats weren’t consistent with the Washington Post transcript. Given that there were only 41 names to go through, I opted to create a party table manually.

Note that none of these preprocessing steps included traditional NLP preprocessing (removing stop words etc.). The most basic initial analysis I performed was to look at word count, something that would be wrecked by Gensim’s preprocessing function, so I decided to save that step for just before the topic modeling & bag of words analyses.

The Dataset & Feature Engineering

After this initial preprocessing, I had a dataset that contained the name of each speaker, his or her comment, Mueller’s response and whether the speaker was serving as a moderator (i.e. any Nadler row where the row number was greater than the first non-Nadler row in the dataset).

An example interaction with Mucarsel-Powell questioning Mueller and then yielding his time, followed by Nadler calling on Escobar who then questions Mueller.

With the dataset configured this way, I could easily make the following additional features:

comment_order (all non-moderator rows numbered 1 —n)
speaker_order (a unique 1:1 ID to name where the ID referenced the order the speaker questioned Mueller in)
statement_order (a per-name order of an individual’s comments)
numWords (the number of words in a given comment)
numWords_m (the number of words in Mueller’s response)
party (Republican or Democrat label)

Using Panda’s group by functionality, I was able to make a similar dataset with one row per speaker, with all of his/her comments concatenated and all of Mueller’s comments concatenated. Comment_order and statement_order were not applicable to this aggregated dataset, but all the other features were (numWords was now at the speaker level instead of statement level).

I also used the vaderSentiment package to calculate a positive, neutral and negative score for each comment and Mueller’s corresponding response. Finally, the Indicoio API provided emotion scores for each comment, reporting values for anger, fear, joy, sadness and surprise.

Findings

MUELLER: It is unusual for a prosecutor to testify about a criminal investigation. And given my role as a prosecutor, there are reasons why my testimony will necessarily be limited.

The correlation between the length of a congressman’s question and the length of Mueller’s response was .002. It did not matter how longwinded a congressman was, the odds of getting a non-monosyllabic response from Mueller didn’t change. At the aggregated speaker level this correlation was higher (.59), but this can be attributed to the question/answer format. If a congressman spoke more total words, he likely asked more questions, so he got more responses from Mueller. The difference between parties for these correlations was insignificant.

H. JOHNSON: You’ve stuck closely to your report and you have declined to answer many of our questions on both sides.

I also examined how the sentiment behind a question affected the length of Mueller’s response, including if the speaker’s party affiliation changed how Mueller received the emotion. Both at the speaker and comment level, there was no significant relationship between positive, neutral, negative, fearful, joyful, angry, sad, or surprised sentiment and how many words Mueller chose to respond with, with one notable exception. At the comment level, the sadness score had a significant effect on Mueller’s comment length; however, this effect depended on party affiliation. If a Democrat were to change her statement from having a sadness score of 0 to a sadness score of 1 (no sadness to all words marked as sad), Mueller would say about 8 words fewer on average. However, if that same interviewer was a Republican, Mueller would actually say about 3 words more.

Mueller’s comment lengths regressed on sadness score with party interaction.

NADLER: No. No. No, we’re running short on time.

Mueller’s short responses didn’t stop the Republicans having significantly wordier interactions with Mueller. Their median total words per speaker was 748 words compared to the Democrat’s 530.5 words. This also showed up when examining the proportion of party members that yielded their time compared to Nadler cutting them off. There were three Democrats where it was unclear if they yielded their time due to the transcript simply recording “crosstalk”. Dropping these individuals from the sample, 85.7% of Democrats yielded their time compared to 23.5% of Republicans, a difference confirmed statistically significant by a permutation test yielding zero occurrences of a more extreme difference in proportions.

Republicans used more words and ran out of time more frequently than Democrats

The length of each interviewer’s interaction with Mueller did not stay fixed. Rather, the order of the interviewer, and the party had a significant influence. Democrats spoke about nine words less per additional speaker. Republicans, however, spoke about two words more with each additional speaker.

Republicans (red) had wordier interviews as the testimony progressed. Democrats (blue) were the opposite.

Interviewer’s Number of Words Regressed On Order of Examination with a Party Interaction Term

The length of Mueller’s responses followed a similar trend to that depicted in the scatterplot above, however when the number or words Mueller used was regressed on the number or words the interviewer used, neither order nor party was significant. This suggests Mueller simply got more wordy with Republicans over time because those Republicans were asking him more questions. Similarly, as Democrats got less wordy, he did not need to respond as much and used fewer words.

CHABOT: Director Mueller, my Democratic colleagues were very disappointed in your report.

The sentiment of the interviewers changed over time as well. The positivity scores for both Republicans and Democrats trended downwards through the testimony, while the neutral sentiment increased. There was no significant difference between party.

According to regression output between sentiment and order, each additional speaker’s positivity score decreased by .0009 (.02 p-value) , while the neutral score increased by .0012 (.04 p-value). There’s some evidence Mueller exhibited a similar trend, but even more important was the effect political party had on his sentiment. He responded to Republicans with .0616 lower positivity scores. That’s some serious grumpiness towards Republicans.

The decrease in Mueller’s positive sentiment the first Republican faced compared to the first Democrat was approximately the same size decrease from the first to the last Democrat.

MUELLER: I direct you to the — what we’ve written in the report in terms of characterizing his feelings.

We saw earlier that the interviewer’s emotions do not have much of an effect on the length Mueller’s response (with sadness being the notable exception). In general both parties display a similar distribution of fear, anger, joy, surprise and sadness. The emotions in which Mueller responded with show very little correlation to either the interviewer’s emotion or their party. However, there is an interesting exception when it comes to Mueller’s reaction to surprise.

Correlation Between Interviewer’s Emotion and Mueller’s Emotion depends on party affiliation in a few cases.

Mueller’s anger was correlated negatively with Democrats’ surprise and positively with Republican’s surprise, so Mueller reacted with anger more frequently to Republican surprise compared to the same emotion from Democrats. A regression mostly supports this relationship (p-values around .07 not .05) as shown below.

Effect of party, interview surprise level, and their interaction on Mueller’s anger score.

Mueller also reacted with different amounts of sadness when the interviewer expressed surprise. For both parties, there is a positive correlation between surprise and Mueller’s sadness, but he also expressed about .099 higher sadness scores towards Republicans, regardless of their surprise levels. The regression output below supports this idea.

Effect of party and interviewer surprise level on Mueller’s sadness score.

RATCLIFFE: So Americans need to know this, as they listen to the Democrats and socialists on the other side of the aisle, as they do dramatic readings from this report…

It is a well recognized fact that politics is becoming more and more partisan. After listening to a few interviews from each party, the questions from each party became predictable. Democrats wanted Mueller to say he only failed to indict Trump because he’s a sitting President. Republicans wanted to berate Mueller for how he handled the investigation. It turns out ML classifiers find this banter pretty predictable too. I fed the contents of an interviewer’s questions into two models. The first model I tried was a simple bag of words (using Scikit-learn’s CountVectorizer after removing stop words, punctuation, etc.) inputted into a Naive Bayes classifier. Due to the small sample size of speakers, I ran leave-one-out cross validation instead of doing a standard 80/20 training/testing split.

The top 10 words most indicative of each party according to Naive Bayes for one leave-one-out iteration.

This model yielded an average accuracy of .90, an F1 score of .91 and an AUC of .93.

The other technique I tried was using LDA to get a topic distribution for each interviewer’s text. Choosing the number of topics was a little difficult due to this bug (usually the recommendation is to use the elbow method to look at where the decrease in perplexity begins to slow down), so after looking at some LDA examples, I guessed that 25 would capture the variety of topics. I then ran XGBoost on this topic distribution. Since the topic distributions depend on initial randomness, the success varied a bit across trials. However, in general, average accuracy hovered around .60 after running leave-one-out cross validation. While LDA is a great way to determine topics, interviewers’ topic distributions did not prove to be good features when predicting party membership. Much better to stick to the basics.

These predictive models could be run at the comment level instead of the individual level; however, it is important when splitting into training and testing datasets to still make these splits on the individual level. Check out Scikit-learn’s LeavePGroupsOut method intended for this type of situation.

NADLER: And without objection, the hearing is now adjourned.

Mueller’s few discrepancies in his treatment of Republicans and Democrats were the exceptions that prove the rule. After all, if these differences were blatant, it would hardly make for an interesting ML project, would it? Though lacking in verbosity, Mueller administered even treatment to an otherwise highly politicized event. Instead, it was the interviewers who betrayed their partisan views — so much so that nine times out of ten an algorithm can tell where their loyalty stands just from their questions.