Day One

Welcome… this is my first day of being a data scientist. I’m new to this, and I don’t have much formal training, but this blog is the beginning of my journey to becoming an expert. I’ve been inspired by blogs I’ve followed to begin the practice of answering my own questions, of which I have many! Here is one that showcases two key skills; data mining and graphic display.

Parsing Text for Emotion Terms: Analysis & Visualization Using R

In this tutorial, Professor Mesfin Gebeyaw from Capella University uses Saif Mohammad’s NRC Emotion lexicon to analyze the emotions expressed by Mr. Warren Buffet in his annual shareholder letters over 40 years. Utilizing the Tidy or Syuzhet packages, Professor Gebeyaw counts the number of emotion words for each of the eight types: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.


Besides the fact that this is an awesome exploration of written communication, there are two techniques that catch my attention. The first is the concept that you can scrape data from a PDF. Here, Professor Gebeyaw uses the pdftools package to extract the words from each document.

The second is pretty basic. Professor Gebeyaw uses ggplot to construct his graphs, which is a package I have never used before. I recently learned the basic structure of these plot commands, and so being able to read his code while looking at the graphs gave me a better understanding of the different applications of ggplot.

## yearly line chart
ggplot(emotions, aes(x=year, y=percent, color=sentiment, group=sentiment)) +
geom_line(size=1) +
geom_point(size=0.5) +
xlab("Year") +
  ylab("Emotion words count (%)") +
  ggtitle("Emotion words expressed in Mr. Buffett's \n annual shareholder letters"

Which produces:

At first, I thought it would take many lines of code to construct such a clean, aesthetically pleasing, and informative plot. Coming from a world where  base plotting is all I’ve ever known, I could picture the coding it would take to specify all these pieces. This is not the case – six lines is all it takes. First to specify the “aesthetics”, which dictate what data should be represented on the plot and in which way, then the assorted “geometrics”, including lines and points, and finally the labels for the axes and title.

Click through the link above to see the step-by-step process and more lovely graphics!