Reddit Comment Sentiment Analysis (Big Data)

May 27, 2021 · 8 minutes read


Reddit is a social news aggregation, web content rating, and discussion website, and it claims to be “the front page of the internet”. It is a network of communities based on people's interests (Homepage - Reddit, n.d.). For this project, I analyze a subset of the Reddit comments archive to understand if the sentiment toward a comment is governed by the choice of words instead of the meaning of the comment itself and whether one can predict the sentiment toward a Reddit comment just by the choice of words. I do this in the dhg32_mdf_final_project Jupiter notebook running python and Spark. I make use of the score variable (schema of the whole dataset is shown in Figure 1 of the Appendix), as a proxy for the sentiment. Before we get into the methodology and the analysis, let's understand the data we have at hand.


The entire dataset I use is comments from October, November and December 2018, and January 2019. There are a total of 476 million comments in this dataset with comments from 233,505 different subreddits. The most number of comments are from the subreddit r/AskReddit, followed by r/politics and r/nfl. Since this dataset is really large, I believe that it would allow me to see if word choice is something that dictates sentiment toward a comment. I use the score variable as a proxy for sentiment, by attributing positive sentiment to a post with more than 1 upvote. If the post has less than 1 upvote, there is a negative sentiment expressed by the community toward a comment. I choose a score of 1 to be neutral since that is the default score provided to a comment when it is posted (i.e. a user upvotes it). To that end, I create a categorical variable called ‘sentiment’ with values as ‘Neutral’, ‘Positive’, and ‘Negative’.

Firstly I clean the dataset. I drop rows of data where the body of the comment is either removed or deleted because that would not be helpful for our analysis in any way. I also delete any blank comments that may have been for whatsoever reason possible. Further, I get rid of rows of data where the comment may not have been made on a post from a subreddit but someplace else. I only keep the columns ‘body’, ‘score’, and ‘subreddit_name_prefixed’ in my dataset since those are my only variables of interest for this analysis. To simplify my analysis, I also take a 5% subset of the entire 476 million comments which amounts to roughly 21.8 million comments. At this point, I create the ‘sentiment’ variable which takes on the value of ‘Positive’ if the score is greater than 1, ‘Negative’ if the score is less than 1, and ‘Neutral’ if the score is equal to 1. I also drop any duplicate text entries that may have crept into the dataset. In Figure 2 of the appendix, we see the schema of our clean dataset. We have 3 variables ‘body’ which contains the text of the comment, ‘sentiment’ which contains the categorical target variable, and ‘subreddit_name_prefixed’ which contains the name of the subreddit. In Table 1 of the appendix, we have the summary statistics for all the variables. Since these variables are all strings, we don't have any values for mean or median.

Our target variable is unbalanced as seen from Table 2 of the Appendix. This might affect our analysis. The subreddit with the overall most negative and positive sentiment towards a comment is r/AskReddit. This also may be due to the fact that the dataset contains the most comments from this subreddit. For the purpose of my analysis, I turn the ‘body’ column of text into a bag of words. I further use this to predict the sentiment associated with each bag of words using Logistic Regression (LR). The reason I use LR is that it is easy to implement and efficient to train. Considering the size of our dataset, I did not want to employ fancy methods which would have been computationally very expensive. One of the disadvantages of using this method is the possibility of multicollinearity in our independent variables, i.e. the bag of words. We also assume a linear relationship, which may not be the case, i.e. some posts may have words that may induce more negative sentiment compared to others, generally.

To that end, I run an LR model with the target variable as the sentiment toward the comment. It is important to note that the number of upvotes or downvotes does not dictate the intensity of the sentiment. For the independent variable, I use the ‘body’ column. I tokenize the column into a bag of words and remove stop words. I also encode the target variable into numeric values. I split the data into a 90% training set holding out 10% of the data for testing. I also run 5-fold cross-validation to see if it differs from holding out data into train and test splits. Further, I run term frequency-inverse document frequency to get higher weights on words that matter the most in the corpus of documents.


My LR model performs very poorly with an accuracy of only 46.05%. I believe that my hypothesis that sentiment toward a comment by another person is dictated by the choice of words and not the meaning of the post itself may not be right. But this analysis is very preliminary. To validate my results, I also run 5 fold cross-validation on the logistic regression model, and I get an accuracy of 47.10%, which is a mere over 1% increase in accuracy. Making use of the term frequency-inverse document frequency (TF-IDF), I run a simple LR model and receive an accuracy of 45.56%.

There may be a lot of reasons that may have been involved with low accuracy. The first and foremost reason I suspect is that since Reddit is an online forum, the words and structure of sentences may not always make sense. For example, “theremaybea post with textlikethis” (There may be a post with text like this.) The algorithm will not be efficient in differentiating the words and hence it may lead to problems with our analysis. Secondly, the data cleaning methods I use to clean the textual data are very preliminary, and more research could possibly be done to get an even more refined bag of words. Lastly, it could very well be possible that the choice of words simply doesn't dictate how the post is reacted with. It could be possible that the sentiment toward a comment cannot be predicted by the choice of words in the comment.

For a future application, I would invest more resources into cleaning the textual data using pre-trained libraries that deal with textual data from the internet. It is also possible that this kind of behavior may only be seen in certain subreddits and not all over Reddit. I may want to add fixed effects of the subreddits, i.e. dummy variables denoting whether a row of data belongs to a particular subreddit. It may be interesting to see patterns around subreddits in this manner. All in all, the 5 fold cross-validation on the simple bag of words worked the best, but there is much more research that can be done in order to improve the accuracy of the classification model.



Figure 1: Schema of whole dataset

|– archived: boolean (nullable = true)
|– author: string (nullable = true)
|– author_cakeday: boolean (nullable = true)
|– author_created_utc: long (nullable = true)
|– author_flair_background_color: string (nullable = true)
|– author_flair_css_class: string (nullable = true)
|– author_flair_richtext: array (nullable = true)
| |– element: struct (containsNull = true)
| | |– a: string (nullable = true)
| | |– e: string (nullable = true)
| | |– t: string (nullable = true)
| | |– u: string (nullable = true)
|– author_flair_template_id: string (nullable = true)
|– author_flair_text: string (nullable = true)
|– author_flair_text_color: string (nullable = true)
|– author_flair_type: string (nullable = true)
|– author_fullname: string (nullable = true)
|– author_patreon_flair: boolean (nullable = true)
|– body: string (nullable = true)
|– can_gild: boolean (nullable = true)
|– can_mod_post: boolean (nullable = true)
|– collapsed: boolean (nullable = true)
|– collapsed_reason: string (nullable = true)
|– controversiality: long (nullable = true)
|– created_utc: long (nullable = true)
|– distinguished: string (nullable = true)
|– edited: string (nullable = true)
|– gilded: long (nullable = true)
|– gildings: struct (nullable = true)
| |– gid_1: long (nullable = true)
| |– gid_2: long (nullable = true)
| |– gid_3: long (nullable = true)
|– id: string (nullable = true)
|– is_submitter: boolean (nullable = true)
|– link_id: string (nullable = true)
|– no_follow: boolean (nullable = true)
|– parent_id: string (nullable = true)
|– permalink: string (nullable = true)
|– removal_reason: string (nullable = true)
|– retrieved_on: long (nullable = true)
|– score: long (nullable = true)
|– send_replies: boolean (nullable = true)
|– stickied: boolean (nullable = true)
|– subreddit: string (nullable = true)
|– subreddit_id: string (nullable = true)
|– subreddit_name_prefixed: string (nullable = true)
|– subreddit_type: string (nullable = true)

Figure 2: Schema of clean sub-set of data

|– body: string (nullable = true)
|– sentiment: string (nullable = true)
|– subreddit_name_prefixed: string (nullable = true)


Table 1: Summary Statistics for variables relevant to analysis

summary body sentiment subreddit_name_prefixed
count 20024966 20024966 20024966
mean NaN null null
stddev NaN null null
min .Good link, you a… Negative r/007
max 𨳒杘? Positive r/zzt

Table 2: Distribution of the target variable

sentiment count
Positive 11536391
Neutral 7035137
Negative 1453438

Jupyter Notebook for Code:
Jupyter Notebook opens in another tab