A tale of two subreddits

4 min readMar 15, 2021

The purpose of this project is to show how machine learning will correctly predict which subreddit a word belongs to, using accuracy and confusion matrix as the metrics.

We chose my two subreddits, houseplants and garden because we became a new gardener. Both are a place “people share photos to highlight new growth, show cool plants, get feedback”, gardening ideas for vegetables and ideas when and how to plant seeds.

Subreddits such as houseplants and garden are a great resource for anyone with a home garden, or looking to start one.

Collect Data

First, we have to collect submissions to the two subreddits, we are going to do this using the Pushshift.io API.

The api limit to download from a subreddit is 100 titles. We created a function, using ‘request.get(url, params)’. The data from each subreddit were collected into csv files and then later combined. The important part of the download was getting new information each time. A variable was created, call ‘create_utc’ that points to the lowest UTC code in the 100 rows of data, and this is used to make additional queries without overlapping the entries. We continue this until the number of rows are met for the project. This process is repeated for the second subreddit.

Combine the Data

Since we downloaded several small files for each subreddit, we now must combine them into a single dataframe.

Then we have to create a feature in the dataset that identifies each set of data by the subreddit they originated from.

We can then combine them into a new dataframe that includes all the data.

Machines, as smart as they are, are unable to understanding words. We used some NLP tools to convert the titles into a format for the model to understand. We used Regular Expressions to replace emojis, remove common stop words, remove unnecessary characters, and symbols.

Lemmatization and Stemming are Word Normalization techniques. Lemmatization produce root forms of words. Porter Stemmer uses suffix stripping to produce the stem.

CountVectorizer is used to transform text into a vector based on the frequency of each word that is in the data. It creates a matrix with each unique word represented by a column and each text sample from the data is a row. Some parameters used were: max_features — “cuts” the matrix to the defined number of columns; preprocessor — is our function above used to clean the data; analyzer — word is the default parameter for CountVectorizer.

Classification Algorithms

We then used Random Forest Classifier as our first algorithm. We set up a pipeline and gridsearch over it to get the best parameters.

GridSearchCV for the Random Forest Classifier

The model predicted 561 words correctly to the garden subreddit and 553 to the houseplants. There were 411 words that they model was unsure where they belong.

The first time we ran the model we used a parameter n_estimators=100 and without selftext added. The accuracy score was 72.9%.

We re-ran the model adding the above parameters with selftext and the result increased to 98.7% for trainand 84.4% for test. We concluded that the model is overfit, which we expected as plants and help appeared in both subreddits. Due to this overlap, our model may be a little confused.

The confusion matrix results are as follows:

533 the model correctly predicted belong to garden, while 561 to plants.

The model also got it wrong when it predicted 189 titles as garden but were plants and 222 as plants but were garden.

We found it interesting that the confusion matrix changed slightly with selftext.

The results before adding the selftext field:

True Positive 533
False Positive 171
True Negative 579
False Negative 242

Random Forest builds multiple trees and merges them together and picks the majority for an accurate and prediction. With the randomness that this model adds at each node, its matrix results before and after selftext is so closely similar.

Additional Algorithms Tried

We also tried Logistic Regression, Stochastic Gradient Descent, and Naïve Bayes Classifier. The pattern was the same across all of them, but the Random Forest worked best for this data set.

A tale of two subreddits

Collect Data

Combine the Data

Additional Algorithms Tried

Written by Celeste Indira Short