DIY Datasets pt. II

2 min readAug 5, 2020

…in my last blog post I talked about how to poll data from reddit’s API on a regular interval to generate time series data. All of this was automated using cron and hosting the project on Heroku. However, I wasn’t entirely sure how I’d retrieve the data in a usable form (a CSV or dataframe). Before I was just checking the data using Heroku’s CLI heroku pg:psql but how could I use this in, say, a jupyter notebook?

In order to do this, I had to look up the connection URL that ended up being hidden behind another Heroku command heroku config — a ${INSERT_APP_NAME} | grep DATABASE_URL.This one string allows us to connect to the Heroku database with one string.

Next I had to install psycopg2 (the most popular PostgreSQL database adapter for the Python programming language) in a somewhat atypical way: python -m pip install psycopg2-binary. Then using psycopg2 in a jupyter notebook was a simple import away. Rather than using the psycopg2’s cursor object, we instead will use the very handy read_sql_table method to read in our data, similar to our approach to loading CSV’s with read_csv.

After partitioning the data by subreddit, I was able to plot my first time series:

Not the most exciting plot, but it’s the product of almost a month’s worth of data collection! For future iterations I want to plot all subreddits on the same graph and scale them appropriately. I’m hoping that the trends will map nicely to real-time current events.

Working with time series data in pandas is actually very nice. On a dataframe indexed on timestamp, you can use the .resample() method to pass in “D” for day or “M” for month and pandas will automatically bucket your time series data into the respective size no matter how many observations you have in that bucket. This is similar to the groupby function so with the bucketing, we need to aggregate our data (in fact resample('M') is equivalent to groupby(lambda x: x.month)). For my daily buckets I chose the mean number of users for the given day.

DIY Datasets pt. II

Written by Kenny Oh