Taking an example of HackerNews, let us look at what kind of posts attract higher user engagement.
Exploring Hacker News posts and analysis
This is a project to explore Hacker News. In this project, we will understand what are the posts of Hacker News which have high engagement on them.
The dataset used for the project can be found here and it maybe noted that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:
id
: The unique identifier from Hacker News for the posttitle
: The title of the posturl
: The URL that the posts links to, if it the post has a URLnum_points
: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotesnum_comments
: The number of comments that were made on the postauthor
: The username of the person who submitted the postcreated_at
: The date and time at which the post was submitted
First, we will start by reading the dataset
from csv import reader
opened_file = open('hacker_news.csv', encoding="utf8")
hn = list(reader(opened_file))
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
It could be noted that the header
is different from the other rows. So we separate the header
from the rest of the data
headers = hn[0]
hn.pop(0)
print(headers)
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Extracting Ask HN and Show HN posts
For this project, we are only concerned with Ask HN or Show HN posts, because these are the posts which ask and show respectively the Hacker News community. So they generate lot of engagement.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if (title.lower()).startswith('ask hn'):
ask_posts.append(row)
elif (title.lower()).startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print(len(ask_posts), len(show_posts), len(other_posts))
1744 1162 17194
We have segrigated posts in three categories
- Ask HN posts
- Show HN posts
- Other posts
Calculating the average number of comments for Ask HN
and Show HN
posts
total_ask_comments = 0
for post in ask_posts:
num_comments = int(post[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)
total_show_comments = 0
for post in show_posts:
num_comments = int(post[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments/len(show_posts)
print('Avg comments on an ask post: ', round(avg_ask_comments,1))
print('Avg comments on a show post: ', round(avg_show_comments,1))
Avg comments on an ask post: 14.0
Avg comments on a show post: 10.3
It should be noted that on an average, an ask post
recieves more comments than a show post
Finding the Amount of Ask Posts and Comments by Hour Created
It is noted that an Ask HN
posts recieves more average comments than a show HN
post. So for the remaining of this project we will concentrate on Ask HN
posts.
We’ll determine if ask posts created at a certain time are more likely to attract comments. We’ll use the following steps to perform this analysis:
- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.
import datetime as dt
result_list = []
for post in ask_posts:
created_at = post[6]
comments = post[4]
result_list.append([created_at, comments])
#print(result_list)
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
created_at = row[0] # eg 9/25/2016 23:44
comments = int(row[1])
date_obj = dt.datetime.strptime(created_at, '%m/%d/%Y %H:%M')
hour = date_obj.strftime('%H')
if hour in counts_by_hour:
counts_by_hour[hour] += 1
comments_by_hour[hour] += comments
else:
counts_by_hour[hour] = 1
comments_by_hour[hour] = comments
print(counts_by_hour)
print(comments_by_hour)
{'16': 108, '21': 109, '09': 45, '10': 59, '08': 48, '06': 44, '11': 58, '04': 47, '02': 58, '19': 110, '14': 107, '20': 80, '07': 34, '03': 54, '01': 60, '13': 85, '05': 46, '18': 109, '22': 71, '17': 100, '15': 116, '23': 68, '12': 73, '00': 55}
{'16': 1814, '21': 1745, '09': 251, '10': 793, '08': 492, '06': 397, '11': 641, '04': 337, '02': 1381, '19': 1188, '14': 1416, '20': 1722, '07': 267, '03': 421, '01': 683, '13': 1253, '05': 464, '18': 1439, '22': 479, '17': 1146, '15': 4477, '23': 543, '12': 687, '00': 447}
Calculate the average number of comments ask posts receive by hour created.
Based on counts_by_hour
and comments_by_hour
dictionaries, we will calculate the average number of comments an ask HN post
recieves by hour
avg comments
in that hour = total no of comments
in that hour/ total no of posts
in that hour
avg_by_hour = []
for key, value in counts_by_hour.items():
tot_comments = comments_by_hour[key]
tot_posts = counts_by_hour[key]
avg_comments = tot_comments/tot_posts
res_list = [key, avg_comments]
avg_by_hour.append(res_list)
print(avg_by_hour)
[['16', 16.796296296296298], ['21', 16.009174311926607], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['08', 10.25], ['06', 9.022727272727273], ['11', 11.051724137931034], ['04', 7.170212765957447], ['02', 23.810344827586206], ['19', 10.8], ['14', 13.233644859813085], ['20', 21.525], ['07', 7.852941176470588], ['03', 7.796296296296297], ['01', 11.383333333333333], ['13', 14.741176470588234], ['05', 10.08695652173913], ['18', 13.20183486238532], ['22', 6.746478873239437], ['17', 11.46], ['15', 38.5948275862069], ['23', 7.985294117647059], ['12', 9.41095890410959], ['00', 8.127272727272727]]
Sorting and Printing Values from a List of Lists
Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let’s finish by sorting the list of lists and printing the five highest values in a format that’s easier to read.
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
[[16.796296296296298, '16'], [16.009174311926607, '21'], [5.5777777777777775, '09'], [13.440677966101696, '10'], [10.25, '08'], [9.022727272727273, '06'], [11.051724137931034, '11'], [7.170212765957447, '04'], [23.810344827586206, '02'], [10.8, '19'], [13.233644859813085, '14'], [21.525, '20'], [7.852941176470588, '07'], [7.796296296296297, '03'], [11.383333333333333, '01'], [14.741176470588234, '13'], [10.08695652173913, '05'], [13.20183486238532, '18'], [6.746478873239437, '22'], [11.46, '17'], [38.5948275862069, '15'], [7.985294117647059, '23'], [9.41095890410959, '12'], [8.127272727272727, '00']]
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap[:5])
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]
print("Top 5 Hours for Ask Posts Comments")
import datetime as dt
for row in sorted_swap[:5]:
dt_obj = dt.datetime.strptime(row[1],'%H')
time_hr = dt_obj.strftime('%H')
print('{time_hr}:00: {avg_comments:.2f} average comments per post'.format(time_hr = time_hr, avg_comments = row[0]))
Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
It can be seen that 15:00
is the best time to get most comments for the post