Taking an example of HackerNews, let us look at what kind of posts attract higher user engagement.

Exploring Hacker News posts and analysis

This is a project to explore Hacker News. In this project, we will understand what are the posts of Hacker News which have high engagement on them.

The dataset used for the project can be found here and it maybe noted that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

id: The unique identifier from Hacker News for the post
title: The title of the post
url: The URL that the posts links to, if it the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post
created_at: The date and time at which the post was submitted

First, we will start by reading the dataset

from csv import reader
opened_file = open('hacker_news.csv', encoding="utf8")
hn = list(reader(opened_file))
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]

It could be noted that the header is different from the other rows. So we separate the header from the rest of the data

headers = hn[0]
hn.pop(0)
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

Extracting Ask HN and Show HN posts

For this project, we are only concerned with Ask HN or Show HN posts, because these are the posts which ask and show respectively the Hacker News community. So they generate lot of engagement.

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if (title.lower()).startswith('ask hn'):
        ask_posts.append(row)
        
    elif (title.lower()).startswith('show hn'):
        show_posts.append(row)
        
    else:
        other_posts.append(row)

print(len(ask_posts), len(show_posts), len(other_posts))

1744 1162 17194

We have segrigated posts in three categories

Ask HN posts
Show HN posts
Other posts

Calculating the average number of comments for `Ask HN` and `Show HN` posts

total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)

total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments/len(show_posts)

print('Avg comments on an ask post: ', round(avg_ask_comments,1))
print('Avg comments on a show post: ', round(avg_show_comments,1))

Avg comments on an ask post:  14.0
Avg comments on a show post:  10.3

It should be noted that on an average, an ask post recieves more comments than a show post

Finding the Amount of Ask Posts and Comments by Hour Created

It is noted that an Ask HN posts recieves more average comments than a show HN post. So for the remaining of this project we will concentrate on Ask HN posts.

We’ll determine if ask posts created at a certain time are more likely to attract comments. We’ll use the following steps to perform this analysis:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

import datetime as dt
result_list = []

for post in ask_posts:
    created_at = post[6]
    comments = post[4]
    result_list.append([created_at, comments])
    
#print(result_list)

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    created_at = row[0]   # eg 9/25/2016 23:44
    comments = int(row[1])
    date_obj = dt.datetime.strptime(created_at, '%m/%d/%Y %H:%M')
    hour = date_obj.strftime('%H')
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
        
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments

print(counts_by_hour)
print(comments_by_hour)

{'16': 108, '21': 109, '09': 45, '10': 59, '08': 48, '06': 44, '11': 58, '04': 47, '02': 58, '19': 110, '14': 107, '20': 80, '07': 34, '03': 54, '01': 60, '13': 85, '05': 46, '18': 109, '22': 71, '17': 100, '15': 116, '23': 68, '12': 73, '00': 55}
{'16': 1814, '21': 1745, '09': 251, '10': 793, '08': 492, '06': 397, '11': 641, '04': 337, '02': 1381, '19': 1188, '14': 1416, '20': 1722, '07': 267, '03': 421, '01': 683, '13': 1253, '05': 464, '18': 1439, '22': 479, '17': 1146, '15': 4477, '23': 543, '12': 687, '00': 447}

Calculate the average number of comments ask posts receive by hour created.

Based on counts_by_hour and comments_by_hour dictionaries, we will calculate the average number of comments an ask HN post recieves by hour

avg comments in that hour = total no of comments in that hour/ total no of posts in that hour

avg_by_hour = []
for key, value in counts_by_hour.items():
    tot_comments = comments_by_hour[key]
    tot_posts = counts_by_hour[key]
    
    avg_comments = tot_comments/tot_posts
    
    res_list = [key, avg_comments]
    avg_by_hour.append(res_list)

print(avg_by_hour)

[['16', 16.796296296296298], ['21', 16.009174311926607], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['08', 10.25], ['06', 9.022727272727273], ['11', 11.051724137931034], ['04', 7.170212765957447], ['02', 23.810344827586206], ['19', 10.8], ['14', 13.233644859813085], ['20', 21.525], ['07', 7.852941176470588], ['03', 7.796296296296297], ['01', 11.383333333333333], ['13', 14.741176470588234], ['05', 10.08695652173913], ['18', 13.20183486238532], ['22', 6.746478873239437], ['17', 11.46], ['15', 38.5948275862069], ['23', 7.985294117647059], ['12', 9.41095890410959], ['00', 8.127272727272727]]

Sorting and Printing Values from a List of Lists

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let’s finish by sorting the list of lists and printing the five highest values in a format that’s easier to read.

swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
            
print(swap_avg_by_hour)

[[16.796296296296298, '16'], [16.009174311926607, '21'], [5.5777777777777775, '09'], [13.440677966101696, '10'], [10.25, '08'], [9.022727272727273, '06'], [11.051724137931034, '11'], [7.170212765957447, '04'], [23.810344827586206, '02'], [10.8, '19'], [13.233644859813085, '14'], [21.525, '20'], [7.852941176470588, '07'], [7.796296296296297, '03'], [11.383333333333333, '01'], [14.741176470588234, '13'], [10.08695652173913, '05'], [13.20183486238532, '18'], [6.746478873239437, '22'], [11.46, '17'], [38.5948275862069, '15'], [7.985294117647059, '23'], [9.41095890410959, '12'], [8.127272727272727, '00']]

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap[:5])

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]

print("Top 5 Hours for Ask Posts Comments")
import datetime as dt
for row in sorted_swap[:5]:
    dt_obj = dt.datetime.strptime(row[1],'%H')
    time_hr = dt_obj.strftime('%H')
    print('{time_hr}:00: {avg_comments:.2f} average comments per post'.format(time_hr = time_hr, avg_comments = row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

It can be seen that 15:00 is the best time to get most comments for the post

What kind of posts attract higher user engagement?

Exploring Hacker News posts and analysis

Extracting Ask HN and Show HN posts

Calculating the average number of comments for `Ask HN` and `Show HN` posts

Finding the Amount of Ask Posts and Comments by Hour Created

Calculate the average number of comments ask posts receive by hour created.

Sorting and Printing Values from a List of Lists

Pavan Kumar

What kind of posts attract higher user engagement?

Exploring Hacker News posts and analysis

Extracting Ask HN and Show HN posts

Calculating the average number of comments for Ask HN and Show HN posts

Finding the Amount of Ask Posts and Comments by Hour Created

Calculate the average number of comments ask posts receive by hour created.

Sorting and Printing Values from a List of Lists

Pavan Kumar

Calculating the average number of comments for `Ask HN` and `Show HN` posts