What do all profitable apps have in common ? From a DataScience perspective, have a look at the profiles of the most profitable apps on Appstore and Playstore. Read on to find out which category of apps have more chance to succeed

Analysis : Profitable apps on AppStore and PlayStore

Aim of the project is to find what are the profitable apps on Apple AppStore and Google PlayStore markets. This project helps the developers and the product team, to make data driven decisions on what kind of apps to be built.

For this project, it is assumed that we are interested only in the apps that are free to download and install and our main source of revenue consists of in-app ads. This means that the revenue for any given app is mostly influenced by the number of users that use our app.

Goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

Opening and exploring the data

As of the first quarter of 2020, Android users were able to choose between 2.56 million apps, making Google Play the app store with biggest number of available apps. Apple’s App Store was the second-largest app store with almost 1.85 million available apps for iOS. Source

Instead of scraping the data from the scratch, for this project, we work on existing datasets.

A data set containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from this link.
A data set containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this link.

Let’s start by opening the two data sets and then continue with exploring the data.

from csv import reader

#Reading appstore data
opened_file = open('AppleStore.csv',encoding='utf8')
ios_data = list(reader(opened_file)) # Reading as a list of lists
ios_header = ios_data[0]
ios_app_data = ios_data[1:]

#Reading playstore data
opened_file = open('googleplaystore.csv',encoding='utf8')
android_data = list(reader(opened_file)) # Reading as a list of lists
android_header = android_data[0]
android_app_data = android_data[1:]

For easy understanding and explore the desired datasets, we would write a custom function explore_data() which has takes in 4 arguments

dataset, which is the desired dataset to be explored
start, which specifies at what row data slicing should start
end, which specifies at what row data slicing ends
rows_amd_columns, a bool value, which tells the function whether there are rows and columns in the dataset. Default is False

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

# Just exploring top 10 to check whether our reading of the data is correct or not
android_top10 = explore_data(android_app_data, 0, 9, True)
ios_top10 = explore_data(ios_app_data, 0, 9, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up']


['Infinite Painter', 'ART_AND_DESIGN', '4.1', '36815', '29M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'June 14, 2018', '6.1.61.1', '4.2 and up']


['Garden Coloring Book', 'ART_AND_DESIGN', '4.4', '13791', '33M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'September 20, 2017', '2.9.2', '3.0 and up']


Number of rows: 10841
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1']


['282935706', 'Bible', '92774400', 'USD', '0.0', '985920', '5320', '4.5', '5.0', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


['553834731', 'Candy Crush Saga', '222846976', 'USD', '0.0', '961794', '2453', '4.5', '4.5', '1.101.0', '4+', 'Games', '43', '5', '24', '1']


['324684580', 'Spotify Music', '132510720', 'USD', '0.0', '878563', '8253', '4.5', '4.5', '8.4.3', '12+', 'Music', '37', '5', '18', '1']


Number of rows: 7197
Number of columns: 16

Note that the number of columns in Google Playstore data is 13, where as number of columns in Apple Appstore data is 16. Let’s print those out and check what are the informative columns in both the datasets. More information about the columns can be found at the description here Appstore:DataDesc and Playstore:DataDesc.

print(ios_header)
print(android_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

It should be noted that, in the google Playstore data, there’s an error in one of the rows (#10472) according to this

print(android_app_data[10471])
print(android_app_data[10472])

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

print(len(android_app_data[10471]), len(android_app_data[10472]))

13 12

It can be noted that there are only 12 columns in 10472 row, whereas they are expected to be 13. Category is missing from the 10472 row. Also, prime_genre is an empty string.
We have two options.

Delete the row completely.
Or fill up the missing values with 0 value or some missing string so that, our program, which relies on positional arguments doesn’t throw out an unexpected result.

We will go with option 1.

# Listing the data before deleting the 10472 row
print(explore_data(android_app_data, 10470, 10475, True))

['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


Number of rows: 10841
Number of columns: 13
None

del android_app_data[10472]

# Listing the data after deleting the problametic 10472 row and ensuring the required row is deleted
print(explore_data(android_app_data, 10470, 10475, True))

['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


['Wi-Fi Visualizer', 'TOOLS', '3.9', '132', '2.6M', '50,000+', 'Free', '0', 'Everyone', 'Tools', 'May 17, 2017', '0.0.9', '2.3 and up']


Number of rows: 10840
Number of columns: 13
None

def seg_duplicate_apps(data_set, idx_app_name):
    unique_apps = []
    duplicate_apps = []

    for app_data in data_set: #each row
        name = app_data[idx_app_name]

        if name in unique_apps:
            # if app already exits in unique apps, remove from unique apps and append to duplicate apps
            duplicate_apps.append(name)
            #print('App name', name)

        else:
            unique_apps.append(name)
            
    return (unique_apps, duplicate_apps)
        
    
android_unique_apps, android_dup_apps = seg_duplicate_apps(android_app_data,0)
ios_unique_apps, ios_dup_apps = seg_duplicate_apps(ios_app_data,1)

# removing the duplicates. Using set() messes our order but it doesn't effect us
'''for app_list in [android_unique_apps, android_dup_apps,ios_unique_apps, ios_dup_apps]:
    app_list = list(set(app_list))'''
    
print(len(android_unique_apps), len(android_dup_apps))
print(len(ios_unique_apps), len(ios_dup_apps))
print(9659+1181)

9659 1181
7195 2
10840

The above is a function to segregate duplilcated apps from unique apps from a given dataset.

Note: It should also be noted that there are multiple entries for duplicate apps

for idx,app in enumerate(android_app_data):
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']

Analysis on the multiple entries in the data set

Here we are looking at all the entries of ‘Temple Run 2’.

The only difference among all the entries is in the column of Reviews. Rest of all the columns are exactly the same.
However, it should also be noted that Reviews are not in the ascending order necessarily. We can take a call and represent all the entries with the most recent entry or the entry corresponding to the highest number of Reviews.
And there is no way to tell whether the last entry is the most recent entry or not. No of Reviews needn’t be in an ascending order. A user could delete his review, his account could be deleted which can imply that number of Reviews needn’t be monotonically increasing. (or)
We could choose to retain the data which has highest reviews.

So in the following section we would write a function, which will take datasets as an argument and can delete all the multiple entries and just retaining the highest reviews.

Distribution of apps in the dataset

We already know for the following data sets, this is the composition of total apps, unique apps and duplicate apps

Data Set	Unique Apps	Duplicate Apps	Total Apps
Android	9659	1181	10840
iOS	7197	0	7197

Note: The total number of apps in the original Android data set are 10841. Kindly note that we deleted one app in cell 6, because it doesn’t have all the required columns

reviews_max = {}
for app_data in android_app_data:
    name = app_data[0]
    n_reviews = float(app_data[3])
    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max)) # should be equal to the unique apps = 9659

android_clean = []
already_added = []

for app_data in android_app_data:
    name = app_data[0]
    n_reviews = float(app_data[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app_data)
        already_added.append(name)

explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13

Removing Non-English Apps : Part One

We are interested in only English apps as we cater to English speaking audience. So we want to remove non-English apps from the study.

We would write a function, which takes in the name of the app and tell us whether the string has non-English characters.

def check_english_str(app_name):
    '''
    input : app name
    outputs: Boolean/ False if any character doesn't belong to English ASCII
    '''
    
    for c in app_name:
        if ord(c) > 127:
            return False
        
    return True

sample_apps = ['Instagram','爱奇艺PPS -《欢乐颂2》电视剧热播','Docs To Go™ Free Office Suite', 'Instachat 😜']
for app in sample_apps:
    print(check_english_str(app))

True
False
False
False

Removing non-English apps : Part two

It should be noted that characters like ‘™’ and emojis have ASCI greater than 127, and hence we are wrongly classifying them as non English. We’ve to change the function accordingly.

So we will be changing our function slightly. If the input string has more than three characters that fall outside the ASCII range (0 - 127), then the function should return False (identify the string as non-English), otherwise it should return True. Although this implementation is crude, it serves our purpose here

def check_english_str(app_name):
    '''
    input : app name
    outputs: Boolean/ False if any character doesn't belong to English ASCII
    '''
    cnt = 0
    for c in app_name:
        if ord(c) > 127:
            if cnt > 3:
                return False
            cnt += 1
    return True

sample_apps = ['Instagram','爱奇艺PPS -《欢乐颂2》电视剧热播','Docs To Go™ Free Office Suite', 'Instachat 😜']
for app in sample_apps:
    print(check_english_str(app))

True
False
True
True

android_english = []
ios_english = []

for app_row in android_clean:
    name = app_row[0]
    
    if check_english_str(name) :
        android_english.append(app_row)
        
for app_row in ios_app_data:
    name = app_row[1]
    
    if check_english_str(name) :
        ios_english.append(app_row)

explore_data(android_clean, 0, 1, True)
explore_data(android_english, 0, 1, True)
explore_data(ios_app_data, 0, 1, True)
explore_data(ios_english, 0, 1, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 9619
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7197
Number of columns: 16
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 6240
Number of columns: 16

Distribution of apps in the dataset : After removing non-English apps

We already know for the following data sets, this is the composition of total apps, unique apps and duplicate apps

Data Set	Total Apps	Unique Apps	English
Android	10840	9659	9619
iOS	7197	7197	6240

Isolating the free Apps

So far in the data cleaning process, we:

Removed inaccurate data
Removed duplicate app entries
Removed non-English apps

We will write a fucntion which will isolate free apps.

def isolate_free_apps(ip_data_set, price_idx):
    op_data_set = []
    for app_row in ip_data_set:
        if (app_row[price_idx] == '0') or (app_row[price_idx] == '0.0'):
            op_data_set.append(app_row)
            
    return op_data_set

android_freeapp_list = isolate_free_apps(android_english, 7)
ios_freeapp_list = isolate_free_apps(ios_english, 4)

#print(ios_english[0:3])
explore_data(android_freeapp_list, 0, 1, True)
explore_data(ios_freeapp_list, 0, 1, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 8869
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 3263
Number of columns: 16

Distribution of apps in the dataset : Total apps, unique apps, English apps, free apps

We already know for the following data sets, this is the composition of total apps, unique apps and duplicate apps

Data Set	Total Apps	Unique Apps	English	Free Apps
Android	10840	9659	9619	8869
iOS	7197	7197	6240	3263

Analysis implies that

Android has more % of free apps compared to iOS.
Android has more % of English apps

Our aim is to determine the kinds of apps that are likely to attract more users because the revenue is highly influenced by the numbers of users of the app.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

Build an MVP of Android version of the app, and add it to Google Play.
If the app has a good response from users, we develop it further.
If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Our final aim is to release the app on both Google Play and App Store, so we need to understand the app profiles that are successful on both markets.

Analysing profiles of suceessful apps:

It is essential to understand what are the profiles of the successful apps. If we look at the data set it can be seen that the following columns give an idea on the profiles of the apps.

PlayStore dataset : Column 8 and 9 correspond to the content rating and prime genre
AppleStore dataset : Column 10 and 11 correspond to the content rating and prime genre

Building frequency tables:

We will create a function freq_table that takes in two inputs, dataset and index.

The function return the frequency table (as a dictionary) for the column we want. Frequencies should also be expressed as percentages.

def freq_table(dataset, index):
    dict_of_int = {}
    for app_row in dataset:
        col_of_int = app_row[index]
        if col_of_int in dict_of_int:
            dict_of_int[col_of_int] += 1
        else:
            dict_of_int[col_of_int] = 1
                
    total = sum(dict_of_int.values())
    dict_int_avg = {key : round(value*100/total, 2) for key, value in dict_of_int.items() }
    #return dict_int_avg 
    return dict_of_int

prime_genre_android = freq_table(android_freeapp_list,9)
#print(prime_genre_android['Action'])
#print(prime_genre_android_avg['Action'])

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
print('\nFrequency of genres in % : Android\n')
prime_genre_android = display_table(android_freeapp_list, 9)
print('\nFrequency of genres in % : iOS\n')
prime_genre_ios = display_table(ios_freeapp_list, 11)

Frequency of genres in % : Android

Tools : 749
Entertainment : 538
Education : 476
Business : 407
Lifestyle : 346
Productivity : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 191
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 125
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 81
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Adventure : 11
Action;Action & Adventure : 9
Educational;Pretend Play : 8
Simulation;Action & Adventure : 7
Parenting;Education : 7
Entertainment;Brain Games : 7
Board;Brain Games : 7
Parenting;Music & Video : 6
Educational;Brain Games : 6
Casual;Creativity : 6
Art & Design;Creativity : 6
Education;Pretend Play : 5
Role Playing;Pretend Play : 4
Education;Creativity : 4
Role Playing;Action & Adventure : 3
Puzzle;Action & Adventure : 3
Entertainment;Creativity : 3
Entertainment;Action & Adventure : 3
Educational;Creativity : 3
Educational;Action & Adventure : 3
Education;Music & Video : 3
Education;Brain Games : 3
Education;Action & Adventure : 3
Adventure;Action & Adventure : 3
Video Players & Editors;Music & Video : 2
Sports;Action & Adventure : 2
Simulation;Pretend Play : 2
Puzzle;Creativity : 2
Music;Music & Video : 2
Entertainment;Pretend Play : 2
Casual;Education : 2
Board;Action & Adventure : 2
Video Players & Editors;Creativity : 1
Trivia;Education : 1
Travel & Local;Action & Adventure : 1
Tools;Education : 1
Strategy;Education : 1
Strategy;Creativity : 1
Strategy;Action & Adventure : 1
Simulation;Education : 1
Role Playing;Brain Games : 1
Racing;Pretend Play : 1
Puzzle;Education : 1
Parenting;Brain Games : 1
Music & Audio;Music & Video : 1
Lifestyle;Pretend Play : 1
Lifestyle;Education : 1
Health & Fitness;Education : 1
Health & Fitness;Action & Adventure : 1
Entertainment;Education : 1
Communication;Creativity : 1
Comics;Creativity : 1
Casual;Music & Video : 1
Card;Action & Adventure : 1
Books & Reference;Education : 1
Art & Design;Pretend Play : 1
Art & Design;Action & Adventure : 1
Arcade;Pretend Play : 1
Adventure;Education : 1

Frequency of genres in % : iOS

Games : 1884
Entertainment : 260
Photo & Video : 161
Education : 118
Social Networking : 107
Shopping : 87
Utilities : 82
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 58
Lifestyle : 53
News : 43
Travel : 42
Finance : 41
Weather : 29
Food & Drink : 29
Reference : 18
Business : 18
Book : 15
Navigation : 8
Medical : 6
Catalogs : 4

iOS Analysis: Most common apps by Genre

Following are the observations from the data which says most number of apps belong to which genre

Most common genre is Games with 57.7%
Second most common genre is Entertainment with 8%
Third most is Photo & Video with 5%
The general impression is that more number of apps are made for the user to engage(spend some time) rather than have productivity.

It should also be noted that we are looking at particular genre. A user may install 4-5 different gaming apps, but may have only 1-2 social networking apps and only one book reading app or one study app.

This data doesn’t necessarily talk about utility.

Android Analysis: Most common apps by Genre

It should be noted that for the android data set, the observations which we’ve noted for iOS data is not applicable, i.e highest number of apps for fun. Here the most apps is in the genre of tools : 8%, followed by Entertainment at 6%.

It should also be noted that the number of genres in Android data set are also higher compared to iOS data, suggesting that the underlying trend to be captured maybe diluted because of multiple options

Most popular apps by Genre on the App store

We will find out what genres are the most popular, ie average number of installs for each app genre

prime_genre_ios = freq_table(ios_freeapp_list, 11)

for key, value in prime_genre_ios.items():
    total = 0
    len_genre = 0
    for app_row in ios_freeapp_list:
        genre_app = app_row[11]
        #print(genre_app, key)
        if genre_app == key:
            avg_user_rating = float(app_row[7])
            total += avg_user_rating
            len_genre += 1
            
    total = total/len_genre
    print(key, total, len_genre)

Catalogs 4.125 4
Navigation 3.875 8
Music 3.946969696969697 66
Shopping 3.8850574712643677 87
Lifestyle 3.358490566037736 53
Food & Drink 3.3620689655172415 29
Book 3.2 15
Travel 3.4404761904761907 42
Finance 3.1341463414634148 41
Medical 3.0 6
Weather 3.3620689655172415 29
Business 3.888888888888889 18
Games 4.01884288747346 1884
Social Networking 3.5934579439252334 107
Utilities 3.5304878048780486 82
News 3.244186046511628 43
Reference 3.6666666666666665 18
Education 3.635593220338983 118
Productivity 4.008620689655173 58
Sports 3.0652173913043477 69
Photo & Video 3.87888198757764 161
Entertainment 3.5173076923076922 260
Health & Fitness 3.769230769230769 65

Recommended profile for an iOS app

We can observe that the highest avg rating is for the genre of apps called catalog. However, it should also be noted that there are only 4 apps in total belonging to that category
But if you look at the second category Games, the highest average rating is also present as well as more number of apps too.

So based on the dataset of the Appstore, it is recommended to make a Gaming app for iOS

Most popular apps by Genre on the Google Playstore

In the Google PlayStore dataset, column 5 specifies how many installs has happened for a particular app. It could be seen that the column has values like ‘100+’, ‘1,000+’, ‘10,000+’, etc. It should be noted that characters ‘+’ and ‘,’ should be deleted

category_android = freq_table(android_freeapp_list, 1)

for key, value in category_android.items():
    total = 0
    len_category = 0
    for app_row in android_freeapp_list:
        category_app = app_row[1]
        #print(genre_app, key)
        if category_app == key:
            no_installs = app_row[5]
            no_installs = no_installs.replace(',','').replace('+','')
            total += float(no_installs)
            len_category += 1
            
    total = total/len_category
    print(key, int(total), len_category)

SPORTS 3638640 301
TOOLS 10801391 750
DATING 854028 165
VIDEO_PLAYERS 24727872 159
LIFESTYLE 1433701 347
LIBRARIES_AND_DEMO 638503 83
EDUCATION 1833495 103
FAMILY 3691833 1678
SOCIAL 23253652 236
HOUSE_AND_HOME 1331540 73
ENTERTAINMENT 11640705 85
HEALTH_AND_FITNESS 4188821 273
PRODUCTIVITY 16787331 345
WEATHER 5074486 71
NEWS_AND_MAGAZINES 9549178 248
BEAUTY 513151 53
PARENTING 542603 58
ART_AND_DESIGN 1986335 57
MAPS_AND_NAVIGATION 4025286 125
AUTO_AND_VEHICLES 647317 82
BOOKS_AND_REFERENCE 8721959 191
TRAVEL_AND_LOCAL 13984077 207
FOOD_AND_DRINK 1924897 110
COMMUNICATION 38456119 287
BUSINESS 1712290 407
FINANCE 1387692 328
PHOTOGRAPHY 17840110 261
COMICS 817657 55
EVENTS 253542 63
PERSONALIZATION 5201482 294
GAME 15588015 862
MEDICAL 120550 313
SHOPPING 7036877 199

Recommended profile for an android app

It could be seen that highest avg installs within a category is ‘Family’, followed by ‘Tools’ and ‘Lifestyle’
If you look at the just the number of installs, ‘Social’, ‘Tools’ and ‘Entertainment’ ** Considering both, the recommended app for android playstore is with category Tools **

Analysis: Profitable apps on AppStore and PlayStore

Analysis : Profitable apps on AppStore and PlayStore

Opening and exploring the data

Analysis on the multiple entries in the data set

Distribution of apps in the dataset

Note: The total number of apps in the original Android data set are 10841. Kindly note that we deleted one app in cell 6, because it doesn’t have all the required columns

Removing Non-English Apps : Part One

Removing non-English apps : Part two

Distribution of apps in the dataset : After removing non-English apps

Isolating the free Apps

Distribution of apps in the dataset : Total apps, unique apps, English apps, free apps

Analysing profiles of suceessful apps:

Building frequency tables:

iOS Analysis: Most common apps by Genre

Android Analysis: Most common apps by Genre

Most popular apps by Genre on the App store

Most popular apps by Genre on the Google Playstore

Recommended profile for an android app

Pavan Kumar