In this post, using pandas we will understand more about fan’s choice on favourite star war’s characters, films etc:-

import pandas as pd
star_wars = pd.read_csv("star_wars.csv", encoding = 'ISO-8859-1')
star_wars.head(10)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

star_wars.columns
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ξ',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')
# Removing NaN rows of RespondentIDs
print(star_wars.shape)
star_wars = star_wars[pd.notnull(star_wars['RespondentID'])]
print(star_wars.shape)
(1187, 38)
(1186, 38)
yes_no = {
    "Yes" : True,
    "No" : False,
}

cols = ['Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?']

for col in cols:
    star_wars[col] = star_wars[col].map(yes_no)
    

star_wars.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

seen_movies = star_wars.columns[3:9]
print(seen_movies)
Index(['Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8'],
      dtype='object')
import numpy as np
movie_mapping = {
    "Star Wars: Episode I  The Phantom Menace": True,
    np.nan: False,
    "Star Wars: Episode II  Attack of the Clones": True,
    "Star Wars: Episode III  Revenge of the Sith": True,
    "Star Wars: Episode IV  A New Hope": True,
    "Star Wars: Episode V The Empire Strikes Back": True,
    "Star Wars: Episode VI Return of the Jedi": True
}

for col in star_wars.columns[3:9]:
    star_wars[col] = star_wars[col].map(movie_mapping)

star_wars = star_wars.rename(columns={
        "Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
        "Unnamed: 4": "seen_2",
        "Unnamed: 5": "seen_3",
        "Unnamed: 6": "seen_4",
        "Unnamed: 7": "seen_5",
        "Unnamed: 8": "seen_6"
        })

star_wars.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

star_wars = star_wars.rename(columns={
        "Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "ranking_1",
        "Unnamed: 10": "ranking_2",
        "Unnamed: 11": "ranking_3",
        "Unnamed: 12": "ranking_4",
        "Unnamed: 13": "ranking_5",
        "Unnamed: 14": "ranking_6"
        })

star_wars.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
ranking_films = star_wars[star_wars.columns[9:15]].mean()
%matplotlib inline
ranking_films.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7fcf794a74a8>

png

Ranking

So far, we have cleaned the data, column names. It should be noted that the films in 1, 2, 3 are relatively modern movies, where as movies in 4, 5, 6 columns are older generation movies in 77, 80 and 83 respectively. As expected, these ranked higher than the newer movies.

The most viewd movie in the StarWars Series is

most_seen_films = star_wars[star_wars.columns[3:9]].sum()
print(most_seen_films)
most_seen_films.plot.bar()
seen_1    673
seen_2    571
seen_3    550
seen_4    607
seen_5    758
seen_6    738
dtype: int64





<matplotlib.axes._subplots.AxesSubplot at 0x7fcf77398470>

png

View counts

It looks like the older movies in the series is watched more than the newer movies of the series. This is consistent with what we have observed in the rankings

Does Males and Females like different sets of movies in the StarWars series?

males = star_wars[star_wars['Gender'] == 'Male']
females = star_wars[star_wars['Gender'] == 'Female']
print(males.shape[0], females.shape[0])
497 549

Interesting: I expected number of males to be higher than the Females

Most liked and seen movie in the Series by Males

males_ranking_films = males[males.columns[9:15]].mean()
males_ranking_films.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7fcf7704b240>

png

males_seen_films = males[males.columns[3:9]].mean()
males_seen_films.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7fcf76fe4978>

png

Most liked and seen movie in the Series by Females

females_ranking_films = females[females.columns[9:15]].mean()
females_ranking_films.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7fcf76ecdc50>

png

females_seen_films = females[females.columns[3:9]].mean()
females_seen_films.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7fcf76e44748>

png

Observations:

  • Both Male and Female viewers show the overall trend. Both sets of viewers like and watched older movies in the series more the newer movies