As a group, we are huge movie fanatics and enjoy great films such as the Godfather, Casablanca, and any Tarantino flick. As data scientists we wanted to dig deeper into the business side of movies and explore the economics behind what makes a successful movie. Basically we wanted to to examine whether there are any trends among films that lead them to become successful at the box office, and whether a film's box office success correlates with its ratings. A useful analysis would help us predict how well a film does at the box office before it screens, without having to rely on critics or our own instinct. Essentially we want to determine if there is a "Hollywood formula" to making a successful movie.
We found an interesting dataset of more than 5000 data points consisting of 28 attributes describing IMDB movies here: https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset.
You can read more about the data set here: https://blog.nycdatascience.com/student-works/machine-learning/movie-rating-prediction/.
We will be focusing our analysis around domestic gross, which is how much the film earned domestically at the box office during its initial run. This figure is in nominal terms, and will need to be transformed into real terms. Also, this figure exlcludes international earnings, as well as revenue from DVD rentals, television runs, etc. We will be focusing our analysis on films produced within the USA only.
Kaggle user chuansun76 was trying to solve the following problem:
We decided to tackle the problem by trying to answer the following questions:
We provided two ways to obtain the data scraped by chuansun76, either downloading the data directly online or simply reading in the downloaded file from our inputs folder.
First let's import the packages and libraries we'll need.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
We'll use the Python packages BeautifulSoup and Requests to download the dataset. Then we'll read the CSV into a pandas DataFrame for further analysis along with NumPy.
DATA_URL = "https://raw.githubusercontent.com/sundeepblue/movie_rating_prediction/master/movie_metadata.csv"
FILE_PATH = "input/movie_metadata.csv"
def load_data_online(data_url):
data = None
SUCCESS = 200
r = requests.get(data_url)
if r.status_code == SUCCESS:
# Decode data and read it into a DataFrame
content = r.content.decode('utf-8')
cr = csv.reader(content.splitlines(), delimiter=',')
my_list = list(cr)
data = pd.DataFrame(my_list[1:], columns=my_list[0])
return data
# movies_table = load_data_online(DATA_URL)
movies_table = pd.read_csv(FILE_PATH)
movies_table.head()
For our analysis we will drop any rows with nil
values for gross
, budget
, title_year
, and country
, as those data points won't contribute to our analysis. We will also focus on movies produced domestically, so we will drop all rows for movies produced outside the United States.
# replace na values with 0
movies_table["gross"].fillna(0, inplace=True)
movies_table["budget"].fillna(0, inplace=True)
movies_table["title_year"].fillna(0, inplace=True)
movies_table["country"].fillna("NaN", inplace=True)
# only consider movies made in the USA. Drop all other rows
movies_table.drop(movies_table[-(movies_table["country"].str.contains("USA"))].index, inplace=True)
movies_table.head()
To be able to compare movies across different years we will need to convert gross and budget values into real dollar amounts, in terms of 2016 purchasing power. To accomplish this we will use the Consumer Price Index (CPI) to adjust for inflation.
Let's scrape CPI values for every year from 1912-2016 exclusively. Using The US Inflation Calculator as our source, we'll traverse all the rows in the table to build our DataFrame. To be able to compare movies across different years, we will need to convert gross and budget values into real dollar amounts, in terms of 2016 purchasing power.
url = "http://www.usinflationcalculator.com/inflation/consumer-price-index-and-annual-percent-changes-from-1913-to-2008/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
table = soup.find('table')
rows = table.tbody.findAll('tr');
years = []
cpis = []
for row in rows:
year = row.findAll('td')[0].get_text()
if year.isdigit() and int(year) < 2017:
years.append(int(year))
cpis.append(float(row.findAll('td')[13].get_text()))
cpi_table = pd.DataFrame({
"year": years,
"avg_annual_cpi": cpis
})
cpi_table.head()
Let's define a function to translate the nominal dollars into real dollars in 2016 using the CPI. We'll use this equation to calculate the real value:
Past dollars in terms of recent dollars = (Dollar amount × Ending-period CPI) ÷ Beginning-period CPI.
def get_real_value(nominal_amt, old_cpi, new_cpi):
real_value = (nominal_amt * new_cpi) / old_cpi
return real_value
Let's drop all rows in movies_table
with a budget
, gross
, or year
of 0, as those rows won't contribute to our analysis:
movies_table.drop(movies_table[(movies_table["budget"] == 0) | (movies_table["gross"] == 0) |
(movies_table["title_year"] == 0)].index, inplace=True)
We're interested in real dollars in 2016 so let's make things easier for ourself and set it to a constant.
CPI_2016 = float(cpi_table[cpi_table['year'] == 2016]['avg_annual_cpi'])
Now we're ready to transform the budget
and gross
for each movie into real 2016 dollar terms:
real_domestic_gross = []
real_budget_values = []
# must transform gross and budget values into real 2016 dollar terms
for index, row in movies_table.iterrows():
gross = row['gross']
budget = row['budget']
year = row['title_year']
cpi = float(cpi_table[cpi_table['year'] == int(year)]['avg_annual_cpi'])
real_gross = get_real_value(gross, cpi, CPI_2016)
real_budget = get_real_value(budget, cpi, CPI_2016)
real_domestic_gross.append(real_gross)
real_budget_values.append(real_budget)
movies_table["real_domestic_gross"] = real_domestic_gross
movies_table["real_budget"] = real_budget_values
We'll also drop the nominal value columns, as those won't contribute to our analysis.
# drop the gross and budget cols because we won't use the nominal values
movies_table.drop(labels='gross', axis=1, inplace=True)
movies_table.drop(labels='budget', axis=1, inplace=True)
movies_table.head()
Let's also calculate the return on investment (ROI) and absolute profit for each movie. The ROI will show which movie studio earned the greatest profit based on their initial budget for the movie, and will be useful in evaluating the sucess of a film in an economic sense. We will be storing the ROI values as percentages.
profits = []
roi_vals = []
for index, row in movies_table.iterrows():
profit = row['real_domestic_gross'] - row['real_budget']
budget = row['real_budget']
num = profit - budget
den = budget
# convert roi to percentage
roi = (num / den) * 100
profits.append(profit)
roi_vals.append(roi)
movies_table['profit'] = profits
movies_table['roi'] = roi_vals
movies_table.head()
Now that we have tidied and processed the data, adding new data that will aid our analysis, we can proceed to the next step of the data science pipeline.
We'll use ggplot, Seaborn, and matplotlib for making our visualizations.
from ggplot import *
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
let's see if these is a relationship between a film's IMDB score and its box office gross by creating a simple scatter plot of a sample of the data.
x = 'imdb_score'
y = 'real_domestic_gross'
# choose a subset to represent the dataset
subset_data = movies_table.sample(500)[[x, y]]
subset_data.sort_values(['imdb_score'], ascending=True, inplace=True)
subset_data.dropna(inplace=True)
subset_data.plot(x=x, y=y, kind='scatter')
The sample plot is rather top-heavy, with a majority of the points clustering with an IMDB score above 5 and real domestic gross below 0.2. It is also interesting to see how this sample almost resembles an exponential relationship between a film's IMDB score and its box office gross.
Let's also create violin plots — combinations of box plots and density plots. We can use the violin plot try to better to better visualize the spread and density of the data distribution.
ggplot(aes(x=x, y=y), data=subset_data) +\
geom_violin() +\
labs(title="Imdb score vs. gross profit",
x = x,
y = y)
We notice that films with larger IMDB scores have a wider range for real domestic gross. This makes some intuitive sense: better scoring films can do dramatically differently in terms of revenue. The long thin lines suggest that only some do really, but most sit below 0.5 as that's where the violin is more dense. We also see that low scoring films — especially the worst — have a smaller range and don't make much money at all.
Next, we wanted to analyze if genre played a role in the correlation betwen IMDB score and gross. In our original data set, each movie could have multiple genres and so we split each row into multiple based on how many genres the movies were classified as part of.
# Subset data #2
subset_data2 = pd.DataFrame(columns=[x, y, 'genre'])
i = 0
values = []
for idx, row in movies_table.iterrows():
genres_val = row['genres']
if not pd.isnull(genres_val) and not pd.isnull(row['real_domestic_gross']):
for genre in genres_val.split('|'):
values.append([row[x], row[y], genre])
subset_data2 = pd.DataFrame(values, columns=[x, y, 'genre'])
subset_data2 = subset_data2.groupby(['imdb_score', 'genre'], as_index=False).mean()
Now let's graph the imdb score vs. gross relationship for each genre in our dataset.
# Plot the data
g = sns.lmplot(x=x, y=y, data=subset_data2, col="genre", hue="genre", scatter=True, fit_reg=True, col_wrap=3)
sns.plt.show()
For a lot of the genres, there's a small positive linear relationship between the two. "War" and "History" both have an outlier (perhaps same film?) that skews their line of best fit. It also should be noted that that there's not enough data for the "Short" and "Film-Noir" genres to make a proper analysis. All this suggests that genre is not a good indicator of real domestic gross.
Next, let's make use of the ROI we calculated earlier and try to analyze which films brought studios the most bang for their buck.
movies_by_roi = movies_table.sort_values('roi', ascending=False)
for index, row in movies_by_roi.head().iterrows():
print(row["movie_title"], row["roi"])
Out of all the movies produced by Hollywood, Paranormal Activity — the 2009 indie horror film — yielded the greatest ROI. With a budget of around \$15,000, it grossed \$107,917,283 at the box office during its run in 2010. We found this statistic to be quite amazing. It is also interesting to see how out of the top 5 movies with greatest ROI in our dataset, 3/5 films are horror.
Now, let's group the movies by greatest absolute profit.
movies_by_profit = movies_table.sort_values('profit', ascending=False)
for index, row in movies_by_profit.head().iterrows():
print(row["movie_title"], row["profit"])
What were their respective IMDB scores?
movies_by_profit[["movie_title","profit","imdb_score"]].head()
Next, let's group the movies by greatest real domestic gross at the box office.
movies_by_gross = movies_table.sort_values('real_domestic_gross', ascending=False)
for index, row in movies_by_gross.head().iterrows():
print(row["movie_title"], row["real_domestic_gross"])
Let's group the movies by greatest imdb scores.
movies_by_score = movies_table.sort_values('imdb_score', ascending=False)
for index, row in movies_by_score.head().iterrows():
print(row["movie_title"], row["imdb_score"], row["real_domestic_gross"])
It is interesting to see how critically acclaimed movies don't necessarily correlate with the highest earning movies. Also amazing to see how movies made in the 1930s and 1940s are some of the highest domestic grossing movies of all time.
These are the attributes we have to work with:
movies_table.columns
We'll use scikit-learn for machine learning.
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import r2_score
Let's do linear regression with ordinary least squares.
new_data = movies_table.dropna()
x_columns = ['imdb_score', 'real_budget', 'num_critic_for_reviews', 'director_facebook_likes',
'actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes',
'movie_facebook_likes', 'cast_total_facebook_likes']
y_column = 'real_domestic_gross'
X = [list(row.values) for _, row in new_data[x_columns].iterrows()]
X_OLS = sm.add_constant(X)
y = new_data[y_column].values
model = sm.OLS(y, X_OLS)
model.data.xnames = ['const'] + x_columns
results = model.fit()
results.summary()
Which variables are significant? Which aren't?
By looking at the p-values and using a significance level of 5%, we can see that imdb_score
, real_budget
, actor_facebook_likes
, and cast_total_facebook_likes
are all significant variables. On the other hand, num_critic_for_reviews
, director_facebook_likes
, and movie_facebook_likes
aren't significant because their p-values are all greater than 0.05.
Interpretations for significant variables
An increase of imdb score of 1 increases the real domestic gross on average by \$3000000, holding all other predictors constant.
An increase of budget by \$1 *increases* the real domestic gross on average by \$0, holding all other predictors constant.
An increase of the actor 1's facebook likes by 1 decreases the real domestic gross by \$9357.47, holding all other predictors constant.
An increase of the actor 2's facebook likes by 1 decreases the real domestic gross by \$9571.03, holding all other predictors constant.
An increase of the actor 3's facebook likes by 1 decreases the gross profit by \$10100.00, holding all other predictors constant.
An increase of the cast total facebook likes by 1 increases the gross profit by \$9090.37, holding all other predictors constant.
Most of the results make sense: A higher score, budget, and cast total facebook likes led to a higher real domestic gross. What was most surprising about these results was the fact that if the main actors had an increase of facebook likes, the gross profit actually went down. This could be due to some correlation to the cast total facebook likes.
Let's now try linear regression the ROI data we collected.
y2_column = 'roi'
X = [list(row.values) for _, row in new_data[x_columns].iterrows()]
X_OLS = sm.add_constant(X)
y2 = new_data[y2_column].values
model = sm.OLS(y2, X_OLS)
model.data.xnames = ['const'] + x_columns
results = model.fit()
results.summary()
Performing linear regression on ROI yielded strange results. For example some significant variables that stood out for this model were real_budget
, and movie_facebook_likes
. For real_budget
, holding all other predictors constant, a dollar increase leads to over an 18,000% decrease in the ROI of a film. This seems very inaccurate. Also, for every additional movie_facebook_likes
a film receives, holding all other predictors constant, leads to a 0.05% reduction in the film's ROI. This seems very counter-intuitive, because we expected that a film with more facebook likes would lead to more success at the box office.
In conclusion, performing linear regression on ROI didn't yield the best results and the model doesn't fit our data very well.
Next we are going to use decision trees and cross validation.
Decision trees are a type of machine learning classifier that attempts to predict a value based on a set of decision rules. Decision trees learn from data in order to create a set of decision rules (if-then-else statements) that predicts values.
Cross validation is a way of testing if our classification works well. We are given a set of data in which we choose to split it up into two groups known as training and testing (normally there is a lot more training data than testing so it's usually skewed towards something like 70/30 split). We will use the training data to allow our decision tree to be built. Then we will use the testing data in order to see how well our classifier does. If it doens't do as well as we expected, we may have overfitted our classifier on the training data and so we may need to prune the tree by changing specific parameters. This is the basic rundown for cross validation.
You can learn more about these machine learning concepts and their Python implementation from the sklearn documentation for Decision Trees and Cross Validation.
Let's train a decision tree classifier on our data.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2)
classifier = tree.DecisionTreeRegressor(min_samples_split=3)
y_pred = classifier.fit(X_train, y_train).predict(X_val)
# accuracy_training = accuracy_score(y_val, y_pred)
r2 = r2_score(y_val, y_pred)
r2
A very low r-sqared score also means that this model doesn't fit our data very well. All in all, linear regression on real_domestic_gross
yielded the most interesting results relatively, while the other models didn't quite fit our data.
Our analysis seems to indicate that IMDB score correlates positively with a film's real domestic gross at the box office, as well as budget, and cast facebook likes. More research has to be conducted to yield models that fit our data better and predict a film's gross in a more sensible and intuitive way. Some next steps could be to construct a model based on film genre, and see if that impacts a film's gross and ROI, and to what extent. Also, what features do films with a high ROI and gross have in common? All in all, this study shed some interesting light on the economics and patterns behind films produced in the USA.
Our tutorial introduced a only some of the packages and libraries available in Python. Much more details are available below: