Finding the Hollywood Formula: IMDB Movie Dataset Analysis

Motivation

As a group, we are huge movie fanatics and enjoy great films such as the Godfather, Casablanca, and any Tarantino flick. As data scientists we wanted to dig deeper into the business side of movies and explore the economics behind what makes a successful movie. Basically we wanted to to examine whether there are any trends among films that lead them to become successful at the box office, and whether a film's box office success correlates with its ratings. A useful analysis would help us predict how well a film does at the box office before it screens, without having to rely on critics or our own instinct. Essentially we want to determine if there is a "Hollywood formula" to making a successful movie.

Background

We found an interesting dataset of more than 5000 data points consisting of 28 attributes describing IMDB movies here: https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset.

You can read more about the data set here: https://blog.nycdatascience.com/student-works/machine-learning/movie-rating-prediction/.

We will be focusing our analysis around domestic gross, which is how much the film earned domestically at the box office during its initial run. This figure is in nominal terms, and will need to be transformed into real terms. Also, this figure exlcludes international earnings, as well as revenue from DVD rentals, television runs, etc. We will be focusing our analysis on films produced within the USA only.

Original Problem

Kaggle user chuansun76 was trying to solve the following problem:

  1. Given that thousands of movies were produced each year, is there a better way for us to tell the greatness of movie without relying on critics or our own instincts?
  2. Will the number of human faces in movie poster correlate with the movie rating?

Our Problem

We decided to tackle the problem by trying to answer the following questions:

  1. Does the genre, imdb score, and popularity of the cast impact a film's success at the box office?
  2. Are there any movies with a high gross-to-budget ratio (ROI), and why?

I. Data Collection

We provided two ways to obtain the data scraped by chuansun76, either downloading the data directly online or simply reading in the downloaded file from our inputs folder.

First let's import the packages and libraries we'll need.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

We'll use the Python packages BeautifulSoup and Requests to download the dataset. Then we'll read the CSV into a pandas DataFrame for further analysis along with NumPy.

In [2]:
DATA_URL = "https://raw.githubusercontent.com/sundeepblue/movie_rating_prediction/master/movie_metadata.csv"
FILE_PATH = "input/movie_metadata.csv"

def load_data_online(data_url):
    data = None
    SUCCESS = 200
    
    r = requests.get(data_url)
    if r.status_code == SUCCESS:
        # Decode data and read it into a DataFrame
        content = r.content.decode('utf-8')
        cr = csv.reader(content.splitlines(), delimiter=',')
        my_list = list(cr)
        data = pd.DataFrame(my_list[1:], columns=my_list[0])
        return data
    
# movies_table = load_data_online(DATA_URL)
movies_table = pd.read_csv(FILE_PATH)
movies_table.head()
Out[2]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns

II. Data Processing

For our analysis we will drop any rows with nil values for gross, budget, title_year, and country, as those data points won't contribute to our analysis. We will also focus on movies produced domestically, so we will drop all rows for movies produced outside the United States.

In [3]:
# replace na values with 0
movies_table["gross"].fillna(0, inplace=True)
movies_table["budget"].fillna(0, inplace=True)
movies_table["title_year"].fillna(0, inplace=True)
movies_table["country"].fillna("NaN", inplace=True)

# only consider movies made in the USA. Drop all other rows
movies_table.drop(movies_table[-(movies_table["country"].str.contains("USA"))].index, inplace=True)

movies_table.head()
Out[3]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
5 Color Andrew Stanton 462.0 132.0 475.0 530.0 Samantha Morton 640.0 73058679.0 Action|Adventure|Sci-Fi ... 738.0 English USA PG-13 263700000.0 2012.0 632.0 6.6 2.35 24000
6 Color Sam Raimi 392.0 156.0 0.0 4000.0 James Franco 24000.0 336530303.0 Action|Adventure|Romance ... 1902.0 English USA PG-13 258000000.0 2007.0 11000.0 6.2 2.35 0

5 rows × 28 columns

To be able to compare movies across different years we will need to convert gross and budget values into real dollar amounts, in terms of 2016 purchasing power. To accomplish this we will use the Consumer Price Index (CPI) to adjust for inflation.

Let's scrape CPI values for every year from 1912-2016 exclusively. Using The US Inflation Calculator as our source, we'll traverse all the rows in the table to build our DataFrame. To be able to compare movies across different years, we will need to convert gross and budget values into real dollar amounts, in terms of 2016 purchasing power.

In [4]:
url = "http://www.usinflationcalculator.com/inflation/consumer-price-index-and-annual-percent-changes-from-1913-to-2008/"

r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')

table = soup.find('table')
rows = table.tbody.findAll('tr');

years = []
cpis = []

for row in rows:
    year = row.findAll('td')[0].get_text()
    if year.isdigit() and int(year) < 2017:
        years.append(int(year))
        cpis.append(float(row.findAll('td')[13].get_text()))

cpi_table = pd.DataFrame({
    "year": years,
    "avg_annual_cpi": cpis
})

cpi_table.head()
Out[4]:
avg_annual_cpi year
0 9.9 1913
1 10.0 1914
2 10.1 1915
3 10.9 1916
4 12.8 1917

Let's define a function to translate the nominal dollars into real dollars in 2016 using the CPI. We'll use this equation to calculate the real value:

Past dollars in terms of recent dollars = (Dollar amount × Ending-period CPI) ÷ Beginning-period CPI.

In [5]:
def get_real_value(nominal_amt, old_cpi, new_cpi):
    real_value = (nominal_amt * new_cpi) / old_cpi
    return real_value

Let's drop all rows in movies_table with a budget, gross, or year of 0, as those rows won't contribute to our analysis:

In [6]:
movies_table.drop(movies_table[(movies_table["budget"] == 0) | (movies_table["gross"] == 0) | 
                                (movies_table["title_year"] == 0)].index, inplace=True)

We're interested in real dollars in 2016 so let's make things easier for ourself and set it to a constant.

In [7]:
CPI_2016 = float(cpi_table[cpi_table['year'] == 2016]['avg_annual_cpi'])

Now we're ready to transform the budget and gross for each movie into real 2016 dollar terms:

In [8]:
real_domestic_gross = []
real_budget_values = []

# must transform gross and budget values into real 2016 dollar terms
for index, row in movies_table.iterrows():
    gross = row['gross']
    budget = row['budget']
    year = row['title_year']
    cpi = float(cpi_table[cpi_table['year'] == int(year)]['avg_annual_cpi'])
    
    real_gross = get_real_value(gross, cpi, CPI_2016)
    real_budget = get_real_value(budget, cpi, CPI_2016)
    real_domestic_gross.append(real_gross)
    real_budget_values.append(real_budget)

movies_table["real_domestic_gross"] = real_domestic_gross
movies_table["real_budget"] = real_budget_values   

We'll also drop the nominal value columns, as those won't contribute to our analysis.

In [9]:
# drop the gross and budget cols because we won't use the nominal values
movies_table.drop(labels='gross', axis=1, inplace=True)
movies_table.drop(labels='budget', axis=1, inplace=True)

movies_table.head() 
Out[9]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes genres actor_1_name ... language country content_rating title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes real_domestic_gross real_budget
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder ... English USA PG-13 2009.0 936.0 7.9 1.78 33000 8.507937e+08 2.651368e+08
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 Action|Adventure|Fantasy Johnny Depp ... English USA PG-13 2007.0 5000.0 7.1 2.35 0 3.582208e+08 3.473329e+08
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 Action|Thriller Tom Hardy ... English USA PG-13 2012.0 23000.0 8.5 2.35 164000 4.684551e+08 2.613385e+08
5 Color Andrew Stanton 462.0 132.0 475.0 530.0 Samantha Morton 640.0 Action|Adventure|Sci-Fi Daryl Sabara ... English USA PG-13 2012.0 632.0 6.6 2.35 24000 7.637218e+07 2.756598e+08
6 Color Sam Raimi 392.0 156.0 0.0 4000.0 James Franco 24000.0 Action|Adventure|Romance J.K. Simmons ... English USA PG-13 2007.0 11000.0 6.2 2.35 0 3.896268e+08 2.987063e+08

5 rows × 28 columns

Let's also calculate the return on investment (ROI) and absolute profit for each movie. The ROI will show which movie studio earned the greatest profit based on their initial budget for the movie, and will be useful in evaluating the sucess of a film in an economic sense. We will be storing the ROI values as percentages.

In [10]:
profits = []
roi_vals = []

for index, row in movies_table.iterrows():
    profit = row['real_domestic_gross'] - row['real_budget']
    budget = row['real_budget']
    num = profit - budget
    den = budget
    # convert roi to percentage
    roi = (num / den) * 100
    
    profits.append(profit)
    roi_vals.append(roi)
    
movies_table['profit'] = profits
movies_table['roi'] = roi_vals

movies_table.head()
Out[10]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes genres actor_1_name ... content_rating title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes real_domestic_gross real_budget profit roi
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder ... PG-13 2009.0 936.0 7.9 1.78 33000 8.507937e+08 2.651368e+08 5.856569e+08 120.888543
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 Action|Adventure|Fantasy Johnny Depp ... PG-13 2007.0 5000.0 7.1 2.35 0 3.582208e+08 3.473329e+08 1.088790e+07 -96.865283
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 Action|Thriller Tom Hardy ... PG-13 2012.0 23000.0 8.5 2.35 164000 4.684551e+08 2.613385e+08 2.071167e+08 -20.747743
5 Color Andrew Stanton 462.0 132.0 475.0 530.0 Samantha Morton 640.0 Action|Adventure|Sci-Fi Daryl Sabara ... PG-13 2012.0 632.0 6.6 2.35 24000 7.637218e+07 2.756598e+08 -1.992877e+08 -172.294775
6 Color Sam Raimi 392.0 156.0 0.0 4000.0 James Franco 24000.0 Action|Adventure|Romance J.K. Simmons ... PG-13 2007.0 11000.0 6.2 2.35 0 3.896268e+08 2.987063e+08 9.092051e+07 -69.561898

5 rows × 30 columns

Now that we have tidied and processed the data, adding new data that will aid our analysis, we can proceed to the next step of the data science pipeline.

III. Exploratory Analysis and Data Visualization

We'll use ggplot, Seaborn, and matplotlib for making our visualizations.

In [11]:
from ggplot import *
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Analyzing IMDB score vs. gross

let's see if these is a relationship between a film's IMDB score and its box office gross by creating a simple scatter plot of a sample of the data.

In [12]:
x = 'imdb_score'
y = 'real_domestic_gross'

# choose a subset to represent the dataset
subset_data = movies_table.sample(500)[[x, y]]
subset_data.sort_values(['imdb_score'], ascending=True, inplace=True)
subset_data.dropna(inplace=True)

subset_data.plot(x=x, y=y, kind='scatter')
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x11223f908>

The sample plot is rather top-heavy, with a majority of the points clustering with an IMDB score above 5 and real domestic gross below 0.2. It is also interesting to see how this sample almost resembles an exponential relationship between a film's IMDB score and its box office gross.

Let's also create violin plots — combinations of box plots and density plots. We can use the violin plot try to better to better visualize the spread and density of the data distribution.

In [13]:
ggplot(aes(x=x, y=y), data=subset_data) +\
    geom_violin() +\
    labs(title="Imdb score vs. gross profit",
         x = x,
         y = y)
Out[13]:
<ggplot: (-9223372036567387125)>

We notice that films with larger IMDB scores have a wider range for real domestic gross. This makes some intuitive sense: better scoring films can do dramatically differently in terms of revenue. The long thin lines suggest that only some do really, but most sit below 0.5 as that's where the violin is more dense. We also see that low scoring films — especially the worst — have a smaller range and don't make much money at all.

Analyzing IMDB score vs. gross by genre

Next, we wanted to analyze if genre played a role in the correlation betwen IMDB score and gross. In our original data set, each movie could have multiple genres and so we split each row into multiple based on how many genres the movies were classified as part of.

In [14]:
# Subset data #2
subset_data2 = pd.DataFrame(columns=[x, y, 'genre'])

i = 0
values = []
for idx, row in movies_table.iterrows():
    genres_val = row['genres']
    
    if not pd.isnull(genres_val) and not pd.isnull(row['real_domestic_gross']):
        for genre in genres_val.split('|'):
            values.append([row[x], row[y], genre])    
        
subset_data2 = pd.DataFrame(values, columns=[x, y, 'genre'])
subset_data2 = subset_data2.groupby(['imdb_score', 'genre'], as_index=False).mean()

Now let's graph the imdb score vs. gross relationship for each genre in our dataset.

In [15]:
# Plot the data
g = sns.lmplot(x=x, y=y, data=subset_data2, col="genre", hue="genre", scatter=True, fit_reg=True, col_wrap=3)
sns.plt.show()

For a lot of the genres, there's a small positive linear relationship between the two. "War" and "History" both have an outlier (perhaps same film?) that skews their line of best fit. It also should be noted that that there's not enough data for the "Short" and "Film-Noir" genres to make a proper analysis. All this suggests that genre is not a good indicator of real domestic gross.

Analyzing film earnings

Next, let's make use of the ROI we calculated earlier and try to analyze which films brought studios the most bang for their buck.

In [16]:
movies_by_roi = movies_table.sort_values('roi', ascending=False)

for index, row in movies_by_roi.head().iterrows():
    print(row["movie_title"], row["roi"])
Paranormal Activity  719248.5533333333
Tarnation  271366.05504587153
The Blair Witch Project  234016.85666666666
The Brothers McMullen  40786.40000000001
The Texas Chain Saw Massacre  36742.72853517215

Out of all the movies produced by Hollywood, Paranormal Activity — the 2009 indie horror film — yielded the greatest ROI. With a budget of around \$15,000, it grossed \$107,917,283 at the box office during its run in 2010. We found this statistic to be quite amazing. It is also interesting to see how out of the top 5 movies with greatest ROI in our dataset, 3/5 films are horror.

Now, let's group the movies by greatest absolute profit.

In [17]:
movies_by_profit = movies_table.sort_values('profit', ascending=False)

for index, row in movies_by_profit.head().iterrows():
    print(row["movie_title"], row["profit"])
Gone with the Wind  3361449602.0105033
Snow White and the Seven Dwarfs  3048847005.4440975
Star Wars: Episode IV - A New Hope  1781975398.5091584
Pinocchio  1400612278.5714285
Fantasia  1270665631.4285712

What were their respective IMDB scores?

In [18]:
movies_by_profit[["movie_title","profit","imdb_score"]].head()
Out[18]:
movie_title profit imdb_score
3970 Gone with the Wind 3.361450e+09 8.2
4449 Snow White and the Seven Dwarfs 3.048847e+09 7.7
3024 Star Wars: Episode IV - A New Hope 1.781975e+09 8.7
1143 Pinocchio 1.400612e+09 7.5
4225 Fantasia 1.270666e+09 7.8

Next, let's group the movies by greatest real domestic gross at the box office.

In [19]:
movies_by_gross = movies_table.sort_values('real_domestic_gross', ascending=False)

for index, row in movies_by_gross.head().iterrows():
    print(row["movie_title"], row["real_domestic_gross"])
Gone with the Wind  3430119230.7155395
Snow White and the Seven Dwarfs  3082181310.999653
Star Wars: Episode IV - A New Hope  1825541025.5718646
Pinocchio  1445185007.142857
Fantasia  1309752485.7142856

Let's group the movies by greatest imdb scores.

In [20]:
movies_by_score = movies_table.sort_values('imdb_score', ascending=False)

for index, row in movies_by_score.head().iterrows():
    print(row["movie_title"], row["imdb_score"], row["real_domestic_gross"])
The Shawshank Redemption  9.3 45898454.45535088
The Godfather  9.2 774119909.896268
The Dark Knight  9.0 594509077.2187428
The Godfather: Part II  9.0 278953369.168357
Pulp Fiction  8.9 174790523.0094467

It is interesting to see how critically acclaimed movies don't necessarily correlate with the highest earning movies. Also amazing to see how movies made in the 1930s and 1940s are some of the highest domestic grossing movies of all time.

IV. Analysis and ML

Linear regression on real domestic gross

These are the attributes we have to work with:

In [21]:
movies_table.columns
Out[21]:
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'genres', 'actor_1_name', 'movie_title',
       'num_voted_users', 'cast_total_facebook_likes', 'actor_3_name',
       'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link',
       'num_user_for_reviews', 'language', 'country', 'content_rating',
       'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio',
       'movie_facebook_likes', 'real_domestic_gross', 'real_budget', 'profit',
       'roi'],
      dtype='object')

We'll use scikit-learn for machine learning.

In [22]:
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import r2_score

Let's do linear regression with ordinary least squares.

In [23]:
new_data = movies_table.dropna()

x_columns = ['imdb_score', 'real_budget', 'num_critic_for_reviews', 'director_facebook_likes',
             'actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes',
             'movie_facebook_likes', 'cast_total_facebook_likes']
y_column = 'real_domestic_gross'


X = [list(row.values) for _, row in new_data[x_columns].iterrows()]
X_OLS = sm.add_constant(X)
y = new_data[y_column].values


model = sm.OLS(y, X_OLS)
model.data.xnames = ['const'] + x_columns
results = model.fit()
results.summary()
Out[23]:
OLS Regression Results
Dep. Variable: y R-squared: 0.210
Model: OLS Adj. R-squared: 0.207
Method: Least Squares F-statistic: 87.69
Date: Fri, 19 May 2017 Prob (F-statistic): 4.54e-145
Time: 10:13:22 Log-Likelihood: -59975.
No. Observations: 2987 AIC: 1.200e+05
Df Residuals: 2977 BIC: 1.200e+05
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const -1.803e+08 1.51e+07 -11.901 0.000 -2.1e+08 -1.51e+08
imdb_score 3.272e+07 2.43e+06 13.492 0.000 2.8e+07 3.75e+07
real_budget 0.9645 0.051 18.867 0.000 0.864 1.065
num_critic_for_reviews -1.381e+04 2.89e+04 -0.478 0.633 -7.05e+04 4.28e+04
director_facebook_likes 589.3956 718.648 0.820 0.412 -819.702 1998.494
actor_1_facebook_likes -9357.4739 2016.709 -4.640 0.000 -1.33e+04 -5403.190
actor_2_facebook_likes -9581.0271 2131.094 -4.496 0.000 -1.38e+04 -5402.461
actor_3_facebook_likes -1.01e+04 3312.501 -3.049 0.002 -1.66e+04 -3604.240
movie_facebook_likes 124.1345 154.513 0.803 0.422 -178.829 427.098
cast_total_facebook_likes 9090.3678 2012.483 4.517 0.000 5144.370 1.3e+04
Omnibus: 5361.692 Durbin-Watson: 1.739
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9071676.065
Skew: 12.645 Prob(JB): 0.00
Kurtosis: 271.793 Cond. No. 4.76e+08

Which variables are significant? Which aren't?

By looking at the p-values and using a significance level of 5%, we can see that imdb_score, real_budget, actor_facebook_likes, and cast_total_facebook_likes are all significant variables. On the other hand, num_critic_for_reviews, director_facebook_likes, and movie_facebook_likes aren't significant because their p-values are all greater than 0.05.

Interpretations for significant variables

An increase of imdb score of 1 increases the real domestic gross on average by \$3000000, holding all other predictors constant.

An increase of budget by \$1 *increases* the real domestic gross on average by \$0, holding all other predictors constant.

An increase of the actor 1's facebook likes by 1 decreases the real domestic gross by \$9357.47, holding all other predictors constant.

An increase of the actor 2's facebook likes by 1 decreases the real domestic gross by \$9571.03, holding all other predictors constant.

An increase of the actor 3's facebook likes by 1 decreases the gross profit by \$10100.00, holding all other predictors constant.

An increase of the cast total facebook likes by 1 increases the gross profit by \$9090.37, holding all other predictors constant.

Most of the results make sense: A higher score, budget, and cast total facebook likes led to a higher real domestic gross. What was most surprising about these results was the fact that if the main actors had an increase of facebook likes, the gross profit actually went down. This could be due to some correlation to the cast total facebook likes.

Linear regression on ROI

Let's now try linear regression the ROI data we collected.

In [24]:
y2_column = 'roi'

X = [list(row.values) for _, row in new_data[x_columns].iterrows()]
X_OLS = sm.add_constant(X)
y2 = new_data[y2_column].values

model = sm.OLS(y2, X_OLS)
model.data.xnames = ['const'] + x_columns
results = model.fit()
results.summary()
Out[24]:
OLS Regression Results
Dep. Variable: y R-squared: 0.008
Model: OLS Adj. R-squared: 0.005
Method: Least Squares F-statistic: 2.555
Date: Fri, 19 May 2017 Prob (F-statistic): 0.00636
Time: 10:13:22 Log-Likelihood: -32906.
No. Observations: 2987 AIC: 6.583e+04
Df Residuals: 2977 BIC: 6.589e+04
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const -131.6642 1756.858 -0.075 0.940 -3576.443 3313.114
imdb_score 20.3356 281.325 0.072 0.942 -531.275 571.946
real_budget -1.814e-05 5.93e-06 -3.058 0.002 -2.98e-05 -6.51e-06
num_critic_for_reviews 13.3255 3.352 3.976 0.000 6.754 19.897
director_facebook_likes -0.0420 0.083 -0.504 0.614 -0.205 0.121
actor_1_facebook_likes 0.1299 0.234 0.555 0.579 -0.329 0.589
actor_2_facebook_likes 0.1147 0.247 0.464 0.643 -0.370 0.599
actor_3_facebook_likes 0.2002 0.384 0.521 0.602 -0.553 0.954
movie_facebook_likes -0.0418 0.018 -2.330 0.020 -0.077 -0.007
cast_total_facebook_likes -0.1436 0.233 -0.615 0.538 -0.601 0.314
Omnibus: 8866.613 Durbin-Watson: 2.005
Prob(Omnibus): 0.000 Jarque-Bera (JB): 453170322.937
Skew: 41.534 Prob(JB): 0.00
Kurtosis: 1909.368 Cond. No. 4.76e+08

Performing linear regression on ROI yielded strange results. For example some significant variables that stood out for this model were real_budget, and movie_facebook_likes. For real_budget, holding all other predictors constant, a dollar increase leads to over an 18,000% decrease in the ROI of a film. This seems very inaccurate. Also, for every additional movie_facebook_likes a film receives, holding all other predictors constant, leads to a 0.05% reduction in the film's ROI. This seems very counter-intuitive, because we expected that a film with more facebook likes would lead to more success at the box office.

In conclusion, performing linear regression on ROI didn't yield the best results and the model doesn't fit our data very well.

Decision Trees and Cross Validation

Next we are going to use decision trees and cross validation.

Decision trees are a type of machine learning classifier that attempts to predict a value based on a set of decision rules. Decision trees learn from data in order to create a set of decision rules (if-then-else statements) that predicts values.

Cross validation is a way of testing if our classification works well. We are given a set of data in which we choose to split it up into two groups known as training and testing (normally there is a lot more training data than testing so it's usually skewed towards something like 70/30 split). We will use the training data to allow our decision tree to be built. Then we will use the testing data in order to see how well our classifier does. If it doens't do as well as we expected, we may have overfitted our classifier on the training data and so we may need to prune the tree by changing specific parameters. This is the basic rundown for cross validation.

You can learn more about these machine learning concepts and their Python implementation from the sklearn documentation for Decision Trees and Cross Validation.

Let's train a decision tree classifier on our data.

In [25]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2)

classifier = tree.DecisionTreeRegressor(min_samples_split=3)
y_pred = classifier.fit(X_train, y_train).predict(X_val)
# accuracy_training = accuracy_score(y_val, y_pred)
r2 = r2_score(y_val, y_pred)
r2
Out[25]:
-0.086923009596375334

A very low r-sqared score also means that this model doesn't fit our data very well. All in all, linear regression on real_domestic_gross yielded the most interesting results relatively, while the other models didn't quite fit our data.

V. Insight

Our analysis seems to indicate that IMDB score correlates positively with a film's real domestic gross at the box office, as well as budget, and cast facebook likes. More research has to be conducted to yield models that fit our data better and predict a film's gross in a more sensible and intuitive way. Some next steps could be to construct a model based on film genre, and see if that impacts a film's gross and ROI, and to what extent. Also, what features do films with a high ROI and gross have in common? All in all, this study shed some interesting light on the economics and patterns behind films produced in the USA.

VI. Resources

Our tutorial introduced a only some of the packages and libraries available in Python. Much more details are available below: