As a group, we are huge movie fanatics and enjoy great films such as the Godfather, Casablanca, and any Tarantino flick. As data scientists we wanted to dig deeper into the business side of movies and explore the economics behind what makes a successful movie. Basically we wanted to to examine whether there are any trends among films that lead them to become successful at the box office, and whether a film's box office success correlates with its ratings. A useful analysis would help us predict how well a film does at the box office before it screens, without having to rely on critics or our own instinct. Essentially we want to determine if there is a "Hollywood formula" to making a successful movie.

We found an interesting dataset of more than 5000 data points consisting of 28 attributes describing IMDB movies here: https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset.

You can read more about the data set here: https://blog.nycdatascience.com/student-works/machine-learning/movie-rating-prediction/.

We will be focusing our analysis around domestic gross, which is how much the film earned domestically at the box office during its initial run. This figure is in nominal terms, and will need to be transformed into real terms. Also, this figure exlcludes international earnings, as well as revenue from DVD rentals, television runs, etc. We will be focusing our analysis on films produced within the USA only.

Kaggle user **chuansun76** was trying to solve the following problem:

- Given that thousands of movies were produced each year, is there a better way for us to tell the greatness of movie without relying on critics or our own instincts?
- Will the number of human faces in movie poster correlate with the movie rating?

We decided to tackle the problem by trying to answer the following questions:

- Does the genre, imdb score, and popularity of the cast impact a film's success at the box office?
- Are there any movies with a high gross-to-budget ratio (ROI), and why?

We provided two ways to obtain the data scraped by chuansun76, either downloading the data directly online or simply reading in the downloaded file from our inputs folder.

First let's import the packages and libraries we'll need.

In [1]:

```
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
```

In [2]:

```
DATA_URL = "https://raw.githubusercontent.com/sundeepblue/movie_rating_prediction/master/movie_metadata.csv"
FILE_PATH = "input/movie_metadata.csv"
def load_data_online(data_url):
data = None
SUCCESS = 200
r = requests.get(data_url)
if r.status_code == SUCCESS:
# Decode data and read it into a DataFrame
content = r.content.decode('utf-8')
cr = csv.reader(content.splitlines(), delimiter=',')
my_list = list(cr)
data = pd.DataFrame(my_list[1:], columns=my_list[0])
return data
# movies_table = load_data_online(DATA_URL)
movies_table = pd.read_csv(FILE_PATH)
movies_table.head()
```

Out[2]:

For our analysis we will drop any rows with `nil`

values for `gross`

, `budget`

, `title_year`

, and `country`

, as those data points won't contribute to our analysis. We will also focus on movies produced domestically, so we will drop all rows for movies produced outside the United States.

In [3]:

```
# replace na values with 0
movies_table["gross"].fillna(0, inplace=True)
movies_table["budget"].fillna(0, inplace=True)
movies_table["title_year"].fillna(0, inplace=True)
movies_table["country"].fillna("NaN", inplace=True)
# only consider movies made in the USA. Drop all other rows
movies_table.drop(movies_table[-(movies_table["country"].str.contains("USA"))].index, inplace=True)
movies_table.head()
```

Out[3]:

To be able to compare movies across different years we will need to convert gross and budget values into real dollar amounts, in terms of 2016 purchasing power. To accomplish this we will use the Consumer Price Index (CPI) to adjust for inflation.

Let's scrape CPI values for every year from 1912-2016 exclusively. Using The US Inflation Calculator as our source, we'll traverse all the rows in the table to build our DataFrame. To be able to compare movies across different years, we will need to convert gross and budget values into real dollar amounts, in terms of 2016 purchasing power.

In [4]:

```
url = "http://www.usinflationcalculator.com/inflation/consumer-price-index-and-annual-percent-changes-from-1913-to-2008/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
table = soup.find('table')
rows = table.tbody.findAll('tr');
years = []
cpis = []
for row in rows:
year = row.findAll('td')[0].get_text()
if year.isdigit() and int(year) < 2017:
years.append(int(year))
cpis.append(float(row.findAll('td')[13].get_text()))
cpi_table = pd.DataFrame({
"year": years,
"avg_annual_cpi": cpis
})
cpi_table.head()
```

Out[4]:

Let's define a function to translate the nominal dollars into real dollars in 2016 using the CPI. We'll use this equation to calculate the real value:

Past dollars in terms of recent dollars = (Dollar amount × Ending-period CPI) ÷ Beginning-period CPI.

In [5]:

```
def get_real_value(nominal_amt, old_cpi, new_cpi):
real_value = (nominal_amt * new_cpi) / old_cpi
return real_value
```

`movies_table`

with a `budget`

, `gross`

, or `year`

of 0, as those rows won't contribute to our analysis:

In [6]:

```
movies_table.drop(movies_table[(movies_table["budget"] == 0) | (movies_table["gross"] == 0) |
(movies_table["title_year"] == 0)].index, inplace=True)
```

*2016* so let's make things easier for ourself and set it to a constant.

In [7]:

```
CPI_2016 = float(cpi_table[cpi_table['year'] == 2016]['avg_annual_cpi'])
```

Now we're ready to transform the `budget`

and `gross`

for each movie into real 2016 dollar terms:

In [8]:

```
real_domestic_gross = []
real_budget_values = []
# must transform gross and budget values into real 2016 dollar terms
for index, row in movies_table.iterrows():
gross = row['gross']
budget = row['budget']
year = row['title_year']
cpi = float(cpi_table[cpi_table['year'] == int(year)]['avg_annual_cpi'])
real_gross = get_real_value(gross, cpi, CPI_2016)
real_budget = get_real_value(budget, cpi, CPI_2016)
real_domestic_gross.append(real_gross)
real_budget_values.append(real_budget)
movies_table["real_domestic_gross"] = real_domestic_gross
movies_table["real_budget"] = real_budget_values
```

We'll also drop the nominal value columns, as those won't contribute to our analysis.

In [9]:

```
# drop the gross and budget cols because we won't use the nominal values
movies_table.drop(labels='gross', axis=1, inplace=True)
movies_table.drop(labels='budget', axis=1, inplace=True)
movies_table.head()
```

Out[9]:

**We will be storing the ROI values as percentages.**

In [10]:

```
profits = []
roi_vals = []
for index, row in movies_table.iterrows():
profit = row['real_domestic_gross'] - row['real_budget']
budget = row['real_budget']
num = profit - budget
den = budget
# convert roi to percentage
roi = (num / den) * 100
profits.append(profit)
roi_vals.append(roi)
movies_table['profit'] = profits
movies_table['roi'] = roi_vals
movies_table.head()
```

Out[10]:

We'll use ggplot, Seaborn, and matplotlib for making our visualizations.

In [11]:

```
from ggplot import *
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
```

let's see if these is a relationship between a film's IMDB score and its box office gross by creating a simple scatter plot of a sample of the data.

In [12]:

```
x = 'imdb_score'
y = 'real_domestic_gross'
# choose a subset to represent the dataset
subset_data = movies_table.sample(500)[[x, y]]
subset_data.sort_values(['imdb_score'], ascending=True, inplace=True)
subset_data.dropna(inplace=True)
subset_data.plot(x=x, y=y, kind='scatter')
```

Out[12]:

In [13]:

```
ggplot(aes(x=x, y=y), data=subset_data) +\
geom_violin() +\
labs(title="Imdb score vs. gross profit",
x = x,
y = y)
```

Out[13]:

Next, we wanted to analyze if genre played a role in the correlation betwen IMDB score and gross. In our original data set, each movie could have multiple genres and so we split each row into multiple based on how many genres the movies were classified as part of.

In [14]:

```
# Subset data #2
subset_data2 = pd.DataFrame(columns=[x, y, 'genre'])
i = 0
values = []
for idx, row in movies_table.iterrows():
genres_val = row['genres']
if not pd.isnull(genres_val) and not pd.isnull(row['real_domestic_gross']):
for genre in genres_val.split('|'):
values.append([row[x], row[y], genre])
subset_data2 = pd.DataFrame(values, columns=[x, y, 'genre'])
subset_data2 = subset_data2.groupby(['imdb_score', 'genre'], as_index=False).mean()
```

Now let's graph the imdb score vs. gross relationship for each genre in our dataset.

In [15]:

```
# Plot the data
g = sns.lmplot(x=x, y=y, data=subset_data2, col="genre", hue="genre", scatter=True, fit_reg=True, col_wrap=3)
sns.plt.show()
```

Next, let's make use of the ROI we calculated earlier and try to analyze which films brought studios the most bang for their buck.

In [16]:

```
movies_by_roi = movies_table.sort_values('roi', ascending=False)
for index, row in movies_by_roi.head().iterrows():
print(row["movie_title"], row["roi"])
```

Now, let's group the movies by greatest absolute profit.

In [17]:

```
movies_by_profit = movies_table.sort_values('profit', ascending=False)
for index, row in movies_by_profit.head().iterrows():
print(row["movie_title"], row["profit"])
```

What were their respective IMDB scores?

In [18]:

```
movies_by_profit[["movie_title","profit","imdb_score"]].head()
```

Out[18]:

Next, let's group the movies by greatest real domestic gross at the box office.

In [19]:

```
movies_by_gross = movies_table.sort_values('real_domestic_gross', ascending=False)
for index, row in movies_by_gross.head().iterrows():
print(row["movie_title"], row["real_domestic_gross"])
```

Let's group the movies by greatest imdb scores.

In [20]:

```
movies_by_score = movies_table.sort_values('imdb_score', ascending=False)
for index, row in movies_by_score.head().iterrows():
print(row["movie_title"], row["imdb_score"], row["real_domestic_gross"])
```

These are the attributes we have to work with:

In [21]:

```
movies_table.columns
```

Out[21]:

We'll use scikit-learn for machine learning.

In [22]:

```
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import r2_score
```

Let's do linear regression with ordinary least squares.

In [23]:

```
new_data = movies_table.dropna()
x_columns = ['imdb_score', 'real_budget', 'num_critic_for_reviews', 'director_facebook_likes',
'actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes',
'movie_facebook_likes', 'cast_total_facebook_likes']
y_column = 'real_domestic_gross'
X = [list(row.values) for _, row in new_data[x_columns].iterrows()]
X_OLS = sm.add_constant(X)
y = new_data[y_column].values
model = sm.OLS(y, X_OLS)
model.data.xnames = ['const'] + x_columns
results = model.fit()
results.summary()
```

Out[23]:

**Which variables are significant? Which aren't?**

By looking at the p-values and using a significance level of 5%, we can see that `imdb_score`

, `real_budget`

, `actor_facebook_likes`

, and `cast_total_facebook_likes`

are all significant variables. On the other hand, `num_critic_for_reviews`

, `director_facebook_likes`

, and `movie_facebook_likes`

aren't significant because their p-values are all greater than 0.05.

**Interpretations for significant variables**

An increase of imdb score of 1 *increases* the real domestic gross on average by \$3000000, holding all other predictors constant.

An increase of budget by \$1 *increases* the real domestic gross on average by \$0, holding all other predictors constant.

An increase of the actor 1's facebook likes by 1 *decreases* the real domestic gross by \$9357.47, holding all other predictors constant.

An increase of the actor 2's facebook likes by 1 *decreases* the real domestic gross by \$9571.03, holding all other predictors constant.

An increase of the actor 3's facebook likes by 1 *decreases* the gross profit by \$10100.00, holding all other predictors constant.

An increase of the cast total facebook likes by 1 *increases* the gross profit by \$9090.37, holding all other predictors constant.

Most of the results make sense: A higher score, budget, and cast total facebook likes led to a higher real domestic gross. What was most surprising about these results was the fact that if the main actors had an increase of facebook likes, the gross profit actually went down. This could be due to some correlation to the cast total facebook likes.

Let's now try linear regression the ROI data we collected.

In [24]:

```
y2_column = 'roi'
X = [list(row.values) for _, row in new_data[x_columns].iterrows()]
X_OLS = sm.add_constant(X)
y2 = new_data[y2_column].values
model = sm.OLS(y2, X_OLS)
model.data.xnames = ['const'] + x_columns
results = model.fit()
results.summary()
```

Out[24]:

Performing linear regression on ROI yielded strange results. For example some significant variables that stood out for this model were `real_budget`

, and `movie_facebook_likes`

. For `real_budget`

, holding all other predictors constant, a dollar increase leads to over an 18,000% decrease in the ROI of a film. This seems very inaccurate. Also, for every additional `movie_facebook_likes`

a film receives, holding all other predictors constant, leads to a 0.05% reduction in the film's ROI. This seems very counter-intuitive, because we expected that a film with more facebook likes would lead to more success at the box office.

In conclusion, performing linear regression on ROI didn't yield the best results and the model doesn't fit our data very well.

Next we are going to use **decision trees** and **cross validation**.

Decision trees are a type of machine learning classifier that attempts to predict a value based on a set of decision rules. Decision trees learn from data in order to create a set of decision rules (if-then-else statements) that predicts values.

Cross validation is a way of testing if our classification works well. We are given a set of data in which we choose to split it up into two groups known as training and testing (normally there is a lot more training data than testing so it's usually skewed towards something like 70/30 split). We will use the training data to allow our decision tree to be built. Then we will use the testing data in order to see how well our classifier does. If it doens't do as well as we expected, we may have overfitted our classifier on the training data and so we may need to prune the tree by changing specific parameters. This is the basic rundown for cross validation.

You can learn more about these machine learning concepts and their Python implementation from the sklearn documentation for Decision Trees and Cross Validation.

Let's train a decision tree classifier on our data.

In [25]:

```
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2)
classifier = tree.DecisionTreeRegressor(min_samples_split=3)
y_pred = classifier.fit(X_train, y_train).predict(X_val)
# accuracy_training = accuracy_score(y_val, y_pred)
r2 = r2_score(y_val, y_pred)
r2
```

Out[25]:

`real_domestic_gross`

yielded the most interesting results relatively, while the other models didn't quite fit our data.

Our tutorial introduced a only some of the packages and libraries available in Python. Much more details are available below:

- BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
- Requests: http://docs.python-requests.org/en/master/
- pandas: http://pandas.pydata.org/
- NumPy: http://www.numpy.org/
- ggplot http://ggplot.yhathq.com/
- Seaborn: https://seaborn.pydata.org/
- Matplotlib: https://matplotlib.org/
- scikit-learn: http://scikit-learn.org