Movie Industry Exploration with Data
EDA on film industry using open source information
Overview
The goal of this analysis was to distill open-source information on movies into some interesting insights to better understand the movie industry from a business perspective.
I started this off by intending to answer four main questions that I thought would generate some interesting findings:
- What types of movies (genres) are superior from an investment standpoint?
- What budget should be used to make a profitable movie?
- Which actors, actresses and directors deliver the best returns on a movie?
- When are the optimal times of year to release a movie?
The link to the repo for viewing the entirety of the project is here.
Data Considerations
Sources
I found that TMDB and IMDB were both comprehensive and fairly accessible sources of data. TMDB has a well documented API. IMDB doesn’t have a public API, but provides several large datasets for free.
Data scraping
I started with an initial dataset that was downloaded from TMDB’s API. The TMDB dataset was very useful, as it provided a comprehensive list of movies. However, it lacked a few fields that I wanted to access. In order to gather more information from the TMDB API on genres, revenue, budget and release date, I decided to retrieve it from the TMDB API. I wrote a simple script that loaded the original csv file into a dataframe, added the necessary information to it, and exported a new file.
The method to retrieve the extra information from the API is here :
def id_to_info(id): r = requests.get('https://api.themoviedb.org/3/movie/{movie_id}?api_key=key&language=en-U&sort_by=revenue.desc&include_all_movies=true'.format(movie_id=id)) if r.status_code == 200:
data = r.json()
if data['runtime'] == None:
data['runtime'] = 0
elif data['imdb_id'] == None:
data['imdb_id'] = 'nan'
return {'revenue':data['revenue'], 'budget':data['budget'],
'runtime':data['runtime'], 'imdb_id':data['imdb_id']} else:
errors.append(r.status_code)
ids.append(id)
return {'revenue': 0, 'budget':0, 'runtime':0, 'imdb_id': 'nan'}
Data Cleaning
After loading the data and some initial investigatory work, it was clear reformatting some of the columns was necessary. For example, entries for the genre ids and the dates columns were formatted as strings, and needed to be converted to lists of numbers and datetime respectively, so the data could be properly manipulated.
# convert the genre ids from strings into lists of numbers
def convert_list(lst):
try:
lst1 = lst.strip('][').split(', ')
return [int(x) for x in lst1]
except:
return 0
# convert the genre columns to numbers
tmdb_movies['genre_ids'] = tmdb_movies.genre_ids.apply(lambda x: convert_list(x))#convert the release date column to datetime
tmdb_movies.release_date = tmdb_movies.release_date.apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
Inflation
Movie data from TMDB was useful, but it occurred to me while looking at older movies that the information was not adjusted for inflation. I’ve come across other reports ranking movies while taking inflation into account, so I decided to explore the idea myself! In order to consider budget and revenue for movies made in the past, all monetary information was adjusted to 2020 dollars. I gathered historical inflation rates from the us inflation calculator website.
Then, the the results were added from the website into a dictionary. Each key-value pair was a year and corresponding inflation rate:
# dictionary format ir_dict
{year: inflation_rate}
The method from there was to take the target dataframe of movie information and loop through each row, taking the year a movie was made, and then adjusting the revenue and budget information to 2020 dollars. Below is the calculation I used to do so:
for curr_year in range(year, 2021, 1):
inflation_rate = ir_dict[curr_year]
new_revenue = (current_revenue * float(inflation_rate/100)) + current_revenue
current_revenue = new_revenue
new_budget = (current_budget * float(inflation_rate/100)) + current_budget
current_budget = new_budget
Q: What types of movies are recommended?
This is the first question I sought to answer. I basically did this by looking at groups of movies and considering revenue and the returns for each genre. Returns were calculated as a multiple dividing revenue by budget.
Sources
I used TMDB information, and only considered movies with a budget higher than $1,000 and revenue greater than $0. I also dropped rows where no genre information was included.
What genres generate the highest revenue?
As a first part to this question, I decided to group the movies by genre, and then look at the average revenue for movies in each category. It is worth noting that there were very different numbers of movies made for each genre, and that a few categories such as TV movies, which only accounted for 3 movies in the dataset, was not even included in the following visualization as a result. I decided to plot a bar graph displaying average revenue for each movie type. I also included the average budget for each movie, so that visualizing the profit margin was possible. Due to several extreme outliers in budget and revenue, median was chosen as the measure of centrality.

As seen in the plot, Adventure, Sci-i and Animation movies generated the highest average revenue, and horror movies and dramas generated the least. It would seem appropriate that any of these top three would be a contender for a recommendation to Microsoft.
However, when plotting the median return on investment for each genre (the revenue divided by the budget) other winners emerged:

Horror movies, though generating low amount of average revenue ($55 MM), actually also accounted for great returns (2.5X!).
When looking at the revenue and returns side by side, it was clear that Adventure movies may be a good bet. With the highest median revenue (434MM) and an average ROI of 2.30X, they seem to produce both high amounts of gross dollars, and also a higher average return multiple than other high grossing genres. The median budget for adventure movies also looks to be a bit high (explored further later), but as a budding studio with purportedly grand plans, Adventure movies seem like they might be the way to go for our pals at Microsoft!
What genre combinations generate the highest revenue?
In addition to looking at individual genres, I decided to look at dual genre combinations, as movies often fall into more than one category.
Within the TMDB dataset, there were 171 dual-genre combinations. Some combinations occurred frequently, such as (Adventure, Science Fiction), while others (Documentary, TV Movies) did not yield any results. I only considered genre combinations with at least 10 movies.
I first decided to group the data into a dataframe to view the numerical information in summary form. I did this by creating a number of mini-methods I could use to find all of the records that fell under each genre combination. For example:
# sort each combination of genre by profits
# create a dataframe from the combinations listdef median_revenue_combo(df, combination):
sub_df = df[df.genre_ids.apply(lambda x: all(item in x for item in combination))]
return round((sub_df['revenue'].median()/1000000),2)
It then made it easy to visualize the data by plotting from the dataframe:


As seen here, Adventure/Sci-Fi movies have the runaway lead on revenue, and also generate good average returns (2.9X).
Further investigation could take this a step further: are there any genre combinations that have been rising in value over time and would perhaps poise Microsoft to leverage a trend?
Q: How much money should be spent making a movie?
To answer this question, I also started with the TMDB dataset and again adjusted the numbers for inflation. I chose to only include movies with budgets ≥ $10,000,000.
What’s the average budget for the top 25 most successful high budget movies?
High budget movies here were defined as movies that spent $50MM and up, and then success was measured by return multiple (revenue / budget).
The mean budget for these top 25 movies was $109MM, while the median was $86MM. To explain the large difference here, see the graph below:

Most of the top 25 movies were made with inflation-adjusted budgets of less than 100MM, but a few of the highest budget movies were significantly more than 100MM. Moreover, looking at returns, the highest budget movies like Avatar did not necessarily yield the best return multiples. Movies such as Deadpool costed significantly less and generated better returns.
Therefore, when making a high budget movie, it is not true that simply spending more will lead to the best returns. It is also certainly possible to generate great returns without the types of spending seen in the highest grossing and most expensive movies like Avatar and The Force Awakens. In conclusion, if making a high budget movie with nothing else specified, using the median of $82MM is recommended.
What if this is broken down by genre?
However, just as with revenue, differences in budget between genres were also expected. I first grouped movies by genre, and then sorted them in descending order by return on investment. I finally took the mean budget from the top 25 movies.
The results:

As seen here, budgets vary quite a bit by genre. This led to the conclusion that the budget needs to be adjusted depending on what type of movie is made.
If using the genre recommendation of making an Adventure movie, a budget of around $135MM is recommended.
Does spending more lead to better results?
I also decided to inspect the correlations between budget and various numeric results.
One of the metrics I wanted to look at was IMBD ratings. Luckily, TMDB movie information includes IMDB ids, so dataframes can be merged easily:
# add the ratings column
two_c = q2.merge(title_ratings[['tconst','avg_rating']],
how='inner',
left_on='imdb_id',
right_on='tconst')
Here are 4 scatterplots with regression lines, plotting budget vs revenue, popularity, ROI multiple and rating. It appears as though spending more could generally lead to higher revenue, but there wasn’t much of a relationship between the other variables.

The budget vs. revenue plot implies that spending more money generally yields more revenue. Further analysis into the relationship between these two variables could lead to a conclusion on whether this effect tapers off above a certain budget, or if the effect differs based on genre.
It was concluded that spending more may lead to higher revenue, but not necessarily better popularity, returns or a higher rating.
Q: Who should be recruited?
To explore this question, I decided to look at actors, actresses and directors using data from several IMDB data sources. Several datasets were provided to us, and the originals are available on IMDB’s website.
Actors and actresses by rating
I first wanted to discover who the top actors were, based on the average rating of the movies they starred in. A number of merged dataframes and groupings later, this dataframe emerged. Actors who were in at least 2 movies and received at least 1MM votes were considered. Microsoft is presumably going to want to see the best of the best, after all.
# top actors have at least 1,000,000 votes and are in at least 1 movie
top_actors = actors[(actors.num_movies > 1) & (actors.sum_votes > 1000000)]
top_actors.head(10)
This is what emerged:

Leonardo DiCaprio and Robery Downey Jr. both seem to have quite a few votes, and both seem to star in very well rated movies.
Actors and actresses by revenue
However, Microsoft will also want actors whose movies sell!
Taking all actors and actresses, and summing the revenue for the movies they starred in yielded this list of top grossing actors/actresses.

Leonardo DiCaprio’s name has disappeared, but Robert Downey Jr. is at the top! If Robert Downey Jr. has room on his calendar, he seems to be a great (albeit expensive) bet.
Top directors
When looking at top directors, I followed the same method as I did with actors and actresses. I found the following:

The Russo brothers look to be a top choice!
Other questions that were explored included:
Who are the top 10 most underrated actors/actresses?
Who are the top 10 most underrated directors?
Q: When should a movie be released?
For this question, I used movie budget and revenue data from from The Numbers.
The final result:

It looks like Microsoft should plan for a late Spring or if not, late Fall release!
Final thoughts
Exploring movie data with an initial set was an interesting way of practicing data scraping, cleaning, and exploratory analysis.
Doing so disproved some assumptions I had about the industry and confirmed others. Because I mainly looked at higher budget movies and more popular movie industry professionals, further exploration into less mainstream work could yield interesting results. Taking a deeper look at lower grossing, but otherwise popular genres, such as dramas could also provide more insights into the movie industry. Moreover, my definition of high budget ($50MM and up) was somewhat arbitrary and not based on any movie standards.
It should also be noted that my method for adjusting monetary information for inflation does not account for the change in movie ticket prices.
In a future exercise, exploring information from other data sources, such as Rotten Tomatoes could be really interesting. I’d also love to see and compare information between streaming movies and theatrical releases, given how much new content is being released these days by companies like Netflix. Exploring larger datasets, and building something predictive could also be enjoyable. Whatever the future holds, this was one small step into a new world for this aspiring data scientist!