"We Have to Go Back!" - A data-driven look at the LOST actors' careers

In [172]:
from IPython.display import YouTubeVideo, Image, HTML

For some reason I've been thinking a lot about LOST lately--thinking about it enough that I rewatched the pilot a few nights ago. I got to thinking: how have all of the actors fared in their post-LOST careers? Despite it's trials and tribulations, did acting on LOST give a sense of purpose, just like Jack felt with the Island? Is it time yet for a career revitalizing LOST reboot?

Normally these questions are relegated to some very simple slide show listicle. However, we don't have to settle for that! We've got data! We can perform a far more interesting analysis than googling "Matthew Fox."

The Data

I scraped all of this data from IMDB, following the process below:

  1. Get each actor that appears on the main page for LOST in IMDB (the top 15 actors by episode appearances).
  2. Go to each actor's page and grab the list of movies/tv they have been in.
  3. Go to each of those movie/tv pages and grab the title, score, and year info.

Additional Notes

Minor Roles: To eliminate minor roles, I only counted roles where that actor appeared on the main cast list for that movie/tv. For example: Jorge Garcia was in two episodes of How I Met Your Mother, but doesn't appear on the main HIMYM IMDB cast page.

Year info: For TV shows, the year included is the year that TV show premiered. It's not the year in which an actor might have appeared on the show. For example: everyone who appeared in LOST will have a year of 2004, regardless of when they actually started on the show. Actors this impacts:

  • Michael Emerson (first appeared 2006)
  • Elizabeth Mitchell (2006)
  • Ken Leung (2008)

Language: I'll use actors to refer to both actors and actresses throughout this exploration. I'll use the term media to refer to the general collection of TV or Movies.

In [173]:
import pandas as pd
%matplotlib inline

Data Exploration

We'll first read in the dataset and see what our data looks like.

In [161]:
we_have_to_go_back = pd.read_csv('./data/LOST_clean.csv')
print "Total rows:", len(we_have_to_go_back)
Total rows: 353
actor title score start_year type
0 Jorge Garcia The Wedding Ringer 6.8 2015 Movie
1 Jorge Garcia Cooties 5.3 2014 Movie
2 Jorge Garcia iSteve 5.4 2013 Movie
3 Jorge Garcia The Ordained 6.9 2013 TV Movie
4 Jorge Garcia Alcatraz 7.1 2012 TV Series

We have 353 total rows listing the actor, the title of the media, the IMDB score, the year that media first aired, and the type of media. Let's first take a look at what different types of media we're working with.

In [162]:
Movie            177
TV Movie          79
TV Series         55
Other/Unknown     23
Video Game        19
dtype: int64

We're only going to include the data from Television or Film, and exclude Other/Unknown and Video Game.

In [163]:
big_and_small_screen = we_have_to_go_back[(we_have_to_go_back['type'] == 'TV Series') |
                                          (we_have_to_go_back['type'] == 'Movie') |
                                          (we_have_to_go_back['type'] == 'TV Movie')]

Now that we've got a clean dataset, let's get a little more information about the scores. LOST's IMDB score is an 8.5, but we have no context to understand whether that's high or low. (Sidebar: Here is a good analysis of the distribution of all IMDB scores)

We've also got to remove the duplicates for this next step. LOST is listed 15 times (once for each actor) hence the spike around 8.5. We'll assume a duplicate is an item with the same title and score.

Let's look at the distribution with a histogram, and also print out some summary statistics.

In [164]:
count    296.000000
mean       6.399324
std        1.059773
min        2.900000
25%        5.800000
50%        6.500000
75%        7.100000
max        9.000000
dtype: float64

Initial Scores Recap:

Comparing LOST's 8.5 score to these numbers shows us a few things:

  • It's higher scored than average (6.3)
  • It's higher scored than the median (6.5)
  • It's higher scored than at least 75% of the scores (75th percentile: 7.1)

Also notice the top scored media for any actor is 9.0. Out of curiosity, let's take a look at the top 5 scored items in our dataset:

In [165]:
big_and_small_screen.sort('score', ascending=0).head(5)
actor title score start_year type
171 Terry O'Quinn Guts and Glory: The Rise and Fall of Oliver North 9.0 1989 TV Movie
295 Harold Perrineau Oz 8.9 1997 TV Series
23 Naveen Andrews Lost 8.5 2004 TV Series
284 Harold Perrineau Lost 8.5 2004 TV Series
337 Ken Leung Lost 8.5 2004 TV Series

Don't tell Terry O'Quinn what he can't do, because he can clearly star in a highly rated 1989 TV Movie.

In [166]:

Even in our listing of top 5 scores, we already see LOST appearing in there. It seems time to ask the ultimate question:

Is LOST the best rated thing that these actors have starred in?

The next cell finds the maximum score for each actor, then prints the row that score appears on.

Top Scored Media by Actor

In [167]:
actor title score start_year type
80 Daniel Dae Kim Lost 8.5 2004 TV Series
255 Dominic Monaghan Lost 8.5 2004 TV Series
315 Elizabeth Mitchell Lost 8.5 2004 TV Series
193 Emilie de Ravin Lost 8.5 2004 TV Series
116 Evangeline Lilly Lost 8.5 2004 TV Series
295 Harold Perrineau Oz 8.9 1997 TV Series
233 Henry Ian Cusick Lost 8.5 2004 TV Series
6 Jorge Garcia Lost 8.5 2004 TV Series
67 Josh Holloway Lost 8.5 2004 TV Series
337 Ken Leung Lost 8.5 2004 TV Series
50 Matthew Fox Lost 8.5 2004 TV Series
205 Michael Emerson Person of Interest 8.5 2011 TV Series
23 Naveen Andrews Lost 8.5 2004 TV Series
171 Terry O'Quinn Guts and Glory: The Rise and Fall of Oliver North 9.0 1989 TV Movie
102 Yunjin Kim Lost 8.5 2004 TV Series

Of the 15 of the most frequent actors on LOST, only 2 of them have ever had a major role in something that has a score higher than LOST. Note that Person of Interest for Michael Emerson is rated the same as LOST, so we're excluding him from the club.

Did LOST help actors get more major roles?

We can also explore how many appearances each actor has had before and after LOST. In order to do that, we'll flag every entry as post-LOST if it started after 2004, then count the number of titles that come before or after.

In [168]:
# side note: not happy with this code... there must be a better way.

big_and_small_screen['post_lost'] = big_and_small_screen['start_year'] > 2004
before_and_after = pd.pivot_table(big_and_small_screen, columns=['post_lost'], 
                                  values=['start_year'], index=['actor'], aggfunc=np.size).reset_index()
before_and_after['more_after_lost'] = (before_and_after['start_year'][True] - before_and_after['start_year'][False] > 0)
actor start_year more_after_lost
post_lost False True
0 Daniel Dae Kim 9 4 False
1 Dominic Monaghan 6 10 True
2 Elizabeth Mitchell 15 8 False
3 Emilie de Ravin 3 12 True
4 Evangeline Lilly 1 3 True
5 Harold Perrineau 16 21 True
6 Henry Ian Cusick 8 9 True
7 Jorge Garcia 7 8 True
8 Josh Holloway 6 8 True
9 Ken Leung 12 8 False
10 Matthew Fox 7 5 False
11 Michael Emerson 9 10 True
12 Naveen Andrews 18 9 False
13 Terry O'Quinn 61 5 False
14 Yunjin Kim 5 8 True

9 out of 15 actors had more major roles after 2004. This is a pretty naive comparison, though, since a recurring role on a TV show is only going to count for 1, while an actor who chooses to go to the big screen is going to have multiple movies they're starring in. It also doesn't take into account things like Terry O'Quinn's massive 61 roles before LOST.

On that note, let's see if there's a difference in what type of media the actors starred in before and after LOST. We'll count the number of Movies, TV, or TV Movies to each actors name before and after LOST, then see which of those categories is the highest.

Before LOST: Most Major Roles by Type

In [169]:
pre_LOST_roles = big_and_small_screen[big_and_small_screen['post_lost'] == False]
actor_type_counts = pre_LOST_roles.groupby(['actor','type']).size().reset_index()
actor_type_counts.columns = ['actor','type','occurrences']
actor type occurrences
0 Daniel Dae Kim Movie 4
5 Dominic Monaghan TV Series 3
6 Elizabeth Mitchell Movie 6
10 Emilie de Ravin TV Series 2
11 Evangeline Lilly TV Series 1
12 Harold Perrineau Movie 13
16 Henry Ian Cusick TV Movie 4
18 Jorge Garcia Movie 5
21 Josh Holloway Movie 4
24 Ken Leung Movie 9
29 Matthew Fox TV Series 4
30 Michael Emerson Movie 6
33 Naveen Andrews Movie 12
37 Terry O'Quinn TV Movie 32
39 Yunjin Kim Movie 4

Note that these also include starring in LOST itself. Henry Ian Cusick is in good company with Terry O'Quinn as a major TV Movie actor! Alright!

Let's quick tally tally up the types:

In [170]:
Movie        9
TV Series    4
TV Movie     2
dtype: int64

After LOST: Most Major Roles by Type

In [171]:
post_LOST_roles = big_and_small_screen[big_and_small_screen['post_lost'] == True]
actor_type_counts = post_LOST_roles.groupby(['actor','type']).size().reset_index()
actor_type_counts.columns = ['actor','type','occurrences']
actor type occurrences
0 Daniel Dae Kim Movie 3
2 Dominic Monaghan Movie 6
5 Elizabeth Mitchell Movie 4
8 Emilie de Ravin Movie 7
11 Evangeline Lilly Movie 3
12 Harold Perrineau Movie 15
15 Henry Ian Cusick Movie 8
17 Jorge Garcia Movie 6
20 Josh Holloway Movie 6
23 Ken Leung Movie 3
26 Matthew Fox Movie 5
27 Michael Emerson Movie 6
30 Naveen Andrews Movie 5
34 Terry O'Quinn TV Series 3
35 Yunjin Kim Movie 6

Everyone but Terry O'Quinn seemed to go to the big screen after LOST.


Our quick, rudimentary analysis gave us some wonderful insight into the acting lives of 15 actors from LOST. Here's what we've learned:

  • The average score for something in which any LOST actor had a major role is 6.3.
  • Of the 15 main LOST actors, LOST was the highest scored media for 13 of them
  • 9 of the 15 actors had more major roles after LOST than before it
  • 14 of the 15 actors appear to have gone to the big screen after LOST

Some other stuff we could do