Create Dummies from Array of Objects in Pandas

While I was navigating interesting datasets in Kaggle, I’ve stambled upon the movies dataset.

And the main dataset(movies_metadata.csv) contains metadata for all 45,000 movies listed in the Full MovieLens Dataset.

genres column specifically reminded me that a challenging problem Data Scientists usually run into. Dealing with unstructured data.

Here’s an example value from genres column.

[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]

This value suggests that, this particular movie represents 3 different genres. But since the data is not structured, it is for instance really challenging to look at questions like

  • How many different genres are there?
  • What proportion of movies have Comedy genres in their list?
  • Do popularity/revenue/votes change depending on genres? and many more.

In this post, I will show the steps to turn this array of objects to dummy columns for each genre.

Steps

  1. Convert string object to list object in Pandas DataFrame
  2. Only select necessary info within the list object and create a clean column
  3. Convert newly created column into dummies
  4. Merge it with the original DataFrame

1. Convert string object to list object in Pandas DataFrame

Let’s read the data and look at the genres field on the first row:

import pandas as pd

df = pd.read_csv('./data/movies_metadata.csv')
df.iloc[0]['genres']
"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

Let’s try to access the first element of the first genres value. (It should be {'id': 16, 'name': 'Animation'})

df.iloc[0]['genres'][0]
'['

As you see, we got [, because that column is interpreted as string object.

df[['genres']].dtypes
genres    object
dtype: object

2. Only select necessary info within the list object and create a clean column

How to convert string object into a list in python?

Thankfully, there is a class called ast (Abstract Syntax Trees) in python which helps Python applications to process trees of the Python abstract syntax grammar.

In there, we are specifically interested in ast.literal_eval function. Here in the documentation it says

Safely evaluate an expression node or a string containing a Python literal or container display. The string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None. This can be used for safely evaluating strings containing Python values from untrusted sources without the need to parse the values oneself. It is not capable of evaluating arbitrarily complex expressions, for example involving operators or indexing.

So let’s apply ast.literal_eval function to our Pandas DataFrame to see how does it change our string object.

from ast import literal_eval
df['new_genres'] = df['genres'].apply(literal_eval)
df.head().iloc[0]['new_genres']
[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

If you refer back to the first cell we run (df.iloc[0]['genres']) you see that the content is the same, but the output looks very different in terms of structure, suggesting list object rather than string object.

Let’s verify that;

print(f"Type of the genres column:\t{type(df.iloc[0]['genres'])}")
print(f"Type of the new_genres column:\t{type(df.iloc[0]['new_genres'])}")
Type of the genres column:	<class 'str'>
Type of the new_genres column:	<class 'list'>

Looks great!

Let’s think about how do we want to use the information within this column…

For this particular exercise, we are more interested in creating dummy columns with the genre names. So ideally we want something like the following:

[Animation, Comedy, Family]

We can use python’s inline list iterator to generate that object for the first row:

[element['name'] for element in df.iloc[0]['new_genres']]
['Animation', 'Comedy', 'Family']

That’s exactly what we want in Step 2, let’s apply this to all the rows in the DataFrame. We should also check the edge case here and if the new_genres column is not a list for any reason we should return an empty list

df['genre_names'] = df['new_genres']\
                        .apply(lambda x: [e['name'] for e in x] if isinstance(x,list) else [])
df.head()[['genres','genre_names']]
genres genre_names
0 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... [Animation, Comedy, Family]
1 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... [Adventure, Fantasy, Family]
2 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... [Romance, Comedy]
3 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... [Comedy, Drama, Romance]
4 [{'id': 35, 'name': 'Comedy'}] [Comedy]

3. Convert newly created column into dummies

Now we have much clean genre data, but they are still not easily accessible for further data exploration.

Remember, our goal is to create dummy columns for the genres and assign them to the corresponding movies.

There are 2 different way to do this, the first one runs much faster. (92ms vs 12.7s)

genres_df = pd.DataFrame(df['genre_names'].tolist())
genres_df.head()
0 1 2 3 4 5 6 7
0 Animation Comedy Family None None None None None
1 Adventure Fantasy Family None None None None None
2 Romance Comedy None None None None None None
3 Comedy Drama Romance None None None None None
4 Comedy None None None None None None None
genres_df = df['genre_names'].apply(pd.Series)
genres_df.head()
0 1 2 3 4 5 6 7
0 Animation Comedy Family NaN NaN NaN NaN NaN
1 Adventure Fantasy Family NaN NaN NaN NaN NaN
2 Romance Comedy NaN NaN NaN NaN NaN NaN
3 Comedy Drama Romance NaN NaN NaN NaN NaN
4 Comedy NaN NaN NaN NaN NaN NaN NaN

So now, we have N number of columns (number of distinct genres) for each movie, and non null columns represent each of the movies’ genres.

For instance our first movie is the following:

genres_df.head(1)
0 1 2 3 4 5 6 7
0 Animation Comedy Family None None None None None

In order to create dummies for each of these movies, we need to stack to reshape our DataFrame.

stacked_genres = genres_df.stack()
stacked_genres
0      0    Animation
       1       Comedy
       2       Family
1      0    Adventure
       1      Fantasy
              ...    
45461  1       Family
45462  0        Drama
45463  0       Action
       1        Drama
       2     Thriller
Length: 91106, dtype: object

So, for each of the genres for each movie we now have a row. Here are the rows corresponding our first movie:

stacked_genres[0]
0    Animation
1       Comedy
2       Family
dtype: object

So we converted our cleaned list [Animation, Comedy, Family] to 3 rows, each represents only 1 particular genre. We are now ready to convert this to dummies!

raw_dummies = pd.get_dummies(stacked_genres)
raw_dummies.head()[['Animation','Comedy','Family','Crime','Documentary']]
Animation Comedy Family Crime Documentary
0 0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
1 0 0 0 0 0 0
1 0 0 0 0 0

Here’s how the new DataFrame looks like, for simplicity I only selected a few columns. But in actual DataFrame, we have all the genres.

And if you focus on the first 3 rows, you see that Animation,Comedy and Family columns are 1, and the rest are 0.

Again, remember our goal, we need these dummy columns but we only need one row per movie. So how do we do that?

We just aggregate the newly created DataFrame by summing on index level.

genre_dummies = raw_dummies.sum(level=0)
genre_dummies[['Animation','Comedy','Family','Crime','Documentary']].head()
Animation Comedy Family Crime Documentary
0 1 1 1 0 0
1 0 0 1 0 0
2 0 1 0 0 0
3 0 1 0 0 0
4 0 1 0 0 0

4. Merge it with the original DataFrame

That’s the simplest step after using ast.literal_eval, stack, and get_dummies functions.

ndf = pd.concat([df, genre_dummies], axis=1)
ndf.head()[['original_title','genres','Animation','Comedy','Family','Crime','Documentary']]
original_title genres Animation Comedy Family Crime Documentary
0 Toy Story [{'id': 16, 'name': 'Animation'}, {'id': 35, '... 1.0 1.0 1.0 0.0 0.0
1 Jumanji [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... 0.0 0.0 1.0 0.0 0.0
2 Grumpier Old Men [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... 0.0 1.0 0.0 0.0 0.0
3 Waiting to Exhale [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... 0.0 1.0 0.0 0.0 0.0
4 Father of the Bride Part II [{'id': 35, 'name': 'Comedy'}] 0.0 1.0 0.0 0.0 0.0