Marvelous data¶

Examinando datos con pandas

In [85]:

speaker = {'name':'Mai Giménez', 
           'twitter': '@adahopper',
           'weapons': ['python', 'bash','C++ ']}

print('\n'.join(["{}: {}".format(k, v) for k,v in speaker.items()]))

name: Mai Giménez
weapons: ['python', 'bash', 'C++']
twitter: @adahopper

In [2]:

from IPython.display import Image
Image(filename='marvel_logo.jpg')

Out[2]:

Marvel, es una editorial de cómics estadounidense fundada por Martin Goodman en 1939. Aunque la marvel tal y como hoy la conocemos data de 1961 con la publicación de Los cuatro fantásticos y otras historias de superhéroes creadas por Stan Lee, Jack Kirbi, Steve Ditko,...

Marvel publica a personajes archiconocidos como:

Spider-Man
X-Men
Captain America
Guardians of the Galaxy
...

[Wikipedia]

¡Y todos estos datos son nuestros!

Recopilar datos¶

In [3]:

from IPython.core.display import HTML
MARVEL_DEV_SITE = "http://developer.marvel.com/"
HTML("<iframe src={} width=800 height=600></iframe>".format(MARVEL_DEV_SITE))

Out[3]:

Pandas time!¶

Pandas es una librería de código abierto, con licencia BSD, que permite trabajar eficientemente analizando datos en python.

A pandas se le da bien:

Estrucutras de datos eficientes (DataFrames) para trabajar con datos indexados.
Herramientas para leer y escribir datos eficientemente. Es capaz de trabajar con distintos formatos:
- Csv.
- Ficheros de texto.
- Microsoft Excel.
- Bases de datos SQL.
- HDF5 format.
- ...
Remodelado flexible y alternancia entre conjuntos de datos.
Selección inteligente basado en etiquetas, indexación compleja, selección de subconjuntos en grandes conjuntos de datos.
Se pueden insertar y borrar columnas: mutabilidad de los conjuntos de datos.
Agrupado y fusionados sencillo de conjuntos de datos.
Funciones para series de tiempos: gestiona eficientemente rangos de fechas.
...

In [4]:

PANDAS_DEV_SITE = "http://pandas.pydata.org/"
HTML("<iframe src={} width=800 height=600></iframe>".format(PANDAS_DEV_SITE))

Out[4]:

In [5]:

import pandas as pd
import sys
import datetime as dt
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

%matplotlib inline

print("Versión de Python:     ", sys.version)
print("Versión de Pandas:     ", pd.version.short_version)
print("Versión de Numpy:      ", np.version.short_version)
print("Versión de Matplotlib: ", matplotlib.__version__)

Versión de Python:      3.3.4 (default, Jul 25 2014, 00:04:27) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]
Versión de Pandas:      0.14.1
Versión de Numpy:       1.8.2
Versión de Matplotlib:  1.4.0

Leer los datos¶

Marvel sólo nos deja buscar hasta 100 personajes/cómics cada vez. Tenemos una libería para acceder directamente a la api de Marvel en python desarrollada por Garrett Pennington pymarvel en python 2 y está portada a python 3 en pymarvel3

Lo primero que deberíamos hacer es recoger información de las web y almacenarnoslas. Pero, a alguien más se le ha ocurrido eso, y no vamos a reinventar la rueda. @asamiller ha desarrollado una app en node.js que explora la api de marvel y almacena los datos usando Orches Orchestrate. Tenemos el código disponible en github.

In [6]:

from os.path import join, abspath, isfile 
from os import listdir, getcwd, pardir

MARVELOUSDB_PATH = join(abspath(join(getcwd(), pardir)),"marvelousdb","data")
MARVELOUSDB_CHARACTERS = join(MARVELOUSDB_PATH,"characters")
MARVELOUSDB_COMICS = join(MARVELOUSDB_PATH,"comics")

In [7]:

characters_json_db = [join(MARVELOUSDB_CHARACTERS,json_file) for json_file in listdir(MARVELOUSDB_CHARACTERS)]
comics_json_db = [join(MARVELOUSDB_COMICS,json_file) for json_file in listdir(MARVELOUSDB_COMICS)]
print("En MarvelousDB tenemos un backup de {} personajes y {} cómics".format(len(characters_json_db),
                                                                             len(comics_json_db)))

En MarvelousDB tenemos un backup de 1402 personajes y 30180 cómics

DataFrame¶

Un DataFrame es una estructura de 2 dimensiones con datos etiqueatados en columnas. Los datos que componen un dataframe pueden ser de distintos tipos. Piensa en un dataframe como si fuera una hoja de cáculo o una tabla SQL.

Puedes formar un dataframe usando:

Diccionarios 1D de ndarrays, listas, diccionarios o series (Pandas).
Una matriz 2D ndarray.
Otro dataframe
...

Al crear un dataframe, también puedes indicar los índices (etiquetas para las filas) y las columnas. Si no pasamos estas etiquetas como argumentos pandas creará un dataframe usando el sentido común.

En nuestro caso, leeremos todos los ficheros json y crearemos un DataFrame. Como tenemos información jerárquica en los ficheros json necesitamos normalizar los datos, pero pandas tiene funciones que lo hacen por nosotros.

Idiomatic¶

In [8]:

import json

In [9]:

json_to_dataframe = []
for json_file in characters_json_db:
    with open(json_file, 'r') as jf:
        json_character = json.loads(''.join(jf.readlines()))
        json_plain = pd.io.json.json_normalize(json_character)
        json_to_dataframe.append(json_plain)
        
characters_df = pd.concat(json_to_dataframe)

Non idiomatic¶

In [10]:

df = pd.concat([pd.io.json.json_normalize(json.loads(''.join(open(json_file,'r').readlines()))) 
                for json_file in characters_json_db])

Podemos realizar operaciones lógica sobre todos los elementos de un DataFrame, son operaciones vectoiales.

In [12]:

all(df == characters_df)

Out[12]:

True

In [13]:

comics_df = pd.concat([pd.io.json.json_normalize(json.loads(''.join(open(json_file,'r').readlines()))) 
                       for json_file in comics_json_db if isfile(json_file)])

¿Y que pinta tiene un DataFrame?

In [14]:

characters_df.head()

Out[14]:

comics.available	comics.collectionURI	comics.items	comics.returned	description	events.available	events.collectionURI	events.items	events.returned	id	...	wiki.specieshistory	wiki.team_name	wiki.teamicon	wiki.technology	wiki.tie-ins	wiki.title_graphic	wiki.universe	wiki.weapons	wiki.weaponss	wiki.weight
36	http://gateway.marvel.com/v1/public/characters...	[{'id': 36737, 'resourceURI': 'http://gateway....	36	AIM is a terrorist organization bent on destro...	0	http://gateway.marvel.com/v1/public/characters...	[]	0	1009144	...	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	NaN	NaN	NaN
43	http://gateway.marvel.com/v1/public/characters...	[{'id': 34050, 'resourceURI': 'http://gateway....	43	Formerly known as Emil Blonsky, a spy of Sovie...	2	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	2	1009146	...	NaN	NaN	NaN	NaN	NaN	NaN	Marvel Universe	None	NaN	(Abomination) 980 lbs.; (Blonsky) 180 lbs.
43	http://gateway.marvel.com/v1/public/characters...	[{'id': 36489, 'resourceURI': 'http://gateway....	43		4	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	4	1009148	...	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	He uses a prison ball-and-chain as a weapon, a...	NaN	365 lbs. (variable)
8	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	8		1	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	1	1009149	...	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	Unrevealed	NaN	Unrevealed
20	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	20		0	http://gateway.marvel.com/v1/public/characters...	[]	0	1009150	...	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	Agent Zero carries a wide array of weapons inc...	NaN	230 lbs.

5 rows × 89 columns

In [15]:

comics_df.tail()

Out[15]:

characters.available	characters.collectionURI	characters.items	characters.returned	collectedIssues	collections	creators.available	creators.collectionURI	creators.items	creators.returned	...	stories.items	stories.returned	textObjects	thumbnail.extension	thumbnail.path	title	urls	variants
0	http://gateway.marvel.com/v1/public/comics/999...	[]	0	[]	[]	0	http://gateway.marvel.com/v1/public/comics/999...	[]	0	...	[{'resourceURI': 'http://gateway.marvel.com/v1...	2	[]	jpg	http://i.annihil.us/u/prod/marvel/i/mg/b/40/im...	Love Romances (1949) #94	[{'type': 'detail', 'url': 'http://marvel.com/...	[]
0	http://gateway.marvel.com/v1/public/comics/999...	[]	0	[]	[]	1	http://gateway.marvel.com/v1/public/comics/999...	[{'resourceURI': 'http://gateway.marvel.com/v1...	1	...	[{'resourceURI': 'http://gateway.marvel.com/v1...	1	[]	jpg	http://i.annihil.us/u/prod/marvel/i/mg/b/40/im...	Love Romances (1949) #96	[{'type': 'detail', 'url': 'http://marvel.com/...	[]
0	http://gateway.marvel.com/v1/public/comics/999...	[]	0	[]	[]	2	http://gateway.marvel.com/v1/public/comics/999...	[{'resourceURI': 'http://gateway.marvel.com/v1...	2	...	[{'resourceURI': 'http://gateway.marvel.com/v1...	1	[]	jpg	http://i.annihil.us/u/prod/marvel/i/mg/b/40/im...	Love Romances (1949) #97	[{'type': 'detail', 'url': 'http://marvel.com/...	[]
0	http://gateway.marvel.com/v1/public/comics/999...	[]	0	[]	[]	2	http://gateway.marvel.com/v1/public/comics/999...	[{'resourceURI': 'http://gateway.marvel.com/v1...	2	...	[{'resourceURI': 'http://gateway.marvel.com/v1...	1	[]	jpg	http://i.annihil.us/u/prod/marvel/i/mg/b/40/im...	Love Romances (1949) #99	[{'type': 'detail', 'url': 'http://marvel.com/...	[]
3	http://gateway.marvel.com/v1/public/comics/999...	[{'resourceURI': 'http://gateway.marvel.com/v1...	3	[]	[]	8	http://gateway.marvel.com/v1/public/comics/999...	[{'resourceURI': 'http://gateway.marvel.com/v1...	8	...	[{'resourceURI': 'http://gateway.marvel.com/v1...	2	[]	jpg	http://i.annihil.us/u/prod/marvel/i/mg/c/a0/4b...	Magneto Rex (1999) #1	[{'type': 'detail', 'url': 'http://marvel.com/...	[]

5 rows × 43 columns

Los DataFrames de pandas están implementados basandose en numpy, de modo que si queremos saber la longitud que tiene un Dataframe es exáctamente igual que en numpy, fácil ¿verdad?

In [16]:

characters_df.shape

Out[16]:

(1402, 89)

In [17]:

comics_df.shape

Out[17]:

(30179, 43)

Vamos a ver que podemos saber de los personajes

In [18]:

', '.join(characters_df.columns.values)

Out[18]:

'comics.available, comics.collectionURI, comics.items, comics.returned, description, events.available, events.collectionURI, events.items, events.returned, id, modified, name, resourceURI, series.available, series.collectionURI, series.items, series.returned, stories.available, stories.collectionURI, stories.items, stories.returned, thumbnail.extension, thumbnail.path, urls, wiki.Date_of_birth, wiki.Place_of_birth, wiki.abilities, wiki.aliases, wiki.appearance, wiki.base_of_operations, wiki.bio, wiki.bio_text, wiki.blurb, wiki.builder, wiki.categories, wiki.categorytext, wiki.citizenship, wiki.creator, wiki.creators, wiki.current_members, wiki.debut, wiki.distinguishing_features, wiki.dstinguishing_features, wiki.education, wiki.event_text, wiki.eyes, wiki.features, wiki.former_members, wiki.govenment, wiki.government, wiki.groups, wiki.hair, wiki.height, wiki.home_world, wiki.identity, wiki.key_characters, wiki.key_issues, wiki.leader, wiki.location, wiki.main_image, wiki.members, wiki.object_text, wiki.occupation, wiki.origin, wiki.other_members, wiki.owner, wiki.paraphernalia, wiki.place_of_birth, wiki.place_of_creation, wiki.place_text, wiki.points_of_interest, wiki.power, wiki.powers, wiki.real_name, wiki.relatives, wiki.significant_citizens, wiki.significant_issues, wiki.skin, wiki.special_limitations, wiki.specieshistory, wiki.team_name, wiki.teamicon, wiki.technology, wiki.tie-ins, wiki.title_graphic, wiki.universe, wiki.weapons, wiki.weaponss, wiki.weight'

En realidad no deberíamos lanzar las campanas al vuelo porque spoiler muchos de los campos están vacios

In [19]:

characters_df.dropna()

Out[19]:

	comics.available	comics.collectionURI	comics.items	comics.returned	description	events.available	events.collectionURI	events.items	events.returned	id	...	wiki.specieshistory	wiki.team_name	wiki.teamicon	wiki.technology	wiki.tie-ins	wiki.title_graphic	wiki.universe	wiki.weapons	wiki.weaponss	wiki.weight

0 rows × 89 columns

¿Y qué pasa con los cómics?

In [20]:

comics_df.dropna().shape

Out[20]:

(17516, 43)

Con una simple instrucción somos capaces de tratar con todos los nulos de un dataframe.

Stan Lee¶

Stanley Martin Lieber, más conocido como Stan Lee, nació el 28 de diciembre de 1922 en la ciudad de Nueva York. Es un guionista y editor de cómics estadounidense, creador de personajes notables por su complejidad y su realismo.

Es el cocreador, junto a dibujantes como Steve Ditko o Jack Kirby, de superhéroes como Los 4 Fantásticos, Spider-Man, Hulk, Iron Man, Thor, The Avengers, Daredevil, Doctor Strange, X-Men y muchos otros personajes, expandiendo Marvel Comics, llevándola de una pequeña casa publicitaria a una gran corporación multimedia. Todavía hoy, los cómics Marvel se distinguen por indicar siempre «Stan Lee presenta» en los rótulos de presentación. También tiene un programa en History Channel en donde busca super humanos reales. [Wikipedia]

Vamos a ver cuantos personajes ha creado. Y quien ostenta el top de creadores según la api de Marvel.

Series¶

Series es un array de 1 dimensión etiquetado. Como una tabla con una única columna. Puede almacenar cualquier tipo de datos:

Enteros
Cadenas
Números en coma flotante.
Objetos Python.
...

Se etiquetan en función del índice, si el índice que le pasamos son fechas se creará una instancie de TimeSerie, esta bien pensado, ¿verdad?

Cuando hacemos una selección de 1 columna en un Dataframe creamos una Serie.

In [21]:

#Stan Lee 
creators_serie = characters_df['wiki.creators'].dropna()
creators_serie.describe()

Out[21]:

count                               119
unique                               37
top       this has not been updated yet
freq                                 44
dtype: object

In [22]:

#Renombramos la serie y el índice
creators_serie.name = 'Creadorers de personajes'
creators_serie.index.name = 'creators'

# Podemos usar head o como estamos sobre series también podemos coger una porción de la lista
# creators_serie.head()
creators_serie[:20]

Out[22]:

creators
0            this has not been updated yet
0                                         
0                                         
0                                         
0                                         
0                                         
0                                         
0                  Peter David & Sam Keith
0               Bill Mantlo and Ed Hanigan
0                                         
0                 Stan Lee and Steve Ditko
0             Grant Morrison & Igor Kordey
0                          Chris Claremont
0                                         
0           Chris Claremont & Dave Cockrum
0            this has not been updated yet
0            this has not been updated yet
0                                         
0                     Stan Lee, Jack Kirby
0                           Grant Morrison
Name: Creadorers de personajes, dtype: object

Usando máscaras para extraer información¶

In [23]:

default_string = creators_serie != "this has not been updated yet"
default_string.head()
#creators_serie[ creators_serie != "this has not been updated yet" ]

Out[23]:

creators
0           False
0            True
0            True
0            True
0            True
Name: Creadorers de personajes, dtype: bool

In [24]:

empty_string = creators_serie != ""
empty_string[:10]

Out[24]:

creators
0            True
0           False
0           False
0           False
0           False
0           False
0           False
0            True
0            True
0           False
Name: Creadorers de personajes, dtype: bool

In [25]:

default_string and empty_string

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-544bf713079b> in <module>()
----> 1 default_string and empty_string

/Users/ada/Dev/.virtualenvs/marvel/lib/python3.3/site-packages/pandas/core/generic.py in __nonzero__(self)
    690         raise ValueError("The truth value of a {0} is ambiguous. "
    691                          "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
--> 692                          .format(self.__class__.__name__))
    693 
    694     __bool__ = __nonzero__

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

A pesar de que la palabra reservada and podríamos creer que funcionaría para unir series no funciona porque la operación no se aplica elemento a elemento. Pero pandas sabe que esto nos podría hacer falta y tenemos operadores que funcionan para elementos (& (and), | (or), ~(not))

In [26]:

creators_mask = default_string & empty_string
creators_mask[:10]

Out[26]:

creators
0           False
0           False
0           False
0           False
0           False
0           False
0           False
0            True
0            True
0           False
Name: Creadorers de personajes, dtype: bool

In [27]:

creators_serie[creators_mask].head()

Out[27]:

creators
0                Peter David & Sam Keith
0             Bill Mantlo and Ed Hanigan
0               Stan Lee and Steve Ditko
0           Grant Morrison & Igor Kordey
0                        Chris Claremont
Name: Creadorers de personajes, dtype: object

Aquí ya tenemos buena parte de la información que queremos, pero vamos a separar los autores que trabajan junto para poder contar cuantos personajes a creado cada uno.

In [28]:

import re
creators = [re.split('&|and|,', line) for line in creators_serie[creators_mask]]
clean_cretors =  pd.Series([c for creator in creators for c in creator])
clean_cretors.head()

Out[28]:

0    Peter David 
1       Sam Keith
2    Bill Mantlo 
3      Ed Hanigan
4       Stan Lee 
dtype: object

In [29]:

clean_cretors.value_counts().head()

Out[29]:

Chris Claremont     10
Stan Lee             7
 John Byrne          6
Chris Claremont      5
 Jack Kirby          4
dtype: int64

¡Vaya Stan Lee parece que Chris Claremont te gana!

Obviamente es un problema de falta de datos. Por eso debemos ser muy cuidadosos con la confianza que tenemos en nuestros resultados. Un corpus con errores nos llevará a conclusiones erróneas, hay que ser conscientes de esto.

Explorando a los superhéroes¶

Limpiando los datos: eliminar grupos¶

Marvel no distingue personajes de grupos de personajes. Es decir, "Los vengadores" es un personaje igual que podría serlo "Iron Man", pero tenemos un campo en la wiki que nos permite diferenciar grupos de personajes: "Former members". Así que vamos a quedarnos solo con los personajes.

Lo normal es que quisieramos eliminar las filas que contienen nulos, y pandas tiene implementada una función para ello dropna, que ya hemos visto. Pero lo que queremos es quedarnos con aquellas filas en cuya columna current_members tengamos un nulo, porque si no hay miembros es porque es un personaje.

In [30]:

 characters_df.dropna(subset=['wiki.current_members'])['name']

Out[30]:

0                         A.I.M.
0                       Avengers
0    Brotherhood of Evil Mutants
0                         Exiles
0                 Fantastic Four
0                    Force Works
0                  Hellfire Club
0                          Hydra
0                 Imperial Guard
0                      Marauders
0                        Reavers
0                   S.H.I.E.L.D.
0                Serpent Society
0                        X-Force
0                          X-Men
...
0                         Sinister Six
0                          ClanDestine
0                            New X-Men
0                      Masters of Evil
0                         Generation X
0              Guardians of the Galaxy
0                               U-Foes
0                            Sentinels
0                          New Mutants
0             Lightning Lords of Nepal
0           Nine-Fold Daughters of Xao
0          Confederates of the Curious
0                             X-Babies
0                        Lethal Legion
0    Brotherhood of Mutants (Ultimate)
Name: name, Length: 70, dtype: object

In [31]:

%timeit (~characters_df['wiki.current_members'].isnull())

import numpy as np
%timeit (np.invert(characters_df['wiki.current_members'].isnull()))

1000 loops, best of 3: 218 µs per loop
1000 loops, best of 3: 233 µs per loop

In [32]:

not_groups_mask = characters_df['wiki.current_members'].isnull()
not_groups_mask.head()

Out[32]:

0    False
0     True
0     True
0     True
0     True
Name: wiki.current_members, dtype: bool

In [33]:

characters_df=characters_df[not_groups_mask]

In [34]:

characters_df[:3]

Out[34]:

comics.available	comics.collectionURI	comics.items	comics.returned	description	events.available	events.collectionURI	events.items	events.returned	id	...	wiki.specieshistory	wiki.team_name	wiki.teamicon	wiki.technology	wiki.tie-ins	wiki.title_graphic	wiki.universe	wiki.weapons	wiki.weaponss	wiki.weight
43	http://gateway.marvel.com/v1/public/characters...	[{'id': 34050, 'resourceURI': 'http://gateway....	43	Formerly known as Emil Blonsky, a spy of Sovie...	2	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	2	1009146	...	NaN	NaN	NaN	NaN	NaN	NaN	Marvel Universe	None	NaN	(Abomination) 980 lbs.; (Blonsky) 180 lbs.
43	http://gateway.marvel.com/v1/public/characters...	[{'id': 36489, 'resourceURI': 'http://gateway....	43		4	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	4	1009148	...	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	He uses a prison ball-and-chain as a weapon, a...	NaN	365 lbs. (variable)
8	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	8		1	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	1	1009149	...	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	Unrevealed	NaN	Unrevealed

3 rows × 89 columns

Vamos a limpiar lo datos, quedarnos con los campos que nos puedan ser útililes y indexar el dataframe usando el nombre del superhéroe o de la superheroína, porque pandas ha hecho lo que ha podido pero los números no son muy intuitivos.

In [35]:

# Agrupamos los datos para tener claro con que queremos trabajar
physical_data = ['wiki.hair', 'wiki.weight', 'wiki.height', 'wiki.eyes']
cultural_data = ['wiki.education', 'wiki.citizenship', 'wiki.place_of_birth', 'wiki.occupation']
personal_data = ['wiki.bio', 'wiki.bio_text', 'wiki.categories']

data_keys = (physical_data + cultural_data + personal_data + ['name','comics.available'])

¿Os acordáis de dropna()? Pues puede hacer mucho más.

In [36]:

clean_df = characters_df.dropna(subset = data_keys)
clean_df = clean_df[data_keys].set_index('name')
clean_df.shape

Out[36]:

(762, 12)

Representación racial, cultural y de género en los cómics de Marvel¶

Por ejemplo, sería muy interesante saber cuantas razas están representadas en los cómics de Marvel, y existe un campo skin en la wiki, pero...

In [37]:

characters_df['wiki.skin'].dropna()

Out[37]:

0    White (as GAmbit), Black (as Death)
Name: wiki.skin, dtype: object

Pero vamos a explorar lo que tenemos.

In [38]:

clean_df[personal_data].head()

Out[38]:

	wiki.bio	wiki.bio_text	wiki.categories
name
Abomination (Emil Blonsky)	Formerly known as Emil Blonsky, a spy of Sovie...	Formerly known as Emil Blonsky, a spy of Sovie...	[Avengers, Deceased, Hulk, International, Vill...
Absorbing Man	Crusher Creel's life was little more than that...	Crusher Creel's life was little more than that...	[Avengers, Civil War, Villains]
Abyss	Sealed in a coffin-like prison, Abyss was take...	Sealed in a coffin-like prison, Abyss was take...	[Cosmic, Magic, Villains]
Agent Zero	Born in the former East Germany, Christoph Nor...	Born in the former East Germany, Christoph Nor...	[Heroes, X-Men, Villains, International, Mutants]
Annihilus	Untold millennia ago, the Tyannans, a technolo...	Untold millennia ago, the Tyannans, a technolo...	[Annihilation, Cosmic, Fantastic Four, Villains]

In [39]:

clean_df[cultural_data].head()

Out[39]:

	wiki.education	wiki.citizenship	wiki.place_of_birth	wiki.occupation
name
Abomination (Emil Blonsky)	Unrevealed	Citizen of Croatia; former citizen of Yugoslavia	Zagreb, Yugoslavia	Professional Criminal, Former Spy
Absorbing Man	High school dropout	U.S.A. with a criminal record	New York City, New York	Professional criminal; former boxer
Abyss	Unrevealed	Unrevealed	Unrevealed	Cosmic sorcerer
Agent Zero	Unrevealed	German	Unrevealed location in former East Germany	Mercenary, former government operative, freedo...
Annihilus	Unrevealed	Arthros	Planet of [[Arthros]], Sector 17A, [[Negative ...	Conqueror, scavenger

In [40]:

clean_df[cultural_data].describe()

Out[40]:

	wiki.education	wiki.citizenship	wiki.place_of_birth	wiki.occupation
count	762	762	762	762
unique	357	262	412	636
top	Unrevealed	U.S.A.	Unrevealed	Adventurer
freq	236	230	156	31

In [41]:

clean_df[physical_data].head()

Out[41]:

	wiki.hair	wiki.weight	wiki.height	wiki.eyes
name
Abomination (Emil Blonsky)	(Abomination) None; (Blonsky) Blond	(Abomination) 980 lbs.; (Blonsky) 180 lbs.	(Abomination) 6'8"; (Blonsky) 5'10"	(Abomination) Green; (Blonsky) Blue
Absorbing Man	Bald	365 lbs. (variable)	6'4" (variable)	Blue
Abyss	Unrevealed	Unrevealed	Unrevealed	Unrevealed
Agent Zero	(Originally) Brown; (currently) Black	230 lbs.	6'3"	Blue
Annihilus	None	200 lbs.	5'11"	Green

¿Cómo diriáis que es físicamente el personaje típico de la marvel? (pandas lo sabe)

In [42]:

clean_df[physical_data].describe()

Out[42]:

	wiki.hair	wiki.weight	wiki.height	wiki.eyes
count	762	762	762	762
unique	223	307	213	165
top	Black	Unrevealed	Unrevealed	Blue
freq	165	48	44	236

De modo que el personaje arquetípico de la Marvel tiene el pelo negro y los ojos azules, es de EE.UU. se dedica a ser aventurero. A mí la profesión ya me gusta.

¿Qué personaje aparece en más cómics?¶

In [43]:

clean_df['comics.available'].describe()

Out[43]:

count     762.000000
mean       53.292651
std       179.820372
min         0.000000
25%         2.000000
50%        10.000000
75%        33.750000
max      2575.000000
dtype: float64

¿2575.000000? Debe ser un error, ¿no? ¿Quién es el pluriempleado?

In [44]:

clean_df[clean_df['comics.available'] == 2575.000000]

Out[44]:

	wiki.hair	wiki.weight	wiki.height	wiki.eyes	wiki.education	wiki.citizenship	wiki.place_of_birth	wiki.occupation	wiki.bio	wiki.bio_text	wiki.categories	comics.available
name
Spider-Man	Brown	167 lbs.	5'10"	Hazel	College graduate (biophysics major), doctorate...	U.S.A.	Forest Hills, New York	Scientist and inventor; former freelance photo...	The bite of an irradiated spider granted high-...	The bite of an irradiated spider granted high-...	[Avengers, Civil War, Heroes, Marvel Knights, ...	2575

No se si es un error, pero sino lo es el llorón de spiderman aparece en muchos cómics.

Distribución de héroes y villanos en función de género¶

Antes de ponernos a jugar con los datos (más), tenemos una columna de la que se pude sacar mucho partido "wiki.categories"

In [45]:

clean_df.iloc[1]

Out[45]:

wiki.hair                                                           Bald
wiki.weight                                          365 lbs. (variable)
wiki.height                                              6'4" (variable)
wiki.eyes                                                           Blue
wiki.education                                       High school dropout
wiki.citizenship                           U.S.A. with a criminal record
wiki.place_of_birth                              New York City, New York
wiki.occupation                      Professional criminal; former boxer
wiki.bio               Crusher Creel's life was little more than that...
wiki.bio_text          Crusher Creel's life was little more than that...
wiki.categories                          [Avengers, Civil War, Villains]
comics.available                                                      43
Name: Absorbing Man, dtype: object

A priori no tenemos información de que personajes son hombres, mujeres o alienigenas. Pero Marvel debió intuir que nos podría interesar el papel de las mujeres en los cómics y nos incluyo una categoría: "Mujeres", que nos va a facilitar la vida un montón. Vamos a crear dos nuevas columnas en el dataframe:

woman: que simplemente contendrá True o False si el personaje es femenino o no respectivamente.
villan: ídem T/F si el personaje es villano o no.

In [46]:

women = clean_df['wiki.categories'].map(lambda x: 'Women' in x)
clean_df['Women'] = women 
women[:5]

Out[46]:

name
Abomination (Emil Blonsky)    False
Absorbing Man                 False
Abyss                         False
Agent Zero                    False
Annihilus                     False
Name: wiki.categories, dtype: bool

In [47]:

# ~ Esto es una negación element-wise
print("Mujeres: #{}, hombres #{}".format(clean_df[women].shape[0],clean_df[~women].shape[0]))

Mujeres: #199, hombres #563

Es decir, tenemos 199 personajes femeninos y 563 masculinos. Es decir solo el 26% de los personajes son femeninos.

In [48]:

villan = clean_df['wiki.categories'].map(lambda x: 'Villains' in x)
clean_df['Villan'] = villan 

In [49]:

print("Villanos: #{}, Héroes #{}".format(clean_df[villan].shape[0],clean_df[~villan].shape[0]))

Villanos: #231, Héroes #531

Los villanos también tienen mucho trabajo porque al parecer son sólo el 30'31% de los personajes.

Vamos a ver cómo se distribuyen hombres y mujeres los roles de héroes y villanos.

In [50]:

men = ~women
gender_data = {'Women':{'Heroes':0,'Villans':0},'Men':{'Heroes':0,'Villans':0}}
# Women and villans
gender_data['Women']['Villans'] = clean_df[villan & women].shape[0]
# Women and heroes
gender_data['Women']['Heroes'] = clean_df[~villan & women].shape[0]

# Men and villans
gender_data['Men']['Villans'] = clean_df[villan & men].shape[0]
# Men and heroes
gender_data['Men']['Heroes'] = clean_df[~villan & men].shape[0]
gender_data

Out[50]:

{'Women': {'Villans': 30, 'Heroes': 169},
 'Men': {'Villans': 201, 'Heroes': 362}}

In [51]:

n_groups = 2
opacity = 0.3
men_data = (gender_data['Men']['Villans'], gender_data['Men']['Heroes'])
women_data = (gender_data['Women']['Villans'], gender_data['Women']['Heroes'])

fig, ax = plt.subplots()

index = np.arange(n_groups)
bar_width = 0.4


rects1 = plt.bar(index, men_data, bar_width,
                 alpha=opacity,
                 color='b',
                 label='Hombres')

rects2 = plt.bar(index + bar_width, women_data, bar_width,
                 alpha=opacity,
                 color='r',
                 label='Mujeres')

plt.xlabel('Rol')
plt.ylabel('Número de personajes')
plt.title('Distribución por género y roles')
plt.xticks(index + bar_width, ('Héroes', 'Villanos'))
plt.legend(loc=0, borderaxespad=1.)

plt.show()

Explorando los cómics¶

In [52]:

comics_df.dtypes

Out[52]:

characters.available          int64
characters.collectionURI     object
characters.items             object
characters.returned           int64
collectedIssues              object
collections                  object
creators.available            int64
creators.collectionURI       object
creators.items               object
creators.returned             int64
dates                        object
description                  object
diamondCode                  object
digitalId                     int64
ean                          object
events.available              int64
events.collectionURI         object
events.items                 object
events.returned               int64
format                       object
id                            int64
images                       object
isbn                         object
issn                         object
issueNumber                 float64
modified                     object
pageCount                     int64
prices                       object
resourceURI                  object
series.name                  object
series.resourceURI           object
stories.available             int64
stories.collectionURI        object
stories.items                object
stories.returned              int64
textObjects                  object
thumbnail.extension          object
thumbnail.path               object
title                        object
upc                          object
urls                         object
variantDescription           object
variants                     object
dtype: object

En el campo precio aun tenemos un objeto json. ¡Mal! Así no podemos analizarlo.

El tipo objeto en dtype proviene de numpy y describe un elemento de un ndarray. Cada elemento deben ser del mismo tamaño en bytes. Para un int64 y un float64 necesitamos 8 bytes, pero para una cadena la longitud total no está prefijada y lo que almacena Pandas es un puntero.

¡Pero no pasa nada! Lo que vamos a hacer es convertirlo a una serie, quedarnos únicamente con el precio impreso y arreglar esta columna del dataframe.

In [53]:

prices = comics_df.prices

In [54]:

prices_serie = prices.apply(pd.Series)

In [55]:

prices_serie[20:30]

Out[55]:

	0	1
0	{'price': 1.5, 'type': 'printPrice'}	NaN
0	{'price': 1.5, 'type': 'printPrice'}	NaN
0	{'price': 1.5, 'type': 'printPrice'}	NaN
0	{'price': 1.5, 'type': 'printPrice'}	{'price': 0.99, 'type': 'digitalPurchasePrice'}
0	{'price': 1.5, 'type': 'printPrice'}	{'price': 0.99, 'type': 'digitalPurchasePrice'}
0	{'price': 1.25, 'type': 'printPrice'}	NaN
0	{'price': 1.5, 'type': 'printPrice'}	{'price': 0.99, 'type': 'digitalPurchasePrice'}
0	{'price': 1.5, 'type': 'printPrice'}	{'price': 0.99, 'type': 'digitalPurchasePrice'}
0	{'price': 1.5, 'type': 'printPrice'}	NaN
0	{'price': 1.5, 'type': 'printPrice'}	NaN

In [60]:

print_price = prices_serie[0].apply(pd.Series)['price']

In [61]:

digital_price = prices_serie[1].apply(pd.Series).price

In [62]:

digital_price.value_counts()

Out[62]:

1.99     7040
3.99      137
2.99       88
0.99       44
0.00       44
7.99        3
6.99        3
4.99        3
19.99       1
dtype: int64

In [63]:

digital_price.count()

Out[63]:

Sólo el 24'4% se ha editado digitalmente.

Eliminamos la columna sucia y añadimos los datos limpios.

In [64]:

#del también funcionaria del df.column_name
comics_df = comics_df.drop('price')

In [65]:

comics_df['print price'] = print_price
comics_df['digital price'] = digital_price

A las fechas les pasa exáctamente lo mismo que a los precios. Vamos a limpiar los datos (data munging again)

In [66]:

dates = comics_df.dates
dates_serie = dates.apply(pd.Series)[0].apply(pd.Series)

In [67]:

on_sale_date = dates_serie.date.astype('datetime64[ns]')
on_sale_date.head()

Out[67]:

0   2004-11-24 05:00:00
0   2003-10-08 04:00:00
0   2005-11-02 05:00:00
0   1999-06-01 04:00:00
0   1999-07-01 04:00:00
Name: date, dtype: datetime64[ns]

In [68]:

comics_df['On sale Date'] = on_sale_date

In [69]:

start = comics_df['On sale Date'].min()
end =  comics_df['On sale Date'].max()

yearly_range = pd.date_range(start, end, freq='365D6H')

In [70]:

comics_per_year = comics_df.groupby(on_sale_date.map(lambda x: x.year)).size()
comics_per_year.plot()

Out[70]:

<matplotlib.axes._subplots.AxesSubplot at 0x10c742b10>

In [71]:

really_old = comics_df[on_sale_date==start]
print(start)

1753-07-29 03:43:41.128654848

WTF! La Marvel es muuuucho más antigua de lo que nosostros/as pensabamos.

In [72]:

really_old.dates.iloc[1]

Out[72]:

[{'date': '-0001-11-30T00:00:00-0500', 'type': 'onsaleDate'},
 {'date': '-0001-11-30T00:00:00-0500', 'type': 'focDate'}]

O es un problema de formato. En cualquier caso no queremos esos datos, son ruido.

In [73]:

#back_to_future_comics = comics_df[on_sale_date==end]
print(end)
back_to_future_comic = comics_df[comics_df['On sale Date'] == end]
back_to_future_comic.title

2020-12-31 05:00:00

Out[73]:

0    Ant-Man: So (Trade Paperback)
Name: title, dtype: object

In [74]:

print("Vamos a eliminar {} ficheros.".format(really_old['On sale Date'].shape[0]))

Vamos a eliminar 945 ficheros.

In [75]:

comics_df = comics_df[comics_df['On sale Date'] != start]

In [76]:

comics_per_year = comics_df.groupby(comics_df['On sale Date'].map(lambda x: x.year)).size()
comics_per_year.plot()

Out[76]:

<matplotlib.axes._subplots.AxesSubplot at 0x10d57e650>

Muuucho mejor.

In [77]:

comics_df = comics_df.fillna(0)

¿Nos acordamos del dropna? Pues tambíen tenemos un fillna

In [78]:

comics_group = comics_df.groupby(comics_df['On sale Date'].map(lambda x: x.year))

In [79]:

price_per_year = comics_group['print price', 'pageCount', 'digital price'].mean()

In [80]:

price_per_year

Out[80]:

	print price	pageCount	digital price
On sale Date
1939	0.100000	68.000000	0.000000
1940	0.100000	68.000000	0.000000
1941	0.100000	68.000000	0.361818
1942	0.100000	68.000000	0.000000
1943	0.092308	59.282051	0.000000
1944	0.090909	51.636364	0.000000
1945	0.088889	37.481481	0.000000
1946	0.089655	46.620690	0.000000
1947	0.081250	40.250000	0.000000
1948	0.056000	26.200000	0.000000
1949	0.050000	16.000000	0.000000
1950	0.100000	41.333333	0.000000
1951	0.090909	32.727273	0.000000
1952	0.078947	28.421053	0.000000
1953	0.080000	27.600000	0.000000
1954	0.065714	26.742857	0.000000
1955	0.070968	26.709677	0.000000
1956	0.075000	26.100000	0.000000
1957	0.078261	28.173913	0.000000
1958	0.053333	26.400000	0.000000
1959	0.073913	15.739130	0.000000
1960	0.070000	20.080000	0.019800
1961	0.084433	8.860825	0.082062
1962	0.105114	20.545455	0.361818
1963	0.090636	19.309091	0.814091
1964	0.096439	25.606061	0.964848
1965	0.094173	27.683453	1.030791
1966	0.099161	28.867133	0.827972
1967	0.100811	27.885135	0.510541
1968	0.107081	28.118012	0.401553
...	...	...	...
1986	0.575697	28.701195	0.356773
1987	0.573469	26.800000	0.203061
1988	0.770741	30.162963	0.243222
1989	0.921711	33.800000	0.130921
1990	0.812363	29.879121	0.183132
1991	0.809682	29.862069	0.187215
1992	0.871466	27.570681	0.197906
1993	1.074494	29.283117	0.320468
1994	1.154888	27.363128	0.188994
1995	1.290229	25.124183	0.312190
1996	1.108352	25.260536	0.152490
1997	1.127034	27.027586	0.466621
1998	0.901168	28.963504	0.275985
1999	2.720977	68.877193	0.403985
2000	0.697008	25.819672	0.481189
2001	0.415556	20.518519	0.853735
2002	0.909750	27.633333	0.784944
2003	3.073240	42.234676	0.616865
2004	3.908775	16.430834	0.615702
2005	4.454624	8.347863	0.675239
2006	4.576349	0.092105	0.713520
2007	4.660382	0.000000	0.657806
2008	7.574411	50.649746	0.585883
2009	8.891997	77.159628	0.505063
2010	8.374445	74.563660	0.553129
2011	9.318335	82.327663	0.653859
2012	9.536820	81.411090	0.912588
2013	9.095204	77.181307	0.755102
2014	8.369410	66.957640	0.006036
2020	19.990000	136.000000	0.000000

77 rows × 3 columns

In [81]:

price_per_year.plot()

Out[81]:

<matplotlib.axes._subplots.AxesSubplot at 0x10cf0a9d0>

In [82]:

plt.figure()
with pd.plot_params.use('x_compat', True):
    price_per_year['print price'].plot(color='r')
    price_per_year['digital price'].plot(color='g')

¡Gracias!¶

In [ ]: