Introduction to the Pandas library in Python

By Matt Niksch

Why are we here?

  1. Pandas is the most popular data table library in Python
  2. You can use it to interact with the many data science and machine learning Python tools
  3. Shifting from Excel to python/pandas can help you move towards automating repeated analyses
  4. Python is a powerful general purpose scripting language, so pandas can be integrated with many other tools
In [1]:
# This is a Jupyter Notebook file
# The section above this one is called "markdown"
# This section is Python code (although the #'s indicate comments)

Before we get started

If you'd like to follow along:

  1. HTML and Notebook versions of this file are available at https://mattniksch.com/blog/
  2. To run everything locally on your own computer, install Python3 and then enter the following in a terminal (feel free not to type the things after the # signs):

pip3 install --upgrade pip #Good idea, but not required

pip3 install pandas #This is the main tool we're talking about today

pip3 install jupyter #If you'd like to run this inside a jupyter notebook; otherwise, you can enter all of these commands inside of IDLE

Note: if you're using Anaconda instead of vanilla Python, you can skip all of that

After all of that is done, launch a Python interactive session in either IDLE or in Jupyter by typing:

jupyter notebook #After you type this, you'll need to start a new Python notebook

Within Jupyter, you can alternate between "Markdown" (this) or "Code". For either one, hit shift+Enter to execute the code in any section


Now, on to the actual introduction:

The most common data structure used in pandas is the DataFrame, which you can generally think of as a grid

You can create a DataFrame a few different ways

In [2]:
# Before we start, we need to import the libraries we're working with into local memory
import pandas as pd # everyone shortens the library name this way (to make it easier to type)
import numpy as np # this is a companion library to pandas that is used for some numeric work
In [3]:
# First way to create a dataframe: list of lists for data with an extra list to define columns:
dn_df = pd.DataFrame([['Donald','Duck'],['Mickey','Mouse'],['Minnie','Mouse']],columns=['First','Last'])
dn_df #Most people put _df at the end of their DataFrames as a reminder
Out[3]:
First Last
0 Donald Duck
1 Mickey Mouse
2 Minnie Mouse
In [4]:
type(dn_df)
Out[4]:
pandas.core.frame.DataFrame
In [5]:
# Alternatively, with a list of dictionaries:
pd.DataFrame([{'First':'Donald','Last':'Duck'},{'First':'Mickey','Last':'Mouse'},{'First':'Minnie','Last':'Mouse'}])
Out[5]:
First Last
0 Donald Duck
1 Mickey Mouse
2 Minnie Mouse
In [6]:
# That's fine for small files or for transforming data you have locally; in most instances, we'll probably start with a file
df_df = pd.read_csv('https://s3.amazonaws.com/mattniksch-python-pandas-intro/data_fellows.csv',encoding='latin1')
In [7]:
# Let's look at how big it is and then look at the top part
print(len(df_df))
df_df.head(2)
45
Out[7]:
First Last Gender Organization City Region/State Country Cohort Track
0 Lucy Abbot Female Braven San Francisco California United States 5 Data Science
1 Arpit Arora Male Pratham Books Bangalore Karnataka India 5 Data Science
In [8]:
df_df.columns
Out[8]:
Index(['First', 'Last', 'Gender', 'Organization', 'City', 'Region/State',
       'Country', 'Cohort', 'Track'],
      dtype='object')
In [9]:
# You can use normal Python control structures on the Index object above (it's an "iterable")
for column in df_df.columns:
    print('{} is the name of one of the columns in this DataFrame'.format(column))
First is the name of one of the columns in this DataFrame
Last is the name of one of the columns in this DataFrame
Gender is the name of one of the columns in this DataFrame
Organization is the name of one of the columns in this DataFrame
City is the name of one of the columns in this DataFrame
Region/State is the name of one of the columns in this DataFrame
Country is the name of one of the columns in this DataFrame
Cohort is the name of one of the columns in this DataFrame
Track is the name of one of the columns in this DataFrame

We're going to go through a range of common tasks here, but this site has a nice quick summary of some of the most used functions

In [10]:
# Let's add a LastFirst field to our DataFrame:
df_df['LastFirst'] = df_df['Last']+', '+df_df['First']
df_df.tail()
Out[10]:
First Last Gender Organization City Region/State Country Cohort Track LastFirst
40 Nick Lucius Male City of Chicago Chicago Illinois United States 1 Data Science Lucius, Nick
41 William Bowser Male Hello Tractor Abuja FCT Nigeria 1 Data Science Bowser, William
42 Bonnie Fan Female Chicago Transit Authority Chicago Illinois United States 1 Data Science Fan, Bonnie
43 Catherine Ojo Female Worldreader Barcelona Catalonia Spain 1 Data Science Ojo, Catherine
44 Dan Wei Female Carbon Lighthouse San Francisco California United States 1 Data Science Wei, Dan

You can can see the index column on the left of the above. All DataFrames have an index and the default is to assign integers.

You can assign it to one of the columns at import, though, or reassign things later

In [11]:
# Use the index_col argument; here index_col=3 and index_col=['Organization'] are equivalent
# Note that I can chain .tail() to the creation of the DataFrame
# You can do this kind of thing a lot in Python, but sometimes it's clearer to use multiple lines
pd.read_csv('https://s3.amazonaws.com/mattniksch-python-pandas-intro/data_fellows.csv',index_col=3, encoding='latin1').tail()
Out[11]:
First Last Gender City Region/State Country Cohort Track
Organization
City of Chicago Nick Lucius Male Chicago Illinois United States 1 Data Science
Hello Tractor William Bowser Male Abuja FCT Nigeria 1 Data Science
Chicago Transit Authority Bonnie Fan Female Chicago Illinois United States 1 Data Science
Worldreader Catherine Ojo Female Barcelona Catalonia Spain 1 Data Science
Carbon Lighthouse Dan Wei Female San Francisco California United States 1 Data Science
In [12]:
# Organization is probably a weird choice for index; let's change it to LastFirst
df = df_df.set_index('LastFirst')
df.head()
Out[12]:
First Last Gender Organization City Region/State Country Cohort Track
LastFirst
Abbot, Lucy Lucy Abbot Female Braven San Francisco California United States 5 Data Science
Arora, Arpit Arpit Arora Male Pratham Books Bangalore Karnataka India 5 Data Science
Crowe, Kevin Kevin Crowe Male Milwaukee Journal Sentinel Milwaukee Wisconsin United States 5 Data Science
Elszasz, Justin Justin Elszasz Male City of Baltimore Baltimore Maryland United States 5 Data Science
Hersher, Monica Monica Hersher Female IDinsight New Dehli National Capital Territory India 5 Data Science

Here are some ways to access different slices of your data:


In [13]:
# Entire row:
df.loc['Niksch, Matt']
Out[13]:
First                                       Matt
Last                                      Niksch
Gender                                      Male
Organization    Noble Network of Charter Schools
City                                     Chicago
Region/State                            Illinois
Country                            United States
Cohort                                         2
Track                               Data Science
Name: Niksch, Matt, dtype: object
In [14]:
# Entire column:
df['Organization']
Out[14]:
LastFirst
Abbot, Lucy                                                           Braven
Arora, Arpit                                                   Pratham Books
Crowe, Kevin                                      Milwaukee Journal Sentinel
Elszasz, Justin                                            City of Baltimore
Hersher, Monica                                                    IDinsight
Hudlow, Jonathan                                  Love Justice International
Jairam, Nesha                 Georgia Division of Family & Children Services
Lavoe, Francis                                                    PEG Africa
Murray, Kate                                                         mRelief
Vega Rodriguez, Juan                                               CentroNia
McAllister, Scott                            General Services Administration
Jimenez Lara, Daniela             Laboratorio Nacional de Políticas Públicas
Filippou, Georgios                                World Vision International
Sundberg, Johnna                                               One Acre Fund
Turlakova, Marina                                                    uAspire
Bouacha, Nora                                             Heartland Alliance
Sinclair, Rajiv                                          Invisible Institute
Nkusi, Sandra                                    Ontario Trillium Foundation
Bromberg, Ben                                             Pencils of Promise
Williams, Adam                                                     Litterati
Ghen, Michael                                            Benefits Data Trust
Benard, Claire                  National Council for Voluntary Organisations
Liadsky, Daniel                                     Canadian Urban Institute
McGowan, Jim                                              American Red Cross
Zander, Keith                                             OneGoal Graduation
Oliva-Altamirano, Paola                                        Our Community
Mitchell, Robert                                      Skid Row Housing Trust
Geraghty, Ryan                                               412 Food Rescue
Bunting, Tom                                                       Ingenuity
Alele, Peter                                                     Vital Signs
Doshi, Kruti                          Cook County Health & Hospitals Systems
Jahani, Kosar                                                     Samasource
Niksch, Matt                                Noble Network of Charter Schools
Gagne, Elizabeth                       New York City Department of Education
Gohil, Deepali                                     Northern Rangelands Trust
Stevens, Matt                                            Benefits Data Trust
Tess, Joanna               Coordination of Healthcare for Complex Kids Pr...
Anselmo, Nicki                    Cesar Chavez Multicultural Academic Center
Aye, Nyi Nyi                                                    Koe Koe Tech
Jimenez, Rebeca Moreno                              UNHCR--UN Refugee Agency
Lucius, Nick                                                 City of Chicago
Bowser, William                                                Hello Tractor
Fan, Bonnie                                        Chicago Transit Authority
Ojo, Catherine                                                   Worldreader
Wei, Dan                                                   Carbon Lighthouse
Name: Organization, dtype: object

Note that the index carried along for this version

This isn't a simple list, but instead a pandas Series object, which is kind of like the little sibling of a DataFrame

We'll talk about that more in a bit, but as a quick note, here are a few other ways to get a column:

df.Organization #you can skip the brackets if there are no special characters

df.loc[:, 'Organization'] #Similar to the rest of Python, you can use a : to specify "everything" in a dimension

In [15]:
type(df.loc[:,'Organization'])
Out[15]:
pandas.core.series.Series
In [16]:
# Here, I'm first getting a Series and then asking for the value at a specific index
df['Organization']['Niksch, Matt']
Out[16]:
'Noble Network of Charter Schools'
In [17]:
# Here, I'm asking for a point in the grid by specifying both axes
df.loc['Niksch, Matt','Gender']
Out[17]:
'Male'
In [18]:
# .loc is used above when you know the values; .iloc is used when you know the location in the grid
# In this example, I'm using the : as the second argument to say I want everything
# Note that Justin is the 4th person in the table above because this is a zero indexed system
df.iloc[3,:]
Out[18]:
First                      Justin
Last                      Elszasz
Gender                       Male
Organization    City of Baltimore
City                    Baltimore
Region/State             Maryland
Country             United States
Cohort                          5
Track                Data Science
Name: Elszasz, Justin, dtype: object

In addition to slicing the data, pandas has a number of descriptive functions

In [19]:
df.describe()
Out[19]:
Cohort
count 45.000000
mean 3.155556
std 1.429487
min 1.000000
25% 2.000000
50% 3.000000
75% 4.000000
max 5.000000
In [20]:
# That seems like a pretty weak description for all of those columns, but maybe that's because of the types of data we have
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 45 entries, Abbot, Lucy to Wei, Dan
Data columns (total 9 columns):
First           45 non-null object
Last            45 non-null object
Gender          45 non-null object
Organization    45 non-null object
City            45 non-null object
Region/State    45 non-null object
Country         45 non-null object
Cohort          45 non-null int64
Track           45 non-null object
dtypes: int64(1), object(8)
memory usage: 2.7+ KB
In [21]:
# .describe() still won't do anything for non-numeric fields when summarizing the whole DF
# However, it should work on the individual Series:
df['Gender'].describe()
Out[21]:
count       45
unique       2
top       Male
freq        24
Name: Gender, dtype: object
In [22]:
df['City'].describe()
Out[22]:
count          45
unique         26
top       Chicago
freq           12
Name: City, dtype: object

Back to manipulation:

In [23]:
# You can sort:
df2 = df.sort_values(['Cohort','Last', 'First'], ascending=[True,False,False])
df2.index
Out[23]:
Index(['Wei, Dan', 'Ojo, Catherine', 'Lucius, Nick', 'Jimenez, Rebeca Moreno',
       'Fan, Bonnie', 'Bowser, William', 'Aye, Nyi Nyi', 'Anselmo, Nicki',
       'Tess, Joanna', 'Stevens, Matt', 'Niksch, Matt', 'Jahani, Kosar',
       'Gohil, Deepali', 'Gagne, Elizabeth', 'Doshi, Kruti', 'Alele, Peter',
       'Zander, Keith', 'Oliva-Altamirano, Paola', 'Mitchell, Robert',
       'McGowan, Jim', 'Liadsky, Daniel', 'Geraghty, Ryan', 'Bunting, Tom',
       'Benard, Claire', 'Williams, Adam', 'Turlakova, Marina',
       'Sundberg, Johnna', 'Sinclair, Rajiv', 'Nkusi, Sandra',
       'McAllister, Scott', 'Jimenez Lara, Daniela', 'Ghen, Michael',
       'Filippou, Georgios', 'Bromberg, Ben', 'Bouacha, Nora',
       'Vega Rodriguez, Juan', 'Murray, Kate', 'Lavoe, Francis',
       'Jairam, Nesha', 'Hudlow, Jonathan', 'Hersher, Monica',
       'Elszasz, Justin', 'Crowe, Kevin', 'Arora, Arpit', 'Abbot, Lucy'],
      dtype='object', name='LastFirst')
In [24]:
# You can grab a subset of columns:
df2[['Cohort','Organization']]
Out[24]:
Cohort Organization
LastFirst
Wei, Dan 1 Carbon Lighthouse
Ojo, Catherine 1 Worldreader
Lucius, Nick 1 City of Chicago
Jimenez, Rebeca Moreno 1 UNHCR--UN Refugee Agency
Fan, Bonnie 1 Chicago Transit Authority
Bowser, William 1 Hello Tractor
Aye, Nyi Nyi 1 Koe Koe Tech
Anselmo, Nicki 1 Cesar Chavez Multicultural Academic Center
Tess, Joanna 2 Coordination of Healthcare for Complex Kids Pr...
Stevens, Matt 2 Benefits Data Trust
Niksch, Matt 2 Noble Network of Charter Schools
Jahani, Kosar 2 Samasource
Gohil, Deepali 2 Northern Rangelands Trust
Gagne, Elizabeth 2 New York City Department of Education
Doshi, Kruti 2 Cook County Health & Hospitals Systems
Alele, Peter 2 Vital Signs
Zander, Keith 3 OneGoal Graduation
Oliva-Altamirano, Paola 3 Our Community
Mitchell, Robert 3 Skid Row Housing Trust
McGowan, Jim 3 American Red Cross
Liadsky, Daniel 3 Canadian Urban Institute
Geraghty, Ryan 3 412 Food Rescue
Bunting, Tom 3 Ingenuity
Benard, Claire 3 National Council for Voluntary Organisations
Williams, Adam 4 Litterati
Turlakova, Marina 4 uAspire
Sundberg, Johnna 4 One Acre Fund
Sinclair, Rajiv 4 Invisible Institute
Nkusi, Sandra 4 Ontario Trillium Foundation
McAllister, Scott 4 General Services Administration
Jimenez Lara, Daniela 4 Laboratorio Nacional de Políticas Públicas
Ghen, Michael 4 Benefits Data Trust
Filippou, Georgios 4 World Vision International
Bromberg, Ben 4 Pencils of Promise
Bouacha, Nora 4 Heartland Alliance
Vega Rodriguez, Juan 5 CentroNia
Murray, Kate 5 mRelief
Lavoe, Francis 5 PEG Africa
Jairam, Nesha 5 Georgia Division of Family & Children Services
Hudlow, Jonathan 5 Love Justice International
Hersher, Monica 5 IDinsight
Elszasz, Justin 5 City of Baltimore
Crowe, Kevin 5 Milwaukee Journal Sentinel
Arora, Arpit 5 Pratham Books
Abbot, Lucy 5 Braven
In [25]:
# Alternatively, you can drop columns:
df2.drop(['Gender','Organization'], axis=1).head(8)
# Note that this doesn't change df2 unless I set inplace=True as one of the arguments
Out[25]:
First Last City Region/State Country Cohort Track
LastFirst
Wei, Dan Dan Wei San Francisco California United States 1 Data Science
Ojo, Catherine Catherine Ojo Barcelona Catalonia Spain 1 Data Science
Lucius, Nick Nick Lucius Chicago Illinois United States 1 Data Science
Jimenez, Rebeca Moreno Rebeca Moreno Jimenez Geneva Geneva Switzerland 1 Data Science
Fan, Bonnie Bonnie Fan Chicago Illinois United States 1 Data Science
Bowser, William William Bowser Abuja FCT Nigeria 1 Data Science
Aye, Nyi Nyi Nyi Nyi Aye Yangon Yangon Myanmar 1 Data Science
Anselmo, Nicki Nicki Anselmo Chicago Illinois United States 1 Data Science
In [26]:
# or rows:
df_test = df2.copy()
df_test.drop(['Wei, Dan']).head(8)
Out[26]:
First Last Gender Organization City Region/State Country Cohort Track
LastFirst
Ojo, Catherine Catherine Ojo Female Worldreader Barcelona Catalonia Spain 1 Data Science
Lucius, Nick Nick Lucius Male City of Chicago Chicago Illinois United States 1 Data Science
Jimenez, Rebeca Moreno Rebeca Moreno Jimenez Female UNHCR--UN Refugee Agency Geneva Geneva Switzerland 1 Data Science
Fan, Bonnie Bonnie Fan Female Chicago Transit Authority Chicago Illinois United States 1 Data Science
Bowser, William William Bowser Male Hello Tractor Abuja FCT Nigeria 1 Data Science
Aye, Nyi Nyi Nyi Nyi Aye Male Koe Koe Tech Yangon Yangon Myanmar 1 Data Science
Anselmo, Nicki Nicki Anselmo Female Cesar Chavez Multicultural Academic Center Chicago Illinois United States 1 Data Science
Tess, Joanna Joanna Tess Female Coordination of Healthcare for Complex Kids Pr... Chicago Illinois United States 2 Data Science
In [27]:
# You can match rows based on values:
df2[df2['Region/State']=='California']
Out[27]:
First Last Gender Organization City Region/State Country Cohort Track
LastFirst
Wei, Dan Dan Wei Female Carbon Lighthouse San Francisco California United States 1 Data Science
Jahani, Kosar Kosar Jahani Female Samasource San Francisco California United States 2 Data Science
Mitchell, Robert Robert Mitchell Male Skid Row Housing Trust Los Angeles California United States 3 Data Science
Williams, Adam Adam Williams Male Litterati San Francisco California United States 4 Data Security
Abbot, Lucy Lucy Abbot Female Braven San Francisco California United States 5 Data Science
In [28]:
# The above looks a little weird. That's because the inner expression is creating a Series
# of True/False values; you can chain these using Pythons & and | logical operators:
df2[(df2['Cohort']==1) & (df2['First'].str.startswith('N'))]
Out[28]:
First Last Gender Organization City Region/State Country Cohort Track
LastFirst
Lucius, Nick Nick Lucius Male City of Chicago Chicago Illinois United States 1 Data Science
Aye, Nyi Nyi Nyi Nyi Aye Male Koe Koe Tech Yangon Yangon Myanmar 1 Data Science
Anselmo, Nicki Nicki Anselmo Female Cesar Chavez Multicultural Academic Center Chicago Illinois United States 1 Data Science

Let's use some of these slicing tricks to do a "Pivot Table" type task

We're going to figure out the % Female and % Chicago for each cohort

In [29]:
g_count = df2[['Cohort','Gender','City']].groupby(['Gender','Cohort']).count()
g_count
Out[29]:
City
Gender Cohort
Female 1 5
2 5
3 2
4 5
5 4
Male 1 3
2 3
3 6
4 6
5 6
In [30]:
# There are a few things happening here, but notably, the groupby() function created a multi-level index
# Let's play with that a little:
g_count.index
Out[30]:
MultiIndex(levels=[['Female', 'Male'], [1, 2, 3, 4, 5]],
           labels=[[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
           names=['Gender', 'Cohort'])
In [31]:
g_count.loc['Female']
Out[31]:
City
Cohort
1 5
2 5
3 2
4 5
5 4
In [32]:
g_count.loc['Female'].rename(columns={'City':'# Female'})
Out[32]:
# Female
Cohort
1 5
2 5
3 2
4 5
5 4
In [33]:
g_count_f = g_count.loc['Female'].rename(columns={'City':'# Female'}).T
g_count_f
Out[33]:
Cohort 1 2 3 4 5
# Female 5 5 2 5 4
In [34]:
g_count_m = g_count.loc['Male'].rename(columns={'City':'# Male'})
g_count_m
Out[34]:
# Male
Cohort
1 3
2 3
3 6
4 6
5 6
In [35]:
gender_df = pd.concat([g_count_f, g_count_m.T])
gender_df
Out[35]:
Cohort 1 2 3 4 5
# Female 5 5 2 5 4
# Male 3 3 6 6 6
In [36]:
gender_df.sum()
Out[36]:
Cohort
1     8
2     8
3     8
4    11
5    10
dtype: int64
In [37]:
percent_female = gender_df.loc['# Female'] / gender_df.sum()
percent_female.name = '% Female'
percent_female
Out[37]:
Cohort
1    0.625000
2    0.625000
3    0.250000
4    0.454545
5    0.400000
Name: % Female, dtype: float64
In [38]:
# Now, let's do % Chicago
df3=df2[['Cohort','City']]
percent_chicago = df3[df3['City']=='Chicago'].groupby(['Cohort']).count()/df3.groupby(['Cohort']).count()
percent_chicago.rename(columns={'City':'% Chicago'},inplace=True)
percent_chicago
Out[38]:
% Chicago
Cohort
1 0.375000
2 0.375000
3 0.375000
4 0.181818
5 0.100000
In [39]:
pd.concat([percent_female, percent_chicago], axis=1)
Out[39]:
% Female % Chicago
Cohort
1 0.625000 0.375000
2 0.625000 0.375000
3 0.250000 0.375000
4 0.454545 0.181818
5 0.400000 0.100000

Bonus: Let's create a Pig Latin descriptor using the first, last, and City

In [40]:
def platin(word):
    """Applies pig latin translation to a provided word"""
    if word[0].lower() in ['a','e','i','o','u']:
        return word + 'way'
    else:
        return word[1].upper()+word[2:]+word[0].lower()+'ay'
    
def make_pig_latin_phrase(x):
    """This is an 'apply' function for use with pandas"""
    first, last, city = x #we'll see below that this was sent as a tuple
    return platin(first)+' '+platin(last)+' is from '+platin(city)
In [41]:
df[['First','Last','City']].apply(make_pig_latin_phrase,axis=1)
Out[41]:
LastFirst
Abbot, Lucy                        Ucylay Abbotway is from An Franciscosay
Arora, Arpit                         Arpitway Aroraway is from Angalorebay
Crowe, Kevin                           Evinkay Rowecay is from Ilwaukeemay
Elszasz, Justin                    Ustinjay Elszaszway is from Altimorebay
Hersher, Monica                     Onicamay Ersherhay is from Ew Dehlinay
Hudlow, Jonathan                     Onathanjay Udlowhay is from Angkokbay
Jairam, Nesha                          Eshanay Airamjay is from Atlantaway
Lavoe, Francis                          Rancisfay Avoelay is from Accraway
Murray, Kate                             Atekay Urraymay is from Hicagocay
Vega Rodriguez, Juan          Uanjay Ega Rodriguezvay is from Ashingtonway
McAllister, Scott                Cottsay CAllistermay is from Ashingtonway
Jimenez Lara, Daniela       Anieladay Imenez Larajay is from Exico Citymay
Filippou, Georgios                Eorgiosgay Ilippoufay is from Imassollay
Sundberg, Johnna                    Ohnnajay Undbergsay is from Akamegakay
Turlakova, Marina                    Arinamay Urlakovatay is from Ostonbay
Bouacha, Nora                           Oranay Ouachabay is from Hicagocay
Sinclair, Rajiv                       Ajivray Inclairsay is from Hicagocay
Nkusi, Sandra                           Andrasay Kusinay is from Orontotay
Bromberg, Ben                          Enbay Rombergbay is from Ew Yorknay
Williams, Adam                  Adamway Illiamsway is from An Franciscosay
Ghen, Michael                      Ichaelmay Hengay is from Hiladelphiapay
Benard, Claire                          Lairecay Enardbay is from Ondonlay
Liadsky, Daniel                       Anielday Iadskylay is from Orontotay
McGowan, Jim                             Imjay CGowanmay is from Hicagocay
Zander, Keith                           Eithkay Anderzay is from Hicagocay
Oliva-Altamirano, Paola    Aolapay Oliva-Altamiranoway is from Elbournemay
Mitchell, Robert                 Obertray Itchellmay is from Os Angeleslay
Geraghty, Ryan                      Yanray Eraghtygay is from Ittsburghpay
Bunting, Tom                             Omtay Untingbay is from Hicagocay
Alele, Peter                            Eterpay Aleleway is from Airobinay
Doshi, Kruti                             Rutikay Oshiday is from Hicagocay
Jahani, Kosar                     Osarkay Ahanijay is from An Franciscosay
Niksch, Matt                             Attmay Ikschnay is from Hicagocay
Gagne, Elizabeth                   Elizabethway Agnegay is from Ew Yorknay
Gohil, Deepali                         Eepaliday Ohilgay is from Airobinay
Stevens, Matt                      Attmay Tevenssay is from Hiladelphiapay
Tess, Joanna                             Oannajay Esstay is from Hicagocay
Anselmo, Nicki                        Ickinay Anselmoway is from Hicagocay
Aye, Nyi Nyi                             Yi Nyinay Ayeway is from Angonyay
Jimenez, Rebeca Moreno          Ebeca Morenoray Imenezjay is from Enevagay
Lucius, Nick                             Icknay Uciuslay is from Hicagocay
Bowser, William                        Illiamway Owserbay is from Abujaway
Fan, Bonnie                               Onniebay Anfay is from Hicagocay
Ojo, Catherine                      Atherinecay Ojoway is from Arcelonabay
Wei, Dan                               Anday Eiway is from An Franciscosay
dtype: object
In [42]:
# A simpler example for apply:
df4 = df.copy()
df4['id'] = df4.index
df4['id'].apply(lambda x: x+' is an index!')
Out[42]:
LastFirst
Abbot, Lucy                            Abbot, Lucy is an index!
Arora, Arpit                          Arora, Arpit is an index!
Crowe, Kevin                          Crowe, Kevin is an index!
Elszasz, Justin                    Elszasz, Justin is an index!
Hersher, Monica                    Hersher, Monica is an index!
Hudlow, Jonathan                  Hudlow, Jonathan is an index!
Jairam, Nesha                        Jairam, Nesha is an index!
Lavoe, Francis                      Lavoe, Francis is an index!
Murray, Kate                          Murray, Kate is an index!
Vega Rodriguez, Juan          Vega Rodriguez, Juan is an index!
McAllister, Scott                McAllister, Scott is an index!
Jimenez Lara, Daniela        Jimenez Lara, Daniela is an index!
Filippou, Georgios              Filippou, Georgios is an index!
Sundberg, Johnna                  Sundberg, Johnna is an index!
Turlakova, Marina                Turlakova, Marina is an index!
Bouacha, Nora                        Bouacha, Nora is an index!
Sinclair, Rajiv                    Sinclair, Rajiv is an index!
Nkusi, Sandra                        Nkusi, Sandra is an index!
Bromberg, Ben                        Bromberg, Ben is an index!
Williams, Adam                      Williams, Adam is an index!
Ghen, Michael                        Ghen, Michael is an index!
Benard, Claire                      Benard, Claire is an index!
Liadsky, Daniel                    Liadsky, Daniel is an index!
McGowan, Jim                          McGowan, Jim is an index!
Zander, Keith                        Zander, Keith is an index!
Oliva-Altamirano, Paola    Oliva-Altamirano, Paola is an index!
Mitchell, Robert                  Mitchell, Robert is an index!
Geraghty, Ryan                      Geraghty, Ryan is an index!
Bunting, Tom                          Bunting, Tom is an index!
Alele, Peter                          Alele, Peter is an index!
Doshi, Kruti                          Doshi, Kruti is an index!
Jahani, Kosar                        Jahani, Kosar is an index!
Niksch, Matt                          Niksch, Matt is an index!
Gagne, Elizabeth                  Gagne, Elizabeth is an index!
Gohil, Deepali                      Gohil, Deepali is an index!
Stevens, Matt                        Stevens, Matt is an index!
Tess, Joanna                          Tess, Joanna is an index!
Anselmo, Nicki                      Anselmo, Nicki is an index!
Aye, Nyi Nyi                          Aye, Nyi Nyi is an index!
Jimenez, Rebeca Moreno      Jimenez, Rebeca Moreno is an index!
Lucius, Nick                          Lucius, Nick is an index!
Bowser, William                    Bowser, William is an index!
Fan, Bonnie                            Fan, Bonnie is an index!
Ojo, Catherine                      Ojo, Catherine is an index!
Wei, Dan                                  Wei, Dan is an index!
Name: id, dtype: object
In [43]:
# When working on an index, we use "map"
df4.index.map(lambda x: x+' is an index!')
Out[43]:
Index(['Abbot, Lucy is an index!', 'Arora, Arpit is an index!',
       'Crowe, Kevin is an index!', 'Elszasz, Justin is an index!',
       'Hersher, Monica is an index!', 'Hudlow, Jonathan is an index!',
       'Jairam, Nesha is an index!', 'Lavoe, Francis is an index!',
       'Murray, Kate is an index!', 'Vega Rodriguez, Juan is an index!',
       'McAllister, Scott is an index!', 'Jimenez Lara, Daniela is an index!',
       'Filippou, Georgios is an index!', 'Sundberg, Johnna is an index!',
       'Turlakova, Marina is an index!', 'Bouacha, Nora is an index!',
       'Sinclair, Rajiv is an index!', 'Nkusi, Sandra is an index!',
       'Bromberg, Ben is an index!', 'Williams, Adam is an index!',
       'Ghen, Michael is an index!', 'Benard, Claire is an index!',
       'Liadsky, Daniel is an index!', 'McGowan, Jim is an index!',
       'Zander, Keith is an index!', 'Oliva-Altamirano, Paola is an index!',
       'Mitchell, Robert is an index!', 'Geraghty, Ryan is an index!',
       'Bunting, Tom is an index!', 'Alele, Peter is an index!',
       'Doshi, Kruti is an index!', 'Jahani, Kosar is an index!',
       'Niksch, Matt is an index!', 'Gagne, Elizabeth is an index!',
       'Gohil, Deepali is an index!', 'Stevens, Matt is an index!',
       'Tess, Joanna is an index!', 'Anselmo, Nicki is an index!',
       'Aye, Nyi Nyi is an index!', 'Jimenez, Rebeca Moreno is an index!',
       'Lucius, Nick is an index!', 'Bowser, William is an index!',
       'Fan, Bonnie is an index!', 'Ojo, Catherine is an index!',
       'Wei, Dan is an index!'],
      dtype='object', name='LastFirst')
In [ ]: