Introduction to the Pandas library in Python¶

By Matt Niksch¶

Why are we here?¶

Pandas is the most popular data table library in Python
You can use it to interact with the many data science and machine learning Python tools
Shifting from Excel to python/pandas can help you move towards automating repeated analyses
Python is a powerful general purpose scripting language, so pandas can be integrated with many other tools

# This is a Jupyter Notebook file
# The section above this one is called "markdown"
# This section is Python code (although the #'s indicate comments)

Before we get started¶

If you'd like to follow along:¶

HTML and Notebook versions of this file are available at https://mattniksch.com/blog/
To run everything locally on your own computer, install Python3 and then enter the following in a terminal (feel free not to type the things after the # signs):

pip3 install --upgrade pip #Good idea, but not required

pip3 install pandas #This is the main tool we're talking about today

pip3 install jupyter #If you'd like to run this inside a jupyter notebook; otherwise, you can enter all of these commands inside of IDLE

Note: if you're using Anaconda instead of vanilla Python, you can skip all of that

After all of that is done, launch a Python interactive session in either IDLE or in Jupyter by typing:¶

jupyter notebook #After you type this, you'll need to start a new Python notebook

Within Jupyter, you can alternate between "Markdown" (this) or "Code". For either one, hit shift+Enter to execute the code in any section¶

Now, on to the actual introduction:¶

The most common data structure used in pandas is the DataFrame, which you can generally think of as a grid¶

You can create a DataFrame a few different ways¶

# Before we start, we need to import the libraries we're working with into local memory
import pandas as pd # everyone shortens the library name this way (to make it easier to type)
import numpy as np # this is a companion library to pandas that is used for some numeric work

# First way to create a dataframe: list of lists for data with an extra list to define columns:
dn_df = pd.DataFrame([['Donald','Duck'],['Mickey','Mouse'],['Minnie','Mouse']],columns=['First','Last'])
dn_df #Most people put _df at the end of their DataFrames as a reminder

type(dn_df)

pandas.core.frame.DataFrame

# Alternatively, with a list of dictionaries:
pd.DataFrame([{'First':'Donald','Last':'Duck'},{'First':'Mickey','Last':'Mouse'},{'First':'Minnie','Last':'Mouse'}])

# That's fine for small files or for transforming data you have locally; in most instances, we'll probably start with a file
df_df = pd.read_csv('https://s3.amazonaws.com/mattniksch-python-pandas-intro/data_fellows.csv',encoding='latin1')

# Let's look at how big it is and then look at the top part
print(len(df_df))
df_df.head(2)

45

df_df.columns

Index(['First', 'Last', 'Gender', 'Organization', 'City', 'Region/State',
       'Country', 'Cohort', 'Track'],
      dtype='object')

# You can use normal Python control structures on the Index object above (it's an "iterable")
for column in df_df.columns:
    print('{} is the name of one of the columns in this DataFrame'.format(column))

First is the name of one of the columns in this DataFrame
Last is the name of one of the columns in this DataFrame
Gender is the name of one of the columns in this DataFrame
Organization is the name of one of the columns in this DataFrame
City is the name of one of the columns in this DataFrame
Region/State is the name of one of the columns in this DataFrame
Country is the name of one of the columns in this DataFrame
Cohort is the name of one of the columns in this DataFrame
Track is the name of one of the columns in this DataFrame

We're going to go through a range of common tasks here, but this site has a nice quick summary of some of the most used functions¶

# Let's add a LastFirst field to our DataFrame:
df_df['LastFirst'] = df_df['Last']+', '+df_df['First']
df_df.tail()

You can can see the index column on the left of the above. All DataFrames have an index and the default is to assign integers.¶

You can assign it to one of the columns at import, though, or reassign things later¶

# Use the index_col argument; here index_col=3 and index_col=['Organization'] are equivalent
# Note that I can chain .tail() to the creation of the DataFrame
# You can do this kind of thing a lot in Python, but sometimes it's clearer to use multiple lines
pd.read_csv('https://s3.amazonaws.com/mattniksch-python-pandas-intro/data_fellows.csv',index_col=3, encoding='latin1').tail()

# Organization is probably a weird choice for index; let's change it to LastFirst
df = df_df.set_index('LastFirst')
df.head()

Here are some ways to access different slices of your data:¶

# Entire row:
df.loc['Niksch, Matt']

First                                       Matt
Last                                      Niksch
Gender                                      Male
Organization    Noble Network of Charter Schools
City                                     Chicago
Region/State                            Illinois
Country                            United States
Cohort                                         2
Track                               Data Science
Name: Niksch, Matt, dtype: object

# Entire column:
df['Organization']

LastFirst
Abbot, Lucy                                                           Braven
Arora, Arpit                                                   Pratham Books
Crowe, Kevin                                      Milwaukee Journal Sentinel
Elszasz, Justin                                            City of Baltimore
Hersher, Monica                                                    IDinsight
Hudlow, Jonathan                                  Love Justice International
Jairam, Nesha                 Georgia Division of Family & Children Services
Lavoe, Francis                                                    PEG Africa
Murray, Kate                                                         mRelief
Vega Rodriguez, Juan                                               CentroNia
McAllister, Scott                            General Services Administration
Jimenez Lara, Daniela             Laboratorio Nacional de Políticas Públicas
Filippou, Georgios                                World Vision International
Sundberg, Johnna                                               One Acre Fund
Turlakova, Marina                                                    uAspire
Bouacha, Nora                                             Heartland Alliance
Sinclair, Rajiv                                          Invisible Institute
Nkusi, Sandra                                    Ontario Trillium Foundation
Bromberg, Ben                                             Pencils of Promise
Williams, Adam                                                     Litterati
Ghen, Michael                                            Benefits Data Trust
Benard, Claire                  National Council for Voluntary Organisations
Liadsky, Daniel                                     Canadian Urban Institute
McGowan, Jim                                              American Red Cross
Zander, Keith                                             OneGoal Graduation
Oliva-Altamirano, Paola                                        Our Community
Mitchell, Robert                                      Skid Row Housing Trust
Geraghty, Ryan                                               412 Food Rescue
Bunting, Tom                                                       Ingenuity
Alele, Peter                                                     Vital Signs
Doshi, Kruti                          Cook County Health & Hospitals Systems
Jahani, Kosar                                                     Samasource
Niksch, Matt                                Noble Network of Charter Schools
Gagne, Elizabeth                       New York City Department of Education
Gohil, Deepali                                     Northern Rangelands Trust
Stevens, Matt                                            Benefits Data Trust
Tess, Joanna               Coordination of Healthcare for Complex Kids Pr...
Anselmo, Nicki                    Cesar Chavez Multicultural Academic Center
Aye, Nyi Nyi                                                    Koe Koe Tech
Jimenez, Rebeca Moreno                              UNHCR--UN Refugee Agency
Lucius, Nick                                                 City of Chicago
Bowser, William                                                Hello Tractor
Fan, Bonnie                                        Chicago Transit Authority
Ojo, Catherine                                                   Worldreader
Wei, Dan                                                   Carbon Lighthouse
Name: Organization, dtype: object

Note that the index carried along for this version¶

This isn't a simple list, but instead a pandas Series object, which is kind of like the little sibling of a DataFrame

We'll talk about that more in a bit, but as a quick note, here are a few other ways to get a column:

df.Organization #you can skip the brackets if there are no special characters

df.loc[:, 'Organization'] #Similar to the rest of Python, you can use a : to specify "everything" in a dimension

type(df.loc[:,'Organization'])

pandas.core.series.Series

# Here, I'm first getting a Series and then asking for the value at a specific index
df['Organization']['Niksch, Matt']

'Noble Network of Charter Schools'

# Here, I'm asking for a point in the grid by specifying both axes
df.loc['Niksch, Matt','Gender']

'Male'

# .loc is used above when you know the values; .iloc is used when you know the location in the grid
# In this example, I'm using the : as the second argument to say I want everything
# Note that Justin is the 4th person in the table above because this is a zero indexed system
df.iloc[3,:]

First                      Justin
Last                      Elszasz
Gender                       Male
Organization    City of Baltimore
City                    Baltimore
Region/State             Maryland
Country             United States
Cohort                          5
Track                Data Science
Name: Elszasz, Justin, dtype: object

In addition to slicing the data, pandas has a number of descriptive functions¶

df.describe()

# That seems like a pretty weak description for all of those columns, but maybe that's because of the types of data we have
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45 entries, Abbot, Lucy to Wei, Dan
Data columns (total 9 columns):
First           45 non-null object
Last            45 non-null object
Gender          45 non-null object
Organization    45 non-null object
City            45 non-null object
Region/State    45 non-null object
Country         45 non-null object
Cohort          45 non-null int64
Track           45 non-null object
dtypes: int64(1), object(8)
memory usage: 2.7+ KB

# .describe() still won't do anything for non-numeric fields when summarizing the whole DF
# However, it should work on the individual Series:
df['Gender'].describe()

count       45
unique       2
top       Male
freq        24
Name: Gender, dtype: object

df['City'].describe()

count          45
unique         26
top       Chicago
freq           12
Name: City, dtype: object

Back to manipulation:¶

# You can sort:
df2 = df.sort_values(['Cohort','Last', 'First'], ascending=[True,False,False])
df2.index

Index(['Wei, Dan', 'Ojo, Catherine', 'Lucius, Nick', 'Jimenez, Rebeca Moreno',
       'Fan, Bonnie', 'Bowser, William', 'Aye, Nyi Nyi', 'Anselmo, Nicki',
       'Tess, Joanna', 'Stevens, Matt', 'Niksch, Matt', 'Jahani, Kosar',
       'Gohil, Deepali', 'Gagne, Elizabeth', 'Doshi, Kruti', 'Alele, Peter',
       'Zander, Keith', 'Oliva-Altamirano, Paola', 'Mitchell, Robert',
       'McGowan, Jim', 'Liadsky, Daniel', 'Geraghty, Ryan', 'Bunting, Tom',
       'Benard, Claire', 'Williams, Adam', 'Turlakova, Marina',
       'Sundberg, Johnna', 'Sinclair, Rajiv', 'Nkusi, Sandra',
       'McAllister, Scott', 'Jimenez Lara, Daniela', 'Ghen, Michael',
       'Filippou, Georgios', 'Bromberg, Ben', 'Bouacha, Nora',
       'Vega Rodriguez, Juan', 'Murray, Kate', 'Lavoe, Francis',
       'Jairam, Nesha', 'Hudlow, Jonathan', 'Hersher, Monica',
       'Elszasz, Justin', 'Crowe, Kevin', 'Arora, Arpit', 'Abbot, Lucy'],
      dtype='object', name='LastFirst')

# You can grab a subset of columns:
df2[['Cohort','Organization']]

# Alternatively, you can drop columns:
df2.drop(['Gender','Organization'], axis=1).head(8)
# Note that this doesn't change df2 unless I set inplace=True as one of the arguments

# or rows:
df_test = df2.copy()
df_test.drop(['Wei, Dan']).head(8)

# You can match rows based on values:
df2[df2['Region/State']=='California']

# The above looks a little weird. That's because the inner expression is creating a Series
# of True/False values; you can chain these using Pythons & and | logical operators:
df2[(df2['Cohort']==1) & (df2['First'].str.startswith('N'))]

Let's use some of these slicing tricks to do a "Pivot Table" type task¶

We're going to figure out the % Female and % Chicago for each cohort

g_count = df2[['Cohort','Gender','City']].groupby(['Gender','Cohort']).count()
g_count

# There are a few things happening here, but notably, the groupby() function created a multi-level index
# Let's play with that a little:
g_count.index

MultiIndex(levels=[['Female', 'Male'], [1, 2, 3, 4, 5]],
           labels=[[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
           names=['Gender', 'Cohort'])

g_count.loc['Female']

g_count.loc['Female'].rename(columns={'City':'# Female'})

g_count_f = g_count.loc['Female'].rename(columns={'City':'# Female'}).T
g_count_f

g_count_m = g_count.loc['Male'].rename(columns={'City':'# Male'})
g_count_m

gender_df = pd.concat([g_count_f, g_count_m.T])
gender_df

gender_df.sum()

Cohort
1     8
2     8
3     8
4    11
5    10
dtype: int64

percent_female = gender_df.loc['# Female'] / gender_df.sum()
percent_female.name = '% Female'
percent_female

Cohort
1    0.625000
2    0.625000
3    0.250000
4    0.454545
5    0.400000
Name: % Female, dtype: float64

# Now, let's do % Chicago
df3=df2[['Cohort','City']]
percent_chicago = df3[df3['City']=='Chicago'].groupby(['Cohort']).count()/df3.groupby(['Cohort']).count()
percent_chicago.rename(columns={'City':'% Chicago'},inplace=True)
percent_chicago

pd.concat([percent_female, percent_chicago], axis=1)

Bonus: Let's create a Pig Latin descriptor using the first, last, and City¶

def platin(word):
    """Applies pig latin translation to a provided word"""
    if word[0].lower() in ['a','e','i','o','u']:
        return word + 'way'
    else:
        return word[1].upper()+word[2:]+word[0].lower()+'ay'
    
def make_pig_latin_phrase(x):
    """This is an 'apply' function for use with pandas"""
    first, last, city = x #we'll see below that this was sent as a tuple
    return platin(first)+' '+platin(last)+' is from '+platin(city)

df[['First','Last','City']].apply(make_pig_latin_phrase,axis=1)

LastFirst
Abbot, Lucy                        Ucylay Abbotway is from An Franciscosay
Arora, Arpit                         Arpitway Aroraway is from Angalorebay
Crowe, Kevin                           Evinkay Rowecay is from Ilwaukeemay
Elszasz, Justin                    Ustinjay Elszaszway is from Altimorebay
Hersher, Monica                     Onicamay Ersherhay is from Ew Dehlinay
Hudlow, Jonathan                     Onathanjay Udlowhay is from Angkokbay
Jairam, Nesha                          Eshanay Airamjay is from Atlantaway
Lavoe, Francis                          Rancisfay Avoelay is from Accraway
Murray, Kate                             Atekay Urraymay is from Hicagocay
Vega Rodriguez, Juan          Uanjay Ega Rodriguezvay is from Ashingtonway
McAllister, Scott                Cottsay CAllistermay is from Ashingtonway
Jimenez Lara, Daniela       Anieladay Imenez Larajay is from Exico Citymay
Filippou, Georgios                Eorgiosgay Ilippoufay is from Imassollay
Sundberg, Johnna                    Ohnnajay Undbergsay is from Akamegakay
Turlakova, Marina                    Arinamay Urlakovatay is from Ostonbay
Bouacha, Nora                           Oranay Ouachabay is from Hicagocay
Sinclair, Rajiv                       Ajivray Inclairsay is from Hicagocay
Nkusi, Sandra                           Andrasay Kusinay is from Orontotay
Bromberg, Ben                          Enbay Rombergbay is from Ew Yorknay
Williams, Adam                  Adamway Illiamsway is from An Franciscosay
Ghen, Michael                      Ichaelmay Hengay is from Hiladelphiapay
Benard, Claire                          Lairecay Enardbay is from Ondonlay
Liadsky, Daniel                       Anielday Iadskylay is from Orontotay
McGowan, Jim                             Imjay CGowanmay is from Hicagocay
Zander, Keith                           Eithkay Anderzay is from Hicagocay
Oliva-Altamirano, Paola    Aolapay Oliva-Altamiranoway is from Elbournemay
Mitchell, Robert                 Obertray Itchellmay is from Os Angeleslay
Geraghty, Ryan                      Yanray Eraghtygay is from Ittsburghpay
Bunting, Tom                             Omtay Untingbay is from Hicagocay
Alele, Peter                            Eterpay Aleleway is from Airobinay
Doshi, Kruti                             Rutikay Oshiday is from Hicagocay
Jahani, Kosar                     Osarkay Ahanijay is from An Franciscosay
Niksch, Matt                             Attmay Ikschnay is from Hicagocay
Gagne, Elizabeth                   Elizabethway Agnegay is from Ew Yorknay
Gohil, Deepali                         Eepaliday Ohilgay is from Airobinay
Stevens, Matt                      Attmay Tevenssay is from Hiladelphiapay
Tess, Joanna                             Oannajay Esstay is from Hicagocay
Anselmo, Nicki                        Ickinay Anselmoway is from Hicagocay
Aye, Nyi Nyi                             Yi Nyinay Ayeway is from Angonyay
Jimenez, Rebeca Moreno          Ebeca Morenoray Imenezjay is from Enevagay
Lucius, Nick                             Icknay Uciuslay is from Hicagocay
Bowser, William                        Illiamway Owserbay is from Abujaway
Fan, Bonnie                               Onniebay Anfay is from Hicagocay
Ojo, Catherine                      Atherinecay Ojoway is from Arcelonabay
Wei, Dan                               Anday Eiway is from An Franciscosay
dtype: object

# A simpler example for apply:
df4 = df.copy()
df4['id'] = df4.index
df4['id'].apply(lambda x: x+' is an index!')

LastFirst
Abbot, Lucy                            Abbot, Lucy is an index!
Arora, Arpit                          Arora, Arpit is an index!
Crowe, Kevin                          Crowe, Kevin is an index!
Elszasz, Justin                    Elszasz, Justin is an index!
Hersher, Monica                    Hersher, Monica is an index!
Hudlow, Jonathan                  Hudlow, Jonathan is an index!
Jairam, Nesha                        Jairam, Nesha is an index!
Lavoe, Francis                      Lavoe, Francis is an index!
Murray, Kate                          Murray, Kate is an index!
Vega Rodriguez, Juan          Vega Rodriguez, Juan is an index!
McAllister, Scott                McAllister, Scott is an index!
Jimenez Lara, Daniela        Jimenez Lara, Daniela is an index!
Filippou, Georgios              Filippou, Georgios is an index!
Sundberg, Johnna                  Sundberg, Johnna is an index!
Turlakova, Marina                Turlakova, Marina is an index!
Bouacha, Nora                        Bouacha, Nora is an index!
Sinclair, Rajiv                    Sinclair, Rajiv is an index!
Nkusi, Sandra                        Nkusi, Sandra is an index!
Bromberg, Ben                        Bromberg, Ben is an index!
Williams, Adam                      Williams, Adam is an index!
Ghen, Michael                        Ghen, Michael is an index!
Benard, Claire                      Benard, Claire is an index!
Liadsky, Daniel                    Liadsky, Daniel is an index!
McGowan, Jim                          McGowan, Jim is an index!
Zander, Keith                        Zander, Keith is an index!
Oliva-Altamirano, Paola    Oliva-Altamirano, Paola is an index!
Mitchell, Robert                  Mitchell, Robert is an index!
Geraghty, Ryan                      Geraghty, Ryan is an index!
Bunting, Tom                          Bunting, Tom is an index!
Alele, Peter                          Alele, Peter is an index!
Doshi, Kruti                          Doshi, Kruti is an index!
Jahani, Kosar                        Jahani, Kosar is an index!
Niksch, Matt                          Niksch, Matt is an index!
Gagne, Elizabeth                  Gagne, Elizabeth is an index!
Gohil, Deepali                      Gohil, Deepali is an index!
Stevens, Matt                        Stevens, Matt is an index!
Tess, Joanna                          Tess, Joanna is an index!
Anselmo, Nicki                      Anselmo, Nicki is an index!
Aye, Nyi Nyi                          Aye, Nyi Nyi is an index!
Jimenez, Rebeca Moreno      Jimenez, Rebeca Moreno is an index!
Lucius, Nick                          Lucius, Nick is an index!
Bowser, William                    Bowser, William is an index!
Fan, Bonnie                            Fan, Bonnie is an index!
Ojo, Catherine                      Ojo, Catherine is an index!
Wei, Dan                                  Wei, Dan is an index!
Name: id, dtype: object

# When working on an index, we use "map"
df4.index.map(lambda x: x+' is an index!')

Index(['Abbot, Lucy is an index!', 'Arora, Arpit is an index!',
       'Crowe, Kevin is an index!', 'Elszasz, Justin is an index!',
       'Hersher, Monica is an index!', 'Hudlow, Jonathan is an index!',
       'Jairam, Nesha is an index!', 'Lavoe, Francis is an index!',
       'Murray, Kate is an index!', 'Vega Rodriguez, Juan is an index!',
       'McAllister, Scott is an index!', 'Jimenez Lara, Daniela is an index!',
       'Filippou, Georgios is an index!', 'Sundberg, Johnna is an index!',
       'Turlakova, Marina is an index!', 'Bouacha, Nora is an index!',
       'Sinclair, Rajiv is an index!', 'Nkusi, Sandra is an index!',
       'Bromberg, Ben is an index!', 'Williams, Adam is an index!',
       'Ghen, Michael is an index!', 'Benard, Claire is an index!',
       'Liadsky, Daniel is an index!', 'McGowan, Jim is an index!',
       'Zander, Keith is an index!', 'Oliva-Altamirano, Paola is an index!',
       'Mitchell, Robert is an index!', 'Geraghty, Ryan is an index!',
       'Bunting, Tom is an index!', 'Alele, Peter is an index!',
       'Doshi, Kruti is an index!', 'Jahani, Kosar is an index!',
       'Niksch, Matt is an index!', 'Gagne, Elizabeth is an index!',
       'Gohil, Deepali is an index!', 'Stevens, Matt is an index!',
       'Tess, Joanna is an index!', 'Anselmo, Nicki is an index!',
       'Aye, Nyi Nyi is an index!', 'Jimenez, Rebeca Moreno is an index!',
       'Lucius, Nick is an index!', 'Bowser, William is an index!',
       'Fan, Bonnie is an index!', 'Ojo, Catherine is an index!',
       'Wei, Dan is an index!'],
      dtype='object', name='LastFirst')

	Cohort
count	45.000000
mean	3.155556
std	1.429487
min	1.000000
25%	2.000000
50%	3.000000
75%	4.000000
max	5.000000

	% Chicago
Cohort
1	0.375000
2	0.375000
3	0.375000
4	0.181818
5	0.100000

	First	Last	Gender	Organization	City	Region/State	Country	Cohort	Track
0	Lucy	Abbot	Female	Braven	San Francisco	California	United States	5	Data Science
1	Arpit	Arora	Male	Pratham Books	Bangalore	Karnataka	India	5	Data Science

	First	Last	Gender	Organization	City	Region/State	Country	Cohort	Track	LastFirst
40	Nick	Lucius	Male	City of Chicago	Chicago	Illinois	United States	1	Data Science	Lucius, Nick
41	William	Bowser	Male	Hello Tractor	Abuja	FCT	Nigeria	1	Data Science	Bowser, William
42	Bonnie	Fan	Female	Chicago Transit Authority	Chicago	Illinois	United States	1	Data Science	Fan, Bonnie
43	Catherine	Ojo	Female	Worldreader	Barcelona	Catalonia	Spain	1	Data Science	Ojo, Catherine
44	Dan	Wei	Female	Carbon Lighthouse	San Francisco	California	United States	1	Data Science	Wei, Dan

	First	Last	Gender	City	Region/State	Country	Cohort	Track
Organization
City of Chicago	Nick	Lucius	Male	Chicago	Illinois	United States	1	Data Science
Hello Tractor	William	Bowser	Male	Abuja	FCT	Nigeria	1	Data Science
Chicago Transit Authority	Bonnie	Fan	Female	Chicago	Illinois	United States	1	Data Science
Worldreader	Catherine	Ojo	Female	Barcelona	Catalonia	Spain	1	Data Science
Carbon Lighthouse	Dan	Wei	Female	San Francisco	California	United States	1	Data Science

	First	Last	Gender	Organization	City	Region/State	Country	Cohort	Track
LastFirst
Abbot, Lucy	Lucy	Abbot	Female	Braven	San Francisco	California	United States	5	Data Science
Arora, Arpit	Arpit	Arora	Male	Pratham Books	Bangalore	Karnataka	India	5	Data Science
Crowe, Kevin	Kevin	Crowe	Male	Milwaukee Journal Sentinel	Milwaukee	Wisconsin	United States	5	Data Science
Elszasz, Justin	Justin	Elszasz	Male	City of Baltimore	Baltimore	Maryland	United States	5	Data Science
Hersher, Monica	Monica	Hersher	Female	IDinsight	New Dehli	National Capital Territory	India	5	Data Science

	% Female	% Chicago
Cohort
1	0.625000	0.375000
2	0.625000	0.375000
3	0.250000	0.375000
4	0.454545	0.181818
5	0.400000	0.100000

	First	Last
0	Donald	Duck
1	Mickey	Mouse
2	Minnie	Mouse

	First	Last
0	Donald	Duck
1	Mickey	Mouse
2	Minnie	Mouse