E2. To be completed after lesson 10
[2]:
import pandas as pd
Exercise 2.1
In the lesson exercise, we will again work with a subset of the Palmer penguin data set. I will load it and view it now.
[3]:
df = pd.read_csv(os.path.join(data_path, "penguins_subset.csv"), header=[0, 1])
df.head()
[3]:
Gentoo | Adelie | Chinstrap | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
bill_depth_mm | bill_length_mm | flipper_length_mm | body_mass_g | bill_depth_mm | bill_length_mm | flipper_length_mm | body_mass_g | bill_depth_mm | bill_length_mm | flipper_length_mm | body_mass_g | |
0 | 16.3 | 48.4 | 220.0 | 5400.0 | 18.5 | 36.8 | 193.0 | 3500.0 | 18.3 | 47.6 | 195.0 | 3850.0 |
1 | 15.8 | 46.3 | 215.0 | 5050.0 | 16.9 | 37.0 | 185.0 | 3000.0 | 16.7 | 42.5 | 187.0 | 3350.0 |
2 | 14.2 | 47.5 | 209.0 | 4600.0 | 19.5 | 42.0 | 200.0 | 4050.0 | 16.6 | 40.9 | 187.0 | 3200.0 |
3 | 15.7 | 48.7 | 208.0 | 5350.0 | 18.3 | 42.7 | 196.0 | 4075.0 | 20.0 | 52.8 | 205.0 | 4550.0 |
4 | 14.1 | 48.7 | 210.0 | 4450.0 | 18.0 | 35.7 | 202.0 | 3550.0 | 18.7 | 45.4 | 188.0 | 3525.0 |
Explain in words what each of the following code cells does as we work toward tidying this data frame. For each cell, I show the top of the data frame.
[4]:
df.columns.names = ['species', 'quantity']
df.head()
[4]:
species | Gentoo | Adelie | Chinstrap | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
quantity | bill_depth_mm | bill_length_mm | flipper_length_mm | body_mass_g | bill_depth_mm | bill_length_mm | flipper_length_mm | body_mass_g | bill_depth_mm | bill_length_mm | flipper_length_mm | body_mass_g |
0 | 16.3 | 48.4 | 220.0 | 5400.0 | 18.5 | 36.8 | 193.0 | 3500.0 | 18.3 | 47.6 | 195.0 | 3850.0 |
1 | 15.8 | 46.3 | 215.0 | 5050.0 | 16.9 | 37.0 | 185.0 | 3000.0 | 16.7 | 42.5 | 187.0 | 3350.0 |
2 | 14.2 | 47.5 | 209.0 | 4600.0 | 19.5 | 42.0 | 200.0 | 4050.0 | 16.6 | 40.9 | 187.0 | 3200.0 |
3 | 15.7 | 48.7 | 208.0 | 5350.0 | 18.3 | 42.7 | 196.0 | 4075.0 | 20.0 | 52.8 | 205.0 | 4550.0 |
4 | 14.1 | 48.7 | 210.0 | 4450.0 | 18.0 | 35.7 | 202.0 | 3550.0 | 18.7 | 45.4 | 188.0 | 3525.0 |
[5]:
df = df.stack(level='species')
df.head()
[5]:
quantity | bill_depth_mm | bill_length_mm | body_mass_g | flipper_length_mm | |
---|---|---|---|---|---|
species | |||||
0 | Adelie | 18.5 | 36.8 | 3500.0 | 193.0 |
Chinstrap | 18.3 | 47.6 | 3850.0 | 195.0 | |
Gentoo | 16.3 | 48.4 | 5400.0 | 220.0 | |
1 | Adelie | 16.9 | 37.0 | 3000.0 | 185.0 |
Chinstrap | 16.7 | 42.5 | 3350.0 | 187.0 |
[6]:
df = df.reset_index(level='species')
df.head()
[6]:
quantity | species | bill_depth_mm | bill_length_mm | body_mass_g | flipper_length_mm |
---|---|---|---|---|---|
0 | Adelie | 18.5 | 36.8 | 3500.0 | 193.0 |
0 | Chinstrap | 18.3 | 47.6 | 3850.0 | 195.0 |
0 | Gentoo | 16.3 | 48.4 | 5400.0 | 220.0 |
1 | Adelie | 16.9 | 37.0 | 3000.0 | 185.0 |
1 | Chinstrap | 16.7 | 42.5 | 3350.0 | 187.0 |
[7]:
df = df.reset_index(drop=True)
df.head()
[7]:
quantity | species | bill_depth_mm | bill_length_mm | body_mass_g | flipper_length_mm |
---|---|---|---|---|---|
0 | Adelie | 18.5 | 36.8 | 3500.0 | 193.0 |
1 | Chinstrap | 18.3 | 47.6 | 3850.0 | 195.0 |
2 | Gentoo | 16.3 | 48.4 | 5400.0 | 220.0 |
3 | Adelie | 16.9 | 37.0 | 3000.0 | 185.0 |
4 | Chinstrap | 16.7 | 42.5 | 3350.0 | 187.0 |
[8]:
df.columns.name = None
df.head()
[8]:
species | bill_depth_mm | bill_length_mm | body_mass_g | flipper_length_mm | |
---|---|---|---|---|---|
0 | Adelie | 18.5 | 36.8 | 3500.0 | 193.0 |
1 | Chinstrap | 18.3 | 47.6 | 3850.0 | 195.0 |
2 | Gentoo | 16.3 | 48.4 | 5400.0 | 220.0 |
3 | Adelie | 16.9 | 37.0 | 3000.0 | 185.0 |
4 | Chinstrap | 16.7 | 42.5 | 3350.0 | 187.0 |
Exercise 2.2
What is the difference between merging and concatenating data frames?
Exercise 2.3
Describe the difference between categorical and quantitative variables. How are they fundamentally different in the way we plot them?
Exercise 2.4
Give pros and cons for using a histogram for display of repeated measurements. Then give pros and cons for using an ECDF.
Exercise 2.5
Write down any questions or points of confusion that you have.
Computing environment
[9]:
%load_ext watermark
%watermark -v -p pandas,jupyterlab
Python implementation: CPython
Python version : 3.9.13
IPython version : 8.4.0
pandas : 1.4.3
jupyterlab: 3.4.4