E2. To be completed after lesson 10

Data set download

[2]:

import pandas as pd

Exercise 2.1

In the lesson exercise, we will again work with a subset of the Palmer penguin data set. I will load it and view it now.

[3]:

df = pd.read_csv(os.path.join(data_path, "penguins_subset.csv"), header=[0, 1])

df.head()

[3]:

	Gentoo				Adelie				Chinstrap
	bill_depth_mm	bill_length_mm	flipper_length_mm	body_mass_g	bill_depth_mm	bill_length_mm	flipper_length_mm	body_mass_g	bill_depth_mm	bill_length_mm	flipper_length_mm	body_mass_g
0	16.3	48.4	220.0	5400.0	18.5	36.8	193.0	3500.0	18.3	47.6	195.0	3850.0
1	15.8	46.3	215.0	5050.0	16.9	37.0	185.0	3000.0	16.7	42.5	187.0	3350.0
2	14.2	47.5	209.0	4600.0	19.5	42.0	200.0	4050.0	16.6	40.9	187.0	3200.0
3	15.7	48.7	208.0	5350.0	18.3	42.7	196.0	4075.0	20.0	52.8	205.0	4550.0
4	14.1	48.7	210.0	4450.0	18.0	35.7	202.0	3550.0	18.7	45.4	188.0	3525.0

Explain in words what each of the following code cells does as we work toward tidying this data frame. For each cell, I show the top of the data frame.

[4]:

df.columns.names = ['species', 'quantity']

df.head()

[4]:

species	Gentoo				Adelie				Chinstrap
quantity	bill_depth_mm	bill_length_mm	flipper_length_mm	body_mass_g	bill_depth_mm	bill_length_mm	flipper_length_mm	body_mass_g	bill_depth_mm	bill_length_mm	flipper_length_mm	body_mass_g
0	16.3	48.4	220.0	5400.0	18.5	36.8	193.0	3500.0	18.3	47.6	195.0	3850.0
1	15.8	46.3	215.0	5050.0	16.9	37.0	185.0	3000.0	16.7	42.5	187.0	3350.0
2	14.2	47.5	209.0	4600.0	19.5	42.0	200.0	4050.0	16.6	40.9	187.0	3200.0
3	15.7	48.7	208.0	5350.0	18.3	42.7	196.0	4075.0	20.0	52.8	205.0	4550.0
4	14.1	48.7	210.0	4450.0	18.0	35.7	202.0	3550.0	18.7	45.4	188.0	3525.0

[5]:

df = df.stack(level='species')

df.head()

[5]:

	quantity	bill_depth_mm	bill_length_mm	body_mass_g	flipper_length_mm
	species
0	Adelie	18.5	36.8	3500.0	193.0
	Chinstrap	18.3	47.6	3850.0	195.0
	Gentoo	16.3	48.4	5400.0	220.0
1	Adelie	16.9	37.0	3000.0	185.0
1	Chinstrap	16.7	42.5	3350.0	187.0

[6]:

df = df.reset_index(level='species')

df.head()

[6]:

quantity	species	bill_depth_mm	bill_length_mm	body_mass_g	flipper_length_mm
0	Adelie	18.5	36.8	3500.0	193.0
0	Chinstrap	18.3	47.6	3850.0	195.0
0	Gentoo	16.3	48.4	5400.0	220.0
1	Adelie	16.9	37.0	3000.0	185.0
1	Chinstrap	16.7	42.5	3350.0	187.0

[7]:

df = df.reset_index(drop=True)

df.head()

[7]:

quantity	species	bill_depth_mm	bill_length_mm	body_mass_g	flipper_length_mm
0	Adelie	18.5	36.8	3500.0	193.0
1	Chinstrap	18.3	47.6	3850.0	195.0
2	Gentoo	16.3	48.4	5400.0	220.0
3	Adelie	16.9	37.0	3000.0	185.0
4	Chinstrap	16.7	42.5	3350.0	187.0

[8]:

df.columns.name = None

df.head()

[8]:

	species	bill_depth_mm	bill_length_mm	body_mass_g	flipper_length_mm
0	Adelie	18.5	36.8	3500.0	193.0
1	Chinstrap	18.3	47.6	3850.0	195.0
2	Gentoo	16.3	48.4	5400.0	220.0
3	Adelie	16.9	37.0	3000.0	185.0
4	Chinstrap	16.7	42.5	3350.0	187.0

Exercise 2.2

What is the difference between merging and concatenating data frames?

Exercise 2.3

Describe the difference between categorical and quantitative variables. How are they fundamentally different in the way we plot them?

Exercise 2.4

Give pros and cons for using a histogram for display of repeated measurements. Then give pros and cons for using an ECDF.

Exercise 2.5

Write down any questions or points of confusion that you have.

Computing environment

[9]:

%load_ext watermark
%watermark -v -p pandas,jupyterlab

Python implementation: CPython
Python version       : 3.9.13
IPython version      : 8.4.0

pandas    : 1.4.3
jupyterlab: 3.4.4