SpreadSheet Munging Strategies in Python - Small Multiples

SpreadSheet Munging Strategies in Python - Small Multiples#

Small Multiples#

updated : April 14, 2022

This is part of a series of blog posts about extracting data from spreadsheets using Python. It is based on the book written by Duncan Garmonsway, which was written primarily for R users. LInks to the other posts are on the homepage.

Small multiples refer to mini tables embedded in a spreadsheet, or multiple spreadsheets. Ideally, this tables should be lumped into one dataframe for meaningful analysis. The examples below show different scenarios and how we can reshape the data

Case 1 : Small Multiples with all Headers Present for Each Multiple#

In this spreadsheet, each table is a separate subject. It would be better to aggregate all the subjects and underlying data into one table.

import pandas as pd
import janitor
import numpy as np

excel_file = pd.ExcelFile("Data_files/worked-examples.xlsx", engine='openpyxl')

df = excel_file.parse(sheet_name="small-multiples", header=None)
df

	0	1	2	3	4	5	6
0	Classics	NaN	NaN	NaN	History	NaN	NaN
1	Name	Score	Grade	NaN	Name	Score	Grade
2	Matilda	1	F	NaN	Matilda	3	D
3	Olivia	2	D	NaN	Olivia	4	C
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	Music	NaN	NaN	NaN	Drama	NaN	NaN
6	Name	Score	Grade	NaN	Name	Score	Grade
7	Matilda	5	B	NaN	Matilda	7	A
8	Olivia	6	B	NaN	Olivia	8	A

Observations :

There is a completely empty column that splits the tables, and a completely empty row as well. We’ll use the coordinates in our data reshaping
For each table, the subject is directly above. We’ll use the empty cells adjacent to it as a criteria to create a subject column

#get the first three columns before the completely null column
before_null = df.iloc[:,:3]
#get the columns after the completely null column
after_null =  df.iloc[:,4:].set_axis([0,1,2],axis=1)

(pd          
.concat((before_null, after_null))
.set_axis(['Name','Score','Grade'],axis=1)
.query('Name != "Name"')
.dropna(subset=['Name'])
.assign(subject = lambda x: np.where(x.Score.isna(), 
                                        x.Name, 
                                        np.nan)
        )
.assign(subject = lambda df: df.subject.ffill())
.dropna()
.reset_index(drop = True)
)

	Name	Score	Grade	subject
0	Matilda	1	F	Classics
1	Olivia	2	D	Classics
2	Matilda	5	B	Music
3	Olivia	6	B	Music
4	Matilda	3	D	History
5	Olivia	4	C	History
6	Matilda	7	A	Drama
7	Olivia	8	A	Drama

The image below illustrates the main concepts of the above code.

"solution visual for case1"

Case 2 : Same table in several worksheets/files (using the sheet/file name)#

For this case, our data is in different worksheets. We can iterate through each worksheet and combine the dataframes into one.

extract = [excel_file.parse(sheet_name=sheetname, index_col=0) 
           for sheetname in ("humanities", "performance")]

extract

[          Matilda  Nicholas
 Classics        1         3
 History         3         5,
        Matilda  Nicholas
 Music        5         9
 Drama        7        12]

Combine the individual dataframes into one:

(pd
.concat(extract)
.rename_axis(index = 'subject', columns='student')
.stack()
.rename('scores')
.reset_index()
)

	subject	student	scores
0	Classics	Matilda	1
1	Classics	Nicholas	3
2	History	Matilda	3
3	History	Nicholas	5
4	Music	Matilda	5
5	Music	Nicholas	9
6	Drama	Matilda	7
7	Drama	Nicholas	12

The image below illustrates the core concepts of the above solution:

"function description for case 2"

Case 3 : Same table in several worksheets/files but in different positions#

This is similar to Case 2, with the core data been the same. Here we need to pick rows from Subject downwards only, as that is the only relevant data:

extract = {sheetname : excel_file.parse(sheet_name=sheetname, header = None, index_col=0)
                                 .loc['Subject':]
                                 # use the subject row as column names
                                 .pipe(lambda df: df.set_axis(df.loc['Subject'], axis = 1))
                                 .drop(index='Subject')
                                 .rename_axis(index = 'subject', columns = 'student')
           for sheetname in ("female", "male")}

extract

{'female': student  Matilda Olivia
 subject                
 Classics       1      2
 History        3      4,
 'male': student  Nicholas Paul
 subject               
 Classics        3    0
 History         5    1}

Combine the individual dataframes into one:

(pd
.concat(extract, names=['sex'])
.stack()
.rename('scores')
.reset_index()
)

	sex	subject	student	scores
0	female	Classics	Matilda	1
1	female	Classics	Olivia	2
2	female	History	Matilda	3
3	female	History	Olivia	4
4	male	Classics	Nicholas	3
5	male	Classics	Paul	0
6	male	History	Nicholas	5
7	male	History	Paul	1

The image below explains the main concepts of the solution above :

"visual explanation of function for case3"

Case 4 : Implied multiples#

For this case, we have the fields at the top, followed by the subjects and grade for each subject. the student names is the very first column.
The goal is to get the subjects,grades and scores per field, per student and combine into one.

(excel_file
.parse(sheet_name='implied-multiples', header=None)
.ffill(axis=1)
.replace({np.nan : 'field'})
.set_index(0)
.T
.melt(id_vars = ['field','Name'],
      var_name = 'student',
      value_name = 'scores'
)
.assign(grade = lambda x: x.loc[x.Name == "Grade", 'scores'])
#scores are above grades per student
#hence the bfill
.bfill()
.query('Name != "Grade"')
.rename(columns={'Name':'subject'})
)

	field	subject	student	scores	grade
0	Humanities	Classics	Matilda	1	F
2	Humanities	History	Matilda	3	D
4	Performance	Music	Matilda	5	B
6	Performance	Drama	Matilda	7	A
8	Humanities	Classics	Olivia	2	D
10	Humanities	History	Olivia	4	C
12	Performance	Music	Olivia	6	B
14	Performance	Drama	Olivia	8	A

And a visual illustration of the steps is shown below:

"visual explanation of code for case4"