SpreadSheet Munging Strategies in Python - Pivot Tables - Complex Unpivoting#

Pivot Tables - Complex Unpivoting#

updated : September 26, 2024

This is part of a series of blog posts about extracting data from spreadsheets using Python. It is based on the book written by Duncan Garmonsway, which was written primarily for R users. Links to the other posts are on the sidebar.

The key takeaway is this - you understand your data layout; use the tools to achieve your end goal. xlsx_cells offers a way to get the cells in a spreadsheet into individual rows, with some metadata. The final outcome however relies on your understanding of the data layout and its proper application.

We’ve dealt with pivot tables in one of the previous posts. Here, we take the complexity up a notch. Let’s dive in to the various scenarios.

Case 1 : Centre-aligned headers#

pivot-centre-aligned.png

In this case, the headers are not aligned completely with the subjects or names. If the data is read into Pandas, columns B and C are set as the index of the dataframe, and a forward/backward fill applied, “Humanities” could be wrongly assigned to “Music” or “Performance” to “ Literature” (“Music” should be paired with “Performance”, while “Humanities” should be paired with “Literature”). Same goes for the header columns - if rows 2 and 3 are read in as header columns, and a forward/backward fill applied, “Female” may be wrongly assigned to “Lenny”, while “Male” could be wrongly assigned to “Olivia”.

highlight-borders.png

The solution is to get the coordinates for the horizontal and vertical borders, and use that to correctly pair the header rows and header columns. We’ll use xlsx_cells function from pyjanitor to get the coordinates - under the hood, it uses openpyxl:

# pip install pyjanitor
import pandas as pd
import janitor as jn
import numpy as np
import sys
print("pandas version: ", pd.__version__)
print("janitor version: ", jn.__version__)
print("python version: ", sys.version)
print("numpy version: ", np.__version__)
pandas version:  2.2.2
janitor version:  0.29.1
python version:  3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
numpy version:  2.0.2
excel_file = pd.ExcelFile("Data_files/worked-examples.xlsx")
frame = jn.xlsx_cells(
    excel_file,
    sheetnames="pivot-centre-aligned",
    border=True,
    include_blank_cells=False,
)
frame = frame.astype({"row": np.int8, "column": np.int8})
frame.head()
value internal_value coordinate row column data_type is_date number_format border
0 Female Female E2 2 5 s False General {'left': {'style': None, 'color': None}, 'righ...
1 Male Male I2 2 9 s False General {'left': {'style': None, 'color': None}, 'righ...
2 Leah Leah D3 3 4 s False General {'left': {'style': 'thin', 'color': None}, 'ri...
3 Matilda Matilda E3 3 5 s False General {'left': {'style': None, 'color': None}, 'righ...
4 Olivia Olivia F3 3 6 s False General {'left': {'style': None, 'color': None}, 'righ...

Observations:

  1. The first row is the gender; the next row are the names.

  2. The male and female students are separated by a vertical border. We’ll use the border position to accurately reshape the data.

  3. The first column are the fields; the next column are the subjects.

  4. Just like the gender/names; a border separates the fields, horizontally.

  5. The scores are all integers.

# get the fields
# the very first column
min_col = frame.column.min()
fields = frame.loc[frame.column == min_col, ["value"]]
fields = fields.rename(columns={"value": "field"})
# the horizontal borders to properly pair field and subject
# a good option is the bottom border
booleans = frame.border.str.get("bottom").str.get("style")
rows = frame.loc[booleans.notna(), "row"].array
# align with the first cell per window
# the first subject is one below the border
# hence the +1
rows = np.unique(rows)[: len(fields)] + 1
fields["row"] = rows
subjects = frame.loc[frame.column == (min_col + 1), ["value", "row"]]
subjects = subjects.rename(columns={"value": "subject"})
# merge with fields:
fields_and_subjects = subjects.merge(fields, on="row", how="left").ffill()
fields_and_subjects
subject row field
0 Classics 4 Humanities
1 History 5 Humanities
2 Literature 6 Humanities
3 Philosophy 7 Humanities
4 Languages 8 Humanities
5 Music 9 Performance
6 Dance 10 Performance
7 Drama 11 Performance
# gender is the very first row
min_row = frame.row.min()
gender = frame.loc[frame.row == min_row, ["value"]]
gender = gender.rename(columns={"value": "gender"})
# the vertical borders we are interested in
# are the ones paired with the numbers
# a good option is the left border
border_booleans = frame.border.str.get("left").str.get("style").notna()
number_booleans = frame.data_type.eq("n")
booleans = border_booleans & number_booleans
columns = frame.loc[booleans, "column"].array
# align with the first cell per window
columns = np.unique(columns)[: len(gender)]
gender["column"] = columns
# nemes are the very next row after gender
names = frame.loc[frame.row == (min_row + 1), ["value", "column"]]
names = names.rename(columns={"value": "name"})
# merge with gender
names_and_gender = names.merge(gender, on="column", how="left").ffill()
names
name column
2 Leah 4
3 Matilda 5
4 Olivia 6
5 Lenny 7
6 Max 8
7 Nicholas 9
8 Paul 10
# grab the scores
data = frame.loc[number_booleans, ["value", "row", "column"]]
data = data.rename(columns={"value": "score"})
# merge with fields_and subjects
# merge with names_and_gender
outcome = (
    data.merge(fields_and_subjects, on="row", how="left")
    .merge(names_and_gender, on="column", how="left")
    .loc[:, ["name", "gender", "field", "subject", "score"]]
)

outcome
name gender field subject score
0 Leah Female Humanities Classics 3
1 Matilda Female Humanities Classics 1
2 Olivia Female Humanities Classics 2
3 Lenny Male Humanities Classics 4
4 Max Male Humanities Classics 3
5 Nicholas Male Humanities Classics 3
6 Paul Male Humanities Classics 0
7 Leah Female Humanities History 8
8 Matilda Female Humanities History 3
9 Olivia Female Humanities History 4
10 Lenny Male Humanities History 7
11 Max Male Humanities History 5
12 Nicholas Male Humanities History 5
13 Paul Male Humanities History 1
14 Leah Female Humanities Literature 1
15 Matilda Female Humanities Literature 1
16 Olivia Female Humanities Literature 9
17 Lenny Male Humanities Literature 3
18 Max Male Humanities Literature 12
19 Nicholas Male Humanities Literature 7
20 Paul Male Humanities Literature 5
21 Leah Female Humanities Philosophy 5
22 Matilda Female Humanities Philosophy 10
23 Olivia Female Humanities Philosophy 10
24 Lenny Male Humanities Philosophy 8
25 Max Male Humanities Philosophy 2
26 Nicholas Male Humanities Philosophy 5
27 Paul Male Humanities Philosophy 12
28 Leah Female Humanities Languages 5
29 Matilda Female Humanities Languages 4
30 Olivia Female Humanities Languages 5
31 Lenny Male Humanities Languages 9
32 Max Male Humanities Languages 8
33 Nicholas Male Humanities Languages 3
34 Paul Male Humanities Languages 8
35 Leah Female Performance Music 4
36 Matilda Female Performance Music 10
37 Olivia Female Performance Music 10
38 Lenny Male Performance Music 2
39 Max Male Performance Music 4
40 Nicholas Male Performance Music 5
41 Paul Male Performance Music 6
42 Leah Female Performance Dance 4
43 Matilda Female Performance Dance 5
44 Olivia Female Performance Dance 6
45 Lenny Male Performance Dance 4
46 Max Male Performance Dance 12
47 Nicholas Male Performance Dance 9
48 Paul Male Performance Dance 2
49 Leah Female Performance Drama 2
50 Matilda Female Performance Drama 7
51 Olivia Female Performance Drama 8
52 Lenny Male Performance Drama 6
53 Max Male Performance Drama 1
54 Nicholas Male Performance Drama 12
55 Paul Male Performance Drama 3

Case 2: Repeated rows/columns of headers within the table#

pivot-repeated-headers.png

Observations :
The row header (Term1, Term2, Term3) is repeated in four locations; we only need one.
The index columns are clearly delineated; the pairing of subjects and names is assured.

For this, we skip xlsx_cells, and take advantage of pandas’ MultiIndexes:

(
    excel_file.parse(
        sheet_name="pivot-repeated-headers", header=[0, 1], index_col=[0, 1, 2]
    )
    .droplevel(axis=0, level=0)
    .droplevel(axis=1, level=0)
    .transform(pd.to_numeric, errors="coerce")
    .dropna()
    .astype(np.int8)
    .stack(future_stack=True)
    .rename("scores")
    .rename_axis(index=["subject", "name", "term"])
    .reset_index()
)
subject name term scores
0 Classics Matilda Term 1 1
1 Classics Matilda Term 2 7
2 Classics Matilda Term 3 4
3 Classics Nicholas Term 1 2
4 Classics Nicholas Term 2 9
5 Classics Nicholas Term 3 2
6 Classics Olivia Term 1 8
7 Classics Olivia Term 2 4
8 Classics Olivia Term 3 9
9 Classics Paul Term 1 9
10 Classics Paul Term 2 4
11 Classics Paul Term 3 4
12 History Matilda Term 1 6
13 History Matilda Term 2 6
14 History Matilda Term 3 4
15 History Nicholas Term 1 2
16 History Nicholas Term 2 4
17 History Nicholas Term 3 6
18 History Olivia Term 1 5
19 History Olivia Term 2 0
20 History Olivia Term 3 0
21 History Paul Term 1 2
22 History Paul Term 2 6
23 History Paul Term 3 7
24 Music Matilda Term 1 0
25 Music Matilda Term 2 5
26 Music Matilda Term 3 1
27 Music Nicholas Term 1 1
28 Music Nicholas Term 2 1
29 Music Nicholas Term 3 2
30 Music Olivia Term 1 6
31 Music Olivia Term 2 1
32 Music Olivia Term 3 3
33 Music Paul Term 1 2
34 Music Paul Term 2 0
35 Music Paul Term 3 1
36 Drama Matilda Term 1 4
37 Drama Matilda Term 2 3
38 Drama Matilda Term 3 6
39 Drama Nicholas Term 1 9
40 Drama Nicholas Term 2 3
41 Drama Nicholas Term 3 2
42 Drama Olivia Term 1 2
43 Drama Olivia Term 2 7
44 Drama Olivia Term 3 6
45 Drama Paul Term 1 2
46 Drama Paul Term 2 7
47 Drama Paul Term 3 1

Case 3 : Headers amongst the data#

pivot-header-within-data.png

In this scenario, we have the subjects as a row header, mixed with the data(classics, history, music, drama). Note that the Term1,Term2,Term3 row is repeated.

Let’s use xlsx_cells to get the font data (specifically, the cells with bold font) and with that location, correctly align the data:

frame = jn.xlsx_cells(
    excel_file,
    sheetnames="pivot-header-within-data",
    font=True,
    border=True,
    include_blank_cells=False,
)
frame = frame.astype({"row": np.int8, "column": np.int8})
frame.head()
value internal_value coordinate row column data_type is_date number_format font border
0 Classics Classics C2 2 3 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... {'left': {'style': 'thin', 'color': None}, 'ri...
1 Term 1 Term 1 C3 3 3 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... {'left': {'style': 'thin', 'color': None}, 'ri...
2 Term 2 Term 2 D3 3 4 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... {'left': {'style': None, 'color': None}, 'righ...
3 Term 3 Term 3 E3 3 5 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... {'left': {'style': None, 'color': None}, 'righ...
4 Matilda Matilda B4 4 2 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... {'left': {'style': 'thin', 'color': None}, 'ri...

Observations:

  • The subjects are in bold font

  • The subjects are one row above the Terms

  • The names are aligned with the scores row wise

  • The term is one row above the scores; this implies the shortest distance between the scores and the subjects is 2

subjects = frame.loc[frame.font.str.get("b"), ["value", "row"]]
subjects = subjects.rename(columns={"value": "subject"})
terms = frame.loc[frame.row.isin(subjects.row + 1), ["value", "row", "column"]]
terms = terms.rename(columns={"value": "term"})
subjects["row"] += 1  # align with the terms
# merge term and subjects:
terms_subjects = subjects.merge(terms, on="row")
terms_subjects["row"] += 1  # align with the immediate next cell
# names have a left border
border_booleans = frame.border.str.get("left").str.get("style").notna()
string_booleans = frame.data_type.eq("s")
booleans = border_booleans & string_booleans
names = frame.loc[booleans, ["value", "row"]]
names = names.rename(columns={"value": "name"})
number_booleans = frame.data_type.eq("n")
numbers = frame.loc[number_booleans, ["value", "row", "column"]]
# sort columns so that merge with term and subject will be in the proper order
numbers = numbers.rename(columns={"value": "score"}).sort_values(["column", "row"])
# merge into one DataFrame
# numbers.merge(names, on='row', how='left')
outcome = (
    numbers.merge(terms_subjects, on=["column", "row"], how="left")
    .ffill()
    .merge(names, on="row")
    .loc[:, ["name", "subject", "term", "score"]]
)
outcome
name subject term score
0 Matilda Classics Term 1 7
1 Nicholas Classics Term 1 9
2 Olivia Classics Term 1 5
3 Paul Classics Term 1 1
4 Matilda History Term 1 0
5 Nicholas History Term 1 4
6 Olivia History Term 1 1
7 Paul History Term 1 3
8 Matilda Music Term 1 6
9 Nicholas Music Term 1 5
10 Olivia Music Term 1 1
11 Paul Music Term 1 1
12 Matilda Drama Term 1 7
13 Nicholas Drama Term 1 6
14 Olivia Drama Term 1 1
15 Paul Drama Term 1 5
16 Matilda Classics Term 2 2
17 Nicholas Classics Term 2 7
18 Olivia Classics Term 2 9
19 Paul Classics Term 2 4
20 Matilda History Term 2 0
21 Nicholas History Term 2 5
22 Olivia History Term 2 3
23 Paul History Term 2 5
24 Matilda Music Term 2 2
25 Nicholas Music Term 2 7
26 Olivia Music Term 2 6
27 Paul Music Term 2 3
28 Matilda Drama Term 2 7
29 Nicholas Drama Term 2 7
30 Olivia Drama Term 2 3
31 Paul Drama Term 2 7
32 Matilda Classics Term 3 1
33 Nicholas Classics Term 3 6
34 Olivia Classics Term 3 5
35 Paul Classics Term 3 9
36 Matilda History Term 3 5
37 Nicholas History Term 3 9
38 Olivia History Term 3 3
39 Paul History Term 3 6
40 Matilda Music Term 3 4
41 Nicholas Music Term 3 6
42 Olivia Music Term 3 0
43 Paul Music Term 3 3
44 Matilda Drama Term 3 0
45 Nicholas Drama Term 3 8
46 Olivia Drama Term 3 3
47 Paul Drama Term 3 2

Another route, without xlsx_cells:

(
    excel_file.parse(sheet_name="pivot-header-within-data")
    .dropna(axis=1, how="all")
    .set_axis(["name", "Term 1", "Term 2", "Term3"], axis="columns")
    .assign(
        subject=lambda f: f["Term 1"]
        .where(~f["Term 1"].str.startswith("Term", na=False) & f.name.isna())
        .ffill()
    )
    .dropna()
    .melt(id_vars=["name", "subject"], var_name="term", value_name="score")
)
name subject term score
0 Matilda Classics Term 1 7
1 Nicholas Classics Term 1 9
2 Olivia Classics Term 1 5
3 Paul Classics Term 1 1
4 Matilda History Term 1 0
5 Nicholas History Term 1 4
6 Olivia History Term 1 1
7 Paul History Term 1 3
8 Matilda Music Term 1 6
9 Nicholas Music Term 1 5
10 Olivia Music Term 1 1
11 Paul Music Term 1 1
12 Matilda Drama Term 1 7
13 Nicholas Drama Term 1 6
14 Olivia Drama Term 1 1
15 Paul Drama Term 1 5
16 Matilda Classics Term 2 2
17 Nicholas Classics Term 2 7
18 Olivia Classics Term 2 9
19 Paul Classics Term 2 4
20 Matilda History Term 2 0
21 Nicholas History Term 2 5
22 Olivia History Term 2 3
23 Paul History Term 2 5
24 Matilda Music Term 2 2
25 Nicholas Music Term 2 7
26 Olivia Music Term 2 6
27 Paul Music Term 2 3
28 Matilda Drama Term 2 7
29 Nicholas Drama Term 2 7
30 Olivia Drama Term 2 3
31 Paul Drama Term 2 7
32 Matilda Classics Term3 1
33 Nicholas Classics Term3 6
34 Olivia Classics Term3 5
35 Paul Classics Term3 9
36 Matilda History Term3 5
37 Nicholas History Term3 9
38 Olivia History Term3 3
39 Paul History Term3 6
40 Matilda Music Term3 4
41 Nicholas Music Term3 6
42 Olivia Music Term3 0
43 Paul Music Term3 3
44 Matilda Drama Term3 0
45 Nicholas Drama Term3 8
46 Olivia Drama Term3 3
47 Paul Drama Term3 2

Comments#