SpreadSheet Munging Strategies in Python - Pivot Tables - Complex Unpivoting

SpreadSheet Munging Strategies in Python - Pivot Tables - Complex Unpivoting#

Pivot Tables - Complex Unpivoting#

updated : September 26, 2024

This is part of a series of blog posts about extracting data from spreadsheets using Python. It is based on the book written by Duncan Garmonsway, which was written primarily for R users. Links to the other posts are on the sidebar.

The key takeaway is this - you understand your data layout; use the tools to achieve your end goal. xlsx_cells offers a way to get the cells in a spreadsheet into individual rows, with some metadata. The final outcome however relies on your understanding of the data layout and its proper application.

We’ve dealt with pivot tables in one of the previous posts. Here, we take the complexity up a notch. Let’s dive in to the various scenarios.

Case 1 : Centre-aligned headers#

In this case, the headers are not aligned completely with the subjects or names. If the data is read into Pandas, columns B and C are set as the index of the dataframe, and a forward/backward fill applied, “Humanities” could be wrongly assigned to “Music” or “Performance” to “ Literature” (“Music” should be paired with “Performance”, while “Humanities” should be paired with “Literature”). Same goes for the header columns - if rows 2 and 3 are read in as header columns, and a forward/backward fill applied, “Female” may be wrongly assigned to “Lenny”, while “Male” could be wrongly assigned to “Olivia”.

The solution is to get the coordinates for the horizontal and vertical borders, and use that to correctly pair the header rows and header columns. We’ll use xlsx_cells function from pyjanitor to get the coordinates - under the hood, it uses openpyxl:

# pip install pyjanitor
import pandas as pd
import janitor as jn
import numpy as np
import sys

print("pandas version: ", pd.__version__)
print("janitor version: ", jn.__version__)
print("python version: ", sys.version)
print("numpy version: ", np.__version__)

pandas version:  2.2.2
janitor version:  0.29.1
python version:  3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
numpy version:  2.0.2

excel_file = pd.ExcelFile("Data_files/worked-examples.xlsx")

frame = jn.xlsx_cells(
    excel_file,
    sheetnames="pivot-centre-aligned",
    border=True,
    include_blank_cells=False,
)
frame = frame.astype({"row": np.int8, "column": np.int8})
frame.head()

	value	internal_value	coordinate	row	column	data_type	is_date	number_format	border
0	Female	Female	E2	2	5	s	False	General	{'left': {'style': None, 'color': None}, 'righ...
1	Male	Male	I2	2	9	s	False	General	{'left': {'style': None, 'color': None}, 'righ...
2	Leah	Leah	D3	3	4	s	False	General	{'left': {'style': 'thin', 'color': None}, 'ri...
3	Matilda	Matilda	E3	3	5	s	False	General	{'left': {'style': None, 'color': None}, 'righ...
4	Olivia	Olivia	F3	3	6	s	False	General	{'left': {'style': None, 'color': None}, 'righ...

Observations:

The first row is the gender; the next row are the names.
The male and female students are separated by a vertical border. We’ll use the border position to accurately reshape the data.
The first column are the fields; the next column are the subjects.
Just like the gender/names; a border separates the fields, horizontally.
The scores are all integers.

# get the fields
# the very first column
min_col = frame.column.min()
fields = frame.loc[frame.column == min_col, ["value"]]
fields = fields.rename(columns={"value": "field"})
# the horizontal borders to properly pair field and subject
# a good option is the bottom border
booleans = frame.border.str.get("bottom").str.get("style")
rows = frame.loc[booleans.notna(), "row"].array
# align with the first cell per window
# the first subject is one below the border
# hence the +1
rows = np.unique(rows)[: len(fields)] + 1
fields["row"] = rows
subjects = frame.loc[frame.column == (min_col + 1), ["value", "row"]]
subjects = subjects.rename(columns={"value": "subject"})
# merge with fields:
fields_and_subjects = subjects.merge(fields, on="row", how="left").ffill()
fields_and_subjects

	subject	row	field
0	Classics	4	Humanities
1	History	5	Humanities
2	Literature	6	Humanities
3	Philosophy	7	Humanities
4	Languages	8	Humanities
5	Music	9	Performance
6	Dance	10	Performance
7	Drama	11	Performance

# gender is the very first row
min_row = frame.row.min()
gender = frame.loc[frame.row == min_row, ["value"]]
gender = gender.rename(columns={"value": "gender"})
# the vertical borders we are interested in
# are the ones paired with the numbers
# a good option is the left border
border_booleans = frame.border.str.get("left").str.get("style").notna()
number_booleans = frame.data_type.eq("n")
booleans = border_booleans & number_booleans
columns = frame.loc[booleans, "column"].array
# align with the first cell per window
columns = np.unique(columns)[: len(gender)]
gender["column"] = columns
# nemes are the very next row after gender
names = frame.loc[frame.row == (min_row + 1), ["value", "column"]]
names = names.rename(columns={"value": "name"})
# merge with gender
names_and_gender = names.merge(gender, on="column", how="left").ffill()
names

	name	column
2	Leah	4
3	Matilda	5
4	Olivia	6
5	Lenny	7
6	Max	8
7	Nicholas	9
8	Paul	10

# grab the scores
data = frame.loc[number_booleans, ["value", "row", "column"]]
data = data.rename(columns={"value": "score"})
# merge with fields_and subjects
# merge with names_and_gender
outcome = (
    data.merge(fields_and_subjects, on="row", how="left")
    .merge(names_and_gender, on="column", how="left")
    .loc[:, ["name", "gender", "field", "subject", "score"]]
)

outcome

	name	gender	field	subject	score
0	Leah	Female	Humanities	Classics	3
1	Matilda	Female	Humanities	Classics	1
2	Olivia	Female	Humanities	Classics	2
3	Lenny	Male	Humanities	Classics	4
4	Max	Male	Humanities	Classics	3
5	Nicholas	Male	Humanities	Classics	3
6	Paul	Male	Humanities	Classics	0
7	Leah	Female	Humanities	History	8
8	Matilda	Female	Humanities	History	3
9	Olivia	Female	Humanities	History	4
10	Lenny	Male	Humanities	History	7
11	Max	Male	Humanities	History	5
12	Nicholas	Male	Humanities	History	5
13	Paul	Male	Humanities	History	1
14	Leah	Female	Humanities	Literature	1
15	Matilda	Female	Humanities	Literature	1
16	Olivia	Female	Humanities	Literature	9
17	Lenny	Male	Humanities	Literature	3
18	Max	Male	Humanities	Literature	12
19	Nicholas	Male	Humanities	Literature	7
20	Paul	Male	Humanities	Literature	5
21	Leah	Female	Humanities	Philosophy	5
22	Matilda	Female	Humanities	Philosophy	10
23	Olivia	Female	Humanities	Philosophy	10
24	Lenny	Male	Humanities	Philosophy	8
25	Max	Male	Humanities	Philosophy	2
26	Nicholas	Male	Humanities	Philosophy	5
27	Paul	Male	Humanities	Philosophy	12
28	Leah	Female	Humanities	Languages	5
29	Matilda	Female	Humanities	Languages	4
30	Olivia	Female	Humanities	Languages	5
31	Lenny	Male	Humanities	Languages	9
32	Max	Male	Humanities	Languages	8
33	Nicholas	Male	Humanities	Languages	3
34	Paul	Male	Humanities	Languages	8
35	Leah	Female	Performance	Music	4
36	Matilda	Female	Performance	Music	10
37	Olivia	Female	Performance	Music	10
38	Lenny	Male	Performance	Music	2
39	Max	Male	Performance	Music	4
40	Nicholas	Male	Performance	Music	5
41	Paul	Male	Performance	Music	6
42	Leah	Female	Performance	Dance	4
43	Matilda	Female	Performance	Dance	5
44	Olivia	Female	Performance	Dance	6
45	Lenny	Male	Performance	Dance	4
46	Max	Male	Performance	Dance	12
47	Nicholas	Male	Performance	Dance	9
48	Paul	Male	Performance	Dance	2
49	Leah	Female	Performance	Drama	2
50	Matilda	Female	Performance	Drama	7
51	Olivia	Female	Performance	Drama	8
52	Lenny	Male	Performance	Drama	6
53	Max	Male	Performance	Drama	1
54	Nicholas	Male	Performance	Drama	12
55	Paul	Male	Performance	Drama	3

Case 2: Repeated rows/columns of headers within the table#

Observations :
The row header (Term1, Term2, Term3) is repeated in four locations; we only need one.
The index columns are clearly delineated; the pairing of subjects and names is assured.

For this, we skip xlsx_cells, and take advantage of pandas’ MultiIndexes:

(
    excel_file.parse(
        sheet_name="pivot-repeated-headers", header=[0, 1], index_col=[0, 1, 2]
    )
    .droplevel(axis=0, level=0)
    .droplevel(axis=1, level=0)
    .transform(pd.to_numeric, errors="coerce")
    .dropna()
    .astype(np.int8)
    .stack(future_stack=True)
    .rename("scores")
    .rename_axis(index=["subject", "name", "term"])
    .reset_index()
)

	subject	name	term	scores
0	Classics	Matilda	Term 1	1
1	Classics	Matilda	Term 2	7
2	Classics	Matilda	Term 3	4
3	Classics	Nicholas	Term 1	2
4	Classics	Nicholas	Term 2	9
5	Classics	Nicholas	Term 3	2
6	Classics	Olivia	Term 1	8
7	Classics	Olivia	Term 2	4
8	Classics	Olivia	Term 3	9
9	Classics	Paul	Term 1	9
10	Classics	Paul	Term 2	4
11	Classics	Paul	Term 3	4
12	History	Matilda	Term 1	6
13	History	Matilda	Term 2	6
14	History	Matilda	Term 3	4
15	History	Nicholas	Term 1	2
16	History	Nicholas	Term 2	4
17	History	Nicholas	Term 3	6
18	History	Olivia	Term 1	5
19	History	Olivia	Term 2	0
20	History	Olivia	Term 3	0
21	History	Paul	Term 1	2
22	History	Paul	Term 2	6
23	History	Paul	Term 3	7
24	Music	Matilda	Term 1	0
25	Music	Matilda	Term 2	5
26	Music	Matilda	Term 3	1
27	Music	Nicholas	Term 1	1
28	Music	Nicholas	Term 2	1
29	Music	Nicholas	Term 3	2
30	Music	Olivia	Term 1	6
31	Music	Olivia	Term 2	1
32	Music	Olivia	Term 3	3
33	Music	Paul	Term 1	2
34	Music	Paul	Term 2	0
35	Music	Paul	Term 3	1
36	Drama	Matilda	Term 1	4
37	Drama	Matilda	Term 2	3
38	Drama	Matilda	Term 3	6
39	Drama	Nicholas	Term 1	9
40	Drama	Nicholas	Term 2	3
41	Drama	Nicholas	Term 3	2
42	Drama	Olivia	Term 1	2
43	Drama	Olivia	Term 2	7
44	Drama	Olivia	Term 3	6
45	Drama	Paul	Term 1	2
46	Drama	Paul	Term 2	7
47	Drama	Paul	Term 3	1

Case 3 : Headers amongst the data#

In this scenario, we have the subjects as a row header, mixed with the data(classics, history, music, drama). Note that the Term1,Term2,Term3 row is repeated.

Let’s use xlsx_cells to get the font data (specifically, the cells with bold font) and with that location, correctly align the data:

frame = jn.xlsx_cells(
    excel_file,
    sheetnames="pivot-header-within-data",
    font=True,
    border=True,
    include_blank_cells=False,
)
frame = frame.astype({"row": np.int8, "column": np.int8})
frame.head()

	value	internal_value	coordinate	row	column	data_type	is_date	number_format	font	border
0	Classics	Classics	C2	2	3	s	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...	{'left': {'style': 'thin', 'color': None}, 'ri...
1	Term 1	Term 1	C3	3	3	s	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...	{'left': {'style': 'thin', 'color': None}, 'ri...
2	Term 2	Term 2	D3	3	4	s	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...	{'left': {'style': None, 'color': None}, 'righ...
3	Term 3	Term 3	E3	3	5	s	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...	{'left': {'style': None, 'color': None}, 'righ...
4	Matilda	Matilda	B4	4	2	s	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...	{'left': {'style': 'thin', 'color': None}, 'ri...

Observations:

The subjects are in bold font
The subjects are one row above the Terms
The names are aligned with the scores row wise
The term is one row above the scores; this implies the shortest distance between the scores and the subjects is 2

subjects = frame.loc[frame.font.str.get("b"), ["value", "row"]]
subjects = subjects.rename(columns={"value": "subject"})
terms = frame.loc[frame.row.isin(subjects.row + 1), ["value", "row", "column"]]
terms = terms.rename(columns={"value": "term"})
subjects["row"] += 1  # align with the terms
# merge term and subjects:
terms_subjects = subjects.merge(terms, on="row")
terms_subjects["row"] += 1  # align with the immediate next cell
# names have a left border
border_booleans = frame.border.str.get("left").str.get("style").notna()
string_booleans = frame.data_type.eq("s")
booleans = border_booleans & string_booleans
names = frame.loc[booleans, ["value", "row"]]
names = names.rename(columns={"value": "name"})
number_booleans = frame.data_type.eq("n")
numbers = frame.loc[number_booleans, ["value", "row", "column"]]
# sort columns so that merge with term and subject will be in the proper order
numbers = numbers.rename(columns={"value": "score"}).sort_values(["column", "row"])
# merge into one DataFrame
# numbers.merge(names, on='row', how='left')
outcome = (
    numbers.merge(terms_subjects, on=["column", "row"], how="left")
    .ffill()
    .merge(names, on="row")
    .loc[:, ["name", "subject", "term", "score"]]
)
outcome

	name	subject	term	score
0	Matilda	Classics	Term 1	7
1	Nicholas	Classics	Term 1	9
2	Olivia	Classics	Term 1	5
3	Paul	Classics	Term 1	1
4	Matilda	History	Term 1	0
5	Nicholas	History	Term 1	4
6	Olivia	History	Term 1	1
7	Paul	History	Term 1	3
8	Matilda	Music	Term 1	6
9	Nicholas	Music	Term 1	5
10	Olivia	Music	Term 1	1
11	Paul	Music	Term 1	1
12	Matilda	Drama	Term 1	7
13	Nicholas	Drama	Term 1	6
14	Olivia	Drama	Term 1	1
15	Paul	Drama	Term 1	5
16	Matilda	Classics	Term 2	2
17	Nicholas	Classics	Term 2	7
18	Olivia	Classics	Term 2	9
19	Paul	Classics	Term 2	4
20	Matilda	History	Term 2	0
21	Nicholas	History	Term 2	5
22	Olivia	History	Term 2	3
23	Paul	History	Term 2	5
24	Matilda	Music	Term 2	2
25	Nicholas	Music	Term 2	7
26	Olivia	Music	Term 2	6
27	Paul	Music	Term 2	3
28	Matilda	Drama	Term 2	7
29	Nicholas	Drama	Term 2	7
30	Olivia	Drama	Term 2	3
31	Paul	Drama	Term 2	7
32	Matilda	Classics	Term 3	1
33	Nicholas	Classics	Term 3	6
34	Olivia	Classics	Term 3	5
35	Paul	Classics	Term 3	9
36	Matilda	History	Term 3	5
37	Nicholas	History	Term 3	9
38	Olivia	History	Term 3	3
39	Paul	History	Term 3	6
40	Matilda	Music	Term 3	4
41	Nicholas	Music	Term 3	6
42	Olivia	Music	Term 3	0
43	Paul	Music	Term 3	3
44	Matilda	Drama	Term 3	0
45	Nicholas	Drama	Term 3	8
46	Olivia	Drama	Term 3	3
47	Paul	Drama	Term 3	2

Another route, without xlsx_cells:

(
    excel_file.parse(sheet_name="pivot-header-within-data")
    .dropna(axis=1, how="all")
    .set_axis(["name", "Term 1", "Term 2", "Term3"], axis="columns")
    .assign(
        subject=lambda f: f["Term 1"]
        .where(~f["Term 1"].str.startswith("Term", na=False) & f.name.isna())
        .ffill()
    )
    .dropna()
    .melt(id_vars=["name", "subject"], var_name="term", value_name="score")
)

	name	subject	term	score
0	Matilda	Classics	Term 1	7
1	Nicholas	Classics	Term 1	9
2	Olivia	Classics	Term 1	5
3	Paul	Classics	Term 1	1
4	Matilda	History	Term 1	0
5	Nicholas	History	Term 1	4
6	Olivia	History	Term 1	1
7	Paul	History	Term 1	3
8	Matilda	Music	Term 1	6
9	Nicholas	Music	Term 1	5
10	Olivia	Music	Term 1	1
11	Paul	Music	Term 1	1
12	Matilda	Drama	Term 1	7
13	Nicholas	Drama	Term 1	6
14	Olivia	Drama	Term 1	1
15	Paul	Drama	Term 1	5
16	Matilda	Classics	Term 2	2
17	Nicholas	Classics	Term 2	7
18	Olivia	Classics	Term 2	9
19	Paul	Classics	Term 2	4
20	Matilda	History	Term 2	0
21	Nicholas	History	Term 2	5
22	Olivia	History	Term 2	3
23	Paul	History	Term 2	5
24	Matilda	Music	Term 2	2
25	Nicholas	Music	Term 2	7
26	Olivia	Music	Term 2	6
27	Paul	Music	Term 2	3
28	Matilda	Drama	Term 2	7
29	Nicholas	Drama	Term 2	7
30	Olivia	Drama	Term 2	3
31	Paul	Drama	Term 2	7
32	Matilda	Classics	Term3	1
33	Nicholas	Classics	Term3	6
34	Olivia	Classics	Term3	5
35	Paul	Classics	Term3	9
36	Matilda	History	Term3	5
37	Nicholas	History	Term3	9
38	Olivia	History	Term3	3
39	Paul	History	Term3	6
40	Matilda	Music	Term3	4
41	Nicholas	Music	Term3	6
42	Olivia	Music	Term3	0
43	Paul	Music	Term3	3
44	Matilda	Drama	Term3	0
45	Nicholas	Drama	Term3	8
46	Olivia	Drama	Term3	3
47	Paul	Drama	Term3	2