SpreadSheet Munging Strategies in Python - Pivot Tables - Complex Unpivoting#

Pivot Tables - Complex Unpivoting#

updated : April 15, 2022

This is part of a series of blog posts about extracting data from spreadsheets using Python. It is based on the book written by Duncan Garmonsway, which was written primarily for R users. LInks to the other posts are on the homepage.

We’ve dealt with pivot tables in one of the previous posts. Here, we take the complexity up a notch. Let’s dive in to the various scenarios.

Case 1 : Centre-aligned headers#

pivot-centre-aligned.png

In this case, the headers are not aligned completely with the subjects or names. If the data is read into Pandas, columns B and C are set as the index of the dataframe, and a forward/backward fill applied, “Humanities” could be wrongly assigned to “Music” or “Performance” to “ Literature” (“Music” should be paired with “Performance”, while “Humanities” should be paired with “Literature”). Same goes for the header columns - if rows 2 and 3 are read in as header columns, and a forward/backward fill applied, “Female” may be wrongly assigned to “Lenny”, while “Male” could be wrongly assigned to “Olivia”.

highlight-borders.png

The solution is to get the coordinates for the horizontal and vertical borders, and use that to correctly pair the header rows and header columns. We’ll use xlsx_cells function from pyjanitor to get the coordinates - under the hood, it uses openpyxl:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
import numpy as np
from janitor import xlsx_cells
excel_file = pd.ExcelFile("Data_files/worked-examples.xlsx")
frame = xlsx_cells(excel_file, sheetnames = 'pivot-centre-aligned', border=True, include_blank_cells = False)
frame
value internal_value coordinate row column data_type is_date number_format border
0 Female Female E2 2 5 s False General {'left': {'style': None, 'color': None}, 'righ...
1 Male Male I2 2 9 s False General {'left': {'style': None, 'color': None}, 'righ...
2 Leah Leah D3 3 4 s False General {'left': {'style': 'thin', 'color': None}, 'ri...
3 Matilda Matilda E3 3 5 s False General {'left': {'style': None, 'color': None}, 'righ...
4 Olivia Olivia F3 3 6 s False General {'left': {'style': None, 'color': None}, 'righ...
... ... ... ... ... ... ... ... ... ...
70 8 8 F11 11 6 n False General {'left': {'style': None, 'color': None}, 'righ...
71 6 6 G11 11 7 n False General {'left': {'style': 'thin', 'color': None}, 'ri...
72 1 1 H11 11 8 n False General {'left': {'style': None, 'color': None}, 'righ...
73 12 12 I11 11 9 n False General {'left': {'style': None, 'color': None}, 'righ...
74 3 3 J11 11 10 n False General {'left': {'style': None, 'color': None}, 'righ...

75 rows Ă— 9 columns

You can view a single row in the border column:

frame.border[0]
{'left': {'style': None, 'color': None},
 'right': {'style': None, 'color': None},
 'top': {'style': 'thin', 'color': None},
 'bottom': {'style': None, 'color': None},
 'diagonal': {'style': None, 'color': None},
 'vertical': None,
 'horizontal': None,
 'diagonal_direction': None,
 'diagonalUp': False,
 'diagonalDown': False,
 'outline': True,
 'start': None,
 'end': None}

The fields and subjects columns are in columns 2 and 3 respectively, the gender and students columns are in rows 2 and 3 respectively; the relevant borders top and left respectively.

Once we have the relevant border coordinates, we can correctly link the fields and subjects, as well as the gender and students columns:

frame['gender'] = frame.border.str.get('left').str.get('style')
frame['fields'] = frame.border.str.get('top').str.get('style')
# the border is above the relevant sections, use that to our advantage to pull the relevant data
index = (frame
        .loc[frame.column.isin([2,3]), ['value', 'row', 'column', 'fields']]
        .assign(fields = lambda df: np.where(df.column.eq(2) & df.fields.isna(), df.value, df.fields))
        # pyjanitor ... syntactic sugar for pd.Series.ffill
        .fill_direction(fields = 'down')
        .replace({'thin':None})
        .fill_direction(fields = 'up')
        .query("value != fields")
        .rename(columns = {'value':'subjects'})
        )
index
/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/pandas_flavor/register.py:157: FutureWarning: This function will be deprecated in a 1.x release. Please use `pd.DataFrame.assign` instead.
  return method(self._obj, *args, **kwargs)
subjects row column fields
9 Classics 4 3 Humanities
18 History 5 3 Humanities
26 Literature 6 3 Humanities
34 Philosophy 7 3 Humanities
42 Languages 8 3 Humanities
50 Music 9 3 Performance
59 Dance 10 3 Performance
67 Drama 11 3 Performance

Same idea used for the index is applied to the columns:

columns = (frame
        .loc[frame.row.isin([2,3]), ['value', 'row', 'column', 'gender']]
        # sorting ensures the right order
        # to allow correct reshaping
        .sort_values(['column', 'row'])
        .assign(gender = lambda df: np.where(df.row.eq(2) & df.gender.isna(), df.value, df.gender))
        .fill_direction(gender = 'down')
        .replace({'thin':None})
        .fill_direction(gender = 'up')
        .query("value != gender")
        .rename(columns = {'value':'students'})
        )
columns
/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/pandas_flavor/register.py:157: FutureWarning: This function will be deprecated in a 1.x release. Please use `pd.DataFrame.assign` instead.
  return method(self._obj, *args, **kwargs)
students row column gender
2 Leah 3 4 Female
3 Matilda 3 5 Female
4 Olivia 3 6 Female
5 Lenny 3 7 Male
6 Max 3 8 Male
7 Nicholas 3 9 Male
8 Paul 3 10 Male
scores = (frame
        .loc[frame.data_type == 'n', ['value', 'row', 'column']]
        .rename(columns={'value':'scores'})
        )
scores
scores row column
10 3 4 4
11 1 4 5
12 2 4 6
13 4 4 7
14 3 4 8
15 3 4 9
16 0 4 10
19 8 5 4
20 3 5 5
21 4 5 6
22 7 5 7
23 5 5 8
24 5 5 9
25 1 5 10
27 1 6 4
28 1 6 5
29 9 6 6
30 3 6 7
31 12 6 8
32 7 6 9
33 5 6 10
35 5 7 4
36 10 7 5
37 10 7 6
38 8 7 7
39 2 7 8
40 5 7 9
41 12 7 10
43 5 8 4
44 4 8 5
45 5 8 6
46 9 8 7
47 8 8 8
48 3 8 9
49 8 8 10
51 4 9 4
52 10 9 5
53 10 9 6
54 2 9 7
55 4 9 8
56 5 9 9
57 6 9 10
60 4 10 4
61 5 10 5
62 6 10 6
63 4 10 7
64 12 10 8
65 9 10 9
66 2 10 10
68 2 11 4
69 7 11 5
70 8 11 6
71 6 11 7
72 1 11 8
73 12 11 9
74 3 11 10

We can merge the dataframes into one:

(index
.drop(columns='column')
.merge(scores, on = 'row')
.merge(columns.drop(columns='row'), on = 'column')
.loc[:, ['students', 'gender', 'fields', 'subjects', 'scores']]
)
students gender fields subjects scores
0 Leah Female Humanities Classics 3
1 Matilda Female Humanities Classics 1
2 Olivia Female Humanities Classics 2
3 Lenny Male Humanities Classics 4
4 Max Male Humanities Classics 3
5 Nicholas Male Humanities Classics 3
6 Paul Male Humanities Classics 0
7 Leah Female Humanities History 8
8 Matilda Female Humanities History 3
9 Olivia Female Humanities History 4
10 Lenny Male Humanities History 7
11 Max Male Humanities History 5
12 Nicholas Male Humanities History 5
13 Paul Male Humanities History 1
14 Leah Female Humanities Literature 1
15 Matilda Female Humanities Literature 1
16 Olivia Female Humanities Literature 9
17 Lenny Male Humanities Literature 3
18 Max Male Humanities Literature 12
19 Nicholas Male Humanities Literature 7
20 Paul Male Humanities Literature 5
21 Leah Female Humanities Philosophy 5
22 Matilda Female Humanities Philosophy 10
23 Olivia Female Humanities Philosophy 10
24 Lenny Male Humanities Philosophy 8
25 Max Male Humanities Philosophy 2
26 Nicholas Male Humanities Philosophy 5
27 Paul Male Humanities Philosophy 12
28 Leah Female Humanities Languages 5
29 Matilda Female Humanities Languages 4
30 Olivia Female Humanities Languages 5
31 Lenny Male Humanities Languages 9
32 Max Male Humanities Languages 8
33 Nicholas Male Humanities Languages 3
34 Paul Male Humanities Languages 8
35 Leah Female Performance Music 4
36 Matilda Female Performance Music 10
37 Olivia Female Performance Music 10
38 Lenny Male Performance Music 2
39 Max Male Performance Music 4
40 Nicholas Male Performance Music 5
41 Paul Male Performance Music 6
42 Leah Female Performance Dance 4
43 Matilda Female Performance Dance 5
44 Olivia Female Performance Dance 6
45 Lenny Male Performance Dance 4
46 Max Male Performance Dance 12
47 Nicholas Male Performance Dance 9
48 Paul Male Performance Dance 2
49 Leah Female Performance Drama 2
50 Matilda Female Performance Drama 7
51 Olivia Female Performance Drama 8
52 Lenny Male Performance Drama 6
53 Max Male Performance Drama 1
54 Nicholas Male Performance Drama 12
55 Paul Male Performance Drama 3

Case 2: Repeated rows/columns of headers within the table#

pivot-repeated-headers.png

Observations :
The row header (Term1, Term2, Term3) is repeated in four locations; we only need one.
The index columns are clearly delineated; the pairing of subjects and names is assured.

For this, we can use Pandas’ directly, and take advantage of MultiIndexes:

(excel_file
.parse(sheet_name= "pivot-repeated-headers", header=[0, 1], index_col=[0,1,2])
.droplevel(axis = 0, level = 0)
.droplevel(axis = 1, level = 0)
.query("`Term 1` != 'Term 1'")
.rename_axis(index = ['subject', 'name'])
.melt(var_name = 'Term', value_name = 'scores', ignore_index = False)
.reset_index()
.fill_direction(subject = 'down')
)
/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/pandas_flavor/register.py:157: FutureWarning: This function will be deprecated in a 1.x release. Please use `pd.DataFrame.assign` instead.
  return method(self._obj, *args, **kwargs)
subject name Term scores
0 Classics Matilda Term 1 1
1 Classics Nicholas Term 1 2
2 Classics Olivia Term 1 8
3 Classics Paul Term 1 9
4 History Matilda Term 1 6
5 History Nicholas Term 1 2
6 History Olivia Term 1 5
7 History Paul Term 1 2
8 Music Matilda Term 1 0
9 Music Nicholas Term 1 1
10 Music Olivia Term 1 6
11 Music Paul Term 1 2
12 Drama Matilda Term 1 4
13 Drama Nicholas Term 1 9
14 Drama Olivia Term 1 2
15 Drama Paul Term 1 2
16 Classics Matilda Term 2 7
17 Classics Nicholas Term 2 9
18 Classics Olivia Term 2 4
19 Classics Paul Term 2 4
20 History Matilda Term 2 6
21 History Nicholas Term 2 4
22 History Olivia Term 2 0
23 History Paul Term 2 6
24 Music Matilda Term 2 5
25 Music Nicholas Term 2 1
26 Music Olivia Term 2 1
27 Music Paul Term 2 0
28 Drama Matilda Term 2 3
29 Drama Nicholas Term 2 3
30 Drama Olivia Term 2 7
31 Drama Paul Term 2 7
32 Classics Matilda Term 3 4
33 Classics Nicholas Term 3 2
34 Classics Olivia Term 3 9
35 Classics Paul Term 3 4
36 History Matilda Term 3 4
37 History Nicholas Term 3 6
38 History Olivia Term 3 0
39 History Paul Term 3 7
40 Music Matilda Term 3 1
41 Music Nicholas Term 3 2
42 Music Olivia Term 3 3
43 Music Paul Term 3 1
44 Drama Matilda Term 3 6
45 Drama Nicholas Term 3 2
46 Drama Olivia Term 3 6
47 Drama Paul Term 3 1

Case 3 : Headers amongst the data#

pivot-header-within-data.png

In this scenario, we have the subjects as a row header, mixed with the data. Note that the Term1,Term2,Term3 row is repeated.

Just like in case 1, we can use xlsx_cells to get the font data (specifically, the cells with bold font) and with that location, correctly align the data:

frame = xlsx_cells(excel_file, sheetnames="pivot-header-within-data", font=True, include_blank_cells=False)
frame['bold'] = frame.font.str.get('b')
frame
value internal_value coordinate row column data_type is_date number_format font bold
0 Classics Classics C2 2 3 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... True
1 Term 1 Term 1 C3 3 3 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... False
2 Term 2 Term 2 D3 3 4 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... False
3 Term 3 Term 3 E3 3 5 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... False
4 Matilda Matilda B4 4 2 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... False
... ... ... ... ... ... ... ... ... ... ...
75 3 3 E24 24 5 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... False
76 Paul Paul B25 25 2 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... False
77 5 5 C25 25 3 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... False
78 7 7 D25 25 4 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... False
79 2 2 E25 25 5 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... False

80 rows Ă— 10 columns

name_subject = (frame
                .loc[frame.data_type.eq('s') & ~frame.value.str.startswith('Term', na = False), ['value', 'bold', 'row', 'column']]
                .assign(subject = lambda df: np.where(df.bold, df.value, np.nan))
                .fill_direction(subject = 'down')
                .query('value != subject')
                .rename(columns={'value':'name'})
                .drop(columns = 'bold')
                )

name_subject
/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/pandas_flavor/register.py:157: FutureWarning: This function will be deprecated in a 1.x release. Please use `pd.DataFrame.assign` instead.
  return method(self._obj, *args, **kwargs)
name row column subject
4 Matilda 4 2 Classics
8 Nicholas 5 2 Classics
12 Olivia 6 2 Classics
16 Paul 7 2 Classics
24 Matilda 10 2 History
28 Nicholas 11 2 History
32 Olivia 12 2 History
36 Paul 13 2 History
44 Matilda 16 2 Music
48 Nicholas 17 2 Music
52 Olivia 18 2 Music
56 Paul 19 2 Music
64 Matilda 22 2 Drama
68 Nicholas 23 2 Drama
72 Olivia 24 2 Drama
76 Paul 25 2 Drama

Get the terms:

terms = (frame
        .loc[frame.value.str.startswith('Term', na = False), ['value', 'column']]
        .drop_duplicates()
        .rename(columns={'value':'term'})
        )
terms
term column
1 Term 1 3
2 Term 2 4
3 Term 3 5
scores = (frame
        .loc[frame.data_type == 'n', ['value', 'row', 'column']]
        .rename(columns={'value':'scores'})
        )
scores
scores row column
5 7 4 3
6 2 4 4
7 1 4 5
9 9 5 3
10 7 5 4
11 6 5 5
13 5 6 3
14 9 6 4
15 5 6 5
17 1 7 3
18 4 7 4
19 9 7 5
25 0 10 3
26 0 10 4
27 5 10 5
29 4 11 3
30 5 11 4
31 9 11 5
33 1 12 3
34 3 12 4
35 3 12 5
37 3 13 3
38 5 13 4
39 6 13 5
45 6 16 3
46 2 16 4
47 4 16 5
49 5 17 3
50 7 17 4
51 6 17 5
53 1 18 3
54 6 18 4
55 0 18 5
57 1 19 3
58 3 19 4
59 3 19 5
65 7 22 3
66 7 22 4
67 0 22 5
69 6 23 3
70 7 23 4
71 8 23 5
73 1 24 3
74 3 24 4
75 3 24 5
77 5 25 3
78 7 25 4
79 2 25 5

Merge the dataframes into one:

(name_subject
.drop(columns='column')
.merge(scores, on = 'row')
.merge(terms, on = 'column')
.drop(columns = ['row', 'column'])
)
name subject scores term
0 Matilda Classics 7 Term 1
1 Matilda Classics 2 Term 2
2 Matilda Classics 1 Term 3
3 Nicholas Classics 9 Term 1
4 Nicholas Classics 7 Term 2
5 Nicholas Classics 6 Term 3
6 Olivia Classics 5 Term 1
7 Olivia Classics 9 Term 2
8 Olivia Classics 5 Term 3
9 Paul Classics 1 Term 1
10 Paul Classics 4 Term 2
11 Paul Classics 9 Term 3
12 Matilda History 0 Term 1
13 Matilda History 0 Term 2
14 Matilda History 5 Term 3
15 Nicholas History 4 Term 1
16 Nicholas History 5 Term 2
17 Nicholas History 9 Term 3
18 Olivia History 1 Term 1
19 Olivia History 3 Term 2
20 Olivia History 3 Term 3
21 Paul History 3 Term 1
22 Paul History 5 Term 2
23 Paul History 6 Term 3
24 Matilda Music 6 Term 1
25 Matilda Music 2 Term 2
26 Matilda Music 4 Term 3
27 Nicholas Music 5 Term 1
28 Nicholas Music 7 Term 2
29 Nicholas Music 6 Term 3
30 Olivia Music 1 Term 1
31 Olivia Music 6 Term 2
32 Olivia Music 0 Term 3
33 Paul Music 1 Term 1
34 Paul Music 3 Term 2
35 Paul Music 3 Term 3
36 Matilda Drama 7 Term 1
37 Matilda Drama 7 Term 2
38 Matilda Drama 0 Term 3
39 Nicholas Drama 6 Term 1
40 Nicholas Drama 7 Term 2
41 Nicholas Drama 8 Term 3
42 Olivia Drama 1 Term 1
43 Olivia Drama 3 Term 2
44 Olivia Drama 3 Term 3
45 Paul Drama 5 Term 1
46 Paul Drama 7 Term 2
47 Paul Drama 2 Term 3