SpreadSheet Munging Strategies in Python - Pivot Tables - Simple Unpivoting#

Pivot Tables - Simple Unpivoting#

updated : April 14, 2022

This is part of a series of blog posts about extracting data from spreadsheets using Python. It is based on the book written by Duncan Garmonsway, which was written primarily for R users. LInks to the other posts are on the homepage.

Pivot tables offer a summarized version of the data, and are usually succinct and visually appealing. However, for analysis, it is not so good. How do we convert pivot tables to tidy data? This section covers that. We’ll see various kinds of pivot tables, and work through each of them.

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
import numpy as np
excel_file = pd.ExcelFile("Data_files/worked-examples.xlsx", engine='openpyxl')

Case 1: Multiple Row and Column headers#

pivot-annotations.png

pivot-annotations-headerrowandcolumn.png

Observations:

  1. There are two header rows - gender and names.

  2. There are also two header columns on the left - Fields(Humanities, Performance) and Subjects(Classics, History, …)

  3. For this case we wont be concerned with the colour formats. We’ll look at that in a later section.

df = (excel_file
     .parse('pivot-annotations', header = [1,2], index_col=[0,1,2])
     .droplevel(axis = 0, level = 0)
     )

df
Female Male
Matilda Olivia Nicholas Paul
Humanities Classics 1 2 3 0
History 3 4 5 1
Performance Music 5 6 9 2
Drama 7 8 12 3
df.index.names = ['field', 'subject']
df.columns.names = ['gender', 'student']

(df
.stack(['gender', 'student'])
.rename('scores')
.reset_index()
.ffill()
)
/tmp/ipykernel_551338/1922738328.py:4: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  (df
field subject gender student scores
0 Humanities Classics Female Matilda 1.0
1 Humanities Classics Female Olivia 2.0
2 Humanities Classics Male Nicholas 3.0
3 Humanities Classics Male Paul 0.0
4 Humanities History Female Matilda 3.0
5 Humanities History Female Olivia 4.0
6 Humanities History Male Nicholas 5.0
7 Humanities History Male Paul 1.0
8 Performance Music Female Matilda 5.0
9 Performance Music Female Olivia 6.0
10 Performance Music Male Nicholas 9.0
11 Performance Music Male Paul 2.0
12 Performance Drama Female Matilda 7.0
13 Performance Drama Female Olivia 8.0
14 Performance Drama Male Nicholas 12.0
15 Performance Drama Male Paul 3.0

Case 2 : Multiple rows or columns of headers, with meaningful formatting#

pivot-annotations.png

Same as above, except we have to capture the format details. To get the colour information, we’ll make use of the xlsx_cells function from pyjanitor - under the hood, it uses openpyxl:

from janitor import xlsx_cells
df = xlsx_cells(excel_file, sheetnames='pivot-annotations', include_blank_cells=False, fill=True)
# the colours are stored in a dictionary form in a cell
# using Pandas' string functions make it easy to traverse
# and pick the exact metadata we are interested in.
df['fill_colour'] = df.fill.str.get('fgColor').str.get('rgb')
df
value internal_value coordinate row column data_type is_date number_format fill fill_colour
0 Female Female D2 2 4 s False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
1 Male Male F2 2 6 s False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
2 Matilda Matilda D3 3 4 s False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
3 Olivia Olivia E3 3 5 s False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
4 Nicholas Nicholas F3 3 6 s False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
5 Paul Paul G3 3 7 s False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
6 Humanities Humanities B4 4 2 s False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
7 Classics Classics C4 4 3 s False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
8 1 1 D4 4 4 n False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
9 2 2 E4 4 5 n False General {'patternType': 'solid', 'fgColor': {'rgb': 'F... FFFFFF00
10 3 3 F4 4 6 n False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
11 0 0 G4 4 7 n False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
12 History History C5 5 3 s False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
13 3 3 D5 5 4 n False General {'patternType': 'solid', 'fgColor': {'rgb': 'F... FFFFFF00
14 4 4 E5 5 5 n False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
15 5 5 F5 5 6 n False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
16 1 1 G5 5 7 n False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
17 Performance Performance B6 6 2 s False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
18 Music Music C6 6 3 s False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
19 5 5 D6 6 4 n False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
20 6 6 E6 6 5 n False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
21 9 9 F6 6 6 n False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
22 2 2 G6 6 7 n False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
23 Drama Drama C7 7 3 s False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
24 7 7 D7 7 4 n False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
25 8 8 E7 7 5 n False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000
26 12 12 F7 7 6 n False General {'patternType': 'solid', 'fgColor': {'rgb': 'F... FFFFFF00
27 3 3 G7 7 7 n False General {'patternType': None, 'fgColor': {'rgb': '0000... 00000000

From the excel file we can tell that the fields and subjects columns are columns 2 and 3 respectively, the gender row is 2, while the students are on row 3:

index = df.loc[df.column.isin([2,3]), ['value', 'row', 'column']]
index = (index
        .assign(fields = np.where(index.column.eq(2), index.value, np.nan), 
                subjects = lambda df: np.where(df.fields.isna(), df.value, np.nan))
        # pyjanitor ... syntactic sugar for pd.Series.ffill     
        .fill_direction(fields='down')
        .dropna()
        .drop(columns='value')
        )
index
/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/pandas_flavor/register.py:157: FutureWarning: This function will be deprecated in a 1.x release. Please use `pd.DataFrame.assign` instead.
  return method(self._obj, *args, **kwargs)
row column fields subjects
7 4 3 Humanities Classics
12 5 3 Humanities History
18 6 3 Performance Music
23 7 3 Performance Drama
columns = df.loc[df.row.isin([2,3]), ['value', 'row', 'column']]
columns = (columns
            .sort_values(['column'])
            .assign(gender = lambda df: np.where(df.row.eq(2), df.value, np.nan), 
                    students = lambda df: np.where(df.gender.isna(), df.value, np.nan))
            .fill_direction(gender='down')
            .dropna()
            .drop(columns='value')
        )
columns
/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/pandas_flavor/register.py:157: FutureWarning: This function will be deprecated in a 1.x release. Please use `pd.DataFrame.assign` instead.
  return method(self._obj, *args, **kwargs)
row column gender students
2 3 4 Female Matilda
3 3 5 Female Olivia
4 3 6 Male Nicholas
5 3 7 Male Paul
scores = (df
        .loc[df.data_type == 'n', ['value', 'row', 'column', 'fill_colour']]
        .rename(columns={'value':'scores'})
        )
scores
scores row column fill_colour
8 1 4 4 00000000
9 2 4 5 FFFFFF00
10 3 4 6 00000000
11 0 4 7 00000000
13 3 5 4 FFFFFF00
14 4 5 5 00000000
15 5 5 6 00000000
16 1 5 7 00000000
19 5 6 4 00000000
20 6 6 5 00000000
21 9 6 6 00000000
22 2 6 7 00000000
24 7 7 4 00000000
25 8 7 5 00000000
26 12 7 6 FFFFFF00
27 3 7 7 00000000

All that’s left is to align them on their row and column positions:

(index
.drop(columns='column')
.merge(scores, on = 'row')
.merge(columns.drop(columns='row'), on = 'column')
.loc[:, ['students', 'gender', 'fields', 'subjects', 'scores', 'fill_colour']]
)
students gender fields subjects scores fill_colour
0 Matilda Female Humanities Classics 1 00000000
1 Olivia Female Humanities Classics 2 FFFFFF00
2 Nicholas Male Humanities Classics 3 00000000
3 Paul Male Humanities Classics 0 00000000
4 Matilda Female Humanities History 3 FFFFFF00
5 Olivia Female Humanities History 4 00000000
6 Nicholas Male Humanities History 5 00000000
7 Paul Male Humanities History 1 00000000
8 Matilda Female Performance Music 5 00000000
9 Olivia Female Performance Music 6 00000000
10 Nicholas Male Performance Music 9 00000000
11 Paul Male Performance Music 2 00000000
12 Matilda Female Performance Drama 7 00000000
13 Olivia Female Performance Drama 8 00000000
14 Nicholas Male Performance Drama 12 FFFFFF00
15 Paul Male Performance Drama 3 00000000

Case 3 : Mixed headers and notes in the same row/column, distinguished by formatting#

pivot-notes.png

In this case, the highlighted texts should be excluded, as they are just notes and not relevant. Same goes for the 0=absent in G2 cell, which is italicized. red is identifed as FFFF0000, while italics should return True if it exists.

We’ll use the same steps as in the last solution:

df = xlsx_cells(excel_file, sheetnames='pivot-notes', include_blank_cells=False, font=True)

You can view a single row in the font column:

df.font[0]
{'name': 'Calibri',
 'family': 2.0,
 'sz': 11.0,
 'b': False,
 'i': False,
 'u': None,
 'strike': None,
 'color': {'rgb': 'FF000000', 'type': 'rgb', 'tint': 0.0},
 'vertAlign': None,
 'charset': 1,
 'outline': None,
 'shadow': None,
 'condense': None,
 'extend': None,
 'scheme': None}
df = df.assign(font_colour = df.font.str.get('color').str.get('rgb'), 
               italics = df.font.str.get('i'))
df
value internal_value coordinate row column data_type is_date number_format font font_colour italics
0 Female Female D2 2 4 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
1 Male Male F2 2 6 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
2 0 = absent 0 = absent G2 2 7 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 True
3 Matilda Matilda D3 3 4 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
4 Olivia Olivia E3 3 5 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
5 Nicholas Nicholas F3 3 6 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
6 Paul Paul G3 3 7 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
7 Humanities Humanities B4 4 2 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
8 Classics Classics C4 4 3 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
9 1 1 D4 4 4 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
10 2 2 E4 4 5 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
11 3 3 F4 4 6 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
12 0 0 G4 4 7 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
13 Excl. project work Excl. project work B5 5 2 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FFFF0000 False
14 History History C5 5 3 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
15 3 3 D5 5 4 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
16 4 4 E5 5 5 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
17 5 5 F5 5 6 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
18 1 1 G5 5 7 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
19 Performance Performance B6 6 2 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
20 Music Music C6 6 3 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
21 5 5 D6 6 4 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
22 6 6 E6 6 5 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
23 9 9 F6 6 6 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
24 2 2 G6 6 7 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
25 Incl. written exam Incl. written exam B7 7 2 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FFFF0000 False
26 Drama Drama C7 7 3 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
27 7 7 D7 7 4 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
28 8 8 E7 7 5 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
29 12 12 F7 7 6 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
30 3 3 G7 7 7 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... FF000000 False
index = df.loc[df.column.isin([2,3]), ['value', 'row', 'column', 'font_colour']]
index = (index
        .loc[lambda df: df.font_colour.ne('FFFF0000')]
        .assign(fields = lambda df: np.where(df.column.eq(2), df.value, np.nan), 
                subjects = lambda df: np.where(df.fields.isna(), df.value, np.nan))   
        .fill_direction(fields='down')
        .dropna()
        .drop(columns=['value', 'font_colour'])
        )
index
/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/pandas_flavor/register.py:157: FutureWarning: This function will be deprecated in a 1.x release. Please use `pd.DataFrame.assign` instead.
  return method(self._obj, *args, **kwargs)
row column fields subjects
8 4 3 Humanities Classics
14 5 3 Humanities History
20 6 3 Performance Music
26 7 3 Performance Drama
columns = df.loc[df.row.isin([2,3]), ['value', 'row', 'column', 'italics']]
columns = (columns
            .sort_values(['column'])
            .loc[lambda df: ~df.italics]
            .assign(gender = lambda df: np.where(df.row.eq(2), df.value, np.nan), 
                    students = lambda df: np.where(df.gender.isna(), df.value, np.nan))
            .fill_direction(gender = 'down')
            .dropna()
            .drop(columns=['value', 'italics'])
        )
columns
/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/pandas_flavor/register.py:157: FutureWarning: This function will be deprecated in a 1.x release. Please use `pd.DataFrame.assign` instead.
  return method(self._obj, *args, **kwargs)
row column gender students
3 3 4 Female Matilda
4 3 5 Female Olivia
5 3 6 Male Nicholas
6 3 7 Male Paul
scores = (df
        .loc[df.data_type == 'n', ['value', 'row', 'column']]
        .rename(columns={'value':'scores'})
        )
scores
scores row column
9 1 4 4
10 2 4 5
11 3 4 6
12 0 4 7
15 3 5 4
16 4 5 5
17 5 5 6
18 1 5 7
21 5 6 4
22 6 6 5
23 9 6 6
24 2 6 7
27 7 7 4
28 8 7 5
29 12 7 6
30 3 7 7
(index
.drop(columns='column')
.merge(scores, on = 'row')
.merge(columns.drop(columns='row'), on = 'column')
.loc[:, ['students', 'gender', 'fields', 'subjects', 'scores']]
)
students gender fields subjects scores
0 Matilda Female Humanities Classics 1
1 Olivia Female Humanities Classics 2
2 Nicholas Male Humanities Classics 3
3 Paul Male Humanities Classics 0
4 Matilda Female Humanities History 3
5 Olivia Female Humanities History 4
6 Nicholas Male Humanities History 5
7 Paul Male Humanities History 1
8 Matilda Female Performance Music 5
9 Olivia Female Performance Music 6
10 Nicholas Male Performance Music 9
11 Paul Male Performance Music 2
12 Matilda Female Performance Drama 7
13 Olivia Female Performance Drama 8
14 Nicholas Male Performance Drama 12
15 Paul Male Performance Drama 3

Case 4: Mixed levels of headers in the same row/column, distinguished by formatting#

pivot-hierarchy.png

The formatting here refers to the indent - notice how Humanities and Performance are closer to the border compared to the subjects. Same goes for Matilda and Nicholas. Getting the indent information would require the use of Openpyxl. This data can however be handled easily by Pandas - for the header columns, we can use the adjacent null cells as identifiers:

(excel_file
.parse(sheet_name='pivot-hierarchy', header=[0,1])
.droplevel(axis = 1, level = 0)
.remove_empty()
.assign(fields = lambda df: np.where(df.iloc[:, -1].isna(), df.iloc[:, 0], np.nan),
        subjects = lambda df: np.where(df.fields.isna(), df.iloc[:, 0], np.nan)
        )
.fill_direction(fields = 'down')
.dropna()
.iloc[:, 1:]
.melt(id_vars = ['fields', 'subjects'], var_name = 'students', value_name = 'scores')
)
/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/pandas_flavor/register.py:157: FutureWarning: This function will be deprecated in a 1.x release. Please use `pd.DataFrame.assign` instead.
  return method(self._obj, *args, **kwargs)
fields subjects students scores
0 Humanities Classics Matilda 1.0
1 Humanities History Matilda 3.0
2 Performance Music Matilda 5.0
3 Performance Drama Matilda 7.0
4 Humanities Classics Nicholas 3.0
5 Humanities History Nicholas 5.0
6 Performance Music Nicholas 9.0
7 Performance Drama Nicholas 12.0