SpreadSheet Munging Strategies in Python - Meaningful Formats#
Meaningful Formats#
updated : April 15, 2022
This is part of a series of blog posts about extracting data from spreadsheets using Python. It is based on the book written by Duncan Garmonsway, which was written primarily for R users. LInks to the other posts are on the homepage.
This is an excerpt from Duncan’s book regarding meaningful formats :
Sometimes whole rows in a table are highlighted by formatting them with, say, a bright yellow fill. The highlighting could mean “this observation should be ignored”, or “this product is no longer available”. Different colours could mean different levels of a hierarchy, e.g. green for “pass” and red for “fail”.
The examples below highlight various ways of dealing with meaningfully formatted spreadsheets. Let’s dive in.
Case 1 : Meaningfully Formatted Rows#
The goal is to capture the color as part of our final output. We’ll make use of the xlsx_cells function from pyjanitor - under the hood, it uses openpyxl:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
import numpy as np
from janitor import xlsx_cells
filename = "Data_files/worked-examples.xlsx"
frame = xlsx_cells(filename, sheetnames = 'highlights', fill=True)
frame
value | internal_value | coordinate | row | column | data_type | is_date | number_format | fill | |
---|---|---|---|---|---|---|---|---|---|
0 | Age | Age | A1 | 1 | 1 | s | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... |
1 | Height | Height | B1 | 1 | 2 | s | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... |
2 | 1 | 1 | A2 | 2 | 1 | n | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... |
3 | 2 | 2 | B2 | 2 | 2 | n | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... |
4 | 3 | 3 | A3 | 3 | 1 | n | False | General | {'patternType': 'solid', 'fgColor': {'rgb': 'F... |
5 | 4 | 4 | B3 | 3 | 2 | n | False | General | {'patternType': 'solid', 'fgColor': {'rgb': 'F... |
6 | 5 | 5 | A4 | 4 | 1 | n | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... |
7 | 6 | 6 | B4 | 4 | 2 | n | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... |
You can view a single row in the fill column:
frame.fill[0]
{'patternType': None,
'fgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0},
'bgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}}
The colour information can be accessed using panda’s string methods:
frame['fill_colour'] = frame.fill.str.get('fgColor').str.get('rgb')
frame['fill_colour']
0 00000000
1 00000000
2 00000000
3 00000000
4 FFFFFF00
5 FFFFFF00
6 00000000
7 00000000
Name: fill_colour, dtype: object
headers = frame.loc[frame.data_type == 's', ['value', 'column']].set_index('column')['value']
numbers = frame.loc[frame.data_type == 'n', ['value', 'row', 'column', 'fill_colour']]
# remove duplicate rows, since the fill colour is the same per row,
fill_colour = numbers.loc[:, ['row', 'fill_colour']].drop_duplicates(subset=['row'])
del numbers['fill_colour']
headers
column
1 Age
2 Height
Name: value, dtype: object
numbers
value | row | column | |
---|---|---|---|
2 | 1 | 2 | 1 |
3 | 2 | 2 | 2 |
4 | 3 | 3 | 1 |
5 | 4 | 3 | 2 |
6 | 5 | 4 | 1 |
7 | 6 | 4 | 2 |
fill_colour
row | fill_colour | |
---|---|---|
2 | 2 | 00000000 |
4 | 3 | FFFFFF00 |
6 | 4 | 00000000 |
Some further reshaping to combine the headers, numbers and colour:
(numbers
.pivot(index='row', columns='column')
.droplevel(axis = 1, level = 0)
.rename(columns=headers)
.merge(fill_colour, on = 'row')
.drop(columns='row')
)
Age | Height | fill_colour | |
---|---|---|---|
0 | 1 | 2 | 00000000 |
1 | 3 | 4 | FFFFFF00 |
2 | 5 | 6 | 00000000 |
Case 2 : Meaningfully Formatted Cells#
In the pic above, the table has different colors for different cells. xlsx_cells comes in handy again, and the focus here, just as in Case 1, is the cell’s fill attribute.
frame = xlsx_cells(filename, sheetnames = 'annotations', fill=True)
frame['fill_colour'] = frame.fill.str.get('fgColor').str.get('rgb')
frame
value | internal_value | coordinate | row | column | data_type | is_date | number_format | fill | fill_colour | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Name | Name | A1 | 1 | 1 | s | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | 00000000 |
1 | Age | Age | B1 | 1 | 2 | s | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | 00000000 |
2 | Height | Height | C1 | 1 | 3 | s | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | 00000000 |
3 | Matilda | Matilda | A2 | 2 | 1 | s | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | 00000000 |
4 | 1 | 1 | B2 | 2 | 2 | n | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | 00000000 |
5 | 2 | 2 | C2 | 2 | 3 | n | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | 00000000 |
6 | Nicholas | Nicholas | A3 | 3 | 1 | s | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | 00000000 |
7 | 3 | 3 | B3 | 3 | 2 | n | False | General | {'patternType': 'solid', 'fgColor': {'rgb': 'F... | FFFFFF00 |
8 | 4 | 4 | C3 | 3 | 3 | n | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | 00000000 |
9 | Olivia | Olivia | A4 | 4 | 1 | s | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | 00000000 |
10 | 5 | 5 | B4 | 4 | 2 | n | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | 00000000 |
11 | 6 | 6 | C4 | 4 | 3 | n | False | General | {'patternType': 'solid', 'fgColor': {'rgb': 'F... | FF92D050 |
headers = frame.loc[frame.row.eq(1), ['value', 'row', 'column']].rename(columns={'value':'measure'})
students = frame.loc[frame.column.eq(1) & frame.row.gt(1), ['value', 'row', 'column']].rename(columns = {'value':'name'})
numbers = frame.loc[frame.data_type == 'n', ['value', 'row', 'column', 'fill_colour']]
headers
measure | row | column | |
---|---|---|---|
0 | Name | 1 | 1 |
1 | Age | 1 | 2 |
2 | Height | 1 | 3 |
numbers
value | row | column | fill_colour | |
---|---|---|---|---|
4 | 1 | 2 | 2 | 00000000 |
5 | 2 | 2 | 3 | 00000000 |
7 | 3 | 3 | 2 | FFFFFF00 |
8 | 4 | 3 | 3 | 00000000 |
10 | 5 | 4 | 2 | 00000000 |
11 | 6 | 4 | 3 | FF92D050 |
students
name | row | column | |
---|---|---|---|
3 | Matilda | 2 | 1 |
6 | Nicholas | 3 | 1 |
9 | Olivia | 4 | 1 |
We can combine the dataframes into one, based on positions:
(students
.drop(columns='column')
.merge(numbers, on='row')
.merge(headers.drop(columns='row'), on = 'column')
.loc[:, ['name', 'measure', 'value', 'fill_colour']]
)
name | measure | value | fill_colour | |
---|---|---|---|---|
0 | Matilda | Age | 1 | 00000000 |
1 | Matilda | Height | 2 | 00000000 |
2 | Nicholas | Age | 3 | FFFFFF00 |
3 | Nicholas | Height | 4 | 00000000 |
4 | Olivia | Age | 5 | 00000000 |
5 | Olivia | Height | 6 | FF92D050 |
Case 3 : Layered Meaningful Formatting#
In this case, we have formats for fill and font.
The format applies to the entire row.
frame = xlsx_cells(filename, sheetnames = 'combined-highlights', fill=True, font = True)
frame['fill_colour'] = frame.fill.str.get('fgColor').str.get('rgb')
frame['font_colour'] = frame.font.str.get('color').str.get('rgb')
frame
value | internal_value | coordinate | row | column | data_type | is_date | number_format | fill | font | fill_colour | font_colour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Name | Name | A1 | 1 | 1 | s | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | 00000000 | FF000000 |
1 | Weight | Weight | B1 | 1 | 2 | s | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | 00000000 | FF000000 |
2 | Price | Price | C1 | 1 | 3 | s | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | 00000000 | FF000000 |
3 | Knife | Knife | A2 | 2 | 1 | s | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | 00000000 | FF000000 |
4 | 7 | 7 | B2 | 2 | 2 | n | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | 00000000 | FF000000 |
5 | 8 | 8 | C2 | 2 | 3 | n | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | 00000000 | FF000000 |
6 | Fork | Fork | A3 | 3 | 1 | s | False | General | {'patternType': 'solid', 'fgColor': {'rgb': 'F... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | FFFFFF00 | FF000000 |
7 | 5 | 5 | B3 | 3 | 2 | n | False | General | {'patternType': 'solid', 'fgColor': {'rgb': 'F... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | FFFFFF00 | None |
8 | 6 | 6 | C3 | 3 | 3 | n | False | General | {'patternType': 'solid', 'fgColor': {'rgb': 'F... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | FFFFFF00 | None |
9 | Spoon | Spoon | A4 | 4 | 1 | s | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | 00000000 | FFFF0000 |
10 | 3 | 3 | B4 | 4 | 2 | n | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | 00000000 | FFFF0000 |
11 | 4 | 4 | C4 | 4 | 3 | n | False | General | {'patternType': None, 'fgColor': {'rgb': '0000... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | 00000000 | FFFF0000 |
12 | Teaspoon | Teaspoon | A5 | 5 | 1 | s | False | General | {'patternType': 'solid', 'fgColor': {'rgb': 'F... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | FFFFFF00 | FFFF0000 |
13 | 1 | 1 | B5 | 5 | 2 | n | False | General | {'patternType': 'solid', 'fgColor': {'rgb': 'F... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | FFFFFF00 | FFFF0000 |
14 | 2 | 2 | C5 | 5 | 3 | n | False | General | {'patternType': 'solid', 'fgColor': {'rgb': 'F... | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | FFFFFF00 | FFFF0000 |
headers = frame.loc[frame.row.eq(1), ['value', 'row', 'column']].rename(columns={'value':'measure'})
kitchen_items = frame.loc[frame.column.eq(1) & frame.row.gt(1), ['value', 'row', 'column']].rename(columns = {'value':'kitchen_item'})
numbers = frame.loc[frame.data_type == 'n', ['value', 'row', 'column', 'fill_colour', 'font_colour']]
# drop duplicates, since the format applies across the row
colour = numbers.loc[:, ['row', 'fill_colour', 'font_colour']].drop_duplicates(subset=['row'])
numbers = numbers.drop(columns=['fill_colour', 'font_colour'])
headers
measure | row | column | |
---|---|---|---|
0 | Name | 1 | 1 |
1 | Weight | 1 | 2 |
2 | Price | 1 | 3 |
kitchen_items
kitchen_item | row | column | |
---|---|---|---|
3 | Knife | 2 | 1 |
6 | Fork | 3 | 1 |
9 | Spoon | 4 | 1 |
12 | Teaspoon | 5 | 1 |
numbers
value | row | column | |
---|---|---|---|
4 | 7 | 2 | 2 |
5 | 8 | 2 | 3 |
7 | 5 | 3 | 2 |
8 | 6 | 3 | 3 |
10 | 3 | 4 | 2 |
11 | 4 | 4 | 3 |
13 | 1 | 5 | 2 |
14 | 2 | 5 | 3 |
colour
row | fill_colour | font_colour | |
---|---|---|---|
4 | 2 | 00000000 | FF000000 |
7 | 3 | FFFFFF00 | None |
10 | 4 | 00000000 | FFFF0000 |
13 | 5 | FFFFFF00 | FFFF0000 |
We can combine the dataframes into one, based on positions:
(kitchen_items
.drop(columns='column')
.merge(numbers, on='row')
.merge(headers.drop(columns='row'), on = 'column')
.pivot(index = ['kitchen_item', 'row'], columns='measure', values='value')
.reset_index(level='kitchen_item')
.merge(colour, on = 'row')
.drop(columns='row')
)
kitchen_item | Price | Weight | fill_colour | font_colour | |
---|---|---|---|---|---|
0 | Fork | 6 | 5 | FFFFFF00 | None |
1 | Knife | 8 | 7 | 00000000 | FF000000 |
2 | Spoon | 4 | 3 | 00000000 | FFFF0000 |
3 | Teaspoon | 2 | 1 | FFFFFF00 | FFFF0000 |
Case 4 : Hierarchies in Formatting#
In the pic above, notice the differences in the formatting (bold, italic, bold and italic, none). In Duncan’s book, he offers a suggestion on what these different levels of hierarchy might mean :
none - good
italic - satisfactory
bold - poor
bold and italic - fail
Again, the formatting is applied to the entire row.
frame = xlsx_cells(filename, sheetnames = "highlight-hierarchy", font=True)
frame['bold'] = frame.font.str.get('b')
frame['italics'] = frame.font.str.get('i')
frame
value | internal_value | coordinate | row | column | data_type | is_date | number_format | font | bold | italics | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Name | Name | A1 | 1 | 1 | s | False | General | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | True | False |
1 | Score | Score | B1 | 1 | 2 | s | False | General | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | True | False |
2 | Matilda | Matilda | A2 | 2 | 1 | s | False | General | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | False | False |
3 | 7 | 7 | B2 | 2 | 2 | n | False | General | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | False | False |
4 | Nicholas | Nicholas | A3 | 3 | 1 | s | False | General | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | False | True |
5 | 5 | 5 | B3 | 3 | 2 | n | False | General | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | False | True |
6 | Olivia | Olivia | A4 | 4 | 1 | s | False | General | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | True | False |
7 | 3 | 3 | B4 | 4 | 2 | n | False | General | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | True | False |
8 | Paul | Paul | A5 | 5 | 1 | s | False | General | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | True | True |
9 | 1 | 1 | B5 | 5 | 2 | n | False | General | {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... | True | True |
headers = frame.loc[frame.row.eq(1), ['value', 'row', 'column']].rename(columns={'value':'measure'})
students = frame.loc[frame.column.eq(1) & frame.row.gt(1), ['value', 'row', 'column']].rename(columns = {'value':'name'})
numbers = frame.loc[frame.data_type == 'n', ['value', 'row', 'column', 'bold', 'italics']].rename(columns={'value':'score'})
headers
measure | row | column | |
---|---|---|---|
0 | Name | 1 | 1 |
1 | Score | 1 | 2 |
students
name | row | column | |
---|---|---|---|
2 | Matilda | 2 | 1 |
4 | Nicholas | 3 | 1 |
6 | Olivia | 4 | 1 |
8 | Paul | 5 | 1 |
numbers
score | row | column | bold | italics | |
---|---|---|---|---|---|
3 | 7 | 2 | 2 | False | False |
5 | 5 | 3 | 2 | False | True |
7 | 3 | 4 | 2 | True | False |
9 | 1 | 5 | 2 | True | True |
(students
.drop(columns='column')
.merge(numbers, on='row')
.merge(headers.drop(columns='row'), on = 'column')
# the case_when function is from pyjanitor
# and is helpful for conditional evaluation
# it is similar to np.select/np.where/python's ifelse
.case_when(
'not bold and not italics', 'good',
'not bold and italics', 'satisfactory',
'bold and not italics', 'poor',
'fail', # default
column_name = 'grade')
.loc[:, ['name', 'bold', 'italics', 'score', 'grade']]
)
/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/pandas_flavor/register.py:157: DeprecationWarning: The last argument in the variable arguments has been assigned as the default. Note however that this will be deprecated in a future release; use an even number of boolean conditions and values, and pass the default argument to the `default` parameter instead.
return method(self._obj, *args, **kwargs)
name | bold | italics | score | grade | |
---|---|---|---|---|---|
0 | Matilda | False | False | 7 | good |
1 | Nicholas | False | True | 5 | satisfactory |
2 | Olivia | True | False | 3 | poor |
3 | Paul | True | True | 1 | fail |