SpreadSheet Munging Strategies in Python - Meaningful Formats

SpreadSheet Munging Strategies in Python - Meaningful Formats#

Meaningful Formats#

updated : September 24, 2024

This is part of a series of blog posts about extracting data from spreadsheets using Python. It is based on the book written by Duncan Garmonsway, which was written primarily for R users. Links to the other posts are on the sidebar.

The key takeaway is this - you understand your data layout; use the tools to achieve your end goal. xlsx_cells offers a way to get the cells in a spreadsheet into individual rows, with some metadata. The final outcome however relies on your understanding of the data layout and its proper application.

This is an excerpt from Duncan’s book regarding meaningful formats :

Sometimes whole rows in a table are highlighted by formatting them with, say, a bright yellow fill. The highlighting could mean “this observation should be ignored”, or “this product is no longer available”. Different colours could mean different levels of a hierarchy, e.g. green for “pass” and red for “fail”.

The examples below highlight various ways of dealing with meaningfully formatted spreadsheets. Let’s dive in.

Case 1 : Meaningfully Formatted Rows#

The goal is to capture the color as part of our final output. We’ll make use of the xlsx_cells function from pyjanitor - under the hood, it uses openpyxl:

# pip install pyjanitor
import pandas as pd
import janitor as jn
import sys
import numpy as np

filename = "Data_files/worked-examples.xlsx"

print('pandas version: ', pd.__version__)
print('janitor version: ', jn.__version__)
print('python version: ', sys.version)
print('numpy version: ', np.__version__)

pandas version:  2.2.2
janitor version:  0.29.1
python version:  3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
numpy version:  2.0.2

frame = jn.xlsx_cells(filename, sheetnames="highlights", fill=True)
frame

	value	internal_value	coordinate	row	column	data_type	is_date	number_format	fill
0	Age	Age	A1	1	1	s	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
1	Height	Height	B1	1	2	s	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
2	1	1	A2	2	1	n	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
3	2	2	B2	2	2	n	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
4	3	3	A3	3	1	n	False	General	{'patternType': 'solid', 'fgColor': {'rgb': 'F...
5	4	4	B3	3	2	n	False	General	{'patternType': 'solid', 'fgColor': {'rgb': 'F...
6	5	5	A4	4	1	n	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
7	6	6	B4	4	2	n	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...

You can view a single row in the fill column:

frame.fill[0]

{'patternType': None,
 'fgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0},
 'bgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}}

Steps:

get the headers
get the data(numbers)
get the colours
combine to create the final DataFrame

headers:

headers = frame.loc[frame.row == "1", "value"].array
headers

<NumpyExtensionArray>
['Age', 'Height']
Length: 2, dtype: object

data:

booleans = frame.data_type == "n"
data = frame.loc[booleans, "value"].array
# There are only two columns in the spreadsheet;
# we'll use that knowledge in reshaping the integer values:
data = data.astype(np.int8).reshape((-1, 2))
data

array([[1, 2],
       [3, 4],
       [5, 6]], dtype=int8)

colours:

fill_colour = frame.fill.str.get("fgColor").str.get("rgb").array
# keep only rows related to the integers:
fill_colour = fill_colour[booleans]
# the number of columns is 2,
# so we only need to jump 2 steps
# to get the corresponding colour per row
fill_colour = fill_colour[::2]
fill_colour

<NumpyExtensionArray>
['00000000', 'FFFFFF00', '00000000']
Length: 3, dtype: object

outcome = pd.DataFrame(data, columns=headers).assign(fill_colour=fill_colour)
outcome

	Age	Height	fill_colour
0	1	2	00000000
1	3	4	FFFFFF00
2	5	6	00000000

Case 2 : Meaningfully Formatted Cells#

In the pic above, the table has different colors for different cells. xlsx_cells comes in handy again, and the focus here, just as in Case 1, is the cell’s fill attribute.

frame = jn.xlsx_cells(filename, sheetnames="annotations", fill=True).astype(
    {"row": np.int8, "column": np.int8}
)
frame

	value	internal_value	coordinate	row	column	data_type	is_date	number_format	fill
0	Name	Name	A1	1	1	s	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
1	Age	Age	B1	1	2	s	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
2	Height	Height	C1	1	3	s	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
3	Matilda	Matilda	A2	2	1	s	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
4	1	1	B2	2	2	n	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
5	2	2	C2	2	3	n	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
6	Nicholas	Nicholas	A3	3	1	s	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
7	3	3	B3	3	2	n	False	General	{'patternType': 'solid', 'fgColor': {'rgb': 'F...
8	4	4	C3	3	3	n	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
9	Olivia	Olivia	A4	4	1	s	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
10	5	5	B4	4	2	n	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...
11	6	6	C4	4	3	n	False	General	{'patternType': 'solid', 'fgColor': {'rgb': 'F...

Steps:

get the headers
get the data in separate columns(strings, numbers)
get the colours
combine to create the final DataFrame

headers:

# all the headers are on the first row
# retain the column positions,
# so we can link them back to the data and colours
headers = frame.loc[frame.row.eq(1), ["value", "column"]]
headers.columns = ["measure", "column"]
headers

	measure	column
0	Name	1
1	Age	2
2	Height	3

data:

# the string values have a data type of 's'
# and the row numbers are greater than 1
# we'll keep the row number so we can correctly
# match with the measurements(age, height)
booleans = frame.row.gt(1) & frame.data_type.eq("s")
students = frame.loc[booleans, ["value", "row"]]
students.columns = ["name", "row"]
students

	name	row
3	Matilda	2
6	Nicholas	3
9	Olivia	4

# the scores values have a data type of 'n'
# and the row numbers are greater than 1
# we'll keep the row and column columns
#  so we can correctly
# match with the student names
booleans = frame.row.gt(1) & frame.data_type.eq("n")
numbers = frame.loc[booleans, ["value", "row", "column"]]
numbers

	value	row	column
4	1	2	2
5	2	2	3
7	3	3	2
8	4	3	3
10	5	4	2
11	6	4	3

colours:

fill_colour = frame.fill.str.get("fgColor").str.get("rgb")
# keep only rows for numbers:
fill_colour = fill_colour[booleans]
fill_colour

   00000000
   00000000
   FFFFFF00
   00000000
  00000000
  FF92D050
Name: fill, dtype: object

combine into a single DataFrame:

outcome = (
    numbers.assign(fill_colour=fill_colour)
    .merge(students, on="row")
    .merge(headers, on="column")
    .loc[:, ["name", "measure", "value", "fill_colour"]]
)

outcome

	name	measure	value	fill_colour
0	Matilda	Age	1	00000000
1	Matilda	Height	2	00000000
2	Nicholas	Age	3	FFFFFF00
3	Nicholas	Height	4	00000000
4	Olivia	Age	5	00000000
5	Olivia	Height	6	FF92D050

Case 3 : Layered Meaningful Formatting#

In this case, we have formats for fill and font.

The format applies to the entire row.

Steps:

get the headers
get the data in separate columns(kitchen_items, price, weight)
get the colours (fill and font)
combine to create the final DataFrame

frame = jn.xlsx_cells(filename, sheetnames="combined-highlights", fill=True, font=True)
frame = frame.astype({"row": np.int8, "column": np.int8})
frame

	value	internal_value	coordinate	row	column	data_type	is_date	number_format	fill	font
0	Name	Name	A1	1	1	s	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1	Weight	Weight	B1	1	2	s	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2	Price	Price	C1	1	3	s	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
3	Knife	Knife	A2	2	1	s	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
4	7	7	B2	2	2	n	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
5	8	8	C2	2	3	n	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
6	Fork	Fork	A3	3	1	s	False	General	{'patternType': 'solid', 'fgColor': {'rgb': 'F...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
7	5	5	B3	3	2	n	False	General	{'patternType': 'solid', 'fgColor': {'rgb': 'F...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
8	6	6	C3	3	3	n	False	General	{'patternType': 'solid', 'fgColor': {'rgb': 'F...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
9	Spoon	Spoon	A4	4	1	s	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
10	3	3	B4	4	2	n	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
11	4	4	C4	4	3	n	False	General	{'patternType': None, 'fgColor': {'rgb': '0000...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
12	Teaspoon	Teaspoon	A5	5	1	s	False	General	{'patternType': 'solid', 'fgColor': {'rgb': 'F...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
13	1	1	B5	5	2	n	False	General	{'patternType': 'solid', 'fgColor': {'rgb': 'F...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
14	2	2	C5	5	3	n	False	General	{'patternType': 'solid', 'fgColor': {'rgb': 'F...	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...

headers:

# headers are on the first row
# we'll rely on the `column` column
# to align with the weight and price extracts
headers = frame.loc[frame.row.eq(1), ["value", "column"]]
headers

	value	column
0	Name	1
1	Weight	2
2	Price	3

data:

# kitchen_items are string datatype
# they are on rows > 1
booleans = frame.row.gt(1) & frame.data_type.eq("s")
kitchen_items = frame.loc[booleans, "value"].array
kitchen_items

<NumpyExtensionArray>
['Knife', 'Fork', 'Spoon', 'Teaspoon']
Length: 4, dtype: object

# price and weight are number datatypes
# they are on rows > 1
booleans = frame.row.gt(1) & frame.data_type.eq("n")
# retrieve column positions:
columns = frame.loc[booleans, "column"]
numbers = frame.loc[booleans, "value"].astype(np.int8).array
# we know there are only two number columns - weight and price
# we'll use that knowledge in reshaping:
numbers = numbers.reshape((-1, 2))
numbers

<NumpyExtensionArray>
[
[np.int8(7), np.int8(8)],
[np.int8(5), np.int8(6)],
[np.int8(3), np.int8(4)],
[np.int8(1), np.int8(2)]
]
Shape: (4, 2), dtype: int8

colors:

fill_colour = frame.fill.str.get("fgColor").str.get("rgb")
font_colour = frame.font.str.get("color").str.get("rgb")
# keep only rows aligned with the numbers:
fill_colour = fill_colour[booleans].array
font_colour = font_colour[booleans].array
# we know there are only two number columns
# this means we only need to jump two steps per colour:
fill_colour = fill_colour[::2]
font_colour = font_colour[::2]

fill_colour

<NumpyExtensionArray>
['00000000', 'FFFFFF00', '00000000', 'FFFFFF00']
Length: 4, dtype: object

font_colour

<NumpyExtensionArray>
['FF000000', None, 'FFFF0000', 'FFFF0000']
Length: 4, dtype: object

Combine into one DataFrame:

# filter headers for only rows
# that align with the weight and price extracts
headers = headers.loc[headers.column.isin(columns), "value"].array
outcome = pd.DataFrame(numbers, columns=headers)
outcome = outcome.assign(
    kitchen_item=kitchen_items, fill_colour=fill_colour, font_colour=font_colour
).loc[:, ["kitchen_item", "Price", "Weight", "fill_colour", "font_colour"]]
outcome

	kitchen_item	Price	Weight	fill_colour	font_colour
0	Knife	8	7	00000000	FF000000
1	Fork	6	5	FFFFFF00	None
2	Spoon	4	3	00000000	FFFF0000
3	Teaspoon	2	1	FFFFFF00	FFFF0000

Note that an alternative would be to use the row and column positions to reshape the data - similar to the solution in case 2

Case 4 : Hierarchies in Formatting#

In the pic above, notice the differences in the formatting (bold, italic, bold and italic, none). In Duncan’s book, he offers a suggestion on what these different levels of hierarchy might mean :

none - good
italic - satisfactory
bold - poor
bold and italic - fail

Again, the formatting is applied to the entire row.

frame = jn.xlsx_cells(filename, sheetnames="highlight-hierarchy", font=True)
frame = frame.astype({'row':np.int8, 'column':np.int8})
frame

	value	internal_value	coordinate	row	column	data_type	is_date	number_format	font
0	Name	Name	A1	1	1	s	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1	Score	Score	B1	1	2	s	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2	Matilda	Matilda	A2	2	1	s	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
3	7	7	B2	2	2	n	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
4	Nicholas	Nicholas	A3	3	1	s	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
5	5	5	B3	3	2	n	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
6	Olivia	Olivia	A4	4	1	s	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
7	3	3	B4	4	2	n	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
8	Paul	Paul	A5	5	1	s	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
9	1	1	B5	5	2	n	False	General	{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...

Steps:

get the data in separate columns(score, grade)
get the fonts (bold, italics)
combine to create the final DataFrame

data:

booleans = frame.row.gt(1) & frame.data_type.eq('s')
students = frame.loc[booleans, 'value'].array
students

<NumpyExtensionArray>
['Matilda', 'Nicholas', 'Olivia', 'Paul']
Length: 4, dtype: object

booleans = frame.row.gt(1) & frame.data_type.eq('n')
score = frame.loc[booleans, 'value'].array.astype(np.int8)
score

array([7, 5, 3, 1], dtype=int8)

fonts:

bold = frame.font.str.get("b")[booleans].to_numpy()
bold

array([False, False,  True,  True])

italics = frame.font.str.get("i")[booleans][booleans].to_numpy()
italics

array([False,  True, False,  True])

condlist = [~bold & ~italics, ~bold & italics, bold & ~italics]
choicelist = ['good','satisfactory','poor']
grade = np.select(condlist, choicelist, default='fail')
grade

array(['good', 'satisfactory', 'poor', 'fail'], dtype='<U12')

Combine into a single DataFrame:

outcome = {'name':students, 'bold':bold,'italics':italics,'score':score,'grade':grade}
outcome = pd.DataFrame(outcome)
outcome

	name	bold	italics	score	grade
0	Matilda	False	False	7	good
1	Nicholas	False	True	5	satisfactory
2	Olivia	True	False	3	poor
3	Paul	True	True	1	fail

SpreadSheet Munging Strategies in Python - Meaningful Formats

Contents

SpreadSheet Munging Strategies in Python - Meaningful Formats#

Meaningful Formats#

Case 1 : Meaningfully Formatted Rows#

Case 2 : Meaningfully Formatted Cells#

Case 3 : Layered Meaningful Formatting#

Case 4 : Hierarchies in Formatting#

Comments#