SpreadSheet Munging Strategies in Python - Meaningful Formats#

Meaningful Formats#

updated : September 24, 2024

This is part of a series of blog posts about extracting data from spreadsheets using Python. It is based on the book written by Duncan Garmonsway, which was written primarily for R users. Links to the other posts are on the sidebar.

The key takeaway is this - you understand your data layout; use the tools to achieve your end goal. xlsx_cells offers a way to get the cells in a spreadsheet into individual rows, with some metadata. The final outcome however relies on your understanding of the data layout and its proper application.

This is an excerpt from Duncan’s book regarding meaningful formats :

Sometimes whole rows in a table are highlighted by formatting them with, say, a bright yellow fill. The highlighting could mean “this observation should be ignored”, or “this product is no longer available”. Different colours could mean different levels of a hierarchy, e.g. green for “pass” and red for “fail”.

The examples below highlight various ways of dealing with meaningfully formatted spreadsheets. Let’s dive in.

Case 1 : Meaningfully Formatted Rows#

meaningfully_formatted_rows.png

The goal is to capture the color as part of our final output. We’ll make use of the xlsx_cells function from pyjanitor - under the hood, it uses openpyxl:

# pip install pyjanitor
import pandas as pd
import janitor as jn
import sys
import numpy as np

filename = "Data_files/worked-examples.xlsx"
print('pandas version: ', pd.__version__)
print('janitor version: ', jn.__version__)
print('python version: ', sys.version)
print('numpy version: ', np.__version__)
pandas version:  2.2.2
janitor version:  0.29.1
python version:  3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
numpy version:  2.0.2
frame = jn.xlsx_cells(filename, sheetnames="highlights", fill=True)
frame
value internal_value coordinate row column data_type is_date number_format fill
0 Age Age A1 1 1 s False General {'patternType': None, 'fgColor': {'rgb': '0000...
1 Height Height B1 1 2 s False General {'patternType': None, 'fgColor': {'rgb': '0000...
2 1 1 A2 2 1 n False General {'patternType': None, 'fgColor': {'rgb': '0000...
3 2 2 B2 2 2 n False General {'patternType': None, 'fgColor': {'rgb': '0000...
4 3 3 A3 3 1 n False General {'patternType': 'solid', 'fgColor': {'rgb': 'F...
5 4 4 B3 3 2 n False General {'patternType': 'solid', 'fgColor': {'rgb': 'F...
6 5 5 A4 4 1 n False General {'patternType': None, 'fgColor': {'rgb': '0000...
7 6 6 B4 4 2 n False General {'patternType': None, 'fgColor': {'rgb': '0000...

You can view a single row in the fill column:

frame.fill[0]
{'patternType': None,
 'fgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0},
 'bgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}}

Steps:

  • get the headers

  • get the data(numbers)

  • get the colours

  • combine to create the final DataFrame

headers:

headers = frame.loc[frame.row == "1", "value"].array
headers
<NumpyExtensionArray>
['Age', 'Height']
Length: 2, dtype: object

data:

booleans = frame.data_type == "n"
data = frame.loc[booleans, "value"].array
# There are only two columns in the spreadsheet;
# we'll use that knowledge in reshaping the integer values:
data = data.astype(np.int8).reshape((-1, 2))
data
array([[1, 2],
       [3, 4],
       [5, 6]], dtype=int8)

colours:

fill_colour = frame.fill.str.get("fgColor").str.get("rgb").array
# keep only rows related to the integers:
fill_colour = fill_colour[booleans]
# the number of columns is 2,
# so we only need to jump 2 steps
# to get the corresponding colour per row
fill_colour = fill_colour[::2]
fill_colour
<NumpyExtensionArray>
['00000000', 'FFFFFF00', '00000000']
Length: 3, dtype: object
outcome = pd.DataFrame(data, columns=headers).assign(fill_colour=fill_colour)
outcome
Age Height fill_colour
0 1 2 00000000
1 3 4 FFFFFF00
2 5 6 00000000

Case 2 : Meaningfully Formatted Cells#

meaningfully_formatted_cells.png

In the pic above, the table has different colors for different cells. xlsx_cells comes in handy again, and the focus here, just as in Case 1, is the cell’s fill attribute.

frame = jn.xlsx_cells(filename, sheetnames="annotations", fill=True).astype(
    {"row": np.int8, "column": np.int8}
)
frame
value internal_value coordinate row column data_type is_date number_format fill
0 Name Name A1 1 1 s False General {'patternType': None, 'fgColor': {'rgb': '0000...
1 Age Age B1 1 2 s False General {'patternType': None, 'fgColor': {'rgb': '0000...
2 Height Height C1 1 3 s False General {'patternType': None, 'fgColor': {'rgb': '0000...
3 Matilda Matilda A2 2 1 s False General {'patternType': None, 'fgColor': {'rgb': '0000...
4 1 1 B2 2 2 n False General {'patternType': None, 'fgColor': {'rgb': '0000...
5 2 2 C2 2 3 n False General {'patternType': None, 'fgColor': {'rgb': '0000...
6 Nicholas Nicholas A3 3 1 s False General {'patternType': None, 'fgColor': {'rgb': '0000...
7 3 3 B3 3 2 n False General {'patternType': 'solid', 'fgColor': {'rgb': 'F...
8 4 4 C3 3 3 n False General {'patternType': None, 'fgColor': {'rgb': '0000...
9 Olivia Olivia A4 4 1 s False General {'patternType': None, 'fgColor': {'rgb': '0000...
10 5 5 B4 4 2 n False General {'patternType': None, 'fgColor': {'rgb': '0000...
11 6 6 C4 4 3 n False General {'patternType': 'solid', 'fgColor': {'rgb': 'F...

Steps:

  • get the headers

  • get the data in separate columns(strings, numbers)

  • get the colours

  • combine to create the final DataFrame

headers:

# all the headers are on the first row
# retain the column positions,
# so we can link them back to the data and colours
headers = frame.loc[frame.row.eq(1), ["value", "column"]]
headers.columns = ["measure", "column"]
headers
measure column
0 Name 1
1 Age 2
2 Height 3

data:

# the string values have a data type of 's'
# and the row numbers are greater than 1
# we'll keep the row number so we can correctly
# match with the measurements(age, height)
booleans = frame.row.gt(1) & frame.data_type.eq("s")
students = frame.loc[booleans, ["value", "row"]]
students.columns = ["name", "row"]
students
name row
3 Matilda 2
6 Nicholas 3
9 Olivia 4
# the scores values have a data type of 'n'
# and the row numbers are greater than 1
# we'll keep the row and column columns
#  so we can correctly
# match with the student names
booleans = frame.row.gt(1) & frame.data_type.eq("n")
numbers = frame.loc[booleans, ["value", "row", "column"]]
numbers
value row column
4 1 2 2
5 2 2 3
7 3 3 2
8 4 3 3
10 5 4 2
11 6 4 3

colours:

fill_colour = frame.fill.str.get("fgColor").str.get("rgb")
# keep only rows for numbers:
fill_colour = fill_colour[booleans]
fill_colour
4     00000000
5     00000000
7     FFFFFF00
8     00000000
10    00000000
11    FF92D050
Name: fill, dtype: object

combine into a single DataFrame:

outcome = (
    numbers.assign(fill_colour=fill_colour)
    .merge(students, on="row")
    .merge(headers, on="column")
    .loc[:, ["name", "measure", "value", "fill_colour"]]
)

outcome
name measure value fill_colour
0 Matilda Age 1 00000000
1 Matilda Height 2 00000000
2 Nicholas Age 3 FFFFFF00
3 Nicholas Height 4 00000000
4 Olivia Age 5 00000000
5 Olivia Height 6 FF92D050

Case 3 : Layered Meaningful Formatting#

layered_meaningful_formatting.png

In this case, we have formats for fill and font.

The format applies to the entire row.

Steps:

  • get the headers

  • get the data in separate columns(kitchen_items, price, weight)

  • get the colours (fill and font)

  • combine to create the final DataFrame

frame = jn.xlsx_cells(filename, sheetnames="combined-highlights", fill=True, font=True)
frame = frame.astype({"row": np.int8, "column": np.int8})
frame
value internal_value coordinate row column data_type is_date number_format fill font
0 Name Name A1 1 1 s False General {'patternType': None, 'fgColor': {'rgb': '0000... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1 Weight Weight B1 1 2 s False General {'patternType': None, 'fgColor': {'rgb': '0000... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2 Price Price C1 1 3 s False General {'patternType': None, 'fgColor': {'rgb': '0000... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
3 Knife Knife A2 2 1 s False General {'patternType': None, 'fgColor': {'rgb': '0000... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
4 7 7 B2 2 2 n False General {'patternType': None, 'fgColor': {'rgb': '0000... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
5 8 8 C2 2 3 n False General {'patternType': None, 'fgColor': {'rgb': '0000... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
6 Fork Fork A3 3 1 s False General {'patternType': 'solid', 'fgColor': {'rgb': 'F... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
7 5 5 B3 3 2 n False General {'patternType': 'solid', 'fgColor': {'rgb': 'F... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
8 6 6 C3 3 3 n False General {'patternType': 'solid', 'fgColor': {'rgb': 'F... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
9 Spoon Spoon A4 4 1 s False General {'patternType': None, 'fgColor': {'rgb': '0000... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
10 3 3 B4 4 2 n False General {'patternType': None, 'fgColor': {'rgb': '0000... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
11 4 4 C4 4 3 n False General {'patternType': None, 'fgColor': {'rgb': '0000... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
12 Teaspoon Teaspoon A5 5 1 s False General {'patternType': 'solid', 'fgColor': {'rgb': 'F... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
13 1 1 B5 5 2 n False General {'patternType': 'solid', 'fgColor': {'rgb': 'F... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
14 2 2 C5 5 3 n False General {'patternType': 'solid', 'fgColor': {'rgb': 'F... {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...

headers:

# headers are on the first row
# we'll rely on the `column` column
# to align with the weight and price extracts
headers = frame.loc[frame.row.eq(1), ["value", "column"]]
headers
value column
0 Name 1
1 Weight 2
2 Price 3

data:

# kitchen_items are string datatype
# they are on rows > 1
booleans = frame.row.gt(1) & frame.data_type.eq("s")
kitchen_items = frame.loc[booleans, "value"].array
kitchen_items
<NumpyExtensionArray>
['Knife', 'Fork', 'Spoon', 'Teaspoon']
Length: 4, dtype: object
# price and weight are number datatypes
# they are on rows > 1
booleans = frame.row.gt(1) & frame.data_type.eq("n")
# retrieve column positions:
columns = frame.loc[booleans, "column"]
numbers = frame.loc[booleans, "value"].astype(np.int8).array
# we know there are only two number columns - weight and price
# we'll use that knowledge in reshaping:
numbers = numbers.reshape((-1, 2))
numbers
<NumpyExtensionArray>
[
[np.int8(7), np.int8(8)],
[np.int8(5), np.int8(6)],
[np.int8(3), np.int8(4)],
[np.int8(1), np.int8(2)]
]
Shape: (4, 2), dtype: int8

colors:

fill_colour = frame.fill.str.get("fgColor").str.get("rgb")
font_colour = frame.font.str.get("color").str.get("rgb")
# keep only rows aligned with the numbers:
fill_colour = fill_colour[booleans].array
font_colour = font_colour[booleans].array
# we know there are only two number columns
# this means we only need to jump two steps per colour:
fill_colour = fill_colour[::2]
font_colour = font_colour[::2]
fill_colour
<NumpyExtensionArray>
['00000000', 'FFFFFF00', '00000000', 'FFFFFF00']
Length: 4, dtype: object
font_colour
<NumpyExtensionArray>
['FF000000', None, 'FFFF0000', 'FFFF0000']
Length: 4, dtype: object

Combine into one DataFrame:

# filter headers for only rows
# that align with the weight and price extracts
headers = headers.loc[headers.column.isin(columns), "value"].array
outcome = pd.DataFrame(numbers, columns=headers)
outcome = outcome.assign(
    kitchen_item=kitchen_items, fill_colour=fill_colour, font_colour=font_colour
).loc[:, ["kitchen_item", "Price", "Weight", "fill_colour", "font_colour"]]
outcome
kitchen_item Price Weight fill_colour font_colour
0 Knife 8 7 00000000 FF000000
1 Fork 6 5 FFFFFF00 None
2 Spoon 4 3 00000000 FFFF0000
3 Teaspoon 2 1 FFFFFF00 FFFF0000

Note that an alternative would be to use the row and column positions to reshape the data - similar to the solution in case 2

Case 4 : Hierarchies in Formatting#

hierarchies_in_formatting.png

In the pic above, notice the differences in the formatting (bold, italic, bold and italic, none). In Duncan’s book, he offers a suggestion on what these different levels of hierarchy might mean :

  • none - good

  • italic - satisfactory

  • bold - poor

  • bold and italic - fail

    Again, the formatting is applied to the entire row.

frame = jn.xlsx_cells(filename, sheetnames="highlight-hierarchy", font=True)
frame = frame.astype({'row':np.int8, 'column':np.int8})
frame
value internal_value coordinate row column data_type is_date number_format font
0 Name Name A1 1 1 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1 Score Score B1 1 2 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2 Matilda Matilda A2 2 1 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
3 7 7 B2 2 2 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
4 Nicholas Nicholas A3 3 1 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
5 5 5 B3 3 2 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
6 Olivia Olivia A4 4 1 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
7 3 3 B4 4 2 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
8 Paul Paul A5 5 1 s False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
9 1 1 B5 5 2 n False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...

Steps:

  • get the data in separate columns(score, grade)

  • get the fonts (bold, italics)

  • combine to create the final DataFrame

data:

booleans = frame.row.gt(1) & frame.data_type.eq('s')
students = frame.loc[booleans, 'value'].array
students
<NumpyExtensionArray>
['Matilda', 'Nicholas', 'Olivia', 'Paul']
Length: 4, dtype: object
booleans = frame.row.gt(1) & frame.data_type.eq('n')
score = frame.loc[booleans, 'value'].array.astype(np.int8)
score
array([7, 5, 3, 1], dtype=int8)

fonts:

bold = frame.font.str.get("b")[booleans].to_numpy()
bold
array([False, False,  True,  True])
italics = frame.font.str.get("i")[booleans][booleans].to_numpy()
italics
array([False,  True, False,  True])
condlist = [~bold & ~italics, ~bold & italics, bold & ~italics]
choicelist = ['good','satisfactory','poor']
grade = np.select(condlist, choicelist, default='fail')
grade
array(['good', 'satisfactory', 'poor', 'fail'], dtype='<U12')

Combine into a single DataFrame:

outcome = {'name':students, 'bold':bold,'italics':italics,'score':score,'grade':grade}
outcome = pd.DataFrame(outcome)
outcome
name bold italics score grade
0 Matilda False False 7 good
1 Nicholas False True 5 satisfactory
2 Olivia True False 3 poor
3 Paul True True 1 fail

Comments#