{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# SpreadSheet Munging Strategies in Python - Meaningful Formats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## __Meaningful Formats__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*updated : September 24, 2024*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is part of a series of blog posts about extracting data from spreadsheets using Python. It is based on the [book](https://nacnudus.github.io/spreadsheet-munging-strategies/index.html) written by [Duncan Garmonsway](https://twitter.com/nacnudus?lang=en), which was written primarily for R users. Links to the other posts are on the sidebar.\n", "\n", "The key takeaway is this - you understand your data layout; use the tools to achieve your end goal. [xlsx_cells](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.xlsx_cells) offers a way to get the cells in a spreadsheet into individual rows, with some metadata. The final outcome however relies on your understanding of the data layout and its proper application." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is an excerpt from [Duncan's book](https://nacnudus.github.io/spreadsheet-munging-strategies/tidy-formatted-rows.html) regarding meaningful formats :\n", ">Sometimes whole rows in a table are highlighted by formatting them with, say, a bright yellow fill. The highlighting could mean “this observation should be ignored”, or “this product is no longer available”. Different colours could mean different levels of a hierarchy, e.g. green for “pass” and red for “fail”.\n", "\n", "The examples below highlight various ways of dealing with meaningfully formatted spreadsheets. Let's dive in." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Case 1 : Meaningfully Formatted Rows**\n", "![meaningfully_formatted_rows.png](Images/meaningfully_formatted_rows.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal is to capture the color as part of our final output. We'll make use of the [xlsx_cells](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.xlsx_cells) function from [pyjanitor](https://pyjanitor-devs.github.io/pyjanitor/) - under the hood, it uses [openpyxl](https://openpyxl.readthedocs.io/en/stable/index.html):" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# pip install pyjanitor\n", "import pandas as pd\n", "import janitor as jn\n", "import sys\n", "import numpy as np\n", "\n", "filename = \"Data_files/worked-examples.xlsx\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.29.1'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('pandas version: ', pd.__version__)\n", "print('janitor version: ', jn.__version__)\n", "print('python version: ', sys.version)\n", "print('numpy version: ', np.__version__)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valueinternal_valuecoordinaterowcolumndata_typeis_datenumber_formatfill
0AgeAgeA111sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
1HeightHeightB112sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
211A221nFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
322B222nFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
433A331nFalseGeneral{'patternType': 'solid', 'fgColor': {'rgb': 'F...
544B332nFalseGeneral{'patternType': 'solid', 'fgColor': {'rgb': 'F...
655A441nFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
766B442nFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
\n", "
" ], "text/plain": [ " value internal_value coordinate row column data_type is_date \\\n", "0 Age Age A1 1 1 s False \n", "1 Height Height B1 1 2 s False \n", "2 1 1 A2 2 1 n False \n", "3 2 2 B2 2 2 n False \n", "4 3 3 A3 3 1 n False \n", "5 4 4 B3 3 2 n False \n", "6 5 5 A4 4 1 n False \n", "7 6 6 B4 4 2 n False \n", "\n", " number_format fill \n", "0 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "1 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "2 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "3 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "4 General {'patternType': 'solid', 'fgColor': {'rgb': 'F... \n", "5 General {'patternType': 'solid', 'fgColor': {'rgb': 'F... \n", "6 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "7 General {'patternType': None, 'fgColor': {'rgb': '0000... " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = jn.xlsx_cells(filename, sheetnames=\"highlights\", fill=True)\n", "frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can view a single row in the fill column:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'patternType': None,\n", " 'fgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0},\n", " 'bgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.fill[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Steps:\n", "- get the headers\n", "- get the data(numbers)\n", "- get the colours\n", "- combine to create the final DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "headers:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "['Age', 'Height']\n", "Length: 2, dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "headers = frame.loc[frame.row == \"1\", \"value\"].array\n", "headers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "data:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1, 2],\n", " [3, 4],\n", " [5, 6]], dtype=int8)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = frame.data_type == \"n\"\n", "data = frame.loc[booleans, \"value\"].array\n", "# There are only two columns in the spreadsheet;\n", "# we'll use that knowledge in reshaping the integer values:\n", "data = data.astype(np.int8).reshape((-1, 2))\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "colours:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "['00000000', 'FFFFFF00', '00000000']\n", "Length: 3, dtype: object" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_colour = frame.fill.str.get(\"fgColor\").str.get(\"rgb\").array\n", "# keep only rows related to the integers:\n", "fill_colour = fill_colour[booleans]\n", "# the number of columns is 2,\n", "# so we only need to jump 2 steps\n", "# to get the corresponding colour per row\n", "fill_colour = fill_colour[::2]\n", "fill_colour" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeHeightfill_colour
01200000000
134FFFFFF00
25600000000
\n", "
" ], "text/plain": [ " Age Height fill_colour\n", "0 1 2 00000000\n", "1 3 4 FFFFFF00\n", "2 5 6 00000000" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outcome = pd.DataFrame(data, columns=headers).assign(fill_colour=fill_colour)\n", "outcome" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(case-2)=\n", "### **Case 2 : Meaningfully Formatted Cells**\n", "![meaningfully_formatted_cells.png](Images/meaningfully_formatted_cells.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the pic above, the table has different colors for different cells. [xlsx_cells](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.xlsx_cells) comes in handy again, and the focus here, just as in Case 1, is the cell's fill attribute." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valueinternal_valuecoordinaterowcolumndata_typeis_datenumber_formatfill
0NameNameA111sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
1AgeAgeB112sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
2HeightHeightC113sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
3MatildaMatildaA221sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
411B222nFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
522C223nFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
6NicholasNicholasA331sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
733B332nFalseGeneral{'patternType': 'solid', 'fgColor': {'rgb': 'F...
844C333nFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
9OliviaOliviaA441sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
1055B442nFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
1166C443nFalseGeneral{'patternType': 'solid', 'fgColor': {'rgb': 'F...
\n", "
" ], "text/plain": [ " value internal_value coordinate row column data_type is_date \\\n", "0 Name Name A1 1 1 s False \n", "1 Age Age B1 1 2 s False \n", "2 Height Height C1 1 3 s False \n", "3 Matilda Matilda A2 2 1 s False \n", "4 1 1 B2 2 2 n False \n", "5 2 2 C2 2 3 n False \n", "6 Nicholas Nicholas A3 3 1 s False \n", "7 3 3 B3 3 2 n False \n", "8 4 4 C3 3 3 n False \n", "9 Olivia Olivia A4 4 1 s False \n", "10 5 5 B4 4 2 n False \n", "11 6 6 C4 4 3 n False \n", "\n", " number_format fill \n", "0 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "1 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "2 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "3 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "4 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "5 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "6 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "7 General {'patternType': 'solid', 'fgColor': {'rgb': 'F... \n", "8 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "9 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "10 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "11 General {'patternType': 'solid', 'fgColor': {'rgb': 'F... " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = jn.xlsx_cells(filename, sheetnames=\"annotations\", fill=True).astype(\n", " {\"row\": np.int8, \"column\": np.int8}\n", ")\n", "frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Steps:\n", "- get the headers\n", "- get the data in separate columns(strings, numbers)\n", "- get the colours\n", "- combine to create the final DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "headers:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurecolumn
0Name1
1Age2
2Height3
\n", "
" ], "text/plain": [ " measure column\n", "0 Name 1\n", "1 Age 2\n", "2 Height 3" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# all the headers are on the first row\n", "# retain the column positions,\n", "# so we can link them back to the data and colours\n", "headers = frame.loc[frame.row.eq(1), [\"value\", \"column\"]]\n", "headers.columns = [\"measure\", \"column\"]\n", "headers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "data:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namerow
3Matilda2
6Nicholas3
9Olivia4
\n", "
" ], "text/plain": [ " name row\n", "3 Matilda 2\n", "6 Nicholas 3\n", "9 Olivia 4" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the string values have a data type of 's'\n", "# and the row numbers are greater than 1\n", "# we'll keep the row number so we can correctly\n", "# match with the measurements(age, height)\n", "booleans = frame.row.gt(1) & frame.data_type.eq(\"s\")\n", "students = frame.loc[booleans, [\"value\", \"row\"]]\n", "students.columns = [\"name\", \"row\"]\n", "students" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valuerowcolumn
4122
5223
7332
8433
10542
11643
\n", "
" ], "text/plain": [ " value row column\n", "4 1 2 2\n", "5 2 2 3\n", "7 3 3 2\n", "8 4 3 3\n", "10 5 4 2\n", "11 6 4 3" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the scores values have a data type of 'n'\n", "# and the row numbers are greater than 1\n", "# we'll keep the row and column columns\n", "# so we can correctly\n", "# match with the student names\n", "booleans = frame.row.gt(1) & frame.data_type.eq(\"n\")\n", "numbers = frame.loc[booleans, [\"value\", \"row\", \"column\"]]\n", "numbers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "colours:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4 00000000\n", "5 00000000\n", "7 FFFFFF00\n", "8 00000000\n", "10 00000000\n", "11 FF92D050\n", "Name: fill, dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_colour = frame.fill.str.get(\"fgColor\").str.get(\"rgb\")\n", "# keep only rows for numbers:\n", "fill_colour = fill_colour[booleans]\n", "fill_colour" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "combine into a single DataFrame:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namemeasurevaluefill_colour
0MatildaAge100000000
1MatildaHeight200000000
2NicholasAge3FFFFFF00
3NicholasHeight400000000
4OliviaAge500000000
5OliviaHeight6FF92D050
\n", "
" ], "text/plain": [ " name measure value fill_colour\n", "0 Matilda Age 1 00000000\n", "1 Matilda Height 2 00000000\n", "2 Nicholas Age 3 FFFFFF00\n", "3 Nicholas Height 4 00000000\n", "4 Olivia Age 5 00000000\n", "5 Olivia Height 6 FF92D050" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outcome = (\n", " numbers.assign(fill_colour=fill_colour)\n", " .merge(students, on=\"row\")\n", " .merge(headers, on=\"column\")\n", " .loc[:, [\"name\", \"measure\", \"value\", \"fill_colour\"]]\n", ")\n", "\n", "outcome" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Case 3 : Layered Meaningful Formatting**\n", "![layered_meaningful_formatting.png](Images/layered_meaningful_formatting.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, we have formats for fill and font. \n", "\n", "The format applies to the entire row." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Steps:\n", "- get the headers\n", "- get the data in separate columns(kitchen_items, price, weight)\n", "- get the colours (fill and font)\n", "- combine to create the final DataFrame" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valueinternal_valuecoordinaterowcolumndata_typeis_datenumber_formatfillfont
0NameNameA111sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1WeightWeightB112sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2PricePriceC113sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
3KnifeKnifeA221sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
477B222nFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
588C223nFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
6ForkForkA331sFalseGeneral{'patternType': 'solid', 'fgColor': {'rgb': 'F...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
755B332nFalseGeneral{'patternType': 'solid', 'fgColor': {'rgb': 'F...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
866C333nFalseGeneral{'patternType': 'solid', 'fgColor': {'rgb': 'F...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
9SpoonSpoonA441sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1033B442nFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1144C443nFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
12TeaspoonTeaspoonA551sFalseGeneral{'patternType': 'solid', 'fgColor': {'rgb': 'F...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1311B552nFalseGeneral{'patternType': 'solid', 'fgColor': {'rgb': 'F...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1422C553nFalseGeneral{'patternType': 'solid', 'fgColor': {'rgb': 'F...{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
\n", "
" ], "text/plain": [ " value internal_value coordinate row column data_type is_date \\\n", "0 Name Name A1 1 1 s False \n", "1 Weight Weight B1 1 2 s False \n", "2 Price Price C1 1 3 s False \n", "3 Knife Knife A2 2 1 s False \n", "4 7 7 B2 2 2 n False \n", "5 8 8 C2 2 3 n False \n", "6 Fork Fork A3 3 1 s False \n", "7 5 5 B3 3 2 n False \n", "8 6 6 C3 3 3 n False \n", "9 Spoon Spoon A4 4 1 s False \n", "10 3 3 B4 4 2 n False \n", "11 4 4 C4 4 3 n False \n", "12 Teaspoon Teaspoon A5 5 1 s False \n", "13 1 1 B5 5 2 n False \n", "14 2 2 C5 5 3 n False \n", "\n", " number_format fill \\\n", "0 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "1 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "2 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "3 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "4 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "5 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "6 General {'patternType': 'solid', 'fgColor': {'rgb': 'F... \n", "7 General {'patternType': 'solid', 'fgColor': {'rgb': 'F... \n", "8 General {'patternType': 'solid', 'fgColor': {'rgb': 'F... \n", "9 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "10 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "11 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "12 General {'patternType': 'solid', 'fgColor': {'rgb': 'F... \n", "13 General {'patternType': 'solid', 'fgColor': {'rgb': 'F... \n", "14 General {'patternType': 'solid', 'fgColor': {'rgb': 'F... \n", "\n", " font \n", "0 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "1 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "2 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "3 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "4 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "5 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "6 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "7 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "8 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "9 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "10 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "11 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "12 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "13 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "14 {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = jn.xlsx_cells(filename, sheetnames=\"combined-highlights\", fill=True, font=True)\n", "frame = frame.astype({\"row\": np.int8, \"column\": np.int8})\n", "frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "headers:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valuecolumn
0Name1
1Weight2
2Price3
\n", "
" ], "text/plain": [ " value column\n", "0 Name 1\n", "1 Weight 2\n", "2 Price 3" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# headers are on the first row\n", "# we'll rely on the `column` column\n", "# to align with the weight and price extracts\n", "headers = frame.loc[frame.row.eq(1), [\"value\", \"column\"]]\n", "headers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "data:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "['Knife', 'Fork', 'Spoon', 'Teaspoon']\n", "Length: 4, dtype: object" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# kitchen_items are string datatype\n", "# they are on rows > 1\n", "booleans = frame.row.gt(1) & frame.data_type.eq(\"s\")\n", "kitchen_items = frame.loc[booleans, \"value\"].array\n", "kitchen_items" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "[\n", "[np.int8(7), np.int8(8)],\n", "[np.int8(5), np.int8(6)],\n", "[np.int8(3), np.int8(4)],\n", "[np.int8(1), np.int8(2)]\n", "]\n", "Shape: (4, 2), dtype: int8" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# price and weight are number datatypes\n", "# they are on rows > 1\n", "booleans = frame.row.gt(1) & frame.data_type.eq(\"n\")\n", "# retrieve column positions:\n", "columns = frame.loc[booleans, \"column\"]\n", "numbers = frame.loc[booleans, \"value\"].astype(np.int8).array\n", "# we know there are only two number columns - weight and price\n", "# we'll use that knowledge in reshaping:\n", "numbers = numbers.reshape((-1, 2))\n", "numbers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "colors:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "fill_colour = frame.fill.str.get(\"fgColor\").str.get(\"rgb\")\n", "font_colour = frame.font.str.get(\"color\").str.get(\"rgb\")\n", "# keep only rows aligned with the numbers:\n", "fill_colour = fill_colour[booleans].array\n", "font_colour = font_colour[booleans].array\n", "# we know there are only two number columns\n", "# this means we only need to jump two steps per colour:\n", "fill_colour = fill_colour[::2]\n", "font_colour = font_colour[::2]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "['00000000', 'FFFFFF00', '00000000', 'FFFFFF00']\n", "Length: 4, dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_colour" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "['FF000000', None, 'FFFF0000', 'FFFF0000']\n", "Length: 4, dtype: object" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "font_colour" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combine into one DataFrame:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
kitchen_itemPriceWeightfill_colourfont_colour
0Knife8700000000FF000000
1Fork65FFFFFF00None
2Spoon4300000000FFFF0000
3Teaspoon21FFFFFF00FFFF0000
\n", "
" ], "text/plain": [ " kitchen_item Price Weight fill_colour font_colour\n", "0 Knife 8 7 00000000 FF000000\n", "1 Fork 6 5 FFFFFF00 None\n", "2 Spoon 4 3 00000000 FFFF0000\n", "3 Teaspoon 2 1 FFFFFF00 FFFF0000" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# filter headers for only rows\n", "# that align with the weight and price extracts\n", "headers = headers.loc[headers.column.isin(columns), \"value\"].array\n", "outcome = pd.DataFrame(numbers, columns=headers)\n", "outcome = outcome.assign(\n", " kitchen_item=kitchen_items, fill_colour=fill_colour, font_colour=font_colour\n", ").loc[:, [\"kitchen_item\", \"Price\", \"Weight\", \"fill_colour\", \"font_colour\"]]\n", "outcome" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that an alternative would be to use the `row` and `column` positions to reshape the data - similar to the solution in [case 2](case-2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Case 4 : Hierarchies in Formatting**\n", "![hierarchies_in_formatting.png](Images/hierarchies_in_formatting.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the pic above, notice the differences in the formatting (bold, italic, bold and italic, none). In Duncan's book, he offers a suggestion on what these different levels of hierarchy might mean :
\n", "- none - good\n", "- italic - satisfactory\n", "- bold - poor\n", "- bold and italic - fail\n", "

\n", "Again, the formatting is applied to the entire row." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valueinternal_valuecoordinaterowcolumndata_typeis_datenumber_formatfont
0NameNameA111sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1ScoreScoreB112sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2MatildaMatildaA221sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
377B222nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
4NicholasNicholasA331sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
555B332nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
6OliviaOliviaA441sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
733B442nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
8PaulPaulA551sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
911B552nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
\n", "
" ], "text/plain": [ " value internal_value coordinate row column data_type is_date \\\n", "0 Name Name A1 1 1 s False \n", "1 Score Score B1 1 2 s False \n", "2 Matilda Matilda A2 2 1 s False \n", "3 7 7 B2 2 2 n False \n", "4 Nicholas Nicholas A3 3 1 s False \n", "5 5 5 B3 3 2 n False \n", "6 Olivia Olivia A4 4 1 s False \n", "7 3 3 B4 4 2 n False \n", "8 Paul Paul A5 5 1 s False \n", "9 1 1 B5 5 2 n False \n", "\n", " number_format font \n", "0 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "1 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "2 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "3 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "4 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "5 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "6 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "7 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "8 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "9 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = jn.xlsx_cells(filename, sheetnames=\"highlight-hierarchy\", font=True)\n", "frame = frame.astype({'row':np.int8, 'column':np.int8})\n", "frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Steps:\n", "- get the data in separate columns(score, grade)\n", "- get the fonts (bold, italics)\n", "- combine to create the final DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "data:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "['Matilda', 'Nicholas', 'Olivia', 'Paul']\n", "Length: 4, dtype: object" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = frame.row.gt(1) & frame.data_type.eq('s')\n", "students = frame.loc[booleans, 'value'].array\n", "students" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([7, 5, 3, 1], dtype=int8)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = frame.row.gt(1) & frame.data_type.eq('n')\n", "score = frame.loc[booleans, 'value'].array.astype(np.int8)\n", "score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "fonts:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([False, False, True, True])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bold = frame.font.str.get(\"b\")[booleans].to_numpy()\n", "bold" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([False, True, False, True])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "italics = frame.font.str.get(\"i\")[booleans][booleans].to_numpy()\n", "italics" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['good', 'satisfactory', 'poor', 'fail'], dtype='\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namebolditalicsscoregrade
0MatildaFalseFalse7good
1NicholasFalseTrue5satisfactory
2OliviaTrueFalse3poor
3PaulTrueTrue1fail
\n", "" ], "text/plain": [ " name bold italics score grade\n", "0 Matilda False False 7 good\n", "1 Nicholas False True 5 satisfactory\n", "2 Olivia True False 3 poor\n", "3 Paul True True 1 fail" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outcome = {'name':students, 'bold':bold,'italics':italics,'score':score,'grade':grade}\n", "outcome = pd.DataFrame(outcome)\n", "outcome" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comments\n", "" ] } ], "metadata": { "jupytext": { "formats": "ipynb,Rmd" }, "kernelspec": { "display_name": "blogger", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 4 }