{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# SpreadSheet Munging Strategies in Python - Pivot Tables - Simple Unpivoting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# __Pivot Tables - Simple Unpivoting__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*updated : September 25, 2024*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is part of a series of blog posts about extracting data from spreadsheets using Python. It is based on the [book](https://nacnudus.github.io/spreadsheet-munging-strategies/index.html) written by [Duncan Garmonsway](https://twitter.com/nacnudus?lang=en), which was written primarily for R users. Links to the other posts are on the sidebar.\n", "\n", "The key takeaway is this - you understand your data layout; use the tools to achieve your end goal. [xlsx_cells](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.xlsx_cells) offers a way to get the cells in a spreadsheet into individual rows, with some metadata. The final outcome however relies on your understanding of the data layout and its proper application." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pivot tables offer a summarized version of the data, and are usually succinct and visually appealing. However, for analysis, it is not so good. How do we convert pivot tables to tidy data? This section covers that. We'll see various kinds of pivot tables, and work through each of them." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# pip install pyjanitor\n", "import pandas as pd\n", "import janitor as jn\n", "import sys\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pandas version: 2.2.2\n", "janitor version: 0.29.1\n", "python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]\n", "numpy version: 2.0.2\n" ] } ], "source": [ "print(\"pandas version: \", pd.__version__)\n", "print(\"janitor version: \", jn.__version__)\n", "print(\"python version: \", sys.version)\n", "print(\"numpy version: \", np.__version__)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "excel_file = pd.ExcelFile(\"Data_files/worked-examples.xlsx\", engine=\"openpyxl\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Case 1: Multiple Row and Column headers__\n", "![pivot-annotations.png](Images/pivot-annotations.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![pivot-annotations-headerrowandcolumn.png](Images/formattedrowsandcolumns.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observations**:\n", "1. There are two header rows - gender and names.\n", "2. There are also two header columns on the left - Fields(Humanities, Performance) and Subjects(Classics, History, ...)\n", "3. For this case we wont be concerned with the colour formats. We'll look at that in a later section." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FemaleMale
MatildaOliviaNicholasPaul
HumanitiesClassics1230
History3451
PerformanceMusic5692
Drama78123
\n", "
" ], "text/plain": [ " Female Male \n", " Matilda Olivia Nicholas Paul\n", "Humanities Classics 1 2 3 0\n", " History 3 4 5 1\n", "Performance Music 5 6 9 2\n", " Drama 7 8 12 3" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = excel_file.parse(\n", " \"pivot-annotations\", header=[1, 2], index_col=[0, 1, 2]\n", ").droplevel(axis=0, level=0)\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fieldsubjectgenderstudentscores
0HumanitiesClassicsFemaleMatilda1
1HumanitiesClassicsFemaleOlivia2
2HumanitiesClassicsMaleNicholas3
3HumanitiesClassicsMalePaul0
4HumanitiesHistoryFemaleMatilda3
5HumanitiesHistoryFemaleOlivia4
6HumanitiesHistoryMaleNicholas5
7HumanitiesHistoryMalePaul1
8PerformanceMusicFemaleMatilda5
9PerformanceMusicFemaleOlivia6
10PerformanceMusicMaleNicholas9
11PerformanceMusicMalePaul2
12PerformanceDramaFemaleMatilda7
13PerformanceDramaFemaleOlivia8
14PerformanceDramaMaleNicholas12
15PerformanceDramaMalePaul3
\n", "
" ], "text/plain": [ " field subject gender student scores\n", "0 Humanities Classics Female Matilda 1\n", "1 Humanities Classics Female Olivia 2\n", "2 Humanities Classics Male Nicholas 3\n", "3 Humanities Classics Male Paul 0\n", "4 Humanities History Female Matilda 3\n", "5 Humanities History Female Olivia 4\n", "6 Humanities History Male Nicholas 5\n", "7 Humanities History Male Paul 1\n", "8 Performance Music Female Matilda 5\n", "9 Performance Music Female Olivia 6\n", "10 Performance Music Male Nicholas 9\n", "11 Performance Music Male Paul 2\n", "12 Performance Drama Female Matilda 7\n", "13 Performance Drama Female Olivia 8\n", "14 Performance Drama Male Nicholas 12\n", "15 Performance Drama Male Paul 3" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index.names = [\"field\", \"subject\"]\n", "df.columns.names = [\"gender\", \"student\"]\n", "\n", "(\n", " df.stack([\"gender\", \"student\"], future_stack=True)\n", " .rename(\"scores\")\n", " .reset_index()\n", " .ffill()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Case 2 : Multiple rows or columns of headers, with meaningful formatting__\n", "![pivot-annotations.png](Images/pivot-annotations.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Same as above, except we have to capture the format details. To get the colour information, we'll make use of the [xlsx_cells](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.xlsx_cells) function from [pyjanitor](https://pyjanitor-devs.github.io/pyjanitor/) - under the hood, it uses [openpyxl](https://openpyxl.readthedocs.io/en/stable/index.html).\n", "\n", "**Observations**:\n", "1. The first row is the gender; the next row are the names.\n", "2. The first column are the fields; the next column are the subjects\n", "3. The scores are all integers.\n", "4. The fonts are attached only to the scores.\n", "\n", "\n", "Plan of action - get the respective fields and combine based on row/column positions:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valueinternal_valuecoordinaterowcolumndata_typeis_datenumber_formatfill
0FemaleFemaleD224sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
1MaleMaleF226sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
2MatildaMatildaD334sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
3OliviaOliviaE335sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
4NicholasNicholasF336sFalseGeneral{'patternType': None, 'fgColor': {'rgb': '0000...
\n", "
" ], "text/plain": [ " value internal_value coordinate row column data_type is_date \\\n", "0 Female Female D2 2 4 s False \n", "1 Male Male F2 2 6 s False \n", "2 Matilda Matilda D3 3 4 s False \n", "3 Olivia Olivia E3 3 5 s False \n", "4 Nicholas Nicholas F3 3 6 s False \n", "\n", " number_format fill \n", "0 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "1 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "2 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "3 General {'patternType': None, 'fgColor': {'rgb': '0000... \n", "4 General {'patternType': None, 'fgColor': {'rgb': '0000... " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = jn.xlsx_cells(\n", " excel_file, sheetnames=\"pivot-annotations\", include_blank_cells=False, fill=True\n", ").astype({\"row\": np.int8, \"column\": np.int8})\n", "frame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the gender and names:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namecolumngender
0Matilda4Female
1Olivia5Female
2Nicholas6Male
3Paul7Male
\n", "
" ], "text/plain": [ " name column gender\n", "0 Matilda 4 Female\n", "1 Olivia 5 Female\n", "2 Nicholas 6 Male\n", "3 Paul 7 Male" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rows = frame.row.unique()[:2]\n", "gender = frame.loc[frame.row == rows[0], [\"value\", \"column\"]]\n", "gender = gender.rename(columns={\"value\": \"gender\"})\n", "names = frame.loc[frame.row == rows[1], [\"value\", \"column\"]]\n", "names = names.rename(columns={\"value\": \"name\"})\n", "gender_and_names = names.merge(gender, on=\"column\", how=\"left\").ffill()\n", "gender_and_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the field and subjects:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subjectrowfield
0Classics4Humanities
1History5Humanities
2Music6Performance
3Drama7Performance
\n", "
" ], "text/plain": [ " subject row field\n", "0 Classics 4 Humanities\n", "1 History 5 Humanities\n", "2 Music 6 Performance\n", "3 Drama 7 Performance" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns = np.unique(frame.column)[:2]\n", "field = frame.loc[frame.column == columns[0], [\"value\", \"row\"]]\n", "field = field.rename(columns={\"value\": \"field\"})\n", "subjects = frame.loc[frame.column == columns[1], [\"value\", \"row\"]]\n", "subjects = subjects.rename(columns={\"value\": \"subject\"})\n", "field_and_subjects = subjects.merge(field, on=\"row\", how=\"left\").ffill()\n", "field_and_subjects" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the scores and colours:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scorerowcolumnfont
814400000000
9245FFFFFF00
1034600000000
1104700000000
13354FFFFFF00
1445500000000
1555600000000
1615700000000
1956400000000
2066500000000
2196600000000
2226700000000
2477400000000
2587500000000
261276FFFFFF00
2737700000000
\n", "
" ], "text/plain": [ " score row column font\n", "8 1 4 4 00000000\n", "9 2 4 5 FFFFFF00\n", "10 3 4 6 00000000\n", "11 0 4 7 00000000\n", "13 3 5 4 FFFFFF00\n", "14 4 5 5 00000000\n", "15 5 5 6 00000000\n", "16 1 5 7 00000000\n", "19 5 6 4 00000000\n", "20 6 6 5 00000000\n", "21 9 6 6 00000000\n", "22 2 6 7 00000000\n", "24 7 7 4 00000000\n", "25 8 7 5 00000000\n", "26 12 7 6 FFFFFF00\n", "27 3 7 7 00000000" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get the scores\n", "booleans = frame.data_type.eq(\"n\")\n", "data = frame.loc[booleans, [\"value\", \"row\", \"column\"]].rename(\n", " columns={\"value\": \"score\"}\n", ")\n", "# get the colours\n", "fonts = frame.fill.str.get(\"fgColor\").str.get(\"rgb\")\n", "fonts = fonts[booleans]\n", "data[\"font\"] = fonts\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combine into a single DataFrame:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenderfieldsubjectscorefont
0MatildaFemaleHumanitiesClassics100000000
1OliviaFemaleHumanitiesClassics2FFFFFF00
2NicholasMaleHumanitiesClassics300000000
3PaulMaleHumanitiesClassics000000000
4MatildaFemaleHumanitiesHistory3FFFFFF00
5OliviaFemaleHumanitiesHistory400000000
6NicholasMaleHumanitiesHistory500000000
7PaulMaleHumanitiesHistory100000000
8MatildaFemalePerformanceMusic500000000
9OliviaFemalePerformanceMusic600000000
10NicholasMalePerformanceMusic900000000
11PaulMalePerformanceMusic200000000
12MatildaFemalePerformanceDrama700000000
13OliviaFemalePerformanceDrama800000000
14NicholasMalePerformanceDrama12FFFFFF00
15PaulMalePerformanceDrama300000000
\n", "
" ], "text/plain": [ " name gender field subject score font\n", "0 Matilda Female Humanities Classics 1 00000000\n", "1 Olivia Female Humanities Classics 2 FFFFFF00\n", "2 Nicholas Male Humanities Classics 3 00000000\n", "3 Paul Male Humanities Classics 0 00000000\n", "4 Matilda Female Humanities History 3 FFFFFF00\n", "5 Olivia Female Humanities History 4 00000000\n", "6 Nicholas Male Humanities History 5 00000000\n", "7 Paul Male Humanities History 1 00000000\n", "8 Matilda Female Performance Music 5 00000000\n", "9 Olivia Female Performance Music 6 00000000\n", "10 Nicholas Male Performance Music 9 00000000\n", "11 Paul Male Performance Music 2 00000000\n", "12 Matilda Female Performance Drama 7 00000000\n", "13 Olivia Female Performance Drama 8 00000000\n", "14 Nicholas Male Performance Drama 12 FFFFFF00\n", "15 Paul Male Performance Drama 3 00000000" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " data.merge(field_and_subjects, on=\"row\", how=\"left\")\n", " .merge(gender_and_names, on=\"column\", how=\"left\")\n", " .loc[:, [\"name\", \"gender\", \"field\", \"subject\", \"score\", \"font\"]]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Case 3 : Mixed headers and notes in the same row/column, distinguished by formatting__\n", "![pivot-notes.png](Images/pivot-notes.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, the highlighted texts should be excluded, as they are just notes and not relevant. Same goes for the 0=absent in G2 cell, which is *italicized*. red is identifed as `FFFF0000`, while italics should return True if it exists.\n", "\n", "We'll use the same steps as in the last solution:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valueinternal_valuecoordinaterowcolumndata_typeis_datenumber_formatfont
0FemaleFemaleD224sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1MaleMaleF226sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
20 = absent0 = absentG227sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
3MatildaMatildaD334sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
4OliviaOliviaE335sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
5NicholasNicholasF336sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
6PaulPaulG337sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
7HumanitiesHumanitiesB442sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
8ClassicsClassicsC443sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
911D444nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1022E445nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1133F446nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1200G447nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
13Excl. project workExcl. project workB552sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
14HistoryHistoryC553sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1533D554nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1644E555nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1755F556nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1811G557nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
19PerformancePerformanceB662sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
20MusicMusicC663sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2155D664nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2266E665nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2399F666nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2422G667nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
25Incl. written examIncl. written examB772sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
26DramaDramaC773sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2777D774nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2888E775nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
291212F776nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
3033G777nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
\n", "
" ], "text/plain": [ " value internal_value coordinate row column data_type \\\n", "0 Female Female D2 2 4 s \n", "1 Male Male F2 2 6 s \n", "2 0 = absent 0 = absent G2 2 7 s \n", "3 Matilda Matilda D3 3 4 s \n", "4 Olivia Olivia E3 3 5 s \n", "5 Nicholas Nicholas F3 3 6 s \n", "6 Paul Paul G3 3 7 s \n", "7 Humanities Humanities B4 4 2 s \n", "8 Classics Classics C4 4 3 s \n", "9 1 1 D4 4 4 n \n", "10 2 2 E4 4 5 n \n", "11 3 3 F4 4 6 n \n", "12 0 0 G4 4 7 n \n", "13 Excl. project work Excl. project work B5 5 2 s \n", "14 History History C5 5 3 s \n", "15 3 3 D5 5 4 n \n", "16 4 4 E5 5 5 n \n", "17 5 5 F5 5 6 n \n", "18 1 1 G5 5 7 n \n", "19 Performance Performance B6 6 2 s \n", "20 Music Music C6 6 3 s \n", "21 5 5 D6 6 4 n \n", "22 6 6 E6 6 5 n \n", "23 9 9 F6 6 6 n \n", "24 2 2 G6 6 7 n \n", "25 Incl. written exam Incl. written exam B7 7 2 s \n", "26 Drama Drama C7 7 3 s \n", "27 7 7 D7 7 4 n \n", "28 8 8 E7 7 5 n \n", "29 12 12 F7 7 6 n \n", "30 3 3 G7 7 7 n \n", "\n", " is_date number_format font \n", "0 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "1 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "2 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "3 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "4 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "5 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "6 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "7 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "8 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "9 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "10 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "11 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "12 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "13 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "14 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "15 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "16 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "17 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "18 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "19 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "20 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "21 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "22 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "23 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "24 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "25 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "26 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "27 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "28 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "29 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "30 False General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = jn.xlsx_cells(\n", " excel_file, sheetnames=\"pivot-notes\", include_blank_cells=False, font=True\n", ")\n", "frame.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namecolumngender
0Matilda4Female
1Olivia5Female
2Nicholas6Male
3Paul7Male
\n", "
" ], "text/plain": [ " name column gender\n", "0 Matilda 4 Female\n", "1 Olivia 5 Female\n", "2 Nicholas 6 Male\n", "3 Paul 7 Male" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# exclude the italicised cell\n", "italics = frame.font.str.get(\"i\")\n", "rows = frame.row.unique()[:2]\n", "booleans = frame.row == rows[0]\n", "booleans &= ~italics\n", "gender = frame.loc[booleans, [\"value\", \"column\"]]\n", "gender = gender.rename(columns={\"value\": \"gender\"})\n", "names = frame.loc[frame.row == rows[1], [\"value\", \"column\"]]\n", "names = names.rename(columns={\"value\": \"name\"})\n", "gender_and_names = names.merge(gender, on=\"column\", how=\"left\").ffill()\n", "gender_and_names" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subjectrowfield
0Classics4Humanities
1History5Humanities
2Music6Performance
3Drama7Performance
\n", "
" ], "text/plain": [ " subject row field\n", "0 Classics 4 Humanities\n", "1 History 5 Humanities\n", "2 Music 6 Performance\n", "3 Drama 7 Performance" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# exclude the cells with red font -> FFFF0000\n", "red = frame.font.str.get(\"color\").str.get(\"rgb\") != \"FFFF0000\"\n", "columns = np.unique(frame.column)[:2]\n", "booleans = frame.column == columns[0]\n", "booleans &= red\n", "field = frame.loc[booleans, [\"value\", \"row\"]]\n", "field = field.rename(columns={\"value\": \"field\"})\n", "subjects = frame.loc[frame.column == columns[1], [\"value\", \"row\"]]\n", "subjects = subjects.rename(columns={\"value\": \"subject\"})\n", "field_and_subjects = subjects.merge(field, on=\"row\", how=\"left\").ffill()\n", "field_and_subjects" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scorerowcolumn
9144
10245
11346
12047
15354
16455
17556
18157
21564
22665
23966
24267
27774
28875
291276
30377
\n", "
" ], "text/plain": [ " score row column\n", "9 1 4 4\n", "10 2 4 5\n", "11 3 4 6\n", "12 0 4 7\n", "15 3 5 4\n", "16 4 5 5\n", "17 5 5 6\n", "18 1 5 7\n", "21 5 6 4\n", "22 6 6 5\n", "23 9 6 6\n", "24 2 6 7\n", "27 7 7 4\n", "28 8 7 5\n", "29 12 7 6\n", "30 3 7 7" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = frame.data_type.eq(\"n\")\n", "data = frame.loc[booleans, [\"value\", \"row\", \"column\"]].rename(\n", " columns={\"value\": \"score\"}\n", ")\n", "data" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenderfieldsubjectscore
0MatildaFemaleHumanitiesClassics1
1OliviaFemaleHumanitiesClassics2
2NicholasMaleHumanitiesClassics3
3PaulMaleHumanitiesClassics0
4MatildaFemaleHumanitiesHistory3
5OliviaFemaleHumanitiesHistory4
6NicholasMaleHumanitiesHistory5
7PaulMaleHumanitiesHistory1
8MatildaFemalePerformanceMusic5
9OliviaFemalePerformanceMusic6
10NicholasMalePerformanceMusic9
11PaulMalePerformanceMusic2
12MatildaFemalePerformanceDrama7
13OliviaFemalePerformanceDrama8
14NicholasMalePerformanceDrama12
15PaulMalePerformanceDrama3
\n", "
" ], "text/plain": [ " name gender field subject score\n", "0 Matilda Female Humanities Classics 1\n", "1 Olivia Female Humanities Classics 2\n", "2 Nicholas Male Humanities Classics 3\n", "3 Paul Male Humanities Classics 0\n", "4 Matilda Female Humanities History 3\n", "5 Olivia Female Humanities History 4\n", "6 Nicholas Male Humanities History 5\n", "7 Paul Male Humanities History 1\n", "8 Matilda Female Performance Music 5\n", "9 Olivia Female Performance Music 6\n", "10 Nicholas Male Performance Music 9\n", "11 Paul Male Performance Music 2\n", "12 Matilda Female Performance Drama 7\n", "13 Olivia Female Performance Drama 8\n", "14 Nicholas Male Performance Drama 12\n", "15 Paul Male Performance Drama 3" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " data.merge(field_and_subjects, on=\"row\", how=\"left\")\n", " .merge(gender_and_names, on=\"column\", how=\"left\")\n", " .loc[:, [\"name\", \"gender\", \"field\", \"subject\", \"score\"]]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Case 4: Mixed levels of headers in the same row/column, distinguished by formatting__\n", "![pivot-hierarchy.png](Images/pivot-hierarchy.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The formatting here refers to the indent - notice how Humanities and Performance are closer to the border compared to the subjects. Same goes for Matilda and Nicholas. Getting the indent information would require the use of Openpyxl. We also know that the fields is just one row above the respective subjects. Let's use that knowlege in extracting our data:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valueinternal_valuecoordinaterowcolumndata_typeis_datenumber_formatalignment
0MatildaMatildaC223sFalseGeneral{'horizontal': 'general', 'vertical': 'bottom'...
1NicholasNicholasD224sFalseGeneral{'horizontal': 'general', 'vertical': 'bottom'...
2HumanitiesHumanitiesB332sFalseGeneral{'horizontal': 'general', 'vertical': 'bottom'...
3ClassicsClassicsB442sFalseGeneral{'horizontal': 'left', 'vertical': 'bottom', '...
411C443nFalseGeneral{'horizontal': 'general', 'vertical': 'bottom'...
533D444nFalseGeneral{'horizontal': 'general', 'vertical': 'bottom'...
6HistoryHistoryB552sFalseGeneral{'horizontal': 'left', 'vertical': 'bottom', '...
733C553nFalseGeneral{'horizontal': 'general', 'vertical': 'bottom'...
855D554nFalseGeneral{'horizontal': 'general', 'vertical': 'bottom'...
9PerformancePerformanceB662sFalseGeneral{'horizontal': 'general', 'vertical': 'bottom'...
10MusicMusicB772sFalseGeneral{'horizontal': 'left', 'vertical': 'bottom', '...
1155C773nFalseGeneral{'horizontal': 'general', 'vertical': 'bottom'...
1299D774nFalseGeneral{'horizontal': 'general', 'vertical': 'bottom'...
13DramaDramaB882sFalseGeneral{'horizontal': 'left', 'vertical': 'bottom', '...
1477C883nFalseGeneral{'horizontal': 'general', 'vertical': 'bottom'...
151212D884nFalseGeneral{'horizontal': 'general', 'vertical': 'bottom'...
\n", "
" ], "text/plain": [ " value internal_value coordinate row column data_type is_date \\\n", "0 Matilda Matilda C2 2 3 s False \n", "1 Nicholas Nicholas D2 2 4 s False \n", "2 Humanities Humanities B3 3 2 s False \n", "3 Classics Classics B4 4 2 s False \n", "4 1 1 C4 4 3 n False \n", "5 3 3 D4 4 4 n False \n", "6 History History B5 5 2 s False \n", "7 3 3 C5 5 3 n False \n", "8 5 5 D5 5 4 n False \n", "9 Performance Performance B6 6 2 s False \n", "10 Music Music B7 7 2 s False \n", "11 5 5 C7 7 3 n False \n", "12 9 9 D7 7 4 n False \n", "13 Drama Drama B8 8 2 s False \n", "14 7 7 C8 8 3 n False \n", "15 12 12 D8 8 4 n False \n", "\n", " number_format alignment \n", "0 General {'horizontal': 'general', 'vertical': 'bottom'... \n", "1 General {'horizontal': 'general', 'vertical': 'bottom'... \n", "2 General {'horizontal': 'general', 'vertical': 'bottom'... \n", "3 General {'horizontal': 'left', 'vertical': 'bottom', '... \n", "4 General {'horizontal': 'general', 'vertical': 'bottom'... \n", "5 General {'horizontal': 'general', 'vertical': 'bottom'... \n", "6 General {'horizontal': 'left', 'vertical': 'bottom', '... \n", "7 General {'horizontal': 'general', 'vertical': 'bottom'... \n", "8 General {'horizontal': 'general', 'vertical': 'bottom'... \n", "9 General {'horizontal': 'general', 'vertical': 'bottom'... \n", "10 General {'horizontal': 'left', 'vertical': 'bottom', '... \n", "11 General {'horizontal': 'general', 'vertical': 'bottom'... \n", "12 General {'horizontal': 'general', 'vertical': 'bottom'... \n", "13 General {'horizontal': 'left', 'vertical': 'bottom', '... \n", "14 General {'horizontal': 'general', 'vertical': 'bottom'... \n", "15 General {'horizontal': 'general', 'vertical': 'bottom'... " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = jn.xlsx_cells(\n", " excel_file, sheetnames=\"pivot-hierarchy\", include_blank_cells=False, alignment=True\n", ").astype({\"row\": np.int8, \"column\": np.int8})\n", "frame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the field and subjects:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subjectrowfield
0Classics4Humanities
1History5Humanities
2Music7Performance
3Drama8Performance
\n", "
" ], "text/plain": [ " subject row field\n", "0 Classics 4 Humanities\n", "1 History 5 Humanities\n", "2 Music 7 Performance\n", "3 Drama 8 Performance" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# fields and subjects are in the first column:\n", "booleans = frame.alignment.str.get(\"indent\").eq(0) & frame.column.eq(frame.column.min())\n", "field = frame.loc[booleans, [\"value\", \"row\"]].assign(row=lambda f: f.row + 1)\n", "field = field.rename(columns={\"value\": \"field\"})\n", "booleans = frame.alignment.str.get(\"indent\").gt(0) & frame.column.eq(frame.column.min())\n", "subject = frame.loc[booleans, [\"value\", \"row\"]]\n", "subject = subject.rename(columns={\"value\": \"subject\"})\n", "field_and_subjects = subject.merge(field, on=\"row\", how=\"left\").ffill()\n", "field_and_subjects" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the names:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namecolumn
0Matilda3
1Nicholas4
\n", "
" ], "text/plain": [ " name column\n", "0 Matilda 3\n", "1 Nicholas 4" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# names are the very first row:\n", "names = frame.loc[frame.row == frame.row.min(), [\"value\", \"column\"]]\n", "names = names.rename(columns={\"value\": \"name\"})\n", "names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the scores" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scorerowcolumn
4143
5344
7353
8554
11573
12974
14783
151284
\n", "
" ], "text/plain": [ " score row column\n", "4 1 4 3\n", "5 3 4 4\n", "7 3 5 3\n", "8 5 5 4\n", "11 5 7 3\n", "12 9 7 4\n", "14 7 8 3\n", "15 12 8 4" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores = frame.loc[frame.data_type == \"n\", [\"value\", \"row\", \"column\"]]\n", "scores = scores.rename(columns={\"value\": \"score\"})\n", "scores" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namefieldsubjectscore
0MatildaHumanitiesClassics1
1NicholasHumanitiesClassics3
2MatildaHumanitiesHistory3
3NicholasHumanitiesHistory5
4MatildaPerformanceMusic5
5NicholasPerformanceMusic9
6MatildaPerformanceDrama7
7NicholasPerformanceDrama12
\n", "
" ], "text/plain": [ " name field subject score\n", "0 Matilda Humanities Classics 1\n", "1 Nicholas Humanities Classics 3\n", "2 Matilda Humanities History 3\n", "3 Nicholas Humanities History 5\n", "4 Matilda Performance Music 5\n", "5 Nicholas Performance Music 9\n", "6 Matilda Performance Drama 7\n", "7 Nicholas Performance Drama 12" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " scores.merge(names, on=\"column\", how=\"left\")\n", " .merge(field_and_subjects, on=\"row\", how=\"left\")\n", " .loc[:, [\"name\", \"field\", \"subject\", \"score\"]]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another route, without [xlsx_cells](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.xlsx_cells):" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fieldsubjectnamescore
0HumanitiesClassicsMatilda1
1HumanitiesHistoryMatilda3
2PerformanceMusicMatilda5
3PerformanceDramaMatilda7
4HumanitiesClassicsNicholas3
5HumanitiesHistoryNicholas5
6PerformanceMusicNicholas9
7PerformanceDramaNicholas12
\n", "
" ], "text/plain": [ " field subject name score\n", "0 Humanities Classics Matilda 1\n", "1 Humanities History Matilda 3\n", "2 Performance Music Matilda 5\n", "3 Performance Drama Matilda 7\n", "4 Humanities Classics Nicholas 3\n", "5 Humanities History Nicholas 5\n", "6 Performance Music Nicholas 9\n", "7 Performance Drama Nicholas 12" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " excel_file.parse(sheet_name=\"pivot-hierarchy\", header=[0, 1])\n", " .droplevel(axis=1, level=0)\n", " .remove_empty()\n", " .set_axis([\"subject\", \"Matilda\", \"Nicholas\"], axis=\"columns\")\n", " .assign(field=lambda f: f.subject.where(f.Matilda.isna()).ffill())\n", " .dropna()\n", " .melt(id_vars=[\"field\", \"subject\"], var_name=\"name\", value_name=\"score\")\n", " .astype({\"score\": np.int8})\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comments\n", "" ] } ], "metadata": { "kernelspec": { "display_name": "blogger", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 4 }