{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# SpreadSheet Munging Strategies in Python - Small Multiples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Small Multiples**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*updated : September 25, 2024*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is part of a series of blog posts about extracting data from spreadsheets using Python. It is based on the [book](https://nacnudus.github.io/spreadsheet-munging-strategies/index.html) written by [Duncan Garmonsway](https://twitter.com/nacnudus?lang=en), which was written primarily for R users. Links to the other posts are on the sidebar.\n", "\n", "The key takeaway is this - you understand your data layout; use the tools to achieve your end goal. [xlsx_cells](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.xlsx_cells) offers a way to get the cells in a spreadsheet into individual rows, with some metadata. The final outcome however relies on your understanding of the data layout and its proper application." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Small multiples refer to mini tables embedded in a spreadsheet, or multiple spreadsheets. Ideally, this tables should be lumped into one dataframe for meaningful analysis. The examples below show different scenarios and how we can reshape the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Case 1 : Small Multiples with all Headers Present for Each Multiple__\n", "\n", "![small-multiples.png](Images/small-multiples.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this spreadsheet, each table is a separate subject. It would be better to aggregate all the subjects and underlying data into one table." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# pip install pyjanitor\n", "import pandas as pd\n", "import janitor as jn\n", "import sys\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pandas version: 2.2.2\n", "janitor version: 0.29.1\n", "python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]\n", "numpy version: 2.0.2\n" ] } ], "source": [ "print('pandas version: ', pd.__version__)\n", "print('janitor version: ', jn.__version__)\n", "print('python version: ', sys.version)\n", "print('numpy version: ', np.__version__)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "excel_file = pd.ExcelFile(\"Data_files/worked-examples.xlsx\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valueinternal_valuecoordinaterowcolumndata_typeis_datenumber_formatfont
0ClassicsClassicsA111sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1HistoryHistoryE115sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2NameNameA221sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
3ScoreScoreB222sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
4GradeGradeC223sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
5NameNameE225sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
6ScoreScoreF226sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
7GradeGradeG227sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
8MatildaMatildaA331sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
911B332nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
10FFC333sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
11MatildaMatildaE335sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1233F336nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
13DDG337sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
14OliviaOliviaA441sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1522B442nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
16DDC443sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
17OliviaOliviaE445sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1844F446nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
19CCG447sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
20MusicMusicA661sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
21DramaDramaE665sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
22NameNameA771sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
23ScoreScoreB772sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
24GradeGradeC773sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
25NameNameE775sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
26ScoreScoreF776sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
27GradeGradeG777sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
28MatildaMatildaA881sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2955B882nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
30BBC883sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
31MatildaMatildaE885sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
3277F886nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
33AAG887sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
34OliviaOliviaA991sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
3566B992nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
36BBC993sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
37OliviaOliviaE995sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
3888F996nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
39AAG997sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
\n", "
" ], "text/plain": [ " value internal_value coordinate row column data_type is_date \\\n", "0 Classics Classics A1 1 1 s False \n", "1 History History E1 1 5 s False \n", "2 Name Name A2 2 1 s False \n", "3 Score Score B2 2 2 s False \n", "4 Grade Grade C2 2 3 s False \n", "5 Name Name E2 2 5 s False \n", "6 Score Score F2 2 6 s False \n", "7 Grade Grade G2 2 7 s False \n", "8 Matilda Matilda A3 3 1 s False \n", "9 1 1 B3 3 2 n False \n", "10 F F C3 3 3 s False \n", "11 Matilda Matilda E3 3 5 s False \n", "12 3 3 F3 3 6 n False \n", "13 D D G3 3 7 s False \n", "14 Olivia Olivia A4 4 1 s False \n", "15 2 2 B4 4 2 n False \n", "16 D D C4 4 3 s False \n", "17 Olivia Olivia E4 4 5 s False \n", "18 4 4 F4 4 6 n False \n", "19 C C G4 4 7 s False \n", "20 Music Music A6 6 1 s False \n", "21 Drama Drama E6 6 5 s False \n", "22 Name Name A7 7 1 s False \n", "23 Score Score B7 7 2 s False \n", "24 Grade Grade C7 7 3 s False \n", "25 Name Name E7 7 5 s False \n", "26 Score Score F7 7 6 s False \n", "27 Grade Grade G7 7 7 s False \n", "28 Matilda Matilda A8 8 1 s False \n", "29 5 5 B8 8 2 n False \n", "30 B B C8 8 3 s False \n", "31 Matilda Matilda E8 8 5 s False \n", "32 7 7 F8 8 6 n False \n", "33 A A G8 8 7 s False \n", "34 Olivia Olivia A9 9 1 s False \n", "35 6 6 B9 9 2 n False \n", "36 B B C9 9 3 s False \n", "37 Olivia Olivia E9 9 5 s False \n", "38 8 8 F9 9 6 n False \n", "39 A A G9 9 7 s False \n", "\n", " number_format font \n", "0 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "1 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "2 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "3 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "4 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "5 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "6 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "7 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "8 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "9 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "10 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "11 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "12 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "13 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "14 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "15 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "16 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "17 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "18 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "19 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "20 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "21 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "22 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "23 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "24 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "25 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "26 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "27 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "28 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "29 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "30 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "31 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "32 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "33 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "34 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "35 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "36 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "37 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "38 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "39 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = jn.xlsx_cells(\n", " excel_file,\n", " sheetnames=\"small-multiples\",\n", " font=True,\n", " include_blank_cells=False,\n", ").astype({\"row\": np.int8, \"column\": np.int8})\n", "frame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We know this much:\n", "- the data are in three columns (`Name`, `Score`, `Grade` are the headers)\n", "- the data is not in bold/italic fonts\n", "\n", "Let's apply this knowledge to grab the data:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valuerowcolumn
8Matilda31
9132
10F33
11Matilda35
12336
13D37
14Olivia41
15242
16D43
17Olivia45
18446
19C47
28Matilda81
29582
30B83
31Matilda85
32786
33A87
34Olivia91
35692
36B93
37Olivia95
38896
39A97
\n", "
" ], "text/plain": [ " value row column\n", "8 Matilda 3 1\n", "9 1 3 2\n", "10 F 3 3\n", "11 Matilda 3 5\n", "12 3 3 6\n", "13 D 3 7\n", "14 Olivia 4 1\n", "15 2 4 2\n", "16 D 4 3\n", "17 Olivia 4 5\n", "18 4 4 6\n", "19 C 4 7\n", "28 Matilda 8 1\n", "29 5 8 2\n", "30 B 8 3\n", "31 Matilda 8 5\n", "32 7 8 6\n", "33 A 8 7\n", "34 Olivia 9 1\n", "35 6 9 2\n", "36 B 9 3\n", "37 Olivia 9 5\n", "38 8 9 6\n", "39 A 9 7" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "italics = frame.font.str.get(\"i\")\n", "bold = frame.font.str.get(\"b\")\n", "strings = frame.data_type.eq(\"s\")\n", "booleans = ~italics & ~bold\n", "data = frame.loc[booleans, [\"value\", \"row\", \"column\"]]\n", "data" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "[\n", "['Matilda', '1', 'F'],\n", "['Matilda', '3', 'D'],\n", "['Olivia', '2', 'D'],\n", "['Olivia', '4', 'C'],\n", "['Matilda', '5', 'B'],\n", "['Matilda', '7', 'A'],\n", "['Olivia', '6', 'B'],\n", "['Olivia', '8', 'A']\n", "]\n", "Shape: (8, 3), dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we know there are three columns per subset of data\n", "extract = data.value.array.reshape((-1, 3))\n", "extract" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "[np.int8(3), np.int8(3), np.int8(4), np.int8(4), np.int8(8), np.int8(8),\n", " np.int8(9), np.int8(9)]\n", "Length: 8, dtype: int8" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# apply same concept to reshape the row and column\n", "# required to pair the subjects to the data\n", "rows = data.row.array.reshape((-1, 3))[:, 0]\n", "rows" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "[np.int8(1), np.int8(5), np.int8(1), np.int8(5), np.int8(1), np.int8(5),\n", " np.int8(1), np.int8(5)]\n", "Length: 8, dtype: int8" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns = data.column.array.reshape((-1, 3))[:, 0]\n", "columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the headers, and create a DataFrame of `extract`, along with `rows`:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Name', 'Score', 'Grade'], dtype=object)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "headers = frame.loc[bold, \"value\"].unique()\n", "headers" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameScoreGraderowcolumn
0Matilda1F31
2Olivia2D41
4Matilda5B81
6Olivia6B91
1Matilda3D35
3Olivia4C45
5Matilda7A85
7Olivia8A95
\n", "
" ], "text/plain": [ " Name Score Grade row column\n", "0 Matilda 1 F 3 1\n", "2 Olivia 2 D 4 1\n", "4 Matilda 5 B 8 1\n", "6 Olivia 6 B 9 1\n", "1 Matilda 3 D 3 5\n", "3 Olivia 4 C 4 5\n", "5 Matilda 7 A 8 5\n", "7 Olivia 8 A 9 5" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "extract = (\n", " pd.DataFrame(extract, columns=headers)\n", " .assign(row=rows, column=columns)\n", " # order necessary for when patching subjects\n", " # into the DatFrame\n", " .astype({\"Score\": np.int8})\n", " .sort_values([\"column\", \"row\"])\n", ")\n", "extract" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All that is left is to pair the subjects(italicized) with the `extract` DataFrame. \n", "\n", "We know that the subject has a difference of two, row-wise to the first entry in the sub data:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subjectrowcolumn
0Classics31
1History35
20Music81
21Drama85
\n", "
" ], "text/plain": [ " subject row column\n", "0 Classics 3 1\n", "1 History 3 5\n", "20 Music 8 1\n", "21 Drama 8 5" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subjects = (\n", " frame.loc[italics, [\"value\", \"row\", \"column\"]]\n", " .assign(row=lambda f: f.row.add(2))\n", " .rename(columns={\"value\": \"subject\"})\n", ")\n", "subjects" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSubjectScoreGrade
0MatildaClassics1F
1OliviaClassics2D
2MatildaMusic5B
3OliviaMusic6B
4MatildaHistory3D
5OliviaHistory4C
6MatildaDrama7A
7OliviaDrama8A
\n", "
" ], "text/plain": [ " Name Subject Score Grade\n", "0 Matilda Classics 1 F\n", "1 Olivia Classics 2 D\n", "2 Matilda Music 5 B\n", "3 Olivia Music 6 B\n", "4 Matilda History 3 D\n", "5 Olivia History 4 C\n", "6 Matilda Drama 7 A\n", "7 Olivia Drama 8 A" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outcome = (\n", " extract\n", " # notice the join order\n", " # join on the columns first, before the row\n", " # to ensure correct output\n", " .merge(subjects, how=\"left\", on=[\"column\", \"row\"])\n", " .assign(Subject=lambda f: f.subject.ffill())\n", " .loc[:, [\"Name\", \"Subject\", \"Score\", \"Grade\"]]\n", ")\n", "\n", "outcome" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Case 2 : Same table in several worksheets/files (using the sheet/file name)__\n", "\n", "![humanities.png](Images/humanities.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![performance.png](Images/performance.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this case, our data is in different worksheets. We can iterate through each worksheet and combine the dataframes into one.\n", "\n", "We do not need [xlsx_cells](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.xlsx_cells) for this - [pandas](https://pandas.pydata.org/docs/user_guide/index.html) is sufficient:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[ Matilda Nicholas\n", " Classics 1 3\n", " History 3 5,\n", " Matilda Nicholas\n", " Music 5 9\n", " Drama 7 12]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "extract = [\n", " excel_file.parse(sheet_name=sheetname, index_col=0)\n", " for sheetname in (\"humanities\", \"performance\")\n", "]\n", "\n", "extract" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combine the individual dataframes into one:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subjectstudentscores
0ClassicsMatilda1
1ClassicsNicholas3
2HistoryMatilda3
3HistoryNicholas5
4MusicMatilda5
5MusicNicholas9
6DramaMatilda7
7DramaNicholas12
\n", "
" ], "text/plain": [ " subject student scores\n", "0 Classics Matilda 1\n", "1 Classics Nicholas 3\n", "2 History Matilda 3\n", "3 History Nicholas 5\n", "4 Music Matilda 5\n", "5 Music Nicholas 9\n", "6 Drama Matilda 7\n", "7 Drama Nicholas 12" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " pd.concat(extract)\n", " .rename_axis(index=\"subject\", columns=\"student\")\n", " .stack(future_stack=True)\n", " .rename(\"scores\")\n", " .reset_index()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The image below illustrates the core concepts of the above solution:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![\"function description for case 2\"](Images/case2.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Case 3 : Same table in several worksheets/files but in different positions__\n", "\n", "![female.png](Images/female.png) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![male.png](Images/male.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is similar to Case 2, with the core data been the same. Here we need to pick rows from `Subject` downwards only, as that is the only relevant data:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'female': subject student\n", " Classics Matilda 1\n", " Olivia 2\n", " History Matilda 3\n", " Olivia 4\n", " dtype: object,\n", " 'male': subject student \n", " Classics Nicholas 3\n", " Paul 0\n", " History Nicholas 5\n", " Paul 1\n", " dtype: object}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "extract = {\n", " sheetname: excel_file.parse(sheet_name=sheetname, header=None, index_col=0).loc[\n", " \"Subject\":\n", " ]\n", " # use the subject row as column names\n", " .row_to_names(0) # pyjanitor method\n", " .drop(index=\"Subject\")\n", " .rename_axis(index=\"subject\", columns=\"student\")\n", " .stack(future_stack=True)\n", " for sheetname in (\"female\", \"male\")\n", "}\n", "\n", "extract" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combine the individual dataframes into one:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexsubjectstudentscores
0femaleClassicsMatilda1
1femaleClassicsOlivia2
2femaleHistoryMatilda3
3femaleHistoryOlivia4
4maleClassicsNicholas3
5maleClassicsPaul0
6maleHistoryNicholas5
7maleHistoryPaul1
\n", "
" ], "text/plain": [ " sex subject student scores\n", "0 female Classics Matilda 1\n", "1 female Classics Olivia 2\n", "2 female History Matilda 3\n", "3 female History Olivia 4\n", "4 male Classics Nicholas 3\n", "5 male Classics Paul 0\n", "6 male History Nicholas 5\n", "7 male History Paul 1" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pd.concat(extract, names=[\"sex\"]).rename(\"scores\").reset_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The image below explains the main concepts of the solution above : " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![\"visual explanation of function for case3\"](Images/case3.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Case 4 : Implied multiples__\n", "\n", "![implied-multiples.png](Images/implied-multiples.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this case, we have the fields at the top, followed by the subjects and grade for each subject. the student names is the very first column.
\n", "The goal is to get the subjects,grades and scores per field, per student and combine into one." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valueinternal_valuecoordinaterowcolumndata_typeis_datenumber_formatfont
0HumanitiesHumanitiesB112sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1PerformancePerformanceF116sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2NameNameA221sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
3ClassicsClassicsB222sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
4GradeGradeC223sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
5HistoryHistoryD224sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
6GradeGradeE225sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
7MusicMusicF226sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
8GradeGradeG227sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
9DramaDramaH228sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
10GradeGradeI229sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
11MatildaMatildaA331sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1211B332nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
13FFC333sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1433D334nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
15DDE335sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1655F336nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
17BBG337sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
1877H338nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
19AAI339sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
20OliviaOliviaA441sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2122B442nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
22DDC443sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2344D444nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
24CCE445sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2566F446nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
26BBG447sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
2788H448nFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
28AAI449sFalseGeneral{'name': 'Calibri', 'family': 2.0, 'sz': 11.0,...
\n", "
" ], "text/plain": [ " value internal_value coordinate row column data_type is_date \\\n", "0 Humanities Humanities B1 1 2 s False \n", "1 Performance Performance F1 1 6 s False \n", "2 Name Name A2 2 1 s False \n", "3 Classics Classics B2 2 2 s False \n", "4 Grade Grade C2 2 3 s False \n", "5 History History D2 2 4 s False \n", "6 Grade Grade E2 2 5 s False \n", "7 Music Music F2 2 6 s False \n", "8 Grade Grade G2 2 7 s False \n", "9 Drama Drama H2 2 8 s False \n", "10 Grade Grade I2 2 9 s False \n", "11 Matilda Matilda A3 3 1 s False \n", "12 1 1 B3 3 2 n False \n", "13 F F C3 3 3 s False \n", "14 3 3 D3 3 4 n False \n", "15 D D E3 3 5 s False \n", "16 5 5 F3 3 6 n False \n", "17 B B G3 3 7 s False \n", "18 7 7 H3 3 8 n False \n", "19 A A I3 3 9 s False \n", "20 Olivia Olivia A4 4 1 s False \n", "21 2 2 B4 4 2 n False \n", "22 D D C4 4 3 s False \n", "23 4 4 D4 4 4 n False \n", "24 C C E4 4 5 s False \n", "25 6 6 F4 4 6 n False \n", "26 B B G4 4 7 s False \n", "27 8 8 H4 4 8 n False \n", "28 A A I4 4 9 s False \n", "\n", " number_format font \n", "0 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "1 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "2 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "3 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "4 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "5 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "6 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "7 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "8 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "9 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "10 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "11 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "12 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "13 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "14 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "15 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "16 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "17 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "18 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "19 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "20 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "21 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "22 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "23 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "24 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "25 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "26 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "27 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... \n", "28 General {'name': 'Calibri', 'family': 2.0, 'sz': 11.0,... " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = frame = jn.xlsx_cells(\n", " excel_file,\n", " sheetnames=\"implied-multiples\",\n", " font=True,\n", " include_blank_cells=False,\n", ").astype({\"row\": np.int8, \"column\": np.int8})\n", "\n", "frame.head()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fieldcolumn
0Humanities2
1Performance6
\n", "
" ], "text/plain": [ " field column\n", "0 Humanities 2\n", "1 Performance 6" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# fields are on the very first row\n", "fields = frame.loc[frame.row == 1, [\"value\", \"column\"]]\n", "fields = fields.rename(columns={\"value\": \"field\"})\n", "fields" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the subjects(bold fonts) and merge with fields:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subjectcolumnfield
0Classics2Humanities
1Classics3Humanities
2History4Humanities
3History5Humanities
4Music6Performance
5Music7Performance
6Drama8Performance
7Drama9Performance
\n", "
" ], "text/plain": [ " subject column field\n", "0 Classics 2 Humanities\n", "1 Classics 3 Humanities\n", "2 History 4 Humanities\n", "3 History 5 Humanities\n", "4 Music 6 Performance\n", "5 Music 7 Performance\n", "6 Drama 8 Performance\n", "7 Drama 9 Performance" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = frame.font.str.get(\"b\") & frame.column.gt(1)\n", "subjects = (\n", " frame.loc[booleans, [\"value\", \"column\"]]\n", " .rename(columns={\"value\": \"subject\"})\n", " .merge(fields, on=\"column\", how=\"left\")\n", " # subject will be repeated for every paired score and grade\n", " # field will be repeated for every paired subject\n", " .assign(subject=lambda f: f.subject.where(f.subject != \"Grade\"))\n", " .ffill()\n", ")\n", "\n", "subjects" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the data (third row and below):" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scoregraderowcolumn
01F32
13D34
25B36
37A38
42D42
54C44
66B46
78A48
\n", "
" ], "text/plain": [ " score grade row column\n", "0 1 F 3 2\n", "1 3 D 3 4\n", "2 5 B 3 6\n", "3 7 A 3 8\n", "4 2 D 4 2\n", "5 4 C 4 4\n", "6 6 B 4 6\n", "7 8 A 4 8" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = frame.loc[frame.row.gt(2) & frame.column.gt(1), [\"value\", \"row\", \"column\"]]\n", "rows = data.loc[:, \"row\"]\n", "columns = data.loc[:, \"column\"]\n", "# We know the columns are in pairs\n", "# (subject, followed by grade):\n", "data = data.value.array.reshape((-1, 2))\n", "# apply the same logic to row:\n", "rows = rows.array.reshape((-1, 2))[:, 0]\n", "# and column:\n", "columns = columns.array.reshape((-1, 2))[:, 0]\n", "data = pd.DataFrame(data, columns=[\"score\", \"grade\"]).assign(row=rows, column=columns)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the student names, and join to data:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
studentfieldsubjectscoregrade
0MatildaHumanitiesClassics1F
1MatildaHumanitiesHistory3D
2MatildaPerformanceMusic5B
3MatildaPerformanceDrama7A
4OliviaHumanitiesClassics2D
5OliviaHumanitiesHistory4C
6OliviaPerformanceMusic6B
7OliviaPerformanceDrama8A
\n", "
" ], "text/plain": [ " student field subject score grade\n", "0 Matilda Humanities Classics 1 F\n", "1 Matilda Humanities History 3 D\n", "2 Matilda Performance Music 5 B\n", "3 Matilda Performance Drama 7 A\n", "4 Olivia Humanities Classics 2 D\n", "5 Olivia Humanities History 4 C\n", "6 Olivia Performance Music 6 B\n", "7 Olivia Performance Drama 8 A" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outcome = (\n", " frame.loc[frame.column.eq(1) & frame.row.gt(2), [\"value\", \"row\"]]\n", " .rename(columns={\"value\": \"student\"})\n", " .merge(data, on=\"row\")\n", " .merge(subjects, on=\"column\")\n", " .loc[:, [\"student\", \"field\", \"subject\", \"score\", \"grade\"]]\n", ")\n", "outcome" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another route, without [xlsx_cells](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.xlsx_cells):" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fieldsubjectstudentscoresgrade
0HumanitiesClassicsMatilda1F
2HumanitiesHistoryMatilda3D
4PerformanceMusicMatilda5B
6PerformanceDramaMatilda7A
8HumanitiesClassicsOlivia2D
10HumanitiesHistoryOlivia4C
12PerformanceMusicOlivia6B
14PerformanceDramaOlivia8A
\n", "
" ], "text/plain": [ " field subject student scores grade\n", "0 Humanities Classics Matilda 1 F\n", "2 Humanities History Matilda 3 D\n", "4 Performance Music Matilda 5 B\n", "6 Performance Drama Matilda 7 A\n", "8 Humanities Classics Olivia 2 D\n", "10 Humanities History Olivia 4 C\n", "12 Performance Music Olivia 6 B\n", "14 Performance Drama Olivia 8 A" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " excel_file.parse(sheet_name=\"implied-multiples\", header=None)\n", " .ffill(axis=1)\n", " .fillna(\"field\")\n", " .set_index(0)\n", " .T.melt(id_vars=[\"field\", \"Name\"], var_name=\"student\", value_name=\"scores\")\n", " .assign(grade=lambda x: x.loc[x.Name == \"Grade\", \"scores\"])\n", " # scores are above grades per student\n", " # hence the bfill\n", " .bfill()\n", " .query('Name != \"Grade\"')\n", " .rename(columns={\"Name\": \"subject\"})\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And a visual illustration of the steps is shown below: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![\"visual explanation of code for case4\"](Images/case4.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comments\n", "" ] } ], "metadata": { "jupytext": { "formats": "ipynb,md" }, "kernelspec": { "display_name": "blogger", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 4 }