{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Access Excel Tables with Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Access Excel Tables with Python**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This post is about extracting data from Excel tables into Python. Source data is with permission from [ExcelisFun](https://www.youtube.com/watch?v=nBu1Bqa1jjs&t=1780s)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Excel Tables are a great way of grouping related data, as it makes analysis easier. Usually, these tables will have names to identify them, as well as some other cool features. An example image is shown below: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![ExcelTables.png](extract_defined_tables/image1.png)
\n", "Source : [support.office.com](https://support.office.com/en-us/article/overview-of-excel-tables-7ab0bb7d-3a9e-4b56-a3c9-6c94334e492c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![ExcelTables.png](extract_defined_tables/image2.png)
\n", "Data Source : [ExcelisFun](https://www.youtube.com/watch?v=nBu1Bqa1jjs&t=1780s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the image above, there are a couple of Excel tables, with defined names - SalesRep, Products, Category, and Supplier tables. How do we read this into Python?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Option 1 - The Naive way:**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's read it into pandas " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "filename = \"Data_files/016-MSPTDA-Excel.xlsx\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SalesRepIDSalesRepRegionUnnamed: 3ProductIDProductsRetailPriceCategoryIDSupplierIDUnnamed: 9CategoryID.1CategoryUnnamed: 12SupplierID.1SupplierCityStateE-mail
01Sioux RadcoolinatorNWNaN1.0Quad43.953.0GBNaN1.0BeginnerNaNGBGel BoomerangsOaklandCAgel@gel-boomerang.com
12Tyrone SmitheNENaN2.0Yanaki27.951.0CONaN2.0AdvancedNaNCOColorado BoomerangsGunnisonCOPollock@coloradoboomerang.com
23Chantel ZoyaSWNaN3.0Eagle19.952.0CCNaN3.0FreestyleNaNCCChannel CraftRichlandWADino@CC.com
34Chin PhamSENaN4.0Bellen26.951.0GBNaN4.0CompetitionNaNDBDarnell BoomsBurlingtonVTDarnell@Darnell.com
45Diego VasqueMWNaN5.0Aspen24.951.0CONaN5.0Long DistanceNaNNaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " SalesRepID SalesRep Region Unnamed: 3 ProductID Products \\\n", "0 1 Sioux Radcoolinator NW NaN 1.0 Quad \n", "1 2 Tyrone Smithe NE NaN 2.0 Yanaki \n", "2 3 Chantel Zoya SW NaN 3.0 Eagle \n", "3 4 Chin Pham SE NaN 4.0 Bellen \n", "4 5 Diego Vasque MW NaN 5.0 Aspen \n", "\n", " RetailPrice CategoryID SupplierID Unnamed: 9 CategoryID.1 \\\n", "0 43.95 3.0 GB NaN 1.0 \n", "1 27.95 1.0 CO NaN 2.0 \n", "2 19.95 2.0 CC NaN 3.0 \n", "3 26.95 1.0 GB NaN 4.0 \n", "4 24.95 1.0 CO NaN 5.0 \n", "\n", " Category Unnamed: 12 SupplierID.1 Supplier City \\\n", "0 Beginner NaN GB Gel Boomerangs Oakland \n", "1 Advanced NaN CO Colorado Boomerangs Gunnison \n", "2 Freestyle NaN CC Channel Craft Richland \n", "3 Competition NaN DB Darnell Booms Burlington \n", "4 Long Distance NaN NaN NaN NaN \n", "\n", " State E-mail \n", "0 CA gel@gel-boomerang.com \n", "1 CO Pollock@coloradoboomerang.com \n", "2 WA Dino@CC.com \n", "3 VT Darnell@Darnell.com \n", "4 NaN NaN " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_excel(filename, sheet_name = \"Tables\", engine='openpyxl')\n", "\n", "#view the first five rows: \n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how Pandas did not identify the tables - it just pulled in everything, even the empty columns. Also note the mangling of column names(*SupplierID.1*, *CategoryID.1*). This is not good enough. Yes, we could fix it, probably use the empty rows as a means of splitting the dataframe into new dataframes, but that is not wise. How do we truly know where one table starts and the other ends? Surely there has to be a better way. Thankfully there is." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Option 2 - The better way :**" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Check on [stackoverflow](https://stackoverflow.com/questions/48657867/manipulate-existing-excel-table-using-openpyxl) and you'll see solutions and conversations regarding this. The better way is via [Openpyxl](https://openpyxl.readthedocs.io/en/stable/usage.html), a python module dedicated to working with Excel files. It has a method - ```.tables``` that allows access to defined tables in the spreadsheet. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# import library\n", "from openpyxl import load_workbook\n", "\n", "# read file\n", "wb = load_workbook(filename)\n", "\n", "# access specific sheet\n", "ws = wb[\"Tables\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can access the tables in the worksheet through the ``tables`` method - this returns a dictionary :" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'dSalesReps': 'A1:C26',\n", " 'dProduct': 'E1:I17',\n", " 'dCategory': 'K1:L6',\n", " 'dSupplier': 'N1:R5'}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{key : value for key, value in ws.tables.items()}" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "From the result above, we can see the name of each table (*name=dSalesReps*) and the span of the data (*ref='A1:C26*). Let's get our data out :" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "mapping = {}\n", "\n", "for entry, data_boundary in ws.tables.items():\n", " #parse the data within the ref boundary\n", " data = ws[data_boundary]\n", " #extract the data \n", " #the inner list comprehension gets the values for each cell in the table\n", " content = [[cell.value for cell in ent] \n", " for ent in data\n", " ]\n", " \n", " header = content[0]\n", " \n", " #the contents ... excluding the header\n", " rest = content[1:]\n", " \n", " #create dataframe with the column names\n", " #and pair table name with dataframe\n", " df = pd.DataFrame(rest, columns = header)\n", " mapping[entry] = df" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'dSalesReps': SalesRepID SalesRep Region\n", " 0 1 Sioux Radcoolinator NW\n", " 1 2 Tyrone Smithe NE\n", " 2 3 Chantel Zoya SW\n", " 3 4 Chin Pham SE\n", " 4 5 Diego Vasque MW\n", " 5 6 Vannessa Deloach W\n", " 6 7 Shon Stein NW\n", " 7 8 Tomi Benton NE\n", " 8 9 Ghislaine Stidham SW\n", " 9 10 Yoshiko Murillo SE\n", " 10 11 Hoyt Potts MW\n", " 11 12 Alysha Dewitt W\n", " 12 13 Claudine Dupuis NW\n", " 13 14 Shanta Spring NE\n", " 14 15 Ramonita Babcock SW\n", " 15 16 Janyce Betancourt SE\n", " 16 17 Rhiannon Cathey MW\n", " 17 18 Dominica Ordonez W\n", " 18 19 Rana Burchfield NW\n", " 19 20 Neida Ashe NE\n", " 20 21 Marylouise Halverson MW\n", " 21 22 Naoma Bloom NW\n", " 22 23 JoJo Jones W\n", " 23 24 Dean Washington W\n", " 24 25 Kiki Lim W,\n", " 'dProduct': ProductID Products RetailPrice CategoryID SupplierID\n", " 0 1 Quad 43.95 3 GB\n", " 1 2 Yanaki 27.95 1 CO\n", " 2 3 Eagle 19.95 2 CC\n", " 3 4 Bellen 26.95 1 GB\n", " 4 5 Aspen 24.95 1 CO\n", " 5 6 Carlota 23.95 3 GB\n", " 6 7 Sunshine 19.95 4 GB\n", " 7 8 Sunset 22.95 4 GB\n", " 8 9 Beaut 35.95 5 CO\n", " 9 10 Kangaroo 25.00 2 CC\n", " 10 11 Elevate 48.95 5 GB\n", " 11 12 Flattop 25.95 2 CO\n", " 12 13 Vrang 12.95 4 DB\n", " 13 14 TriFly 21.95 3 DB\n", " 14 15 NaturalElbow 35.00 4 DB\n", " 15 16 LongRang 41.00 5 CC,\n", " 'dCategory': CategoryID Category\n", " 0 1 Beginner\n", " 1 2 Advanced\n", " 2 3 Freestyle\n", " 3 4 Competition\n", " 4 5 Long Distance,\n", " 'dSupplier': SupplierID Supplier City State \\\n", " 0 GB Gel Boomerangs Oakland CA \n", " 1 CO Colorado Boomerangs Gunnison CO \n", " 2 CC Channel Craft Richland WA \n", " 3 DB Darnell Booms Burlington VT \n", " \n", " E-mail \n", " 0 gel@gel-boomerang.com \n", " 1 Pollock@coloradoboomerang.com \n", " 2 Dino@CC.com \n", " 3 Darnell@Darnell.com }" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mapping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can safely extract our tables from the dictionary :" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "dSalesReps, dProduct, dCategory, dSupplier = mapping.values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's view some of the dataframes" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SalesRepIDSalesRepRegion
01Sioux RadcoolinatorNW
12Tyrone SmitheNE
23Chantel ZoyaSW
34Chin PhamSE
45Diego VasqueMW
\n", "
" ], "text/plain": [ " SalesRepID SalesRep Region\n", "0 1 Sioux Radcoolinator NW\n", "1 2 Tyrone Smithe NE\n", "2 3 Chantel Zoya SW\n", "3 4 Chin Pham SE\n", "4 5 Diego Vasque MW" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dSalesReps.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SupplierIDSupplierCityStateE-mail
0GBGel BoomerangsOaklandCAgel@gel-boomerang.com
1COColorado BoomerangsGunnisonCOPollock@coloradoboomerang.com
2CCChannel CraftRichlandWADino@CC.com
3DBDarnell BoomsBurlingtonVTDarnell@Darnell.com
\n", "
" ], "text/plain": [ " SupplierID Supplier City State \\\n", "0 GB Gel Boomerangs Oakland CA \n", "1 CO Colorado Boomerangs Gunnison CO \n", "2 CC Channel Craft Richland WA \n", "3 DB Darnell Booms Burlington VT \n", "\n", " E-mail \n", "0 gel@gel-boomerang.com \n", "1 Pollock@coloradoboomerang.com \n", "2 Dino@CC.com \n", "3 Darnell@Darnell.com " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dSupplier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There we have it, our excel tables successfully extracted and assigned to variables. We can proceed from here and run our computations in Pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Option 3 - An even simpler form :**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The steps above can be simplified by using the [xlsx_table](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.xlsx_table) function from [pyjanitor](https://pyjanitor-devs.github.io/pyjanitor/). Simply install [pyjanitor](https://github.com/pyjanitor-devs/pyjanitor) with pip, and read in the Excel table:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CategoryIDCategory
01Beginner
12Advanced
23Freestyle
34Competition
45Long Distance
\n", "
" ], "text/plain": [ " CategoryID Category\n", "0 1 Beginner\n", "1 2 Advanced\n", "2 3 Freestyle\n", "3 4 Competition\n", "4 5 Long Distance" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pip install pyjanitor\n", "from janitor import xlsx_table\n", "\n", "# read in the specific table:\n", "xlsx_table(filename, table='dCategory')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All the tables can be read at once; it will be loaded as a dictionary, and the relevant table accessed:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['dSalesReps', 'dProduct', 'dCategory', 'dSupplier'])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tables = xlsx_table(filename)\n", "\n", "# view all the keys, which are the table names:\n", "tables.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Access a specific table, using Python's dictionary syntax:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SalesRepIDSalesRepRegion
01Sioux RadcoolinatorNW
12Tyrone SmitheNE
23Chantel ZoyaSW
34Chin PhamSE
45Diego VasqueMW
56Vannessa DeloachW
67Shon SteinNW
78Tomi BentonNE
89Ghislaine StidhamSW
910Yoshiko MurilloSE
1011Hoyt PottsMW
1112Alysha DewittW
1213Claudine DupuisNW
1314Shanta SpringNE
1415Ramonita BabcockSW
1516Janyce BetancourtSE
1617Rhiannon CatheyMW
1718Dominica OrdonezW
1819Rana BurchfieldNW
1920Neida AsheNE
2021Marylouise HalversonMW
2122Naoma BloomNW
2223JoJo JonesW
2324Dean WashingtonW
2425Kiki LimW
\n", "
" ], "text/plain": [ " SalesRepID SalesRep Region\n", "0 1 Sioux Radcoolinator NW\n", "1 2 Tyrone Smithe NE\n", "2 3 Chantel Zoya SW\n", "3 4 Chin Pham SE\n", "4 5 Diego Vasque MW\n", "5 6 Vannessa Deloach W\n", "6 7 Shon Stein NW\n", "7 8 Tomi Benton NE\n", "8 9 Ghislaine Stidham SW\n", "9 10 Yoshiko Murillo SE\n", "10 11 Hoyt Potts MW\n", "11 12 Alysha Dewitt W\n", "12 13 Claudine Dupuis NW\n", "13 14 Shanta Spring NE\n", "14 15 Ramonita Babcock SW\n", "15 16 Janyce Betancourt SE\n", "16 17 Rhiannon Cathey MW\n", "17 18 Dominica Ordonez W\n", "18 19 Rana Burchfield NW\n", "19 20 Neida Ashe NE\n", "20 21 Marylouise Halverson MW\n", "21 22 Naoma Bloom NW\n", "22 23 JoJo Jones W\n", "23 24 Dean Washington W\n", "24 25 Kiki Lim W" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tables['dSalesReps']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "you can also read it into a polars DataFrame:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
CategoryIDCategory
i64str
1"Beginner"
2"Advanced"
3"Freestyle"
4"Competition"
5"Long Distance"
" ], "text/plain": [ "shape: (5, 2)\n", "┌────────────┬───────────────┐\n", "│ CategoryID ┆ Category │\n", "│ --- ┆ --- │\n", "│ i64 ┆ str │\n", "╞════════════╪═══════════════╡\n", "│ 1 ┆ Beginner │\n", "│ 2 ┆ Advanced │\n", "│ 3 ┆ Freestyle │\n", "│ 4 ┆ Competition │\n", "│ 5 ┆ Long Distance │\n", "└────────────┴───────────────┘" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xlsx_table(filename, table='dCategory', engine='polars')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comments\n", "" ] } ], "metadata": { "jupytext": { "formats": "ipynb,md" }, "kernelspec": { "display_name": "blogger", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 4 }