Access Excel Tables with Python

Access Excel Tables with Python#

This post is about extracting data from Excel tables into Python. Source data is with permission from ExcelisFun

Excel Tables are a great way of grouping related data, as it makes analysis easier. Usually, these tables will have names to identify them, as well as some other cool features. An example image is shown below:

Source : support.office.com

Data Source : ExcelisFun

In the image above, there are a couple of Excel tables, with defined names - SalesRep, Products, Category, and Supplier tables. How do we read this into Python?

Option 1 - The Naive way:#

Let’s read it into pandas

import pandas as pd

filename = "Data_files/016-MSPTDA-Excel.xlsx"

df = pd.read_excel(filename, sheet_name = "Tables", engine='openpyxl')

#view the first five rows: 
df.head()

	SalesRepID	SalesRep	Region	Unnamed: 3	ProductID	Products	RetailPrice	CategoryID	SupplierID	Unnamed: 9	CategoryID.1	Category	Unnamed: 12	SupplierID.1	Supplier	City	State	E-mail
0	1	Sioux Radcoolinator	NW	NaN	1.0	Quad	43.95	3.0	GB	NaN	1.0	Beginner	NaN	GB	Gel Boomerangs	Oakland	CA	gel@gel-boomerang.com
1	2	Tyrone Smithe	NE	NaN	2.0	Yanaki	27.95	1.0	CO	NaN	2.0	Advanced	NaN	CO	Colorado Boomerangs	Gunnison	CO	Pollock@coloradoboomerang.com
2	3	Chantel Zoya	SW	NaN	3.0	Eagle	19.95	2.0	CC	NaN	3.0	Freestyle	NaN	CC	Channel Craft	Richland	WA	Dino@CC.com
3	4	Chin Pham	SE	NaN	4.0	Bellen	26.95	1.0	GB	NaN	4.0	Competition	NaN	DB	Darnell Booms	Burlington	VT	Darnell@Darnell.com
4	5	Diego Vasque	MW	NaN	5.0	Aspen	24.95	1.0	CO	NaN	5.0	Long Distance	NaN	NaN	NaN	NaN	NaN	NaN

Notice how Pandas did not identify the tables - it just pulled in everything, even the empty columns. Also note the mangling of column names(SupplierID.1, CategoryID.1). This is not good enough. Yes, we could fix it, probably use the empty rows as a means of splitting the dataframe into new dataframes, but that is not wise. How do we truly know where one table starts and the other ends? Surely there has to be a better way. Thankfully there is.

Option 2 - The better way :#

Check on stackoverflow and you’ll see solutions and conversations regarding this. The better way is via Openpyxl, a python module dedicated to working with Excel files. It has a method - .tables that allows access to defined tables in the spreadsheet.

# import library
from openpyxl import load_workbook

# read file
wb = load_workbook(filename)

# access specific sheet
ws = wb["Tables"]

We can access the tables in the worksheet through the tables method - this returns a dictionary :

{key : value for key, value in ws.tables.items()}

{'dSalesReps': 'A1:C26',
 'dProduct': 'E1:I17',
 'dCategory': 'K1:L6',
 'dSupplier': 'N1:R5'}

From the result above, we can see the name of each table (name=dSalesReps) and the span of the data (ref=’A1:C26). Let’s get our data out :

mapping = {}

for entry, data_boundary in ws.tables.items():
    #parse the data within the ref boundary
    data = ws[data_boundary]
    #extract the data 
    #the inner list comprehension gets the values for each cell in the table
    content = [[cell.value for cell in ent] 
               for ent in data
          ]
    
    header = content[0]
    
    #the contents ... excluding the header
    rest = content[1:]
    
    #create dataframe with the column names
    #and pair table name with dataframe
    df = pd.DataFrame(rest, columns = header)
    mapping[entry] = df

mapping

{'dSalesReps':     SalesRepID              SalesRep Region
          1   Sioux Radcoolinator     NW
          2         Tyrone Smithe     NE
          3          Chantel Zoya     SW
          4             Chin Pham     SE
          5          Diego Vasque     MW
          6      Vannessa Deloach      W
          7            Shon Stein     NW
          8           Tomi Benton     NE
          9     Ghislaine Stidham     SW
         10       Yoshiko Murillo     SE
        11            Hoyt Potts     MW
        12         Alysha Dewitt      W
        13       Claudine Dupuis     NW
        14         Shanta Spring     NE
        15      Ramonita Babcock     SW
        16     Janyce Betancourt     SE
        17       Rhiannon Cathey     MW
        18      Dominica Ordonez      W
        19       Rana Burchfield     NW
        20            Neida Ashe     NE
        21  Marylouise Halverson     MW
        22           Naoma Bloom     NW
        23            JoJo Jones      W
        24       Dean Washington      W
        25              Kiki Lim      W,
 'dProduct':     ProductID      Products  RetailPrice  CategoryID SupplierID
         1          Quad        43.95           3         GB
         2        Yanaki        27.95           1         CO
         3         Eagle        19.95           2         CC
         4        Bellen        26.95           1         GB
         5         Aspen        24.95           1         CO
         6       Carlota        23.95           3         GB
         7      Sunshine        19.95           4         GB
         8        Sunset        22.95           4         GB
         9         Beaut        35.95           5         CO
        10      Kangaroo        25.00           2         CC
       11       Elevate        48.95           5         GB
       12       Flattop        25.95           2         CO
       13         Vrang        12.95           4         DB
       14        TriFly        21.95           3         DB
       15  NaturalElbow        35.00           4         DB
       16      LongRang        41.00           5         CC,
 'dCategory':    CategoryID       Category
         1       Beginner
         2       Advanced
         3      Freestyle
         4    Competition
         5  Long Distance,
 'dSupplier':   SupplierID             Supplier        City State  \
       GB       Gel Boomerangs     Oakland    CA   
       CO  Colorado Boomerangs    Gunnison    CO   
       CC        Channel Craft    Richland    WA   
       DB        Darnell Booms  Burlington    VT   
 
                           E-mail  
        gel@gel-boomerang.com  
Pollock@coloradoboomerang.com  
                  Dino@CC.com  
          Darnell@Darnell.com  }

We can safely extract our tables from the dictionary :

dSalesReps, dProduct, dCategory, dSupplier = mapping.values()

Let’s view some of the dataframes

dSalesReps.head()

	SalesRepID	SalesRep	Region
0	1	Sioux Radcoolinator	NW
1	2	Tyrone Smithe	NE
2	3	Chantel Zoya	SW
3	4	Chin Pham	SE
4	5	Diego Vasque	MW

dSupplier

	SupplierID	Supplier	City	State	E-mail
0	GB	Gel Boomerangs	Oakland	CA	gel@gel-boomerang.com
1	CO	Colorado Boomerangs	Gunnison	CO	Pollock@coloradoboomerang.com
2	CC	Channel Craft	Richland	WA	Dino@CC.com
3	DB	Darnell Booms	Burlington	VT	Darnell@Darnell.com

There we have it, our excel tables successfully extracted and assigned to variables. We can proceed from here and run our computations in Pandas.

Option 3 - An even simpler form :#

The steps above can be simplified by using the xlsx_table function from pyjanitor. Simply install pyjanitor with pip, and read in the Excel table:

# pip install pyjanitor
from janitor import xlsx_table

# read in the specific table:
xlsx_table(filename, table='dCategory')

	CategoryID	Category
0	1	Beginner
1	2	Advanced
2	3	Freestyle
3	4	Competition
4	5	Long Distance

All the tables can be read at once; it will be loaded as a dictionary, and the relevant table accessed:

tables = xlsx_table(filename)

# view all the keys, which are the table names:
tables.keys()

dict_keys(['dSalesReps', 'dProduct', 'dCategory', 'dSupplier'])

Access a specific table, using Python’s dictionary syntax:

tables['dSalesReps']

	SalesRepID	SalesRep	Region
0	1	Sioux Radcoolinator	NW
1	2	Tyrone Smithe	NE
2	3	Chantel Zoya	SW
3	4	Chin Pham	SE
4	5	Diego Vasque	MW
5	6	Vannessa Deloach	W
6	7	Shon Stein	NW
7	8	Tomi Benton	NE
8	9	Ghislaine Stidham	SW
9	10	Yoshiko Murillo	SE
10	11	Hoyt Potts	MW
11	12	Alysha Dewitt	W
12	13	Claudine Dupuis	NW
13	14	Shanta Spring	NE
14	15	Ramonita Babcock	SW
15	16	Janyce Betancourt	SE
16	17	Rhiannon Cathey	MW
17	18	Dominica Ordonez	W
18	19	Rana Burchfield	NW
19	20	Neida Ashe	NE
20	21	Marylouise Halverson	MW
21	22	Naoma Bloom	NW
22	23	JoJo Jones	W
23	24	Dean Washington	W
24	25	Kiki Lim	W

you can also read it into a polars DataFrame:

xlsx_table(filename, table='dCategory', engine='polars')

shape: (5, 2)

CategoryID	Category
i64	str
1	"Beginner"
2	"Advanced"
3	"Freestyle"
4	"Competition"
5	"Long Distance"

Access Excel Tables with Python

Contents

Access Excel Tables with Python#