Reshape Data in Polars Efficiently from Wide to Long Form - Part I#

Introduction#

You’ve got data in a wide form (lots of columns), and you wish to transform it into a long form (fewer columns, more rows), to make it tidy, or for ease of analysis. In polars, this is achieved with the unpivot method. There are scenarios however where some processing has to be done before and/or after unpivoting. This article shows how to reshape data for such scenarios.

Most of the examples here are based on the tidyr pivoting vignette.

Polars’ eager API is used in the examples below; for more performance, or in production mode, it is recommended to use the lazy API.

import polars as pl
import polars.selectors as cs
import sys

print("polars version :", pl.__version__)
print("python version :", sys.version)
polars version : 1.7.1
python version : 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]

String data in column names#

relig_income = pl.read_csv("./Data_files/relig_income.csv")
relig_income
shape: (18, 11)
religion<$10k$10-20k$20-30k$30-40k$40-50k$50-75k$75-100k$100-150k>150kDon't know/refused
stri64i64i64i64i64i64i64i64i64i64
"Agnostic"27346081761371221098496
"Atheist"12273752357073597476
"Buddhist"27213034335862395354
"Catholic"41861773267063811169497926331489
"Don’t know/refused"151415111035211718116
"Orthodox"13172332324738424673
"Other Christian"971113131418141218
"Other Faiths"20334046496346404171
"Other World Religions"5234273448
"Unaffiliated"217299374365341528407321258597

The relig_income dataset stores counts based on a survey which (among other things) asked people about their religion and annual income. This dataset contains three variables:

  • religion, stored in the rows,

  • income spread across the column names, and

  • count stored in the cell values.

We can flip this into long form, and ensure each variable has its own column - this can be achieved with the unpivot method:

relig_income.unpivot(index="religion", variable_name="income", value_name="count")
shape: (180, 3)
religionincomecount
strstri64
"Agnostic""<$10k"27
"Atheist""<$10k"12
"Buddhist""<$10k"27
"Catholic""<$10k"418
"Don’t know/refused""<$10k"15
"Orthodox""Don't know/refused"73
"Other Christian""Don't know/refused"18
"Other Faiths""Don't know/refused"71
"Other World Religions""Don't know/refused"8
"Unaffiliated""Don't know/refused"597

No preprocessing required here.

Numeric data in column names#

billboard = pl.read_csv("./Data_files/billboard.csv", null_values="NA")
billboard = billboard.with_columns(cs.starts_with("wk").cast(pl.Int32))
billboard
shape: (317, 81)
yearartisttracktimedate.enteredwk1wk2wk3wk4wk5wk6wk7wk8wk9wk10wk11wk12wk13wk14wk15wk16wk17wk18wk19wk20wk21wk22wk23wk24wk25wk26wk27wk28wk29wk30wk31wk32wk40wk41wk42wk43wk44wk45wk46wk47wk48wk49wk50wk51wk52wk53wk54wk55wk56wk57wk58wk59wk60wk61wk62wk63wk64wk65wk66wk67wk68wk69wk70wk71wk72wk73wk74wk75wk76
i64strstrstrstri32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32i32
2000"2 Pac""Baby Don't Cry (Keep...""4:22""2000-02-26"87827277879499nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"2Ge+her""The Hardest Part Of ...""3:15""2000-09-02"918792nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"3 Doors Down""Kryptonite""3:53""2000-04-08"817068676657545351515151474438282218181412766655444431514131416172122242833424249nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"3 Doors Down""Loser""4:24""2000-10-21"7676726967655559626161596166727675677370nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"504 Boyz""Wobble Wobble""3:35""2000-04-15"573425171731364953576470757678859296nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"Yankee Grey""Another Nine Minutes""3:10""2000-04-29"8683777483798895nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"Yearwood, Trisha""Real Live Woman""3:55""2000-04-01"858383828191nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"Ying Yang Twins""Whistle While You Tw...""4:19""2000-03-18"9594918584787478858997969999nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"Zombie Nation""Kernkraft 400""3:30""2000-09-02"9999nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"matchbox twenty""Bent""4:12""2000-04-29"60372924222118161312861232234544691213192020242928nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull

The billboard dataset records the billboard rank of songs in the year 2000. It has a form similar to the relig_income data, but the data encoded in the column names is really a number, not a string.

Let’s unpivot the data into a tidy form:

(
    billboard.unpivot(
        index=~cs.starts_with("wk"), variable_name="week", value_name="rank"
    )
    # irrelevant nulls
    .drop_nulls(subset="rank")
)
shape: (5_307, 7)
yearartisttracktimedate.enteredweekrank
i64strstrstrstrstri32
2000"2 Pac""Baby Don't Cry (Keep...""4:22""2000-02-26""wk1"87
2000"2Ge+her""The Hardest Part Of ...""3:15""2000-09-02""wk1"91
2000"3 Doors Down""Kryptonite""3:53""2000-04-08""wk1"81
2000"3 Doors Down""Loser""4:24""2000-10-21""wk1"76
2000"504 Boyz""Wobble Wobble""3:35""2000-04-15""wk1"57
2000"Creed""Higher""5:16""1999-09-11""wk63"50
2000"Lonestar""Amazed""4:25""1999-06-05""wk63"45
2000"Creed""Higher""5:16""1999-09-11""wk64"50
2000"Lonestar""Amazed""4:25""1999-06-05""wk64"50
2000"Creed""Higher""5:16""1999-09-11""wk65"49

It would be nice to easily determine how long each song stayed in the charts, but to do that, we’ll need to convert the week variable to an integer.

This is where the post processing comes into play after the unpivoting - extract the number from the string column, and cast the extracted column to an integer dtype:

(
    billboard.unpivot(
        index=~cs.starts_with("wk"), variable_name="week", value_name="rank"
    )
    # irrelevant nulls
    .drop_nulls(subset="rank").with_columns(
        week=pl.col("week").str.slice(-1).cast(pl.Int16)
    )
)
shape: (5_307, 7)
yearartisttracktimedate.enteredweekrank
i64strstrstrstri16i32
2000"2 Pac""Baby Don't Cry (Keep...""4:22""2000-02-26"187
2000"2Ge+her""The Hardest Part Of ...""3:15""2000-09-02"191
2000"3 Doors Down""Kryptonite""3:53""2000-04-08"181
2000"3 Doors Down""Loser""4:24""2000-10-21"176
2000"504 Boyz""Wobble Wobble""3:35""2000-04-15"157
2000"Creed""Higher""5:16""1999-09-11"350
2000"Lonestar""Amazed""4:25""1999-06-05"345
2000"Creed""Higher""5:16""1999-09-11"450
2000"Lonestar""Amazed""4:25""1999-06-05"450
2000"Creed""Higher""5:16""1999-09-11"549

Many variables in column names#

who = pl.read_csv("./Data_files/who.csv", null_values="NA")
who
shape: (7_240, 60)
countryiso2iso3yearnew_sp_m014new_sp_m1524new_sp_m2534new_sp_m3544new_sp_m4554new_sp_m5564new_sp_m65new_sp_f014new_sp_f1524new_sp_f2534new_sp_f3544new_sp_f4554new_sp_f5564new_sp_f65new_sn_m014new_sn_m1524new_sn_m2534new_sn_m3544new_sn_m4554new_sn_m5564new_sn_m65new_sn_f014new_sn_f1524new_sn_f2534new_sn_f3544new_sn_f4554new_sn_f5564new_sn_f65new_ep_m014new_ep_m1524new_ep_m2534new_ep_m3544new_ep_m4554new_ep_m5564new_ep_m65new_ep_f014new_ep_f1524new_ep_f2534new_ep_f3544new_ep_f4554new_ep_f5564new_ep_f65newrel_m014newrel_m1524newrel_m2534newrel_m3544newrel_m4554newrel_m5564newrel_m65newrel_f014newrel_f1524newrel_f2534newrel_f3544newrel_f4554newrel_f5564newrel_f65
strstrstri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64
"Afghanistan""AF""AFG"1980nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1981nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1982nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1983nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1984nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"2009125578null3471681293192180873null30284192291261560860null6496165588286114251334null702315517295142442660192249123122321039401944438182138nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"201015071022081682761350252173974218512834902651711826821334232701545882864173212824013285113777895632702439028684182291922203191058677338181146nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"2011152784246720717803772781741084216113864482741601364596247328131264702728127194727542216962587495250195746796342172172209318802640284137129nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"201212078324212086796360271173939205312864832311611169613230226571154708796100888822871957829516432233214658789331178182208319710579228140143nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"2013nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull131516425331536323491206120812522069464935261453811725

A more challenging situation occurs when you have multiple variables crammed into the column names. For example, take the who dataset above.

  • country, iso2, iso3, and year are already variables, so they can be left as is. But the columns from new_sp_m014 to newrel_f65 encode four variables in their names:

  • the new_/new prefix indicates these are counts of new cases. This dataset only contains new cases, so we’ll ignore it here because it’s constant.

  • sp/rel/ep describe how the case was diagnosed.

  • m/f gives the gender.

  • 014/1524/2535/3544/4554/65 supplies the age range.

Sticking to the same pattern as before, we unpivot into long form; however, after unpivoting, we extract the relevant fields into new columns using Polars’ efficient string methods - for this use case a regex is a natural fit:

regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
expression = pl.col("variable").str.extract_groups(regex)
(
    who.unpivot(
        index=["country", "iso2", "iso3", "year"],
        value_name="count",
    )
    .with_columns(variable=expression)
    .unnest("variable")
)
shape: (405_440, 8)
countryiso2iso3yeardiagnosisgenderagecount
strstrstri64strstrstri64
"Afghanistan""AF""AFG"1980"sp""m""014"null
"Afghanistan""AF""AFG"1981"sp""m""014"null
"Afghanistan""AF""AFG"1982"sp""m""014"null
"Afghanistan""AF""AFG"1983"sp""m""014"null
"Afghanistan""AF""AFG"1984"sp""m""014"null
"Zimbabwe""ZW""ZWE"2009"rel""f""65"null
"Zimbabwe""ZW""ZWE"2010"rel""f""65"null
"Zimbabwe""ZW""ZWE"2011"rel""f""65"null
"Zimbabwe""ZW""ZWE"2012"rel""f""65"null
"Zimbabwe""ZW""ZWE"2013"rel""f""65"725

Multiple observations per row#

So far, we have been working with data frames that have one observation per row, but many important pivoting problems involve multiple observations per row. You can usually recognise this case because the name of the column that you want to appear in the output is part of the column name in the input. In this section, you’ll learn how to pivot this sort of data.

household = pl.read_csv("./Data_files/household.csv", null_values="NA")
household
shape: (5, 5)
familydob_child1dob_child2name_child1name_child2
i64strstrstrstr
1"1998-11-26""2000-01-29""Susan""Jose"
2"1996-06-22"null"Mark"null
3"2002-07-11""2004-04-05""Sam""Seth"
4"2004-10-10""2009-08-27""Craig""Khai"
5"2000-12-05""2005-02-28""Parker""Gracie"

Note that we have two pieces of information (or values) for each child: their name and their dob (date of birth). These need to go into separate columns in the result.

Steps involved:

  • unpivot into long form

  • string split into individual columns.

  • pivot the dataframe, to keep dob and name as headers

expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
(
    household.unpivot(index="family")
    .drop_nulls("value")
    .with_columns(variable=expression)
    .unnest("variable")
    .pivot(index=["family", "child"], on="header", values="value")
)
shape: (9, 4)
familychilddobname
i64strstrstr
1"child1""1998-11-26""Susan"
2"child1""1996-06-22""Mark"
3"child1""2002-07-11""Sam"
4"child1""2004-10-10""Craig"
5"child1""2000-12-05""Parker"
1"child2""2000-01-29""Jose"
3"child2""2004-04-05""Seth"
4"child2""2009-08-27""Khai"
5"child2""2005-02-28""Gracie"

In the household data above, we had just one header to retain from the column names; what if we had multiple?

manufacturer = pl.DataFrame(
    {
        "Sony | TV | Model | value": ["A222", "A234", "A4345"],
        "Sony | TV | Quantity | value": [5, 5, 4],
        "Sony | TV | Max-quant | value": [10, 9, 9],
        "Panasonic | TV | Model | value": ["T232", "S3424", "X3421"],
        "Panasonic | TV | Quantity | value": [1, 5, 1],
        "Panasonic | TV | Max-quant | value": [10, 12, 11],
        "Sanyo | Radio | Model | value": ["S111", "S1s1", "S1s2"],
        "Sanyo | Radio | Quantity | value": [4, 2, 4],
        "Sanyo | Radio | Max-quant | value": [9, 9, 10],
    }
)

manufacturer
shape: (3, 9)
Sony | TV | Model | valueSony | TV | Quantity | valueSony | TV | Max-quant | valuePanasonic | TV | Model | valuePanasonic | TV | Quantity | valuePanasonic | TV | Max-quant | valueSanyo | Radio | Model | valueSanyo | Radio | Quantity | valueSanyo | Radio | Max-quant | value
stri64i64stri64i64stri64i64
"A222"510"T232"110"S111"49
"A234"59"S3424"512"S1s1"29
"A4345"49"X3421"111"S1s2"410

Again, we stick to the same pattern as with the household reshaping - unpivot into long form, split into individual columns, and finally pivot the relevant column:

expression = pl.col("variable").str.split("|")
expression = expression.list.to_struct(
    fields=["Manufacturer", "Device", "header", "counts"]
)
(
    manufacturer.unpivot(index=None)
    .with_columns(variable=expression)
    .unnest("variable")
    .with_columns(cs.exclude("value").str.strip_chars())
    .with_row_index(name="index")
    # ensure the pivoting operation does not encounter duplicate entries
    .with_columns(pl.cum_count("index").over("Manufacturer", "Device", "header"))
    .pivot(index=["index", "Manufacturer", "Device"], on="header", values="value")
    .drop("index")
    .with_columns(cs.starts_with("Q", "Max").as_expr().cast(pl.Int16))
)
shape: (9, 5)
ManufacturerDeviceModelQuantityMax-quant
strstrstri16i16
"Sony""TV""A222"510
"Sony""TV""A234"59
"Sony""TV""A4345"49
"Panasonic""TV""T232"110
"Panasonic""TV""S3424"512
"Panasonic""TV""X3421"111
"Sanyo""Radio""S111"49
"Sanyo""Radio""S1s1"29
"Sanyo""Radio""S1s2"410

What if our header is actually composed of different parts of a column name? This is best explained with an example:

df_mean = pl.DataFrame(
    {
        "x_1_mean": [10],
        "x_2_mean": [20],
        "y_1_mean": [30],
        "y_2_mean": [40],
        "unit": [50],
    }
)

df_mean
shape: (1, 5)
x_1_meanx_2_meany_1_meany_2_meanunit
i64i64i64i64i64
1020304050

For the dataframe above, we wish to lump similar columns together - specifically, we want to lump the x* columns into one, same goes for the y* columns.

Steps:

  • unpivot into long form

  • split string and unnest into individucal columns

  • combine the relevant columns into a single header column

  • pivot the dataframe to get the desired output

out = df_mean.unpivot(index="unit")
expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["letter", "num", "agg"])
out = out.with_columns(variable=expression).unnest("variable")
expression = pl.concat_str("letter", "agg").alias("header")
(
    out.select(cs.exclude("letter", "agg"), expression)
    .pivot(index=["num", "unit"], on="header", values="value")
    .cast({"num": pl.Int8})
)
shape: (2, 4)
numunitxmeanymean
i8i64i64i64
1501030
2502040

What if we want multiple header columns and multiple value columns? Let’s have a look at this scenario:

df = pl.DataFrame(
    {
        "City": ["Houston", "Austin", "Hoover"],
        "State": ["Texas", "Texas", "Alabama"],
        "Name": ["Aria", "Penelope", "Niko"],
        "Mango": [4, 10, 90],
        "Orange": [10, 8, 14],
        "Watermelon": [40, 99, 43],
        "Gin": [16, 200, 34],
        "Vodka": [20, 33, 18],
    },
)

df
shape: (3, 8)
CityStateNameMangoOrangeWatermelonGinVodka
strstrstri64i64i64i64i64
"Houston""Texas""Aria"410401620
"Austin""Texas""Penelope"1089920033
"Hoover""Alabama""Niko"9014433418

The desired output is below:

      City    State       Fruit  Pounds  Drink  Ounces
0  Houston    Texas       Mango       4    Gin    16.0
1   Austin    Texas       Mango      10    Gin   200.0
2   Hoover  Alabama       Mango      90    Gin    34.0
3  Houston    Texas      Orange      10  Vodka    20.0
4   Austin    Texas      Orange       8  Vodka    33.0
5   Hoover  Alabama      Orange      14  Vodka    18.0
6  Houston    Texas  Watermelon      40    nan     NaN
7   Austin    Texas  Watermelon      99    nan     NaN
8   Hoover  Alabama  Watermelon      43    nan     NaN

From the above expected output, the goal is to pair the Fruits(headers) with Drinks(values), in separate columns. How can we achieve this in Polars?

We need to create a pairing between the fruits and the drinks :

('Mango', 'Gin'), ('Orange', 'Vodka'), ('Watermelon', None)

We can reuse an idea from earlier, by pairing the relevant columns, using structs:

index = ["City", "State"]
expression = [
    pl.struct(Pounds="Mango", Ounces="Gin").alias("Mango|Gin"),
    pl.struct(Pounds="Orange", Ounces="Vodka").alias("Orange|Vodka"),
    pl.struct(Pounds="Watermelon").alias("Watermelon"),
]
(
    df.select(index + expression)
    .unpivot(index=index)
    .unnest("value")
    .with_columns(
        variable=pl.col("variable")
        .str.split("|")
        .list.to_struct(fields=["Fruit", "Drink"])
    )
    .unnest("variable")
)
shape: (9, 6)
CityStateFruitDrinkPoundsOunces
strstrstrstri64i64
"Houston""Texas""Mango""Gin"416
"Austin""Texas""Mango""Gin"10200
"Hoover""Alabama""Mango""Gin"9034
"Houston""Texas""Orange""Vodka"1020
"Austin""Texas""Orange""Vodka"833
"Hoover""Alabama""Orange""Vodka"1418
"Houston""Texas""Watermelon"null40null
"Austin""Texas""Watermelon"null99null
"Hoover""Alabama""Watermelon"null43null

Note the steps:

  • create the structs to pair relevant columns - preprocessing

  • unpivot

  • unnest - post processing

  • string splitting - post processing

  • unnest - post processing

Summary#

This blog post shows various ways to reshape data into long form, in Polars. There are scenarios however, where we may need to tweak our approach to significantly improve performance. In another blog post, we’ll see how to make this reshaping more efficient for such conditions.

Comments#