Reshape Data in Polars Efficiently from Wide to Long Form - Part I

Reshape Data in Polars Efficiently from Wide to Long Form - Part I#

Introduction#

You’ve got data in a wide form (lots of columns), and you wish to transform it into a long form (fewer columns, more rows), to make it tidy, or for ease of analysis. In polars, this is achieved with the unpivot method. There are scenarios however where some processing has to be done before and/or after unpivoting. This article shows how to reshape data for such scenarios.

Most of the examples here are based on the tidyr pivoting vignette.

Polars’ eager API is used in the examples below; for more performance, or in production mode, it is recommended to use the lazy API.

import polars as pl
import polars.selectors as cs
import sys

print("polars version :", pl.__version__)
print("python version :", sys.version)

polars version : 1.7.1
python version : 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]

String data in column names#

relig_income = pl.read_csv("./Data_files/relig_income.csv")
relig_income

shape: (18, 11)

religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k	$75-100k	$100-150k	>150k	Don't know/refused
str	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
"Agnostic"	27	34	60	81	76	137	122	109	84	96
"Atheist"	12	27	37	52	35	70	73	59	74	76
"Buddhist"	27	21	30	34	33	58	62	39	53	54
"Catholic"	418	617	732	670	638	1116	949	792	633	1489
"Don’t know/refused"	15	14	15	11	10	35	21	17	18	116
…	…	…	…	…	…	…	…	…	…	…
"Orthodox"	13	17	23	32	32	47	38	42	46	73
"Other Christian"	9	7	11	13	13	14	18	14	12	18
"Other Faiths"	20	33	40	46	49	63	46	40	41	71
"Other World Religions"	5	2	3	4	2	7	3	4	4	8
"Unaffiliated"	217	299	374	365	341	528	407	321	258	597

The relig_income dataset stores counts based on a survey which (among other things) asked people about their religion and annual income. This dataset contains three variables:

religion, stored in the rows,

income spread across the column names, and

count stored in the cell values.

We can flip this into long form, and ensure each variable has its own column - this can be achieved with the unpivot method:

relig_income.unpivot(index="religion", variable_name="income", value_name="count")

shape: (180, 3)

religion	income	count
str	str	i64
"Agnostic"	"<$10k"	27
"Atheist"	"<$10k"	12
"Buddhist"	"<$10k"	27
"Catholic"	"<$10k"	418
"Don’t know/refused"	"<$10k"	15
…	…	…
"Orthodox"	"Don't know/refused"	73
"Other Christian"	"Don't know/refused"	18
"Other Faiths"	"Don't know/refused"	71
"Other World Religions"	"Don't know/refused"	8
"Unaffiliated"	"Don't know/refused"	597

No preprocessing required here.

Numeric data in column names#

billboard = pl.read_csv("./Data_files/billboard.csv", null_values="NA")
billboard = billboard.with_columns(cs.starts_with("wk").cast(pl.Int32))
billboard

shape: (317, 81)

year	artist	track	time	date.entered	wk1	wk2	wk3	wk4	wk5	wk6	wk7	wk8	wk9	wk10	wk11	wk12	wk13	wk14	wk15	wk16	wk17	wk18	wk19	wk20	wk21	wk22	wk23	wk24	wk25	wk26	wk27	wk28	wk29	wk30	wk31	wk32	…	wk40	wk41	wk42	wk43	wk44	wk45	wk46	wk47	wk48	wk49	wk50	wk51	wk52	wk53	wk54	wk55	wk56	wk57	wk58	wk59	wk60	wk61	wk62	wk63	wk64	wk65	wk66	wk67	wk68	wk69	wk70	wk71	wk72	wk73	wk74	wk75	wk76
i64	str	str	str	str	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	…	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32
2000	"2 Pac"	"Baby Don't Cry (Keep..."	"4:22"	"2000-02-26"	87	82	72	77	87	94	99	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"2Ge+her"	"The Hardest Part Of ..."	"3:15"	"2000-09-02"	91	87	92	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"3 Doors Down"	"Kryptonite"	"3:53"	"2000-04-08"	81	70	68	67	66	57	54	53	51	51	51	51	47	44	38	28	22	18	18	14	12	7	6	6	6	5	5	4	4	4	4	3	…	15	14	13	14	16	17	21	22	24	28	33	42	42	49	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"3 Doors Down"	"Loser"	"4:24"	"2000-10-21"	76	76	72	69	67	65	55	59	62	61	61	59	61	66	72	76	75	67	73	70	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"504 Boyz"	"Wobble Wobble"	"3:35"	"2000-04-15"	57	34	25	17	17	31	36	49	53	57	64	70	75	76	78	85	92	96	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2000	"Yankee Grey"	"Another Nine Minutes"	"3:10"	"2000-04-29"	86	83	77	74	83	79	88	95	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"Yearwood, Trisha"	"Real Live Woman"	"3:55"	"2000-04-01"	85	83	83	82	81	91	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"Ying Yang Twins"	"Whistle While You Tw..."	"4:19"	"2000-03-18"	95	94	91	85	84	78	74	78	85	89	97	96	99	99	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"Zombie Nation"	"Kernkraft 400"	"3:30"	"2000-09-02"	99	99	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"matchbox twenty"	"Bent"	"4:12"	"2000-04-29"	60	37	29	24	22	21	18	16	13	12	8	6	1	2	3	2	2	3	4	5	4	4	6	9	12	13	19	20	20	24	29	28	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null

The billboard dataset records the billboard rank of songs in the year 2000. It has a form similar to the relig_income data, but the data encoded in the column names is really a number, not a string.

Let’s unpivot the data into a tidy form:

(
    billboard.unpivot(
        index=~cs.starts_with("wk"), variable_name="week", value_name="rank"
    )
    # irrelevant nulls
    .drop_nulls(subset="rank")
)

shape: (5_307, 7)

year	artist	track	time	date.entered	week	rank
i64	str	str	str	str	str	i32
2000	"2 Pac"	"Baby Don't Cry (Keep..."	"4:22"	"2000-02-26"	"wk1"	87
2000	"2Ge+her"	"The Hardest Part Of ..."	"3:15"	"2000-09-02"	"wk1"	91
2000	"3 Doors Down"	"Kryptonite"	"3:53"	"2000-04-08"	"wk1"	81
2000	"3 Doors Down"	"Loser"	"4:24"	"2000-10-21"	"wk1"	76
2000	"504 Boyz"	"Wobble Wobble"	"3:35"	"2000-04-15"	"wk1"	57
…	…	…	…	…	…	…
2000	"Creed"	"Higher"	"5:16"	"1999-09-11"	"wk63"	50
2000	"Lonestar"	"Amazed"	"4:25"	"1999-06-05"	"wk63"	45
2000	"Creed"	"Higher"	"5:16"	"1999-09-11"	"wk64"	50
2000	"Lonestar"	"Amazed"	"4:25"	"1999-06-05"	"wk64"	50
2000	"Creed"	"Higher"	"5:16"	"1999-09-11"	"wk65"	49

It would be nice to easily determine how long each song stayed in the charts, but to do that, we’ll need to convert the week variable to an integer.

This is where the post processing comes into play after the unpivoting - extract the number from the string column, and cast the extracted column to an integer dtype:

(
    billboard.unpivot(
        index=~cs.starts_with("wk"), variable_name="week", value_name="rank"
    )
    # irrelevant nulls
    .drop_nulls(subset="rank").with_columns(
        week=pl.col("week").str.slice(-1).cast(pl.Int16)
    )
)

shape: (5_307, 7)

year	artist	track	time	date.entered	week	rank
i64	str	str	str	str	i16	i32
2000	"2 Pac"	"Baby Don't Cry (Keep..."	"4:22"	"2000-02-26"	1	87
2000	"2Ge+her"	"The Hardest Part Of ..."	"3:15"	"2000-09-02"	1	91
2000	"3 Doors Down"	"Kryptonite"	"3:53"	"2000-04-08"	1	81
2000	"3 Doors Down"	"Loser"	"4:24"	"2000-10-21"	1	76
2000	"504 Boyz"	"Wobble Wobble"	"3:35"	"2000-04-15"	1	57
…	…	…	…	…	…	…
2000	"Creed"	"Higher"	"5:16"	"1999-09-11"	3	50
2000	"Lonestar"	"Amazed"	"4:25"	"1999-06-05"	3	45
2000	"Creed"	"Higher"	"5:16"	"1999-09-11"	4	50
2000	"Lonestar"	"Amazed"	"4:25"	"1999-06-05"	4	50
2000	"Creed"	"Higher"	"5:16"	"1999-09-11"	5	49

Many variables in column names#

who = pl.read_csv("./Data_files/who.csv", null_values="NA")
who

shape: (7_240, 60)

country	iso2	iso3	year	new_sp_m014	new_sp_m1524	new_sp_m2534	new_sp_m3544	new_sp_m4554	new_sp_m5564	new_sp_m65	new_sp_f014	new_sp_f1524	new_sp_f2534	new_sp_f3544	new_sp_f4554	new_sp_f5564	new_sp_f65	new_sn_m014	new_sn_m1524	new_sn_m2534	new_sn_m3544	new_sn_m4554	new_sn_m5564	new_sn_m65	new_sn_f014	new_sn_f1524	new_sn_f2534	new_sn_f3544	new_sn_f4554	new_sn_f5564	new_sn_f65	new_ep_m014	new_ep_m1524	new_ep_m2534	new_ep_m3544	new_ep_m4554	new_ep_m5564	new_ep_m65	new_ep_f014	new_ep_f1524	new_ep_f2534	new_ep_f3544	new_ep_f4554	new_ep_f5564	new_ep_f65	newrel_m014	newrel_m1524	newrel_m2534	newrel_m3544	newrel_m4554	newrel_m5564	newrel_m65	newrel_f014	newrel_f1524	newrel_f2534	newrel_f3544	newrel_f4554	newrel_f5564	newrel_f65
str	str	str	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
"Afghanistan"	"AF"	"AFG"	1980	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Afghanistan"	"AF"	"AFG"	1981	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Afghanistan"	"AF"	"AFG"	1982	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Afghanistan"	"AF"	"AFG"	1983	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Afghanistan"	"AF"	"AFG"	1984	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"Zimbabwe"	"ZW"	"ZWE"	2009	125	578	null	3471	681	293	192	180	873	null	3028	419	229	126	1560	860	null	6496	1655	882	861	1425	1334	null	7023	1551	729	514	244	266	0	1922	491	231	223	210	394	0	1944	438	182	138	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Zimbabwe"	"ZW"	"ZWE"	2010	150	710	2208	1682	761	350	252	173	974	2185	1283	490	265	171	1826	821	3342	3270	1545	882	864	1732	1282	4013	2851	1377	789	563	270	243	902	868	418	229	192	220	319	1058	677	338	181	146	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Zimbabwe"	"ZW"	"ZWE"	2011	152	784	2467	2071	780	377	278	174	1084	2161	1386	448	274	160	1364	596	2473	2813	1264	702	728	1271	947	2754	2216	962	587	495	250	195	746	796	342	172	172	209	318	802	640	284	137	129	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Zimbabwe"	"ZW"	"ZWE"	2012	120	783	2421	2086	796	360	271	173	939	2053	1286	483	231	161	1169	613	2302	2657	1154	708	796	1008	888	2287	1957	829	516	432	233	214	658	789	331	178	182	208	319	710	579	228	140	143	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Zimbabwe"	"ZW"	"ZWE"	2013	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	1315	1642	5331	5363	2349	1206	1208	1252	2069	4649	3526	1453	811	725

A more challenging situation occurs when you have multiple variables crammed into the column names. For example, take the who dataset above.

country, iso2, iso3, and year are already variables, so they can be left as is. But the columns from new_sp_m014 to newrel_f65 encode four variables in their names:

the new_/new prefix indicates these are counts of new cases. This dataset only contains new cases, so we’ll ignore it here because it’s constant.

sp/rel/ep describe how the case was diagnosed.

m/f gives the gender.

014/1524/2535/3544/4554/65 supplies the age range.

Sticking to the same pattern as before, we unpivot into long form; however, after unpivoting, we extract the relevant fields into new columns using Polars’ efficient string methods - for this use case a regex is a natural fit:

regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
expression = pl.col("variable").str.extract_groups(regex)
(
    who.unpivot(
        index=["country", "iso2", "iso3", "year"],
        value_name="count",
    )
    .with_columns(variable=expression)
    .unnest("variable")
)

shape: (405_440, 8)

country	iso2	iso3	year	diagnosis	gender	age	count
str	str	str	i64	str	str	str	i64
"Afghanistan"	"AF"	"AFG"	1980	"sp"	"m"	"014"	null
"Afghanistan"	"AF"	"AFG"	1981	"sp"	"m"	"014"	null
"Afghanistan"	"AF"	"AFG"	1982	"sp"	"m"	"014"	null
"Afghanistan"	"AF"	"AFG"	1983	"sp"	"m"	"014"	null
"Afghanistan"	"AF"	"AFG"	1984	"sp"	"m"	"014"	null
…	…	…	…	…	…	…	…
"Zimbabwe"	"ZW"	"ZWE"	2009	"rel"	"f"	"65"	null
"Zimbabwe"	"ZW"	"ZWE"	2010	"rel"	"f"	"65"	null
"Zimbabwe"	"ZW"	"ZWE"	2011	"rel"	"f"	"65"	null
"Zimbabwe"	"ZW"	"ZWE"	2012	"rel"	"f"	"65"	null
"Zimbabwe"	"ZW"	"ZWE"	2013	"rel"	"f"	"65"	725

Group related sets of columns#

baseball = pl.DataFrame(
    {
        "hr1": [514, 573],
        "hr2": [545, 526],
        "team": ["Red Sox", "Yankees"],
        "year1": [2007, 2007],
        "year2": [2008, 2008],
    }
)
baseball

shape: (2, 5)

hr1	hr2	team	year1	year2
i64	i64	str	i64	i64
514	545	"Red Sox"	2007	2008
573	526	"Yankees"	2007	2008

In the dataframe above, we wish to combine the years (year1 and year2) into a single year column, and the hours(hr1 and hr2) into a single hour column.

For this use case, we have to do some processing before unpivoting - we create structs of the year* and hr* columns, then we unpivot:

expression = [
    "team",
    pl.struct(hour="hr1", year="year1").alias("1"),
    pl.struct(hour="hr2", year="year2").alias("2"),
]
baseball.select(expression).unpivot(index="team").drop("variable").unnest("value")

shape: (4, 3)

team	hour	year
str	i64	i64
"Red Sox"	514	2007
"Yankees"	573	2007
"Red Sox"	545	2008
"Yankees"	526	2008

The above is a variant of multiple observations per row which is discussed in the next section.

Multiple observations per row#

So far, we have been working with data frames that have one observation per row, but many important pivoting problems involve multiple observations per row. You can usually recognise this case because the name of the column that you want to appear in the output is part of the column name in the input. In this section, you’ll learn how to pivot this sort of data.

household = pl.read_csv("./Data_files/household.csv", null_values="NA")
household

shape: (5, 5)

family	dob_child1	dob_child2	name_child1	name_child2
i64	str	str	str	str
1	"1998-11-26"	"2000-01-29"	"Susan"	"Jose"
2	"1996-06-22"	null	"Mark"	null
3	"2002-07-11"	"2004-04-05"	"Sam"	"Seth"
4	"2004-10-10"	"2009-08-27"	"Craig"	"Khai"
5	"2000-12-05"	"2005-02-28"	"Parker"	"Gracie"

Note that we have two pieces of information (or values) for each child: their name and their dob (date of birth). These need to go into separate columns in the result.

Steps involved:

unpivot into long form
string split into individual columns.
pivot the dataframe, to keep dob and name as headers

expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
(
    household.unpivot(index="family")
    .drop_nulls("value")
    .with_columns(variable=expression)
    .unnest("variable")
    .pivot(index=["family", "child"], on="header", values="value")
)

shape: (9, 4)

family	child	dob	name
i64	str	str	str
1	"child1"	"1998-11-26"	"Susan"
2	"child1"	"1996-06-22"	"Mark"
3	"child1"	"2002-07-11"	"Sam"
4	"child1"	"2004-10-10"	"Craig"
5	"child1"	"2000-12-05"	"Parker"
1	"child2"	"2000-01-29"	"Jose"
3	"child2"	"2004-04-05"	"Seth"
4	"child2"	"2009-08-27"	"Khai"
5	"child2"	"2005-02-28"	"Gracie"

In the household data above, we had just one header to retain from the column names; what if we had multiple?

manufacturer = pl.DataFrame(
    {
        "Sony | TV | Model | value": ["A222", "A234", "A4345"],
        "Sony | TV | Quantity | value": [5, 5, 4],
        "Sony | TV | Max-quant | value": [10, 9, 9],
        "Panasonic | TV | Model | value": ["T232", "S3424", "X3421"],
        "Panasonic | TV | Quantity | value": [1, 5, 1],
        "Panasonic | TV | Max-quant | value": [10, 12, 11],
        "Sanyo | Radio | Model | value": ["S111", "S1s1", "S1s2"],
        "Sanyo | Radio | Quantity | value": [4, 2, 4],
        "Sanyo | Radio | Max-quant | value": [9, 9, 10],
    }
)

manufacturer

shape: (3, 9)

Sony \| TV \| Model \| value	Sony \| TV \| Quantity \| value	Sony \| TV \| Max-quant \| value	Panasonic \| TV \| Model \| value	Panasonic \| TV \| Quantity \| value	Panasonic \| TV \| Max-quant \| value	Sanyo \| Radio \| Model \| value	Sanyo \| Radio \| Quantity \| value	Sanyo \| Radio \| Max-quant \| value
str	i64	i64	str	i64	i64	str	i64	i64
"A222"	5	10	"T232"	1	10	"S111"	4	9
"A234"	5	9	"S3424"	5	12	"S1s1"	2	9
"A4345"	4	9	"X3421"	1	11	"S1s2"	4	10

Again, we stick to the same pattern as with the household reshaping - unpivot into long form, split into individual columns, and finally pivot the relevant column:

expression = pl.col("variable").str.split("|")
expression = expression.list.to_struct(
    fields=["Manufacturer", "Device", "header", "counts"]
)
(
    manufacturer.unpivot(index=None)
    .with_columns(variable=expression)
    .unnest("variable")
    .with_columns(cs.exclude("value").str.strip_chars())
    .with_row_index(name="index")
    # ensure the pivoting operation does not encounter duplicate entries
    .with_columns(pl.cum_count("index").over("Manufacturer", "Device", "header"))
    .pivot(index=["index", "Manufacturer", "Device"], on="header", values="value")
    .drop("index")
    .with_columns(cs.starts_with("Q", "Max").as_expr().cast(pl.Int16))
)

shape: (9, 5)

Manufacturer	Device	Model	Quantity	Max-quant
str	str	str	i16	i16
"Sony"	"TV"	"A222"	5	10
"Sony"	"TV"	"A234"	5	9
"Sony"	"TV"	"A4345"	4	9
"Panasonic"	"TV"	"T232"	1	10
"Panasonic"	"TV"	"S3424"	5	12
"Panasonic"	"TV"	"X3421"	1	11
"Sanyo"	"Radio"	"S111"	4	9
"Sanyo"	"Radio"	"S1s1"	2	9
"Sanyo"	"Radio"	"S1s2"	4	10

What if our header is actually composed of different parts of a column name? This is best explained with an example:

df_mean = pl.DataFrame(
    {
        "x_1_mean": [10],
        "x_2_mean": [20],
        "y_1_mean": [30],
        "y_2_mean": [40],
        "unit": [50],
    }
)

df_mean

shape: (1, 5)

x_1_mean	x_2_mean	y_1_mean	y_2_mean	unit
i64	i64	i64	i64	i64
10	20	30	40	50

For the dataframe above, we wish to lump similar columns together - specifically, we want to lump the x* columns into one, same goes for the y* columns.

Steps:

unpivot into long form
split string and unnest into individucal columns
combine the relevant columns into a single header column
pivot the dataframe to get the desired output

out = df_mean.unpivot(index="unit")
expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["letter", "num", "agg"])
out = out.with_columns(variable=expression).unnest("variable")
expression = pl.concat_str("letter", "agg").alias("header")
(
    out.select(cs.exclude("letter", "agg"), expression)
    .pivot(index=["num", "unit"], on="header", values="value")
    .cast({"num": pl.Int8})
)

shape: (2, 4)

num	unit	xmean	ymean
i8	i64	i64	i64
1	50	10	30
2	50	20	40

What if we want multiple header columns and multiple value columns? Let’s have a look at this scenario:

df = pl.DataFrame(
    {
        "City": ["Houston", "Austin", "Hoover"],
        "State": ["Texas", "Texas", "Alabama"],
        "Name": ["Aria", "Penelope", "Niko"],
        "Mango": [4, 10, 90],
        "Orange": [10, 8, 14],
        "Watermelon": [40, 99, 43],
        "Gin": [16, 200, 34],
        "Vodka": [20, 33, 18],
    },
)

df

shape: (3, 8)

City	State	Name	Mango	Orange	Watermelon	Gin	Vodka
str	str	str	i64	i64	i64	i64	i64
"Houston"	"Texas"	"Aria"	4	10	40	16	20
"Austin"	"Texas"	"Penelope"	10	8	99	200	33
"Hoover"	"Alabama"	"Niko"	90	14	43	34	18

The desired output is below:

      City    State       Fruit  Pounds  Drink  Ounces
Houston    Texas       Mango       4    Gin    16.0
 Austin    Texas       Mango      10    Gin   200.0
 Hoover  Alabama       Mango      90    Gin    34.0
Houston    Texas      Orange      10  Vodka    20.0
 Austin    Texas      Orange       8  Vodka    33.0
 Hoover  Alabama      Orange      14  Vodka    18.0
Houston    Texas  Watermelon      40    nan     NaN
 Austin    Texas  Watermelon      99    nan     NaN
 Hoover  Alabama  Watermelon      43    nan     NaN

From the above expected output, the goal is to pair the Fruits(headers) with Drinks(values), in separate columns. How can we achieve this in Polars?

We need to create a pairing between the fruits and the drinks :

('Mango', 'Gin'), ('Orange', 'Vodka'), ('Watermelon', None)

We can reuse an idea from earlier, by pairing the relevant columns, using structs:

index = ["City", "State"]
expression = [
    pl.struct(Pounds="Mango", Ounces="Gin").alias("Mango|Gin"),
    pl.struct(Pounds="Orange", Ounces="Vodka").alias("Orange|Vodka"),
    pl.struct(Pounds="Watermelon").alias("Watermelon"),
]
(
    df.select(index + expression)
    .unpivot(index=index)
    .unnest("value")
    .with_columns(
        variable=pl.col("variable")
        .str.split("|")
        .list.to_struct(fields=["Fruit", "Drink"])
    )
    .unnest("variable")
)

shape: (9, 6)

City	State	Fruit	Drink	Pounds	Ounces
str	str	str	str	i64	i64
"Houston"	"Texas"	"Mango"	"Gin"	4	16
"Austin"	"Texas"	"Mango"	"Gin"	10	200
"Hoover"	"Alabama"	"Mango"	"Gin"	90	34
"Houston"	"Texas"	"Orange"	"Vodka"	10	20
"Austin"	"Texas"	"Orange"	"Vodka"	8	33
"Hoover"	"Alabama"	"Orange"	"Vodka"	14	18
"Houston"	"Texas"	"Watermelon"	null	40	null
"Austin"	"Texas"	"Watermelon"	null	99	null
"Hoover"	"Alabama"	"Watermelon"	null	43	null

Note the steps:

create the structs to pair relevant columns - preprocessing
unpivot
unnest - post processing
string splitting - post processing
unnest - post processing

Summary#

This blog post shows various ways to reshape data into long form, in Polars. There are scenarios however, where we may need to tweak our approach to significantly improve performance. In another blog post, we’ll see how to make this reshaping more efficient for such conditions.