Reshape Data in Polars Efficiently from Wide to Long Form - Part III

Reshape Data in Polars Efficiently from Wide to Long Form - Part III#

Introduction#

This post continues on from a previous post, where I highlighted steps to significantly improve performance when reshaping a DataFrame from wide to long form. The solutions proferred in the blog post however were quite verbose, compared to the naive solutions, and were tailored to the specific problems.

In this article I introduce two functions - pivot_longer and pivot_longer_spec - from the pyjanitor library. These functions, which are ports of the same named functions in R, offer conciseness, while still being performant.

I am a contributor to the pyjanitor library.

We will reuse most of the examples from a previous blogpost to highlight the various usages for pivot_longer and pivot_longer_spec. First things first, let’s install and import the pyjanitor library:

# pip install pyjanitor
import janitor as jn
import janitor.polars
import polars as pl
import polars.selectors as cs
from janitor.polars import pivot_longer_spec

pl.__version__

'1.7.1'

jn.__version__

'0.29.1'

String data in column names#

relig_income = pl.read_csv("./Data_files/relig_income.csv")
relig_income

shape: (18, 11)

religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k	$75-100k	$100-150k	>150k	Don't know/refused
str	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
"Agnostic"	27	34	60	81	76	137	122	109	84	96
"Atheist"	12	27	37	52	35	70	73	59	74	76
"Buddhist"	27	21	30	34	33	58	62	39	53	54
"Catholic"	418	617	732	670	638	1116	949	792	633	1489
"Don’t know/refused"	15	14	15	11	10	35	21	17	18	116
…	…	…	…	…	…	…	…	…	…	…
"Orthodox"	13	17	23	32	32	47	38	42	46	73
"Other Christian"	9	7	11	13	13	14	18	14	12	18
"Other Faiths"	20	33	40	46	49	63	46	40	41	71
"Other World Religions"	5	2	3	4	2	7	3	4	4	8
"Unaffiliated"	217	299	374	365	341	528	407	321	258	597

This is an easy use case - pivot_longer simply calls unpivot under the hood:

relig_income.pivot_longer(
    index="religion",
    # all columns will be collated
    # under names_to
    names_to="income",
    # all values in the columns
    # will be collated into values_to
    values_to="count",
)

shape: (180, 3)

religion	income	count
str	str	i64
"Agnostic"	"<$10k"	27
"Atheist"	"<$10k"	12
"Buddhist"	"<$10k"	27
"Catholic"	"<$10k"	418
"Don’t know/refused"	"<$10k"	15
…	…	…
"Orthodox"	"Don't know/refused"	73
"Other Christian"	"Don't know/refused"	18
"Other Faiths"	"Don't know/refused"	71
"Other World Religions"	"Don't know/refused"	8
"Unaffiliated"	"Don't know/refused"	597

Numeric data in column names#

billboard = pl.read_csv("./Data_files/billboard.csv", null_values="NA")
billboard = billboard.with_columns(cs.starts_with("wk").cast(pl.Int32))
billboard

shape: (317, 81)

year	artist	track	time	date.entered	wk1	wk2	wk3	wk4	wk5	wk6	wk7	wk8	wk9	wk10	wk11	wk12	wk13	wk14	wk15	wk16	wk17	wk18	wk19	wk20	wk21	wk22	wk23	wk24	wk25	wk26	wk27	wk28	wk29	wk30	wk31	wk32	…	wk40	wk41	wk42	wk43	wk44	wk45	wk46	wk47	wk48	wk49	wk50	wk51	wk52	wk53	wk54	wk55	wk56	wk57	wk58	wk59	wk60	wk61	wk62	wk63	wk64	wk65	wk66	wk67	wk68	wk69	wk70	wk71	wk72	wk73	wk74	wk75	wk76
i64	str	str	str	str	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	…	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32	i32
2000	"2 Pac"	"Baby Don't Cry (Keep..."	"4:22"	"2000-02-26"	87	82	72	77	87	94	99	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"2Ge+her"	"The Hardest Part Of ..."	"3:15"	"2000-09-02"	91	87	92	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"3 Doors Down"	"Kryptonite"	"3:53"	"2000-04-08"	81	70	68	67	66	57	54	53	51	51	51	51	47	44	38	28	22	18	18	14	12	7	6	6	6	5	5	4	4	4	4	3	…	15	14	13	14	16	17	21	22	24	28	33	42	42	49	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"3 Doors Down"	"Loser"	"4:24"	"2000-10-21"	76	76	72	69	67	65	55	59	62	61	61	59	61	66	72	76	75	67	73	70	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"504 Boyz"	"Wobble Wobble"	"3:35"	"2000-04-15"	57	34	25	17	17	31	36	49	53	57	64	70	75	76	78	85	92	96	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2000	"Yankee Grey"	"Another Nine Minutes"	"3:10"	"2000-04-29"	86	83	77	74	83	79	88	95	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"Yearwood, Trisha"	"Real Live Woman"	"3:55"	"2000-04-01"	85	83	83	82	81	91	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"Ying Yang Twins"	"Whistle While You Tw..."	"4:19"	"2000-03-18"	95	94	91	85	84	78	74	78	85	89	97	96	99	99	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"Zombie Nation"	"Kernkraft 400"	"3:30"	"2000-09-02"	99	99	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"matchbox twenty"	"Bent"	"4:12"	"2000-04-29"	60	37	29	24	22	21	18	16	13	12	8	6	1	2	3	2	2	3	4	5	4	4	6	9	12	13	19	20	20	24	29	28	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null

The aim here is to flip into long form, extract the numbers from the wk columns and convert to integer dtype. Let’s see below how pivot_longer handles this:

(
    billboard.pivot_longer(
        index=["year", "artist", "track", "time", "date.entered"],
        names_to="week",
        values_to="rank",
        names_pattern=r".+(\d)",
        names_transform=pl.col("week").cast(pl.Int32),
    ).drop_nulls(subset="rank")
)

shape: (5_307, 7)

year	artist	track	time	date.entered	week	rank
i64	str	str	str	str	i32	i32
2000	"2 Pac"	"Baby Don't Cry (Keep..."	"4:22"	"2000-02-26"	1	87
2000	"2Ge+her"	"The Hardest Part Of ..."	"3:15"	"2000-09-02"	1	91
2000	"3 Doors Down"	"Kryptonite"	"3:53"	"2000-04-08"	1	81
2000	"3 Doors Down"	"Loser"	"4:24"	"2000-10-21"	1	76
2000	"504 Boyz"	"Wobble Wobble"	"3:35"	"2000-04-15"	1	57
…	…	…	…	…	…	…
2000	"Creed"	"Higher"	"5:16"	"1999-09-11"	3	50
2000	"Lonestar"	"Amazed"	"4:25"	"1999-06-05"	3	45
2000	"Creed"	"Higher"	"5:16"	"1999-09-11"	4	50
2000	"Lonestar"	"Amazed"	"4:25"	"1999-06-05"	4	50
2000	"Creed"	"Higher"	"5:16"	"1999-09-11"	5	49

Two things to note here:

names_pattern offers a flexible way to extract/separate labels within columns. In this case we are only interested in the integers attached to the wk columns, and the regex easily achieves this. Only the groups are retained; the non-groups are discarded. The other option of splitting columns is the names_sep parameter.
names_transform parameter offers a way to convert the data type of the flipped columns. It uses a polars expression to achieve this.

You can use the names_sep parameter to achieve something similar to the above; the regex though is a better fit.

(
    billboard.pivot_longer(
        index=["year", "artist", "track", "time", "date.entered"],
        names_to=["", "week"],
        values_to="rank",
        names_sep="wk",
        names_transform=pl.col("week").cast(pl.Int32),
    )
    .drop("")
    .drop_nulls(subset="rank")
)

shape: (5_307, 7)

year	artist	track	time	date.entered	week	rank
i64	str	str	str	str	i32	i32
2000	"2 Pac"	"Baby Don't Cry (Keep..."	"4:22"	"2000-02-26"	1	87
2000	"2Ge+her"	"The Hardest Part Of ..."	"3:15"	"2000-09-02"	1	91
2000	"3 Doors Down"	"Kryptonite"	"3:53"	"2000-04-08"	1	81
2000	"3 Doors Down"	"Loser"	"4:24"	"2000-10-21"	1	76
2000	"504 Boyz"	"Wobble Wobble"	"3:35"	"2000-04-15"	1	57
…	…	…	…	…	…	…
2000	"Creed"	"Higher"	"5:16"	"1999-09-11"	63	50
2000	"Lonestar"	"Amazed"	"4:25"	"1999-06-05"	63	45
2000	"Creed"	"Higher"	"5:16"	"1999-09-11"	64	50
2000	"Lonestar"	"Amazed"	"4:25"	"1999-06-05"	64	50
2000	"Creed"	"Higher"	"5:16"	"1999-09-11"	65	49

Many variables in column names#

who = pl.read_csv("./Data_files/who.csv", null_values="NA")
who

shape: (7_240, 60)

country	iso2	iso3	year	new_sp_m014	new_sp_m1524	new_sp_m2534	new_sp_m3544	new_sp_m4554	new_sp_m5564	new_sp_m65	new_sp_f014	new_sp_f1524	new_sp_f2534	new_sp_f3544	new_sp_f4554	new_sp_f5564	new_sp_f65	new_sn_m014	new_sn_m1524	new_sn_m2534	new_sn_m3544	new_sn_m4554	new_sn_m5564	new_sn_m65	new_sn_f014	new_sn_f1524	new_sn_f2534	new_sn_f3544	new_sn_f4554	new_sn_f5564	new_sn_f65	new_ep_m014	new_ep_m1524	new_ep_m2534	new_ep_m3544	new_ep_m4554	new_ep_m5564	new_ep_m65	new_ep_f014	new_ep_f1524	new_ep_f2534	new_ep_f3544	new_ep_f4554	new_ep_f5564	new_ep_f65	newrel_m014	newrel_m1524	newrel_m2534	newrel_m3544	newrel_m4554	newrel_m5564	newrel_m65	newrel_f014	newrel_f1524	newrel_f2534	newrel_f3544	newrel_f4554	newrel_f5564	newrel_f65
str	str	str	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
"Afghanistan"	"AF"	"AFG"	1980	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Afghanistan"	"AF"	"AFG"	1981	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Afghanistan"	"AF"	"AFG"	1982	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Afghanistan"	"AF"	"AFG"	1983	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Afghanistan"	"AF"	"AFG"	1984	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"Zimbabwe"	"ZW"	"ZWE"	2009	125	578	null	3471	681	293	192	180	873	null	3028	419	229	126	1560	860	null	6496	1655	882	861	1425	1334	null	7023	1551	729	514	244	266	0	1922	491	231	223	210	394	0	1944	438	182	138	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Zimbabwe"	"ZW"	"ZWE"	2010	150	710	2208	1682	761	350	252	173	974	2185	1283	490	265	171	1826	821	3342	3270	1545	882	864	1732	1282	4013	2851	1377	789	563	270	243	902	868	418	229	192	220	319	1058	677	338	181	146	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Zimbabwe"	"ZW"	"ZWE"	2011	152	784	2467	2071	780	377	278	174	1084	2161	1386	448	274	160	1364	596	2473	2813	1264	702	728	1271	947	2754	2216	962	587	495	250	195	746	796	342	172	172	209	318	802	640	284	137	129	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Zimbabwe"	"ZW"	"ZWE"	2012	120	783	2421	2086	796	360	271	173	939	2053	1286	483	231	161	1169	613	2302	2657	1154	708	796	1008	888	2287	1957	829	516	432	233	214	658	789	331	178	182	208	319	710	579	228	140	143	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Zimbabwe"	"ZW"	"ZWE"	2013	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	1315	1642	5331	5363	2349	1206	1208	1252	2069	4649	3526	1453	811	725

The aim is to separate the columns into diagnosis, gender and age columns. We can reuse the idea in the previous exercise and pass a regular expression to the names_pattern parameter:

who.pivot_longer(
    index=["country", "iso2", "iso3", "year"],
    names_to=("diagnosis", "gender", "age"),
    names_pattern=r"new_?(.*)_(.)(.*)",
    values_to="count",
)

shape: (405_440, 8)

country	iso2	iso3	year	diagnosis	gender	age	count
str	str	str	i64	str	str	str	i64
"Afghanistan"	"AF"	"AFG"	1980	"sp"	"m"	"014"	null
"Afghanistan"	"AF"	"AFG"	1981	"sp"	"m"	"014"	null
"Afghanistan"	"AF"	"AFG"	1982	"sp"	"m"	"014"	null
"Afghanistan"	"AF"	"AFG"	1983	"sp"	"m"	"014"	null
"Afghanistan"	"AF"	"AFG"	1984	"sp"	"m"	"014"	null
…	…	…	…	…	…	…	…
"Zimbabwe"	"ZW"	"ZWE"	2009	"rel"	"f"	"65"	null
"Zimbabwe"	"ZW"	"ZWE"	2010	"rel"	"f"	"65"	null
"Zimbabwe"	"ZW"	"ZWE"	2011	"rel"	"f"	"65"	null
"Zimbabwe"	"ZW"	"ZWE"	2012	"rel"	"f"	"65"	null
"Zimbabwe"	"ZW"	"ZWE"	2013	"rel"	"f"	"65"	725

Multiple observations per row#

household = pl.read_csv("./Data_files/household.csv", null_values="NA")
household

shape: (5, 5)

family	dob_child1	dob_child2	name_child1	name_child2
i64	str	str	str	str
1	"1998-11-26"	"2000-01-29"	"Susan"	"Jose"
2	"1996-06-22"	null	"Mark"	null
3	"2002-07-11"	"2004-04-05"	"Sam"	"Seth"
4	"2004-10-10"	"2009-08-27"	"Craig"	"Khai"
5	"2000-12-05"	"2005-02-28"	"Parker"	"Gracie"

In the DataFrame above, we wish to separate the name and dob into separate columns. How does pivot_longer handle this?

household.pivot_longer(index="family", names_to=(".value", "child"), names_sep="_")

shape: (10, 4)

family	child	dob	name
i64	str	str	str
1	"child1"	"1998-11-26"	"Susan"
2	"child1"	"1996-06-22"	"Mark"
3	"child1"	"2002-07-11"	"Sam"
4	"child1"	"2004-10-10"	"Craig"
5	"child1"	"2000-12-05"	"Parker"
1	"child2"	"2000-01-29"	"Jose"
2	"child2"	null	null
3	"child2"	"2004-04-05"	"Seth"
4	"child2"	"2009-08-27"	"Khai"
5	"child2"	"2005-02-28"	"Gracie"

If you want to retain a part of the column as a header, use the .value placeholder; the paired label remains as a header, while the unpaired parts, if any, are flipped into rows.

Let’s break it down a bit. The name_to parameter is (.value, child) which pairs with (dob, child1), (dob, child2), (name, child1), (name, child2), after it has been split by the names_sep argument. .value will be paired with dob and name, while child will be paired with child1 and child2. In tabular form, it can be portrayed as below:

.value   names_sep     child
dob        _           child1
dob        _           child2
name       _           child1
name       _           child2

Let’s look at another example:

manufacturer = pl.DataFrame(
    {
        "Sony | TV | Model | value": ["A222", "A234", "A4345"],
        "Sony | TV | Quantity | value": [5, 5, 4],
        "Sony | TV | Max-quant | value": [10, 9, 9],
        "Panasonic | TV | Model | value": ["T232", "S3424", "X3421"],
        "Panasonic | TV | Quantity | value": [1, 5, 1],
        "Panasonic | TV | Max-quant | value": [10, 12, 11],
        "Sanyo | Radio | Model | value": ["S111", "S1s1", "S1s2"],
        "Sanyo | Radio | Quantity | value": [4, 2, 4],
        "Sanyo | Radio | Max-quant | value": [9, 9, 10],
    }
)

manufacturer

shape: (3, 9)

Sony \| TV \| Model \| value	Sony \| TV \| Quantity \| value	Sony \| TV \| Max-quant \| value	Panasonic \| TV \| Model \| value	Panasonic \| TV \| Quantity \| value	Panasonic \| TV \| Max-quant \| value	Sanyo \| Radio \| Model \| value	Sanyo \| Radio \| Quantity \| value	Sanyo \| Radio \| Max-quant \| value
str	i64	i64	str	i64	i64	str	i64	i64
"A222"	5	10	"T232"	1	10	"S111"	4	9
"A234"	5	9	"S3424"	5	12	"S1s1"	2	9
"A4345"	4	9	"X3421"	1	11	"S1s2"	4	10

We wish to retain Model, Quantity and Max-quant as headers, and flip the remaining parts of the columns:

(
    manufacturer.pivot_longer(
        index=None,
        names_to=("Manufacturer", "Device", ".value", "value"),
        names_sep=" | ",
    ).drop("value")
)

shape: (9, 5)

Manufacturer	Device	Model	Quantity	Max-quant
str	str	str	i64	i64
"Sanyo"	"Radio"	"S111"	4	9
"Sanyo"	"Radio"	"S1s1"	2	9
"Sanyo"	"Radio"	"S1s2"	4	10
"Panasonic"	"TV"	"T232"	1	10
"Panasonic"	"TV"	"S3424"	5	12
"Panasonic"	"TV"	"X3421"	1	11
"Sony"	"TV"	"A222"	5	10
"Sony"	"TV"	"A234"	5	9
"Sony"	"TV"	"A4345"	4	9

(
    manufacturer.pivot_longer(
        index=None,
        names_to=("Manufacturer", "Device", ".value"),
        names_pattern=r"([a-zA-Z]+)\s?\|\s?([a-zA-Z]+)\s?\|\s?([a-zA-Z-]+)\s?\|.*",
    )
)

shape: (9, 5)

Manufacturer	Device	Model	Quantity	Max-quant
str	str	str	i64	i64
"Sony"	"TV"	"A222"	5	10
"Sony"	"TV"	"A234"	5	9
"Sony"	"TV"	"A4345"	4	9
"Sanyo"	"Radio"	"S111"	4	9
"Sanyo"	"Radio"	"S1s1"	2	9
"Sanyo"	"Radio"	"S1s2"	4	10
"Panasonic"	"TV"	"T232"	1	10
"Panasonic"	"TV"	"S3424"	5	12
"Panasonic"	"TV"	"X3421"	1	11

What if our header is actually composed of different parts of a column name? This is best explained with an example:

df_mean = pl.DataFrame(
    {
        "x_1_mean": [10],
        "x_2_mean": [20],
        "y_1_mean": [30],
        "y_2_mean": [40],
        "unit": [50],
    }
)

df_mean

shape: (1, 5)

x_1_mean	x_2_mean	y_1_mean	y_2_mean	unit
i64	i64	i64	i64	i64
10	20	30	40	50

For the DataFrame above, we wish to lump similar columns together - specifically, we want to lump the x* columns into one, same goes for the y* columns.

With pivot_longer we pass multiple .value to handle this (you can pass as many .value as required for the reshaping):

(
    df_mean.pivot_longer(
        index="unit",
        names_to=(".value", "number", ".value"),
        names_sep="_",
        names_transform=pl.col("number").cast(pl.Int8),
    )
)

shape: (2, 4)

unit	number	xmean	ymean
i64	i8	i64	i64
50	1	10	30
50	2	20	40

Describe the output with pivot_longer_spec#

pivot_longer offers a powerful, flexible and concise way to reshape data from wide to long form. However, there are situations where pivot_longer is inadequate for the transformation, or you need more control over the reshaping. This is where pivot_longer_spec shines - you describe how the new DataFrame should look, based on the source DataFrame.

Let’s look at an example, adapted from data.table vignette:

iris = pl.DataFrame(
    {
        "Sepal.Length": [5.1, 5.9],
        "Sepal.Width": [3.5, 3.0],
        "Petal.Length": [1.4, 5.1],
        "Petal.Width": [0.2, 1.8],
        "Species": ["setosa", "virginica"],
    }
)

iris

shape: (2, 5)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
f64	f64	f64	f64	str
5.1	3.5	1.4	0.2	"setosa"
5.9	3.0	5.1	1.8	"virginica"

For the DataFrame above, we wish to consolidate the columns into part and dimensions - the Sepal/Petal prefixes go into the part column, while Width/Length go into the dimension column. Let’s translate the specification into a table form:

source_column_name     new_column_name           part                 dimension
Sepal.Length           measurement               Sepal                 Length
Sepal.Width            measurement               Sepal                 Width
Petal.Length           measurement               Petal                 Length
Petal.Width            measurement               Petal                 Width                   

The table above shows a clear relation between the source and target; let’s take this relationship and create a polars DataFrame of this specification:

spec = {
    ".name": ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"],
    ".value": ["measurement", "measurement", "measurement", "measurement"],
    "part": ["Sepal", "Sepal", "Petal", "Petal"],
    "dimension": ["Length", "Width", "Length", "Width"],
}
spec = pl.DataFrame(spec)
spec

shape: (4, 4)

.name	.value	part	dimension
str	str	str	str
"Sepal.Length"	"measurement"	"Sepal"	"Length"
"Sepal.Width"	"measurement"	"Sepal"	"Width"
"Petal.Length"	"measurement"	"Petal"	"Length"
"Petal.Width"	"measurement"	"Petal"	"Width"

pivot_longer_spec takes the specification above and returns the expected new DataFrame:

pivot_longer_spec(df=iris, spec=spec)

shape: (8, 4)

Species	part	dimension	measurement
str	str	str	f64
"setosa"	"Petal"	"Length"	1.4
"virginica"	"Petal"	"Length"	5.1
"setosa"	"Sepal"	"Width"	3.5
"virginica"	"Sepal"	"Width"	3.0
"setosa"	"Petal"	"Width"	0.2
"virginica"	"Petal"	"Width"	1.8
"setosa"	"Sepal"	"Length"	5.1
"virginica"	"Sepal"	"Length"	5.9

Let’s take it further - the Sepal/Petal prefixes go into a part column, while the Width/Length suffixes remain as headers:

source_column_name     new_column_name           part                 
Sepal.Length           Length                   Sepal                 
Sepal.Width            Width                    Sepal                 
Petal.Length           Length                   Petal                 
Petal.Width            Width                    Petal                                    

Again, the relationship is clear. Let’s create the spec, and then unpivot:

spec = {
    ".name": ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"],
    "part": ["Sepal", "Sepal", "Petal", "Petal"],
    ".value": ["Length", "Width", "Length", "Width"],
}
spec = pl.DataFrame(spec)
spec

shape: (4, 3)

.name	part	.value
str	str	str
"Sepal.Length"	"Sepal"	"Length"
"Sepal.Width"	"Sepal"	"Width"
"Petal.Length"	"Petal"	"Length"
"Petal.Width"	"Petal"	"Width"

pivot_longer_spec(df=iris, spec=spec)

shape: (4, 4)

Species	part	Length	Width
str	str	f64	f64
"setosa"	"Petal"	1.4	0.2
"virginica"	"Petal"	5.1	1.8
"setosa"	"Sepal"	5.1	3.5
"virginica"	"Sepal"	5.9	3.0

The idea is simple - you know your data well, so describe the new DataFrame based on the current DataFrame, and pivot_longer_spec will take care of the transformation.

Performance#

pivot_longer is not just simple, convenient, and straightforward, it is efficient as well - We can refer to part II of this series to compare speeds:

from zipfile import ZipFile

zip_file = ZipFile("./Data_files/household_large.csv.zip")
zip_file = zip_file.read("household_large.csv")
household_large = pl.read_csv(zip_file, null_values="NA")
household_large

shape: (100_000, 61)

family	name_child0	dob_child0	name_child1	dob_child1	name_child2	dob_child2	name_child3	dob_child3	name_child4	dob_child4	name_child5	dob_child5	name_child6	dob_child6	name_child7	dob_child7	name_child8	dob_child8	name_child9	dob_child9	name_child10	dob_child10	name_child11	dob_child11	name_child12	dob_child12	name_child13	dob_child13	name_child14	dob_child14	name_child15	dob_child15	name_child16	dob_child16	name_child17	dob_child17	name_child18	dob_child18	name_child19	dob_child19	name_child20	dob_child20	name_child21	dob_child21	name_child22	dob_child22	name_child23	dob_child23	name_child24	dob_child24	name_child25	dob_child25	name_child26	dob_child26	name_child27	dob_child27	name_child28	dob_child28	name_child29	dob_child29
i64	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str
0	"Brittany Mcdonald"	"1989-08-24"	"Karen Mendoza MD"	"1938-12-01"	"Stephanie Casey"	"1959-08-07"	"Rhonda Hartman"	"1984-01-17"	"Elizabeth Sampson"	"1947-01-25"	"Jacqueline Barrett"	"1949-02-15"	"Rebecca Harvey"	"1974-12-03"	"John Hopkins"	"2009-11-27"	"Diane Ali"	"1993-03-04"	"Emily Jones"	"1933-01-27"	"Rickey Bates"	"1938-06-03"	"Walter Stuart"	"1971-12-04"	"Megan Smith"	"1992-06-05"	"Amanda Morris"	"1976-02-01"	"Jennifer Greer"	"1971-12-12"	"Megan Haney"	"1935-09-11"	"Alyssa Garcia"	"1963-09-03"	"Daniel Turner"	"1938-11-17"	"Brittany Simmons"	"1919-11-21"	"Kurt Schwartz"	"1956-02-27"	"Tanya Brandt"	"1943-08-16"	"Jennifer White"	"1960-08-27"	"Eddie Hamilton"	"1988-01-15"	"Sean Chavez"	"1968-12-08"	"Rhonda White"	"1948-08-30"	"Linda Hicks"	"1936-08-03"	"Brittany Beard"	"1996-03-08"	"Nicole Schmidt"	"1948-12-06"	"Nancy Clark"	"1914-12-12"	"Dr. David Walters"	"1922-10-24"
1	"Diana Welch"	"1989-02-11"	"Brandon Pittman"	"2019-09-02"	"Alicia Smith"	"1923-06-27"	"Sarah Dean DVM"	"2001-02-10"	"Katherine Landry"	"1935-06-26"	"Sarah Hughes"	"1911-01-19"	"Amanda Peters"	"1922-06-02"	"Shawn Sanders"	"1936-11-17"	"David White"	"1998-02-25"	"Justin Mccormick"	"1994-03-06"	"James Kelly"	"2005-02-13"	"Kimberly Lynch MD"	"1986-02-18"	"Kimberly Hall"	"2001-07-23"	"Susan Anderson"	"1908-10-26"	"Anthony Hall"	"2012-10-10"	"Lisa Gardner"	"1939-08-17"	"Andrew Burke"	"1981-07-13"	"Brittney Evans"	"1993-06-06"	"Taylor Wood"	"1951-07-29"	"Tanya Taylor"	"2019-02-01"	"Ashley Schultz"	"1993-04-15"	"Amanda Sanchez"	"1923-07-14"	"Harry Stevens"	"1986-11-10"	"David Flores"	"1931-09-16"	"Timothy Gray"	"2017-05-07"	"Jay Ramsey"	"1936-05-08"	"Charles Bell"	"2014-12-05"	"Jacob Williams"	"1963-04-17"	"Kimberly Gonzalez"	"1991-11-15"	"Monica Newton"	"1923-10-27"
2	"Jamie Richards"	"1973-11-04"	"Kristin Dunn"	"1973-04-02"	"Kathy Vasquez"	"1966-03-20"	"William Johnson"	"2006-03-08"	"Anthony Brewer"	"1987-06-21"	"James Hendricks"	"1981-03-29"	"Emily Torres"	"1990-11-09"	"Jennifer Santos"	"1982-02-13"	"Travis Snyder"	"2018-09-04"	"Misty Hudson"	"1974-01-03"	"Mathew Price"	"1965-07-09"	"Charles Anthony"	"2003-07-20"	"Sean Cook"	"1919-03-26"	"Krista Hall"	"1951-02-12"	"Sarah Hurst"	"1962-09-22"	"Angela May"	"2009-11-07"	"Antonio Ayers"	"1980-12-11"	"Kathryn Beck"	"1944-05-27"	"Barbara Davis"	"1929-04-19"	"Vanessa Gonzalez"	"1944-06-13"	"Tina Benjamin"	"1911-02-01"	"Anthony Wade"	"1938-10-10"	"Rachel Green"	"2001-02-26"	"Laura Wright"	"1950-01-11"	"Kenneth Stewart"	"1976-03-10"	"Megan Glass"	"1987-10-25"	"Jennifer King"	"1962-01-26"	"Cheyenne Ramirez"	"2009-12-21"	"Hayden Williamson"	"1993-12-27"	"Thomas Buchanan"	"2017-06-23"
3	"Michele Reed"	"1994-11-05"	"John Perry"	"1915-07-08"	"Benjamin Garcia"	"2006-03-12"	"Mrs. Carolyn Chavez"	"1999-08-03"	"Bailey Stevens"	"2000-04-09"	"Gilbert Lynch"	"1978-10-06"	"Jennifer Burton"	"2000-12-13"	"Jose Nguyen"	"2020-11-15"	"Yvonne Wall"	"1910-10-16"	"Jacob Owens"	"1943-11-14"	"William Gibson"	"1935-09-11"	"Nicholas Garcia"	"2014-12-28"	"Jamie Diaz"	"1952-10-05"	"Brian Burns"	"1934-03-03"	"Ryan Taylor"	"1978-11-15"	"Ashley Miller"	"1915-01-14"	"Francisco Brewer"	"2006-10-21"	"Melissa Ramos"	"1966-05-01"	"Antonio Moore"	"1914-11-28"	"Darin Russell"	"1994-07-29"	"Eric Santos"	"1928-02-18"	"Mariah Cunningham"	"1958-07-21"	"Deanna White"	"1914-08-30"	"Peggy Williams"	"1973-03-17"	"Michael Richardson"	"1968-02-20"	"Debra Haney"	"1972-04-02"	"Alexander Roberts"	"1977-04-14"	"Derrick Martin"	"1999-09-02"	"Teresa Stephens"	"1928-04-11"	"Andrew Murphy"	"1910-03-21"
4	"Steven Burgess"	"1997-12-24"	"Robert Willis"	"2011-04-18"	"Amy Forbes"	"1943-06-26"	"Chase Gutierrez"	"1976-06-21"	"Kevin Gonzales"	"1977-10-03"	"Michael Lyons"	"1973-08-06"	"Brian Carlson"	"1964-07-16"	"Jonathan Brown"	"1983-07-06"	"Ryan Lewis"	"2011-09-30"	"Brandon Drake"	"1995-07-27"	"Christine Mueller"	"1945-07-15"	"Joseph Sanders"	"1995-01-22"	"Patricia Robertson"	"1953-01-29"	"Kevin Young"	"1962-08-12"	"Amber Gray"	"1992-05-30"	"Laura Powers"	"1934-10-28"	"Carrie Ross"	"1974-11-03"	"Anthony Taylor"	"2002-03-31"	"Bill Hawkins"	"1923-01-26"	"Erin Mcbride"	"2009-01-12"	"Roger Meza"	"1934-12-10"	"Sandra Davis"	"1918-11-05"	"Kristie Whitehead"	"1979-12-01"	"James Anderson"	"1954-02-02"	"Sara Aguilar"	"2008-03-09"	"Hector Reynolds"	"1990-08-17"	"Cynthia Klein"	"1931-12-16"	"Richard Copeland"	"1978-11-24"	"Rickey Hudson"	"1926-09-29"	"Carmen Smith"	"1963-11-05"
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
99995	"Donna Vazquez"	"1938-12-26"	"William Curry"	"1943-03-27"	"Brian Green"	"1941-08-20"	"Cathy Krause"	"1910-10-22"	"Christopher Johnson"	"2006-10-22"	"Amanda Anderson"	"1910-07-21"	"Paul Casey"	"1965-02-23"	"Franklin Strickland"	"2004-03-26"	"Brian Murphy"	"1951-06-04"	"Jennifer Martinez"	"1921-08-12"	"Tony Long"	"1928-04-01"	"Anthony Garcia"	"2015-12-03"	"Todd Gonzales"	"2001-10-31"	"Justin Anderson"	"2021-05-04"	"Chad Garrett"	"1939-06-03"	"Mr. Christopher Harper"	"2002-07-22"	"Jennifer Flores"	"1933-06-17"	"Alexis Porter"	"2020-08-27"	"Jason Pitts"	"1969-03-11"	"Adam Sanchez"	"1959-02-02"	"Angela Cooper"	"1972-02-16"	"Elizabeth Thomas"	"1959-10-25"	"Mark Aguilar"	"1989-03-28"	"Chad Chapman"	"1919-03-01"	"Jessica Doyle"	"1938-04-14"	"Kim Miller"	"2017-08-04"	"Angela Hughes"	"1930-10-31"	"Jamie Thompson"	"2017-02-28"	"Brandy Frederick"	"1932-07-31"	"Patrick Solomon"	"1920-03-12"
99996	"Monica Davis"	"1987-10-07"	"Laura Donovan"	"1992-07-30"	"Morgan Garcia"	"1997-01-09"	"Stacey Chambers"	"1944-01-10"	"Stephen Fields"	"1935-10-03"	"Ann Barnes"	"1996-01-04"	"Nicholas Jones"	"1967-02-01"	"Karen Dunn"	"2012-07-21"	"Sheryl Hill"	"1989-02-21"	"Kyle Fox"	"2005-01-04"	"Mark Adkins"	"1981-12-01"	"Spencer Blankenship"	"1934-04-05"	"Catherine Smith"	"2008-12-06"	"Andrea Walker"	"1940-07-01"	"Ricky Barr"	"1912-11-17"	"Desiree Parker"	"1939-11-15"	"Amanda Ruiz"	"2018-08-09"	"Sharon Miller"	"2019-05-05"	"Larry Smith"	"1981-02-24"	"James Harris"	"1990-12-14"	"Justin Smith"	"1994-10-18"	"Tamara Roman"	"1921-04-01"	"Bridget Serrano"	"1974-05-13"	"Ann Melendez"	"1986-07-06"	"Susan Huff"	"2018-09-05"	"Antonio Juarez"	"1995-08-11"	"Ashley Lamb"	"1955-05-31"	"Patrick Morse"	"1932-08-08"	"Thomas Gregory"	"1963-04-11"	"Steven Smith"	"1987-09-22"
99997	"Vanessa Byrd"	"1967-03-17"	"Tammy Mcpherson"	"1987-11-06"	"Jacob Schwartz"	"2019-07-07"	"Barry Smith"	"1996-03-17"	"Samuel Martinez"	"1980-02-27"	"Erin Flores"	"1959-04-12"	"Melissa Perez"	"2001-07-28"	"Victoria Wilson"	"1975-02-12"	"Timothy Ramirez"	"1912-08-05"	"Kyle Johnson"	"2015-04-01"	"David Roberts"	"2001-10-01"	"Michael Alexander"	"1922-10-18"	"Robert Green"	"2004-03-13"	"Brent Nelson"	"1973-02-24"	"Curtis Smith"	"1978-11-04"	"Walter Lyons"	"2002-03-03"	"Angela Johnson"	"2015-06-24"	"Amanda Smith"	"1913-01-08"	"Jacqueline Smith"	"2012-01-07"	"Brian Ho"	"1948-12-11"	"Robert Lopez"	"1967-02-04"	"Charles Rogers"	"1996-04-25"	"Marcus Dixon"	"1952-03-11"	"Alexis Moore"	"1952-07-26"	"Joshua Robles"	"1956-11-12"	"Rachel Hull"	"1979-05-28"	"Jessica Smith"	"2006-01-17"	"Joshua Hernandez"	"2005-06-26"	"Melanie Crawford"	"1927-07-12"	"Scott Carlson"	"1950-05-23"
99998	"Gary Harmon"	"2002-05-09"	"William Reeves"	"1960-01-23"	"Linda Cohen"	"1953-05-31"	"Damon Soto"	"1948-11-15"	"Christina Hull"	"1992-04-07"	"Sean Boone"	"2022-08-10"	"Benjamin Graham"	"1971-12-30"	"Daniel Pacheco"	"1926-08-16"	"Jennifer Hopkins"	"2016-04-15"	"Anne Kelly"	"1988-12-03"	"Cristina Garcia"	"1997-05-03"	"Lindsay Wright"	"1979-12-03"	"Teresa Berg"	"1953-11-01"	"Lisa Murphy"	"1930-09-15"	"Jacob Hernandez"	"2017-04-25"	"Amber Campbell"	"1983-10-31"	"Ashley Mitchell"	"1962-09-07"	"Jacqueline Singleton"	"1970-06-14"	"Melinda Miller"	"1978-05-08"	"Brenda Gibbs"	"1964-08-13"	"Rebecca Cooper"	"2022-10-04"	"Jonathan Kelly"	"1935-11-19"	"Stephen Sanchez"	"2010-05-15"	"Adam Stark"	"1911-07-12"	"Denise George"	"1930-02-08"	"Brenda Lee"	"2003-09-22"	"Tracey Nelson"	"1910-09-10"	"Greg Esparza"	"2000-07-07"	"Tanner Bell"	"1928-01-15"	"Veronica Collins"	"1934-09-23"
99999	"Sheila Elliott"	"1965-08-06"	"Charlene Lee"	"2014-06-22"	"Gabriella Brown"	"2015-03-17"	"Joseph Warren"	"2015-03-18"	"Martin Frederick"	"1992-05-01"	"Cameron Pruitt"	"1992-12-19"	"Jennifer Wilson"	"1929-02-17"	"Scott Soto"	"2014-02-18"	"Lisa Oneill"	"1909-10-27"	"Joy Jones"	"1963-11-07"	"Diana Rodriguez"	"1951-12-27"	"Rachel Estes"	"1986-01-22"	"Susan Smith"	"1985-01-30"	"Cody Perez"	"1932-02-02"	"Pamela Stevenson DDS"	"2020-07-06"	"Katelyn Oconnor"	"1975-11-13"	"James Blair"	"1976-03-29"	"Betty Rowe"	"1991-03-26"	"William Briggs"	"1932-10-01"	"James Hudson"	"1998-01-17"	"Shawn Martinez"	"2013-11-03"	"Carla Shelton"	"2018-10-05"	"Emily Burke"	"1965-04-03"	"Matthew Brown"	"1932-06-30"	"Mrs. Jill Rogers MD"	"2000-05-07"	"Martin Schmidt"	"1977-11-22"	"Cody Collins"	"1994-08-01"	"Kevin Duncan"	"2009-10-15"	"Jerry Brown"	"2004-01-29"	"Nancy Benjamin"	"1966-05-18"

%%timeit
expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
(
    household_large.unpivot(index="family")
    .drop_nulls("value")
    .with_columns(variable=expression)
    .unnest("variable")
    .pivot(index=["family", "child"], on="header", values="value")
)

1.2 s ± 8.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
index = "family"
columns = pl.Series("header", household_large.columns)
mask = columns.is_in([index]).not_()
columns = columns.filter(mask)
columns_ = columns.str.split("_")
columns_ = columns_.list.to_struct(fields=["header", "child"])
columns_ = columns_.struct.unnest()
columns_ = columns_.with_columns(columns=columns)
grouped = columns_.group_by("child")
grouped = grouped.agg(pl.all())
expression = [index]
zipped = zip(
    grouped.get_column("header").to_list(),
    grouped.get_column("columns").to_list(),
)
for num, (heads, column_names) in enumerate(zipped):
    expr_ = dict(zip(heads, column_names))
    expr_ = pl.struct(**expr_).alias(str(num))
    expression.append(expr_)
out = household_large.select(expression)
out = out.select(pl.implode('*'))
out = out.unpivot(index=index)
out = out.with_columns(grouped.get_column('child'))
out = out.drop('variable')
out = out.explode(['family','value'])
out = out.unnest('value')
out

81.3 ms ± 4.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit household_large.pivot_longer(index='family', names_to=('.value','child'), names_sep='_')

55.8 ms ± 959 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
spec = (
    pl.DataFrame({".name": household_large.columns[1:]})
    .with_columns(
        other=pl.col(".name").str.split("_").list.to_struct(fields=[".value", "child"])
    )
    .unnest("other")
)
pivot_longer_spec(df=household_large, spec=spec)

56.5 ms ± 747 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Your mileage may vary with these speed tests.

Summary#

This blog post shows various ways to reshape data into long form, using pivot_longer and pivot_longer_spec. They are meant to be concise, straightforward and efficient.