Reshape Data in Polars Efficiently from Wide to Long Form - Part II

Reshape Data in Polars Efficiently from Wide to Long Form - Part II#

Introduction#

In a previous blogpost, we explored various scenarios when reshaping a DataFrame from wide to long form.

This article focuses on reshaping efficiently, especially for large datasets, where some post processing is done after unpivoting.

For the most part, we will reuse the datasets used in the previous blogpost.

Polars’ eager API is used in the examples below; for more performance, or in production mode, it is recommended to use the lazy API.

The examples are broken down in a line by line approach, instead of a chainable form, for illustration purposes - this way we can clearly identify what sections of the code have the biggest impact on performance.

Caveat: Do not optimise prematurely, or rather optimise only if you need to.

import polars as pl
import sys

print("polars version :", pl.__version__)
print("python version :", sys.version)

polars version : 1.7.1
python version : 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]

Many variables in column names#

who = pl.read_csv("./Data_files/who.csv", null_values="NA")
who

shape: (7_240, 60)

country	iso2	iso3	year	new_sp_m014	new_sp_m1524	new_sp_m2534	new_sp_m3544	new_sp_m4554	new_sp_m5564	new_sp_m65	new_sp_f014	new_sp_f1524	new_sp_f2534	new_sp_f3544	new_sp_f4554	new_sp_f5564	new_sp_f65	new_sn_m014	new_sn_m1524	new_sn_m2534	new_sn_m3544	new_sn_m4554	new_sn_m5564	new_sn_m65	new_sn_f014	new_sn_f1524	new_sn_f2534	new_sn_f3544	new_sn_f4554	new_sn_f5564	new_sn_f65	new_ep_m014	new_ep_m1524	new_ep_m2534	new_ep_m3544	new_ep_m4554	new_ep_m5564	new_ep_m65	new_ep_f014	new_ep_f1524	new_ep_f2534	new_ep_f3544	new_ep_f4554	new_ep_f5564	new_ep_f65	newrel_m014	newrel_m1524	newrel_m2534	newrel_m3544	newrel_m4554	newrel_m5564	newrel_m65	newrel_f014	newrel_f1524	newrel_f2534	newrel_f3544	newrel_f4554	newrel_f5564	newrel_f65
str	str	str	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
"Afghanistan"	"AF"	"AFG"	1980	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Afghanistan"	"AF"	"AFG"	1981	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Afghanistan"	"AF"	"AFG"	1982	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Afghanistan"	"AF"	"AFG"	1983	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Afghanistan"	"AF"	"AFG"	1984	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"Zimbabwe"	"ZW"	"ZWE"	2009	125	578	null	3471	681	293	192	180	873	null	3028	419	229	126	1560	860	null	6496	1655	882	861	1425	1334	null	7023	1551	729	514	244	266	0	1922	491	231	223	210	394	0	1944	438	182	138	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Zimbabwe"	"ZW"	"ZWE"	2010	150	710	2208	1682	761	350	252	173	974	2185	1283	490	265	171	1826	821	3342	3270	1545	882	864	1732	1282	4013	2851	1377	789	563	270	243	902	868	418	229	192	220	319	1058	677	338	181	146	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Zimbabwe"	"ZW"	"ZWE"	2011	152	784	2467	2071	780	377	278	174	1084	2161	1386	448	274	160	1364	596	2473	2813	1264	702	728	1271	947	2754	2216	962	587	495	250	195	746	796	342	172	172	209	318	802	640	284	137	129	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Zimbabwe"	"ZW"	"ZWE"	2012	120	783	2421	2086	796	360	271	173	939	2053	1286	483	231	161	1169	613	2302	2657	1154	708	796	1008	888	2287	1957	829	516	432	233	214	658	789	331	178	182	208	319	710	579	228	140	143	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"Zimbabwe"	"ZW"	"ZWE"	2013	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	1315	1642	5331	5363	2349	1206	1208	1252	2069	4649	3526	1453	811	725

In the previous blogpost, we flipped to long form using the code below:

regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
expression = pl.col("variable").str.extract_groups(regex)
out_naive = (
    who.unpivot(
        index=["country", "iso2", "iso3", "year"],
        value_name="count",
    )
    .with_columns(variable=expression)
    .unnest("variable")
    .cast({"count": pl.Int32})
)

out_naive

shape: (405_440, 8)

country	iso2	iso3	year	diagnosis	gender	age	count
str	str	str	i64	str	str	str	i32
"Afghanistan"	"AF"	"AFG"	1980	"sp"	"m"	"014"	null
"Afghanistan"	"AF"	"AFG"	1981	"sp"	"m"	"014"	null
"Afghanistan"	"AF"	"AFG"	1982	"sp"	"m"	"014"	null
"Afghanistan"	"AF"	"AFG"	1983	"sp"	"m"	"014"	null
"Afghanistan"	"AF"	"AFG"	1984	"sp"	"m"	"014"	null
…	…	…	…	…	…	…	…
"Zimbabwe"	"ZW"	"ZWE"	2009	"rel"	"f"	"65"	null
"Zimbabwe"	"ZW"	"ZWE"	2010	"rel"	"f"	"65"	null
"Zimbabwe"	"ZW"	"ZWE"	2011	"rel"	"f"	"65"	null
"Zimbabwe"	"ZW"	"ZWE"	2012	"rel"	"f"	"65"	null
"Zimbabwe"	"ZW"	"ZWE"	2013	"rel"	"f"	"65"	725

%%timeit
regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
expression = pl.col("variable").str.extract_groups(regex)
(
    who.unpivot(
        index = ["country", "iso2", "iso3", "year"],
        value_name="count",
    )
    .with_columns(variable=expression)
    .unnest("variable")
)

102 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Is this good enough, or can we do better? Let’s investigate with lprun:

%load_ext line_profiler

def wide_to_long(df):
    # the computation is broken down
    # into parts for more clarity
    # on what happens at each stage
    out = df.unpivot(
        index=["country", "iso2", "iso3", "year"],
        value_name="count",
    )
    string = out.get_column("variable")
    regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
    string = string.str.extract_groups(regex)
    out = out.with_columns(variable=string)
    out = out.unnest("variable")
    return out

%lprun -f wide_to_long wide_to_long(df=who)

Timer unit: 1e-09 s

Total time: 0.100968 s
File: /var/folders/yh/003fk4jn5fl9kl3mnm1y0b1h0000gr/T/ipykernel_83318/3562139280.py
Function: wide_to_long at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
                                         def wide_to_long(df):
                                             # the computation is broken down
                                             # into parts for more clarity
                                             # on what happens at each stage
       2    3317000.0    2e+06      3.3      out = df.unpivot(
       1          0.0      0.0      0.0          index=["country", "iso2", "iso3", "year"],
       1          0.0      0.0      0.0          value_name="count",
                                             )
       1      12000.0  12000.0      0.0      string = out.get_column("variable")
       1          0.0      0.0      0.0      regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
       1   97543000.0    1e+08     96.6      string = string.str.extract_groups(regex)
       1      82000.0  82000.0      0.1      out = out.with_columns(variable=string)
       1      14000.0  14000.0      0.0      out = out.unnest("variable")
       1          0.0      0.0      0.0      return out

The bulk of time is spent in the string computation, specifically the section where we extract the diagnosis, gender and age columns, using the extract_groups method.

Can we reduce this cost? How? What if we compute the string regex operation on the columns before unpivoting? 🤔

index = ["country", "iso2", "iso3", "year"]
columns = pl.Series(who.columns)
mask = columns.is_in(index).not_()
columns = columns.filter(mask)
regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
columns = columns.str.extract_groups(regex)
columns = columns.struct.unnest()
columns

shape: (56, 3)

diagnosis	gender	age
str	str	str
"sp"	"m"	"014"
"sp"	"m"	"1524"
"sp"	"m"	"2534"
"sp"	"m"	"3544"
"sp"	"m"	"4554"
…	…	…
"rel"	"f"	"2534"
"rel"	"f"	"3544"
"rel"	"f"	"4554"
"rel"	"f"	"5564"
"rel"	"f"	"65"

Next, we need to properly pair the extracted column with the unpivoted data:

# implosion and unpivoting ensures we get a final length
# that is the same as the columns' length
# ensuring we can pair correctly
out = who.select(pl.implode("*"))
out = out.unpivot(index=index, value_name="count")
out = out.drop("variable")
out = out.with_columns(columns)
index.append("count")
out = out.explode(index)
out

shape: (405_440, 8)

country	iso2	iso3	year	count	diagnosis	gender	age
str	str	str	i64	i64	str	str	str
"Afghanistan"	"AF"	"AFG"	1980	null	"sp"	"m"	"014"
"Afghanistan"	"AF"	"AFG"	1981	null	"sp"	"m"	"014"
"Afghanistan"	"AF"	"AFG"	1982	null	"sp"	"m"	"014"
"Afghanistan"	"AF"	"AFG"	1983	null	"sp"	"m"	"014"
"Afghanistan"	"AF"	"AFG"	1984	null	"sp"	"m"	"014"
…	…	…	…	…	…	…	…
"Zimbabwe"	"ZW"	"ZWE"	2009	null	"rel"	"f"	"65"
"Zimbabwe"	"ZW"	"ZWE"	2010	null	"rel"	"f"	"65"
"Zimbabwe"	"ZW"	"ZWE"	2011	null	"rel"	"f"	"65"
"Zimbabwe"	"ZW"	"ZWE"	2012	null	"rel"	"f"	"65"
"Zimbabwe"	"ZW"	"ZWE"	2013	725	"rel"	"f"	"65"

cols = out_naive.columns
out_naive.sort(by=pl.all()).equals(out.select(cols).sort(by=pl.all()))

True

%%timeit
index = ["country", "iso2", "iso3", "year"]
columns = pl.Series(who.columns)
mask = columns.is_in(index).not_()
columns = columns.filter(mask)
regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
columns = columns.str.extract_groups(regex)
columns = columns.struct.unnest()
out = who.select(pl.implode("*"))
out = out.unpivot(index=index, value_name="count")
out = out.drop('variable')
out = out.with_columns(columns)
index.append('count')
out = out.explode(index)
out

4.2 ms ± 14.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That is a significant improvement in performance - about 20x faster.

By changing our approach and shifting the computation to a less expensive section, we claw back some performance gains.

def wide_to_long_optimised(df):
    index = ["country", "iso2", "iso3", "year"]
    columns = pl.Series(df.columns)
    mask = columns.is_in(index).not_()
    columns = columns.filter(mask)
    regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
    columns = columns.str.extract_groups(regex)
    columns = columns.struct.unnest()
    out = df.select(pl.implode("*"))
    out = out.unpivot(index=index, value_name="count")
    out = out.drop("variable")
    out = out.with_columns(columns)
    index.append("count")
    out = out.explode(index)
    return out

%lprun -f wide_to_long_optimised wide_to_long_optimised(df=who)

Timer unit: 1e-09 s

Total time: 0.00589 s
File: /var/folders/yh/003fk4jn5fl9kl3mnm1y0b1h0000gr/T/ipykernel_83318/509568358.py
Function: wide_to_long_optimised at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
                                         def wide_to_long_optimised(df):
       1          0.0      0.0      0.0      index = ["country", "iso2", "iso3", "year"]
       1      29000.0  29000.0      0.5      columns = pl.Series(df.columns)
       1     232000.0 232000.0      3.9      mask = columns.is_in(index).not_()
       1       4000.0   4000.0      0.1      columns = columns.filter(mask)
       1          0.0      0.0      0.0      regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
       1     299000.0 299000.0      5.1      columns = columns.str.extract_groups(regex)
       1       5000.0   5000.0      0.1      columns = columns.struct.unnest()
       1     439000.0 439000.0      7.5      out = df.select(pl.implode("*"))
       1    3119000.0    3e+06     53.0      out = out.unpivot(index=index, value_name="count")
       1      34000.0  34000.0      0.6      out = out.drop("variable")
       1      60000.0  60000.0      1.0      out = out.with_columns(columns)
       1          0.0      0.0      0.0      index.append("count")
       1    1669000.0    2e+06     28.3      out = out.explode(index)
       1          0.0      0.0      0.0      return out

Multiple observations per row#

household = pl.read_csv("./Data_files/household.csv", null_values="NA")
household

shape: (5, 5)

family	dob_child1	dob_child2	name_child1	name_child2
i64	str	str	str	str
1	"1998-11-26"	"2000-01-29"	"Susan"	"Jose"
2	"1996-06-22"	null	"Mark"	null
3	"2002-07-11"	"2004-04-05"	"Sam"	"Seth"
4	"2004-10-10"	"2009-08-27"	"Craig"	"Khai"
5	"2000-12-05"	"2005-02-28"	"Parker"	"Gracie"

In the previous blogpost, we kept the dob and name as column headers, while flipping the remaining parts of the columns to rows:

expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
(
    household.unpivot(index="family")
    .drop_nulls("value")
    .with_columns(variable=expression)
    .unnest("variable")
    .pivot(index=["family", "child"], on="header", values="value")
)

shape: (9, 4)

family	child	dob	name
i64	str	str	str
1	"child1"	"1998-11-26"	"Susan"
2	"child1"	"1996-06-22"	"Mark"
3	"child1"	"2002-07-11"	"Sam"
4	"child1"	"2004-10-10"	"Craig"
5	"child1"	"2000-12-05"	"Parker"
1	"child2"	"2000-01-29"	"Jose"
3	"child2"	"2004-04-05"	"Seth"
4	"child2"	"2009-08-27"	"Khai"
5	"child2"	"2005-02-28"	"Gracie"

%%timeit
expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
(
    household.unpivot(index="family")
    .drop_nulls("value")
    .with_columns(variable=expression)
    .unnest("variable")
    .pivot(index=["family", "child"], on="header", values="value")
)

222 μs ± 4.05 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Let’s see the performance for a similar dataset, but larger:

from zipfile import ZipFile
zip_file = ZipFile("./Data_files/household_large.csv.zip")
zip_file = zip_file.read("household_large.csv")
household_large = pl.read_csv(zip_file, null_values="NA")
household_large

shape: (100_000, 61)

family	name_child0	dob_child0	name_child1	dob_child1	name_child2	dob_child2	name_child3	dob_child3	name_child4	dob_child4	name_child5	dob_child5	name_child6	dob_child6	name_child7	dob_child7	name_child8	dob_child8	name_child9	dob_child9	name_child10	dob_child10	name_child11	dob_child11	name_child12	dob_child12	name_child13	dob_child13	name_child14	dob_child14	name_child15	dob_child15	name_child16	dob_child16	name_child17	dob_child17	name_child18	dob_child18	name_child19	dob_child19	name_child20	dob_child20	name_child21	dob_child21	name_child22	dob_child22	name_child23	dob_child23	name_child24	dob_child24	name_child25	dob_child25	name_child26	dob_child26	name_child27	dob_child27	name_child28	dob_child28	name_child29	dob_child29
i64	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str
0	"Brittany Mcdonald"	"1989-08-24"	"Karen Mendoza MD"	"1938-12-01"	"Stephanie Casey"	"1959-08-07"	"Rhonda Hartman"	"1984-01-17"	"Elizabeth Sampson"	"1947-01-25"	"Jacqueline Barrett"	"1949-02-15"	"Rebecca Harvey"	"1974-12-03"	"John Hopkins"	"2009-11-27"	"Diane Ali"	"1993-03-04"	"Emily Jones"	"1933-01-27"	"Rickey Bates"	"1938-06-03"	"Walter Stuart"	"1971-12-04"	"Megan Smith"	"1992-06-05"	"Amanda Morris"	"1976-02-01"	"Jennifer Greer"	"1971-12-12"	"Megan Haney"	"1935-09-11"	"Alyssa Garcia"	"1963-09-03"	"Daniel Turner"	"1938-11-17"	"Brittany Simmons"	"1919-11-21"	"Kurt Schwartz"	"1956-02-27"	"Tanya Brandt"	"1943-08-16"	"Jennifer White"	"1960-08-27"	"Eddie Hamilton"	"1988-01-15"	"Sean Chavez"	"1968-12-08"	"Rhonda White"	"1948-08-30"	"Linda Hicks"	"1936-08-03"	"Brittany Beard"	"1996-03-08"	"Nicole Schmidt"	"1948-12-06"	"Nancy Clark"	"1914-12-12"	"Dr. David Walters"	"1922-10-24"
1	"Diana Welch"	"1989-02-11"	"Brandon Pittman"	"2019-09-02"	"Alicia Smith"	"1923-06-27"	"Sarah Dean DVM"	"2001-02-10"	"Katherine Landry"	"1935-06-26"	"Sarah Hughes"	"1911-01-19"	"Amanda Peters"	"1922-06-02"	"Shawn Sanders"	"1936-11-17"	"David White"	"1998-02-25"	"Justin Mccormick"	"1994-03-06"	"James Kelly"	"2005-02-13"	"Kimberly Lynch MD"	"1986-02-18"	"Kimberly Hall"	"2001-07-23"	"Susan Anderson"	"1908-10-26"	"Anthony Hall"	"2012-10-10"	"Lisa Gardner"	"1939-08-17"	"Andrew Burke"	"1981-07-13"	"Brittney Evans"	"1993-06-06"	"Taylor Wood"	"1951-07-29"	"Tanya Taylor"	"2019-02-01"	"Ashley Schultz"	"1993-04-15"	"Amanda Sanchez"	"1923-07-14"	"Harry Stevens"	"1986-11-10"	"David Flores"	"1931-09-16"	"Timothy Gray"	"2017-05-07"	"Jay Ramsey"	"1936-05-08"	"Charles Bell"	"2014-12-05"	"Jacob Williams"	"1963-04-17"	"Kimberly Gonzalez"	"1991-11-15"	"Monica Newton"	"1923-10-27"
2	"Jamie Richards"	"1973-11-04"	"Kristin Dunn"	"1973-04-02"	"Kathy Vasquez"	"1966-03-20"	"William Johnson"	"2006-03-08"	"Anthony Brewer"	"1987-06-21"	"James Hendricks"	"1981-03-29"	"Emily Torres"	"1990-11-09"	"Jennifer Santos"	"1982-02-13"	"Travis Snyder"	"2018-09-04"	"Misty Hudson"	"1974-01-03"	"Mathew Price"	"1965-07-09"	"Charles Anthony"	"2003-07-20"	"Sean Cook"	"1919-03-26"	"Krista Hall"	"1951-02-12"	"Sarah Hurst"	"1962-09-22"	"Angela May"	"2009-11-07"	"Antonio Ayers"	"1980-12-11"	"Kathryn Beck"	"1944-05-27"	"Barbara Davis"	"1929-04-19"	"Vanessa Gonzalez"	"1944-06-13"	"Tina Benjamin"	"1911-02-01"	"Anthony Wade"	"1938-10-10"	"Rachel Green"	"2001-02-26"	"Laura Wright"	"1950-01-11"	"Kenneth Stewart"	"1976-03-10"	"Megan Glass"	"1987-10-25"	"Jennifer King"	"1962-01-26"	"Cheyenne Ramirez"	"2009-12-21"	"Hayden Williamson"	"1993-12-27"	"Thomas Buchanan"	"2017-06-23"
3	"Michele Reed"	"1994-11-05"	"John Perry"	"1915-07-08"	"Benjamin Garcia"	"2006-03-12"	"Mrs. Carolyn Chavez"	"1999-08-03"	"Bailey Stevens"	"2000-04-09"	"Gilbert Lynch"	"1978-10-06"	"Jennifer Burton"	"2000-12-13"	"Jose Nguyen"	"2020-11-15"	"Yvonne Wall"	"1910-10-16"	"Jacob Owens"	"1943-11-14"	"William Gibson"	"1935-09-11"	"Nicholas Garcia"	"2014-12-28"	"Jamie Diaz"	"1952-10-05"	"Brian Burns"	"1934-03-03"	"Ryan Taylor"	"1978-11-15"	"Ashley Miller"	"1915-01-14"	"Francisco Brewer"	"2006-10-21"	"Melissa Ramos"	"1966-05-01"	"Antonio Moore"	"1914-11-28"	"Darin Russell"	"1994-07-29"	"Eric Santos"	"1928-02-18"	"Mariah Cunningham"	"1958-07-21"	"Deanna White"	"1914-08-30"	"Peggy Williams"	"1973-03-17"	"Michael Richardson"	"1968-02-20"	"Debra Haney"	"1972-04-02"	"Alexander Roberts"	"1977-04-14"	"Derrick Martin"	"1999-09-02"	"Teresa Stephens"	"1928-04-11"	"Andrew Murphy"	"1910-03-21"
4	"Steven Burgess"	"1997-12-24"	"Robert Willis"	"2011-04-18"	"Amy Forbes"	"1943-06-26"	"Chase Gutierrez"	"1976-06-21"	"Kevin Gonzales"	"1977-10-03"	"Michael Lyons"	"1973-08-06"	"Brian Carlson"	"1964-07-16"	"Jonathan Brown"	"1983-07-06"	"Ryan Lewis"	"2011-09-30"	"Brandon Drake"	"1995-07-27"	"Christine Mueller"	"1945-07-15"	"Joseph Sanders"	"1995-01-22"	"Patricia Robertson"	"1953-01-29"	"Kevin Young"	"1962-08-12"	"Amber Gray"	"1992-05-30"	"Laura Powers"	"1934-10-28"	"Carrie Ross"	"1974-11-03"	"Anthony Taylor"	"2002-03-31"	"Bill Hawkins"	"1923-01-26"	"Erin Mcbride"	"2009-01-12"	"Roger Meza"	"1934-12-10"	"Sandra Davis"	"1918-11-05"	"Kristie Whitehead"	"1979-12-01"	"James Anderson"	"1954-02-02"	"Sara Aguilar"	"2008-03-09"	"Hector Reynolds"	"1990-08-17"	"Cynthia Klein"	"1931-12-16"	"Richard Copeland"	"1978-11-24"	"Rickey Hudson"	"1926-09-29"	"Carmen Smith"	"1963-11-05"
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
99995	"Donna Vazquez"	"1938-12-26"	"William Curry"	"1943-03-27"	"Brian Green"	"1941-08-20"	"Cathy Krause"	"1910-10-22"	"Christopher Johnson"	"2006-10-22"	"Amanda Anderson"	"1910-07-21"	"Paul Casey"	"1965-02-23"	"Franklin Strickland"	"2004-03-26"	"Brian Murphy"	"1951-06-04"	"Jennifer Martinez"	"1921-08-12"	"Tony Long"	"1928-04-01"	"Anthony Garcia"	"2015-12-03"	"Todd Gonzales"	"2001-10-31"	"Justin Anderson"	"2021-05-04"	"Chad Garrett"	"1939-06-03"	"Mr. Christopher Harper"	"2002-07-22"	"Jennifer Flores"	"1933-06-17"	"Alexis Porter"	"2020-08-27"	"Jason Pitts"	"1969-03-11"	"Adam Sanchez"	"1959-02-02"	"Angela Cooper"	"1972-02-16"	"Elizabeth Thomas"	"1959-10-25"	"Mark Aguilar"	"1989-03-28"	"Chad Chapman"	"1919-03-01"	"Jessica Doyle"	"1938-04-14"	"Kim Miller"	"2017-08-04"	"Angela Hughes"	"1930-10-31"	"Jamie Thompson"	"2017-02-28"	"Brandy Frederick"	"1932-07-31"	"Patrick Solomon"	"1920-03-12"
99996	"Monica Davis"	"1987-10-07"	"Laura Donovan"	"1992-07-30"	"Morgan Garcia"	"1997-01-09"	"Stacey Chambers"	"1944-01-10"	"Stephen Fields"	"1935-10-03"	"Ann Barnes"	"1996-01-04"	"Nicholas Jones"	"1967-02-01"	"Karen Dunn"	"2012-07-21"	"Sheryl Hill"	"1989-02-21"	"Kyle Fox"	"2005-01-04"	"Mark Adkins"	"1981-12-01"	"Spencer Blankenship"	"1934-04-05"	"Catherine Smith"	"2008-12-06"	"Andrea Walker"	"1940-07-01"	"Ricky Barr"	"1912-11-17"	"Desiree Parker"	"1939-11-15"	"Amanda Ruiz"	"2018-08-09"	"Sharon Miller"	"2019-05-05"	"Larry Smith"	"1981-02-24"	"James Harris"	"1990-12-14"	"Justin Smith"	"1994-10-18"	"Tamara Roman"	"1921-04-01"	"Bridget Serrano"	"1974-05-13"	"Ann Melendez"	"1986-07-06"	"Susan Huff"	"2018-09-05"	"Antonio Juarez"	"1995-08-11"	"Ashley Lamb"	"1955-05-31"	"Patrick Morse"	"1932-08-08"	"Thomas Gregory"	"1963-04-11"	"Steven Smith"	"1987-09-22"
99997	"Vanessa Byrd"	"1967-03-17"	"Tammy Mcpherson"	"1987-11-06"	"Jacob Schwartz"	"2019-07-07"	"Barry Smith"	"1996-03-17"	"Samuel Martinez"	"1980-02-27"	"Erin Flores"	"1959-04-12"	"Melissa Perez"	"2001-07-28"	"Victoria Wilson"	"1975-02-12"	"Timothy Ramirez"	"1912-08-05"	"Kyle Johnson"	"2015-04-01"	"David Roberts"	"2001-10-01"	"Michael Alexander"	"1922-10-18"	"Robert Green"	"2004-03-13"	"Brent Nelson"	"1973-02-24"	"Curtis Smith"	"1978-11-04"	"Walter Lyons"	"2002-03-03"	"Angela Johnson"	"2015-06-24"	"Amanda Smith"	"1913-01-08"	"Jacqueline Smith"	"2012-01-07"	"Brian Ho"	"1948-12-11"	"Robert Lopez"	"1967-02-04"	"Charles Rogers"	"1996-04-25"	"Marcus Dixon"	"1952-03-11"	"Alexis Moore"	"1952-07-26"	"Joshua Robles"	"1956-11-12"	"Rachel Hull"	"1979-05-28"	"Jessica Smith"	"2006-01-17"	"Joshua Hernandez"	"2005-06-26"	"Melanie Crawford"	"1927-07-12"	"Scott Carlson"	"1950-05-23"
99998	"Gary Harmon"	"2002-05-09"	"William Reeves"	"1960-01-23"	"Linda Cohen"	"1953-05-31"	"Damon Soto"	"1948-11-15"	"Christina Hull"	"1992-04-07"	"Sean Boone"	"2022-08-10"	"Benjamin Graham"	"1971-12-30"	"Daniel Pacheco"	"1926-08-16"	"Jennifer Hopkins"	"2016-04-15"	"Anne Kelly"	"1988-12-03"	"Cristina Garcia"	"1997-05-03"	"Lindsay Wright"	"1979-12-03"	"Teresa Berg"	"1953-11-01"	"Lisa Murphy"	"1930-09-15"	"Jacob Hernandez"	"2017-04-25"	"Amber Campbell"	"1983-10-31"	"Ashley Mitchell"	"1962-09-07"	"Jacqueline Singleton"	"1970-06-14"	"Melinda Miller"	"1978-05-08"	"Brenda Gibbs"	"1964-08-13"	"Rebecca Cooper"	"2022-10-04"	"Jonathan Kelly"	"1935-11-19"	"Stephen Sanchez"	"2010-05-15"	"Adam Stark"	"1911-07-12"	"Denise George"	"1930-02-08"	"Brenda Lee"	"2003-09-22"	"Tracey Nelson"	"1910-09-10"	"Greg Esparza"	"2000-07-07"	"Tanner Bell"	"1928-01-15"	"Veronica Collins"	"1934-09-23"
99999	"Sheila Elliott"	"1965-08-06"	"Charlene Lee"	"2014-06-22"	"Gabriella Brown"	"2015-03-17"	"Joseph Warren"	"2015-03-18"	"Martin Frederick"	"1992-05-01"	"Cameron Pruitt"	"1992-12-19"	"Jennifer Wilson"	"1929-02-17"	"Scott Soto"	"2014-02-18"	"Lisa Oneill"	"1909-10-27"	"Joy Jones"	"1963-11-07"	"Diana Rodriguez"	"1951-12-27"	"Rachel Estes"	"1986-01-22"	"Susan Smith"	"1985-01-30"	"Cody Perez"	"1932-02-02"	"Pamela Stevenson DDS"	"2020-07-06"	"Katelyn Oconnor"	"1975-11-13"	"James Blair"	"1976-03-29"	"Betty Rowe"	"1991-03-26"	"William Briggs"	"1932-10-01"	"James Hudson"	"1998-01-17"	"Shawn Martinez"	"2013-11-03"	"Carla Shelton"	"2018-10-05"	"Emily Burke"	"1965-04-03"	"Matthew Brown"	"1932-06-30"	"Mrs. Jill Rogers MD"	"2000-05-07"	"Martin Schmidt"	"1977-11-22"	"Cody Collins"	"1994-08-01"	"Kevin Duncan"	"2009-10-15"	"Jerry Brown"	"2004-01-29"	"Nancy Benjamin"	"1966-05-18"

household_large.shape

(100000, 61)

expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
out_naive = (
    household_large.unpivot(index="family")
    .drop_nulls("value")
    .with_columns(variable=expression)
    .unnest("variable")
    .pivot(index=["family", "child"], on="header", values="value")
)
out_naive

shape: (3_000_000, 4)

family	child	name	dob
i64	str	str	str
0	"child0"	"Brittany Mcdonald"	"1989-08-24"
1	"child0"	"Diana Welch"	"1989-02-11"
2	"child0"	"Jamie Richards"	"1973-11-04"
3	"child0"	"Michele Reed"	"1994-11-05"
4	"child0"	"Steven Burgess"	"1997-12-24"
…	…	…	…
99995	"child29"	"Patrick Solomon"	"1920-03-12"
99996	"child29"	"Steven Smith"	"1987-09-22"
99997	"child29"	"Scott Carlson"	"1950-05-23"
99998	"child29"	"Veronica Collins"	"1934-09-23"
99999	"child29"	"Nancy Benjamin"	"1966-05-18"

%%timeit
expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
(
    household_large.unpivot(index="family")
    .drop_nulls("value")
    .with_columns(variable=expression)
    .unnest("variable")
    .pivot(index=["family", "child"], on="header", values="value")
)

1.2 s ± 8.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def lreshape(df):
    out = df.unpivot(index="family")
    out = out.drop_nulls("value")
    string = out.get_column("variable")
    string = string.str.split("_")
    string = string.list.to_struct(fields=["header", "child"])
    out = out.with_columns(variable=string)
    out = out.unnest("variable")
    out = out.pivot(index=["family", "child"], on="header", values="value")
    return out

%lprun -f lreshape lreshape(household_large)

Timer unit: 1e-09 s

Total time: 1.14191 s
File: /var/folders/yh/003fk4jn5fl9kl3mnm1y0b1h0000gr/T/ipykernel_83318/2283907795.py
Function: lreshape at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def lreshape(df):
     2         1   55695000.0    6e+07      4.9      out = df.unpivot(index="family")
     3         1     268000.0 268000.0      0.0      out = out.drop_nulls("value")
     4         1       9000.0   9000.0      0.0      string = out.get_column("variable")
     5         1  222072000.0    2e+08     19.4      string = string.str.split("_")
     6         1   31940000.0    3e+07      2.8      string = string.list.to_struct(fields=["header", "child"])
     7         1    1298000.0    1e+06      0.1      out = out.with_columns(variable=string)
     8         1      17000.0  17000.0      0.0      out = out.unnest("variable")
     9         1  830615000.0    8e+08     72.7      out = out.pivot(index=["family", "child"], on="header", values="value")
    10         1          0.0      0.0      0.0      return out

The pivot operation takes more than half of the reshaping; looks like a good place to start, with a view to improving performance, if possible.

To possibly improve performance, we need to find a way to pair the headers(dob, name) with the relevant columns, before unpivoting.

We will also process the columns before unpivoting, based on what we learnt from our previous reshaping exercise.

Let’s start with processing the columns :

index = "family"
columns = pl.Series("header", household_large.columns)
mask = columns.is_in([index]).not_()
columns = columns.filter(mask)
columns_ = columns.str.split("_")
columns_ = columns_.list.to_struct(fields=["header", "child"])
columns_ = columns_.struct.unnest()
columns_ = columns_.with_columns(columns=columns)
columns_

shape: (60, 3)

header	child	columns
str	str	str
"name"	"child0"	"name_child0"
"dob"	"child0"	"dob_child0"
"name"	"child1"	"name_child1"
"dob"	"child1"	"dob_child1"
"name"	"child2"	"name_child2"
…	…	…
"dob"	"child27"	"dob_child27"
"name"	"child28"	"name_child28"
"dob"	"child28"	"dob_child28"
"name"	"child29"	"name_child29"
"dob"	"child29"	"dob_child29"

The next step is to pair the header and columns, based on child:

# pass maintain_order=True
# if you want to retain the order
grouped = columns_.group_by("child") 
grouped = grouped.agg(pl.all())
grouped

shape: (30, 3)

child	header	columns
str	list[str]	list[str]
"child4"	["name", "dob"]	["name_child4", "dob_child4"]
"child2"	["name", "dob"]	["name_child2", "dob_child2"]
"child13"	["name", "dob"]	["name_child13", "dob_child13"]
"child25"	["name", "dob"]	["name_child25", "dob_child25"]
"child19"	["name", "dob"]	["name_child19", "dob_child19"]
…	…	…
"child11"	["name", "dob"]	["name_child11", "dob_child11"]
"child10"	["name", "dob"]	["name_child10", "dob_child10"]
"child5"	["name", "dob"]	["name_child5", "dob_child5"]
"child24"	["name", "dob"]	["name_child24", "dob_child24"]
"child15"	["name", "dob"]	["name_child15", "dob_child15"]

From the above we can see a relationship; we’ll create structs of pairs, where the field names are the headers(dob, name) and the values in the structs are the relevant column names:

expression = [index]
zipped = zip(
    grouped.get_column("header").to_list(),
    grouped.get_column("columns").to_list(),
)
for num, (heads, column_names) in enumerate(zipped):
    expr_ = dict(zip(heads, column_names))
    expr_ = pl.struct(**expr_).alias(str(num))
    expression.append(expr_)

Now, we unpivot:

out = household_large.select(expression)
out = out.select(pl.implode('*'))
out = out.unpivot(index=index)
out = out.with_columns(grouped.get_column('child'))
out = out.drop('variable')
out = out.explode(['family','value'])
out = out.unnest('value')
out

shape: (3_000_000, 4)

family	name	dob	child
i64	str	str	str
0	"Elizabeth Sampson"	"1947-01-25"	"child4"
1	"Katherine Landry"	"1935-06-26"	"child4"
2	"Anthony Brewer"	"1987-06-21"	"child4"
3	"Bailey Stevens"	"2000-04-09"	"child4"
4	"Kevin Gonzales"	"1977-10-03"	"child4"
…	…	…	…
99995	"Mr. Christopher Harper"	"2002-07-22"	"child15"
99996	"Desiree Parker"	"1939-11-15"	"child15"
99997	"Walter Lyons"	"2002-03-03"	"child15"
99998	"Amber Campbell"	"1983-10-31"	"child15"
99999	"Katelyn Oconnor"	"1975-11-13"	"child15"

cols = out_naive.columns
out_naive.sort(by=pl.all()).equals(out.select(cols).sort(by=pl.all()))

True

%%timeit
index = "family"
columns = pl.Series("header", household_large.columns)
mask = columns.is_in([index]).not_()
columns = columns.filter(mask)
columns_ = columns.str.split("_")
columns_ = columns_.list.to_struct(fields=["header", "child"])
columns_ = columns_.struct.unnest()
columns_ = columns_.with_columns(columns=columns)
grouped = columns_.group_by("child")
grouped = grouped.agg(pl.all())
expression = [index]
zipped = zip(
    grouped.get_column("header").to_list(),
    grouped.get_column("columns").to_list(),
)
for num, (heads, column_names) in enumerate(zipped):
    expr_ = dict(zip(heads, column_names))
    expr_ = pl.struct(**expr_).alias(str(num))
    expression.append(expr_)
out = household_large.select(expression)
out = out.select(pl.implode('*'))
out = out.unpivot(index=index)
out = out.with_columns(grouped.get_column('child'))
out = out.drop('variable')
out = out.explode(['family','value'])
out = out.unnest('value')
out

86.1 ms ± 233 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

By changing our approach (preprocess the columns, and grouping related columns into structs before unpivoting) we improve the peformance significantly - about 14x faster.

def lreshape_optimised(df):
    index = "family"
    columns = pl.Series("header", df.columns)
    mask = columns.is_in([index]).not_()
    columns = columns.filter(mask)
    columns_ = columns.str.split("_")
    columns_ = columns_.list.to_struct(fields=["header", "child"])
    columns_ = columns_.struct.unnest()
    columns_ = columns_.with_columns(columns=columns)
    grouped = columns_.group_by("child")
    grouped = grouped.agg(pl.all())
    expression = [index]
    zipped = zip(
        grouped.get_column("header").to_list(),
        grouped.get_column("columns").to_list(),
    )
    for num, (heads, column_names) in enumerate(zipped):
        expr_ = dict(zip(heads, column_names))
        expr_ = pl.struct(**expr_).alias(str(num))
        expression.append(expr_)
    out = df.select(expression)
    out = out.select(pl.implode('*'))
    out = out.unpivot(index=index)
    out = out.with_columns(grouped.get_column('child'))
    out = out.drop('variable')
    out = out.explode(['family','value'])
    out = out.unnest('value')
    return out

%lprun -f lreshape_optimised lreshape_optimised(household_large)

Timer unit: 1e-09 s

Total time: 0.08356 s
File: /var/folders/yh/003fk4jn5fl9kl3mnm1y0b1h0000gr/T/ipykernel_83318/2813683113.py
Function: lreshape_optimised at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
                                         def lreshape_optimised(df):
       1          0.0      0.0      0.0      index = "family"
       1      34000.0  34000.0      0.0      columns = pl.Series("header", df.columns)
       1     165000.0 165000.0      0.2      mask = columns.is_in([index]).not_()
       1       6000.0   6000.0      0.0      columns = columns.filter(mask)
       1     136000.0 136000.0      0.2      columns_ = columns.str.split("_")
       1     157000.0 157000.0      0.2      columns_ = columns_.list.to_struct(fields=["header", "child"])
       1       8000.0   8000.0      0.0      columns_ = columns_.struct.unnest()
       1     187000.0 187000.0      0.2      columns_ = columns_.with_columns(columns=columns)
       1       4000.0   4000.0      0.0      grouped = columns_.group_by("child")
       1     274000.0 274000.0      0.3      grouped = grouped.agg(pl.all())
       1       1000.0   1000.0      0.0      expression = [index]
       2       1000.0    500.0      0.0      zipped = zip(
       1      14000.0  14000.0      0.0          grouped.get_column("header").to_list(),
       1       6000.0   6000.0      0.0          grouped.get_column("columns").to_list(),
                                             )
      31       8000.0    258.1      0.0      for num, (heads, column_names) in enumerate(zipped):
      30      12000.0    400.0      0.0          expr_ = dict(zip(heads, column_names))
      30     193000.0   6433.3      0.2          expr_ = pl.struct(**expr_).alias(str(num))
      30       5000.0    166.7      0.0          expression.append(expr_)
       1     719000.0 719000.0      0.9      out = df.select(expression)
       1    6927000.0    7e+06      8.3      out = out.select(pl.implode('*'))
       1   57090000.0    6e+07     68.3      out = out.unpivot(index=index)
       1      93000.0  93000.0      0.1      out = out.with_columns(grouped.get_column('child'))
       1      23000.0  23000.0      0.0      out = out.drop('variable')
       1   17479000.0    2e+07     20.9      out = out.explode(['family','value'])
       1      18000.0  18000.0      0.0      out = out.unnest('value')
       1          0.0      0.0      0.0      return out

Summary#

This blog post highlights how to flip from wide to long form efficiently.

One thing to note is the verbosity of the faster approach. In another blog post, we’ll have a look at two functions from the pyjanitor library that offer conciseness and efficiency when reshaping from wide to long form.