Reshape Data in Polars Efficiently from Wide to Long Form - Part II

Reshape Data in Polars Efficiently from Wide to Long Form - Part II#

Introduction#

In a previous blogpost, we explored various scenarios when reshaping a DataFrame from wide to long form.

This article focuses on reshaping efficiently, especially for large datasets, where some post processing is done after unpivoting.

For the most part, we will reuse the datasets used in the previous blogpost.

Polars’ eager API is used in the examples below; for more performance, or in production mode, it is recommended to use the lazy API.

The examples are broken down in a line by line approach, instead of a chainable form, for illustration purposes - this way we can clearly identify what sections of the code have the biggest impact on performance.

Caveat: Do not optimise prematurely, or rather optimise only if you need to.

import polars as pl
import sys

print("polars version :", pl.__version__)
print("python version :", sys.version)
polars version : 1.7.1
python version : 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]

Many variables in column names#

who = pl.read_csv("./Data_files/who.csv", null_values="NA")
who
shape: (7_240, 60)
countryiso2iso3yearnew_sp_m014new_sp_m1524new_sp_m2534new_sp_m3544new_sp_m4554new_sp_m5564new_sp_m65new_sp_f014new_sp_f1524new_sp_f2534new_sp_f3544new_sp_f4554new_sp_f5564new_sp_f65new_sn_m014new_sn_m1524new_sn_m2534new_sn_m3544new_sn_m4554new_sn_m5564new_sn_m65new_sn_f014new_sn_f1524new_sn_f2534new_sn_f3544new_sn_f4554new_sn_f5564new_sn_f65new_ep_m014new_ep_m1524new_ep_m2534new_ep_m3544new_ep_m4554new_ep_m5564new_ep_m65new_ep_f014new_ep_f1524new_ep_f2534new_ep_f3544new_ep_f4554new_ep_f5564new_ep_f65newrel_m014newrel_m1524newrel_m2534newrel_m3544newrel_m4554newrel_m5564newrel_m65newrel_f014newrel_f1524newrel_f2534newrel_f3544newrel_f4554newrel_f5564newrel_f65
strstrstri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64
"Afghanistan""AF""AFG"1980nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1981nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1982nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1983nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Afghanistan""AF""AFG"1984nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"2009125578null3471681293192180873null30284192291261560860null6496165588286114251334null702315517295142442660192249123122321039401944438182138nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"201015071022081682761350252173974218512834902651711826821334232701545882864173212824013285113777895632702439028684182291922203191058677338181146nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"2011152784246720717803772781741084216113864482741601364596247328131264702728127194727542216962587495250195746796342172172209318802640284137129nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"201212078324212086796360271173939205312864832311611169613230226571154708796100888822871957829516432233214658789331178182208319710579228140143nullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Zimbabwe""ZW""ZWE"2013nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull131516425331536323491206120812522069464935261453811725

In the previous blogpost, we flipped to long form using the code below:

regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
expression = pl.col("variable").str.extract_groups(regex)
out_naive = (
    who.unpivot(
        index=["country", "iso2", "iso3", "year"],
        value_name="count",
    )
    .with_columns(variable=expression)
    .unnest("variable")
    .cast({"count": pl.Int32})
)

out_naive
shape: (405_440, 8)
countryiso2iso3yeardiagnosisgenderagecount
strstrstri64strstrstri32
"Afghanistan""AF""AFG"1980"sp""m""014"null
"Afghanistan""AF""AFG"1981"sp""m""014"null
"Afghanistan""AF""AFG"1982"sp""m""014"null
"Afghanistan""AF""AFG"1983"sp""m""014"null
"Afghanistan""AF""AFG"1984"sp""m""014"null
"Zimbabwe""ZW""ZWE"2009"rel""f""65"null
"Zimbabwe""ZW""ZWE"2010"rel""f""65"null
"Zimbabwe""ZW""ZWE"2011"rel""f""65"null
"Zimbabwe""ZW""ZWE"2012"rel""f""65"null
"Zimbabwe""ZW""ZWE"2013"rel""f""65"725
%%timeit
regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
expression = pl.col("variable").str.extract_groups(regex)
(
    who.unpivot(
        index = ["country", "iso2", "iso3", "year"],
        value_name="count",
    )
    .with_columns(variable=expression)
    .unnest("variable")
)
102 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Is this good enough, or can we do better? Let’s investigate with lprun:

%load_ext line_profiler
def wide_to_long(df):
    # the computation is broken down
    # into parts for more clarity
    # on what happens at each stage
    out = df.unpivot(
        index=["country", "iso2", "iso3", "year"],
        value_name="count",
    )
    string = out.get_column("variable")
    regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
    string = string.str.extract_groups(regex)
    out = out.with_columns(variable=string)
    out = out.unnest("variable")
    return out
%lprun -f wide_to_long wide_to_long(df=who)
Timer unit: 1e-09 s

Total time: 0.100968 s
File: /var/folders/yh/003fk4jn5fl9kl3mnm1y0b1h0000gr/T/ipykernel_83318/3562139280.py
Function: wide_to_long at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def wide_to_long(df):
     2                                               # the computation is broken down
     3                                               # into parts for more clarity
     4                                               # on what happens at each stage
     5         2    3317000.0    2e+06      3.3      out = df.unpivot(
     6         1          0.0      0.0      0.0          index=["country", "iso2", "iso3", "year"],
     7         1          0.0      0.0      0.0          value_name="count",
     8                                               )
     9         1      12000.0  12000.0      0.0      string = out.get_column("variable")
    10         1          0.0      0.0      0.0      regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
    11         1   97543000.0    1e+08     96.6      string = string.str.extract_groups(regex)
    12         1      82000.0  82000.0      0.1      out = out.with_columns(variable=string)
    13         1      14000.0  14000.0      0.0      out = out.unnest("variable")
    14         1          0.0      0.0      0.0      return out

The bulk of time is spent in the string computation, specifically the section where we extract the diagnosis, gender and age columns, using the extract_groups method.

Can we reduce this cost? How? What if we compute the string regex operation on the columns before unpivoting? 🤔

index = ["country", "iso2", "iso3", "year"]
columns = pl.Series(who.columns)
mask = columns.is_in(index).not_()
columns = columns.filter(mask)
regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
columns = columns.str.extract_groups(regex)
columns = columns.struct.unnest()
columns
shape: (56, 3)
diagnosisgenderage
strstrstr
"sp""m""014"
"sp""m""1524"
"sp""m""2534"
"sp""m""3544"
"sp""m""4554"
"rel""f""2534"
"rel""f""3544"
"rel""f""4554"
"rel""f""5564"
"rel""f""65"

Next, we need to properly pair the extracted column with the unpivoted data:

# implosion and unpivoting ensures we get a final length
# that is the same as the columns' length
# ensuring we can pair correctly
out = who.select(pl.implode("*"))
out = out.unpivot(index=index, value_name="count")
out = out.drop("variable")
out = out.with_columns(columns)
index.append("count")
out = out.explode(index)
out
shape: (405_440, 8)
countryiso2iso3yearcountdiagnosisgenderage
strstrstri64i64strstrstr
"Afghanistan""AF""AFG"1980null"sp""m""014"
"Afghanistan""AF""AFG"1981null"sp""m""014"
"Afghanistan""AF""AFG"1982null"sp""m""014"
"Afghanistan""AF""AFG"1983null"sp""m""014"
"Afghanistan""AF""AFG"1984null"sp""m""014"
"Zimbabwe""ZW""ZWE"2009null"rel""f""65"
"Zimbabwe""ZW""ZWE"2010null"rel""f""65"
"Zimbabwe""ZW""ZWE"2011null"rel""f""65"
"Zimbabwe""ZW""ZWE"2012null"rel""f""65"
"Zimbabwe""ZW""ZWE"2013725"rel""f""65"
cols = out_naive.columns
out_naive.sort(by=pl.all()).equals(out.select(cols).sort(by=pl.all()))
True
%%timeit
index = ["country", "iso2", "iso3", "year"]
columns = pl.Series(who.columns)
mask = columns.is_in(index).not_()
columns = columns.filter(mask)
regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
columns = columns.str.extract_groups(regex)
columns = columns.struct.unnest()
out = who.select(pl.implode("*"))
out = out.unpivot(index=index, value_name="count")
out = out.drop('variable')
out = out.with_columns(columns)
index.append('count')
out = out.explode(index)
out
4.2 ms ± 14.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That is a significant improvement in performance - about 20x faster.

By changing our approach and shifting the computation to a less expensive section, we claw back some performance gains.

def wide_to_long_optimised(df):
    index = ["country", "iso2", "iso3", "year"]
    columns = pl.Series(df.columns)
    mask = columns.is_in(index).not_()
    columns = columns.filter(mask)
    regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
    columns = columns.str.extract_groups(regex)
    columns = columns.struct.unnest()
    out = df.select(pl.implode("*"))
    out = out.unpivot(index=index, value_name="count")
    out = out.drop("variable")
    out = out.with_columns(columns)
    index.append("count")
    out = out.explode(index)
    return out
%lprun -f wide_to_long_optimised wide_to_long_optimised(df=who)
Timer unit: 1e-09 s

Total time: 0.00589 s
File: /var/folders/yh/003fk4jn5fl9kl3mnm1y0b1h0000gr/T/ipykernel_83318/509568358.py
Function: wide_to_long_optimised at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def wide_to_long_optimised(df):
     2         1          0.0      0.0      0.0      index = ["country", "iso2", "iso3", "year"]
     3         1      29000.0  29000.0      0.5      columns = pl.Series(df.columns)
     4         1     232000.0 232000.0      3.9      mask = columns.is_in(index).not_()
     5         1       4000.0   4000.0      0.1      columns = columns.filter(mask)
     6         1          0.0      0.0      0.0      regex = r"new_?(?<diagnosis>.*)_(?<gender>.)(?<age>.*)"
     7         1     299000.0 299000.0      5.1      columns = columns.str.extract_groups(regex)
     8         1       5000.0   5000.0      0.1      columns = columns.struct.unnest()
     9         1     439000.0 439000.0      7.5      out = df.select(pl.implode("*"))
    10         1    3119000.0    3e+06     53.0      out = out.unpivot(index=index, value_name="count")
    11         1      34000.0  34000.0      0.6      out = out.drop("variable")
    12         1      60000.0  60000.0      1.0      out = out.with_columns(columns)
    13         1          0.0      0.0      0.0      index.append("count")
    14         1    1669000.0    2e+06     28.3      out = out.explode(index)
    15         1          0.0      0.0      0.0      return out

Multiple observations per row#

household = pl.read_csv("./Data_files/household.csv", null_values="NA")
household
shape: (5, 5)
familydob_child1dob_child2name_child1name_child2
i64strstrstrstr
1"1998-11-26""2000-01-29""Susan""Jose"
2"1996-06-22"null"Mark"null
3"2002-07-11""2004-04-05""Sam""Seth"
4"2004-10-10""2009-08-27""Craig""Khai"
5"2000-12-05""2005-02-28""Parker""Gracie"

In the previous blogpost, we kept the dob and name as column headers, while flipping the remaining parts of the columns to rows:

expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
(
    household.unpivot(index="family")
    .drop_nulls("value")
    .with_columns(variable=expression)
    .unnest("variable")
    .pivot(index=["family", "child"], on="header", values="value")
)
shape: (9, 4)
familychilddobname
i64strstrstr
1"child1""1998-11-26""Susan"
2"child1""1996-06-22""Mark"
3"child1""2002-07-11""Sam"
4"child1""2004-10-10""Craig"
5"child1""2000-12-05""Parker"
1"child2""2000-01-29""Jose"
3"child2""2004-04-05""Seth"
4"child2""2009-08-27""Khai"
5"child2""2005-02-28""Gracie"
%%timeit
expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
(
    household.unpivot(index="family")
    .drop_nulls("value")
    .with_columns(variable=expression)
    .unnest("variable")
    .pivot(index=["family", "child"], on="header", values="value")
)
222 μs ± 4.05 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Let’s see the performance for a similar dataset, but larger:

from zipfile import ZipFile
zip_file = ZipFile("./Data_files/household_large.csv.zip")
zip_file = zip_file.read("household_large.csv")
household_large = pl.read_csv(zip_file, null_values="NA")
household_large
shape: (100_000, 61)
familyname_child0dob_child0name_child1dob_child1name_child2dob_child2name_child3dob_child3name_child4dob_child4name_child5dob_child5name_child6dob_child6name_child7dob_child7name_child8dob_child8name_child9dob_child9name_child10dob_child10name_child11dob_child11name_child12dob_child12name_child13dob_child13name_child14dob_child14name_child15dob_child15name_child16dob_child16name_child17dob_child17name_child18dob_child18name_child19dob_child19name_child20dob_child20name_child21dob_child21name_child22dob_child22name_child23dob_child23name_child24dob_child24name_child25dob_child25name_child26dob_child26name_child27dob_child27name_child28dob_child28name_child29dob_child29
i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstr
0"Brittany Mcdonald""1989-08-24""Karen Mendoza MD""1938-12-01""Stephanie Casey""1959-08-07""Rhonda Hartman""1984-01-17""Elizabeth Sampson""1947-01-25""Jacqueline Barrett""1949-02-15""Rebecca Harvey""1974-12-03""John Hopkins""2009-11-27""Diane Ali""1993-03-04""Emily Jones""1933-01-27""Rickey Bates""1938-06-03""Walter Stuart""1971-12-04""Megan Smith""1992-06-05""Amanda Morris""1976-02-01""Jennifer Greer""1971-12-12""Megan Haney""1935-09-11""Alyssa Garcia""1963-09-03""Daniel Turner""1938-11-17""Brittany Simmons""1919-11-21""Kurt Schwartz""1956-02-27""Tanya Brandt""1943-08-16""Jennifer White""1960-08-27""Eddie Hamilton""1988-01-15""Sean Chavez""1968-12-08""Rhonda White""1948-08-30""Linda Hicks""1936-08-03""Brittany Beard""1996-03-08""Nicole Schmidt""1948-12-06""Nancy Clark""1914-12-12""Dr. David Walters""1922-10-24"
1"Diana Welch""1989-02-11""Brandon Pittman""2019-09-02""Alicia Smith""1923-06-27""Sarah Dean DVM""2001-02-10""Katherine Landry""1935-06-26""Sarah Hughes""1911-01-19""Amanda Peters""1922-06-02""Shawn Sanders""1936-11-17""David White""1998-02-25""Justin Mccormick""1994-03-06""James Kelly""2005-02-13""Kimberly Lynch MD""1986-02-18""Kimberly Hall""2001-07-23""Susan Anderson""1908-10-26""Anthony Hall""2012-10-10""Lisa Gardner""1939-08-17""Andrew Burke""1981-07-13""Brittney Evans""1993-06-06""Taylor Wood""1951-07-29""Tanya Taylor""2019-02-01""Ashley Schultz""1993-04-15""Amanda Sanchez""1923-07-14""Harry Stevens""1986-11-10""David Flores""1931-09-16""Timothy Gray""2017-05-07""Jay Ramsey""1936-05-08""Charles Bell""2014-12-05""Jacob Williams""1963-04-17""Kimberly Gonzalez""1991-11-15""Monica Newton""1923-10-27"
2"Jamie Richards""1973-11-04""Kristin Dunn""1973-04-02""Kathy Vasquez""1966-03-20""William Johnson""2006-03-08""Anthony Brewer""1987-06-21""James Hendricks""1981-03-29""Emily Torres""1990-11-09""Jennifer Santos""1982-02-13""Travis Snyder""2018-09-04""Misty Hudson""1974-01-03""Mathew Price""1965-07-09""Charles Anthony""2003-07-20""Sean Cook""1919-03-26""Krista Hall""1951-02-12""Sarah Hurst""1962-09-22""Angela May""2009-11-07""Antonio Ayers""1980-12-11""Kathryn Beck""1944-05-27""Barbara Davis""1929-04-19""Vanessa Gonzalez""1944-06-13""Tina Benjamin""1911-02-01""Anthony Wade""1938-10-10""Rachel Green""2001-02-26""Laura Wright""1950-01-11""Kenneth Stewart""1976-03-10""Megan Glass""1987-10-25""Jennifer King""1962-01-26""Cheyenne Ramirez""2009-12-21""Hayden Williamson""1993-12-27""Thomas Buchanan""2017-06-23"
3"Michele Reed""1994-11-05""John Perry""1915-07-08""Benjamin Garcia""2006-03-12""Mrs. Carolyn Chavez""1999-08-03""Bailey Stevens""2000-04-09""Gilbert Lynch""1978-10-06""Jennifer Burton""2000-12-13""Jose Nguyen""2020-11-15""Yvonne Wall""1910-10-16""Jacob Owens""1943-11-14""William Gibson""1935-09-11""Nicholas Garcia""2014-12-28""Jamie Diaz""1952-10-05""Brian Burns""1934-03-03""Ryan Taylor""1978-11-15""Ashley Miller""1915-01-14""Francisco Brewer""2006-10-21""Melissa Ramos""1966-05-01""Antonio Moore""1914-11-28""Darin Russell""1994-07-29""Eric Santos""1928-02-18""Mariah Cunningham""1958-07-21""Deanna White""1914-08-30""Peggy Williams""1973-03-17""Michael Richardson""1968-02-20""Debra Haney""1972-04-02""Alexander Roberts""1977-04-14""Derrick Martin""1999-09-02""Teresa Stephens""1928-04-11""Andrew Murphy""1910-03-21"
4"Steven Burgess""1997-12-24""Robert Willis""2011-04-18""Amy Forbes""1943-06-26""Chase Gutierrez""1976-06-21""Kevin Gonzales""1977-10-03""Michael Lyons""1973-08-06""Brian Carlson""1964-07-16""Jonathan Brown""1983-07-06""Ryan Lewis""2011-09-30""Brandon Drake""1995-07-27""Christine Mueller""1945-07-15""Joseph Sanders""1995-01-22""Patricia Robertson""1953-01-29""Kevin Young""1962-08-12""Amber Gray""1992-05-30""Laura Powers""1934-10-28""Carrie Ross""1974-11-03""Anthony Taylor""2002-03-31""Bill Hawkins""1923-01-26""Erin Mcbride""2009-01-12""Roger Meza""1934-12-10""Sandra Davis""1918-11-05""Kristie Whitehead""1979-12-01""James Anderson""1954-02-02""Sara Aguilar""2008-03-09""Hector Reynolds""1990-08-17""Cynthia Klein""1931-12-16""Richard Copeland""1978-11-24""Rickey Hudson""1926-09-29""Carmen Smith""1963-11-05"
99995"Donna Vazquez""1938-12-26""William Curry""1943-03-27""Brian Green""1941-08-20""Cathy Krause""1910-10-22""Christopher Johnson""2006-10-22""Amanda Anderson""1910-07-21""Paul Casey""1965-02-23""Franklin Strickland""2004-03-26""Brian Murphy""1951-06-04""Jennifer Martinez""1921-08-12""Tony Long""1928-04-01""Anthony Garcia""2015-12-03""Todd Gonzales""2001-10-31""Justin Anderson""2021-05-04""Chad Garrett""1939-06-03""Mr. Christopher Harper""2002-07-22""Jennifer Flores""1933-06-17""Alexis Porter""2020-08-27""Jason Pitts""1969-03-11""Adam Sanchez""1959-02-02""Angela Cooper""1972-02-16""Elizabeth Thomas""1959-10-25""Mark Aguilar""1989-03-28""Chad Chapman""1919-03-01""Jessica Doyle""1938-04-14""Kim Miller""2017-08-04""Angela Hughes""1930-10-31""Jamie Thompson""2017-02-28""Brandy Frederick""1932-07-31""Patrick Solomon""1920-03-12"
99996"Monica Davis""1987-10-07""Laura Donovan""1992-07-30""Morgan Garcia""1997-01-09""Stacey Chambers""1944-01-10""Stephen Fields""1935-10-03""Ann Barnes""1996-01-04""Nicholas Jones""1967-02-01""Karen Dunn""2012-07-21""Sheryl Hill""1989-02-21""Kyle Fox""2005-01-04""Mark Adkins""1981-12-01""Spencer Blankenship""1934-04-05""Catherine Smith""2008-12-06""Andrea Walker""1940-07-01""Ricky Barr""1912-11-17""Desiree Parker""1939-11-15""Amanda Ruiz""2018-08-09""Sharon Miller""2019-05-05""Larry Smith""1981-02-24""James Harris""1990-12-14""Justin Smith""1994-10-18""Tamara Roman""1921-04-01""Bridget Serrano""1974-05-13""Ann Melendez""1986-07-06""Susan Huff""2018-09-05""Antonio Juarez""1995-08-11""Ashley Lamb""1955-05-31""Patrick Morse""1932-08-08""Thomas Gregory""1963-04-11""Steven Smith""1987-09-22"
99997"Vanessa Byrd""1967-03-17""Tammy Mcpherson""1987-11-06""Jacob Schwartz""2019-07-07""Barry Smith""1996-03-17""Samuel Martinez""1980-02-27""Erin Flores""1959-04-12""Melissa Perez""2001-07-28""Victoria Wilson""1975-02-12""Timothy Ramirez""1912-08-05""Kyle Johnson""2015-04-01""David Roberts""2001-10-01""Michael Alexander""1922-10-18""Robert Green""2004-03-13""Brent Nelson""1973-02-24""Curtis Smith""1978-11-04""Walter Lyons""2002-03-03""Angela Johnson""2015-06-24""Amanda Smith""1913-01-08""Jacqueline Smith""2012-01-07""Brian Ho""1948-12-11""Robert Lopez""1967-02-04""Charles Rogers""1996-04-25""Marcus Dixon""1952-03-11""Alexis Moore""1952-07-26""Joshua Robles""1956-11-12""Rachel Hull""1979-05-28""Jessica Smith""2006-01-17""Joshua Hernandez""2005-06-26""Melanie Crawford""1927-07-12""Scott Carlson""1950-05-23"
99998"Gary Harmon""2002-05-09""William Reeves""1960-01-23""Linda Cohen""1953-05-31""Damon Soto""1948-11-15""Christina Hull""1992-04-07""Sean Boone""2022-08-10""Benjamin Graham""1971-12-30""Daniel Pacheco""1926-08-16""Jennifer Hopkins""2016-04-15""Anne Kelly""1988-12-03""Cristina Garcia""1997-05-03""Lindsay Wright""1979-12-03""Teresa Berg""1953-11-01""Lisa Murphy""1930-09-15""Jacob Hernandez""2017-04-25""Amber Campbell""1983-10-31""Ashley Mitchell""1962-09-07""Jacqueline Singleton""1970-06-14""Melinda Miller""1978-05-08""Brenda Gibbs""1964-08-13""Rebecca Cooper""2022-10-04""Jonathan Kelly""1935-11-19""Stephen Sanchez""2010-05-15""Adam Stark""1911-07-12""Denise George""1930-02-08""Brenda Lee""2003-09-22""Tracey Nelson""1910-09-10""Greg Esparza""2000-07-07""Tanner Bell""1928-01-15""Veronica Collins""1934-09-23"
99999"Sheila Elliott""1965-08-06""Charlene Lee""2014-06-22""Gabriella Brown""2015-03-17""Joseph Warren""2015-03-18""Martin Frederick""1992-05-01""Cameron Pruitt""1992-12-19""Jennifer Wilson""1929-02-17""Scott Soto""2014-02-18""Lisa Oneill""1909-10-27""Joy Jones""1963-11-07""Diana Rodriguez""1951-12-27""Rachel Estes""1986-01-22""Susan Smith""1985-01-30""Cody Perez""1932-02-02""Pamela Stevenson DDS""2020-07-06""Katelyn Oconnor""1975-11-13""James Blair""1976-03-29""Betty Rowe""1991-03-26""William Briggs""1932-10-01""James Hudson""1998-01-17""Shawn Martinez""2013-11-03""Carla Shelton""2018-10-05""Emily Burke""1965-04-03""Matthew Brown""1932-06-30""Mrs. Jill Rogers MD""2000-05-07""Martin Schmidt""1977-11-22""Cody Collins""1994-08-01""Kevin Duncan""2009-10-15""Jerry Brown""2004-01-29""Nancy Benjamin""1966-05-18"
household_large.shape
(100000, 61)
expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
out_naive = (
    household_large.unpivot(index="family")
    .drop_nulls("value")
    .with_columns(variable=expression)
    .unnest("variable")
    .pivot(index=["family", "child"], on="header", values="value")
)
out_naive
shape: (3_000_000, 4)
familychildnamedob
i64strstrstr
0"child0""Brittany Mcdonald""1989-08-24"
1"child0""Diana Welch""1989-02-11"
2"child0""Jamie Richards""1973-11-04"
3"child0""Michele Reed""1994-11-05"
4"child0""Steven Burgess""1997-12-24"
99995"child29""Patrick Solomon""1920-03-12"
99996"child29""Steven Smith""1987-09-22"
99997"child29""Scott Carlson""1950-05-23"
99998"child29""Veronica Collins""1934-09-23"
99999"child29""Nancy Benjamin""1966-05-18"
%%timeit
expression = pl.col("variable").str.split("_")
expression = expression.list.to_struct(fields=["header", "child"])
(
    household_large.unpivot(index="family")
    .drop_nulls("value")
    .with_columns(variable=expression)
    .unnest("variable")
    .pivot(index=["family", "child"], on="header", values="value")
)
1.2 s ± 8.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def lreshape(df):
    out = df.unpivot(index="family")
    out = out.drop_nulls("value")
    string = out.get_column("variable")
    string = string.str.split("_")
    string = string.list.to_struct(fields=["header", "child"])
    out = out.with_columns(variable=string)
    out = out.unnest("variable")
    out = out.pivot(index=["family", "child"], on="header", values="value")
    return out
%lprun -f lreshape lreshape(household_large)
Timer unit: 1e-09 s

Total time: 1.14191 s
File: /var/folders/yh/003fk4jn5fl9kl3mnm1y0b1h0000gr/T/ipykernel_83318/2283907795.py
Function: lreshape at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def lreshape(df):
     2         1   55695000.0    6e+07      4.9      out = df.unpivot(index="family")
     3         1     268000.0 268000.0      0.0      out = out.drop_nulls("value")
     4         1       9000.0   9000.0      0.0      string = out.get_column("variable")
     5         1  222072000.0    2e+08     19.4      string = string.str.split("_")
     6         1   31940000.0    3e+07      2.8      string = string.list.to_struct(fields=["header", "child"])
     7         1    1298000.0    1e+06      0.1      out = out.with_columns(variable=string)
     8         1      17000.0  17000.0      0.0      out = out.unnest("variable")
     9         1  830615000.0    8e+08     72.7      out = out.pivot(index=["family", "child"], on="header", values="value")
    10         1          0.0      0.0      0.0      return out

The pivot operation takes more than half of the reshaping; looks like a good place to start, with a view to improving performance, if possible.

To possibly improve performance, we need to find a way to pair the headers(dob, name) with the relevant columns, before unpivoting.

We will also process the columns before unpivoting, based on what we learnt from our previous reshaping exercise.

Let’s start with processing the columns :

index = "family"
columns = pl.Series("header", household_large.columns)
mask = columns.is_in([index]).not_()
columns = columns.filter(mask)
columns_ = columns.str.split("_")
columns_ = columns_.list.to_struct(fields=["header", "child"])
columns_ = columns_.struct.unnest()
columns_ = columns_.with_columns(columns=columns)
columns_
shape: (60, 3)
headerchildcolumns
strstrstr
"name""child0""name_child0"
"dob""child0""dob_child0"
"name""child1""name_child1"
"dob""child1""dob_child1"
"name""child2""name_child2"
"dob""child27""dob_child27"
"name""child28""name_child28"
"dob""child28""dob_child28"
"name""child29""name_child29"
"dob""child29""dob_child29"

The next step is to pair the header and columns, based on child:

# pass maintain_order=True
# if you want to retain the order
grouped = columns_.group_by("child") 
grouped = grouped.agg(pl.all())
grouped
shape: (30, 3)
childheadercolumns
strlist[str]list[str]
"child4"["name", "dob"]["name_child4", "dob_child4"]
"child2"["name", "dob"]["name_child2", "dob_child2"]
"child13"["name", "dob"]["name_child13", "dob_child13"]
"child25"["name", "dob"]["name_child25", "dob_child25"]
"child19"["name", "dob"]["name_child19", "dob_child19"]
"child11"["name", "dob"]["name_child11", "dob_child11"]
"child10"["name", "dob"]["name_child10", "dob_child10"]
"child5"["name", "dob"]["name_child5", "dob_child5"]
"child24"["name", "dob"]["name_child24", "dob_child24"]
"child15"["name", "dob"]["name_child15", "dob_child15"]

From the above we can see a relationship; we’ll create structs of pairs, where the field names are the headers(dob, name) and the values in the structs are the relevant column names:

expression = [index]
zipped = zip(
    grouped.get_column("header").to_list(),
    grouped.get_column("columns").to_list(),
)
for num, (heads, column_names) in enumerate(zipped):
    expr_ = dict(zip(heads, column_names))
    expr_ = pl.struct(**expr_).alias(str(num))
    expression.append(expr_)

Now, we unpivot:

out = household_large.select(expression)
out = out.select(pl.implode('*'))
out = out.unpivot(index=index)
out = out.with_columns(grouped.get_column('child'))
out = out.drop('variable')
out = out.explode(['family','value'])
out = out.unnest('value')
out
shape: (3_000_000, 4)
familynamedobchild
i64strstrstr
0"Elizabeth Sampson""1947-01-25""child4"
1"Katherine Landry""1935-06-26""child4"
2"Anthony Brewer""1987-06-21""child4"
3"Bailey Stevens""2000-04-09""child4"
4"Kevin Gonzales""1977-10-03""child4"
99995"Mr. Christopher Harper""2002-07-22""child15"
99996"Desiree Parker""1939-11-15""child15"
99997"Walter Lyons""2002-03-03""child15"
99998"Amber Campbell""1983-10-31""child15"
99999"Katelyn Oconnor""1975-11-13""child15"
cols = out_naive.columns
out_naive.sort(by=pl.all()).equals(out.select(cols).sort(by=pl.all()))
True
%%timeit
index = "family"
columns = pl.Series("header", household_large.columns)
mask = columns.is_in([index]).not_()
columns = columns.filter(mask)
columns_ = columns.str.split("_")
columns_ = columns_.list.to_struct(fields=["header", "child"])
columns_ = columns_.struct.unnest()
columns_ = columns_.with_columns(columns=columns)
grouped = columns_.group_by("child")
grouped = grouped.agg(pl.all())
expression = [index]
zipped = zip(
    grouped.get_column("header").to_list(),
    grouped.get_column("columns").to_list(),
)
for num, (heads, column_names) in enumerate(zipped):
    expr_ = dict(zip(heads, column_names))
    expr_ = pl.struct(**expr_).alias(str(num))
    expression.append(expr_)
out = household_large.select(expression)
out = out.select(pl.implode('*'))
out = out.unpivot(index=index)
out = out.with_columns(grouped.get_column('child'))
out = out.drop('variable')
out = out.explode(['family','value'])
out = out.unnest('value')
out
86.1 ms ± 233 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

By changing our approach (preprocess the columns, and grouping related columns into structs before unpivoting) we improve the peformance significantly - about 14x faster.

def lreshape_optimised(df):
    index = "family"
    columns = pl.Series("header", df.columns)
    mask = columns.is_in([index]).not_()
    columns = columns.filter(mask)
    columns_ = columns.str.split("_")
    columns_ = columns_.list.to_struct(fields=["header", "child"])
    columns_ = columns_.struct.unnest()
    columns_ = columns_.with_columns(columns=columns)
    grouped = columns_.group_by("child")
    grouped = grouped.agg(pl.all())
    expression = [index]
    zipped = zip(
        grouped.get_column("header").to_list(),
        grouped.get_column("columns").to_list(),
    )
    for num, (heads, column_names) in enumerate(zipped):
        expr_ = dict(zip(heads, column_names))
        expr_ = pl.struct(**expr_).alias(str(num))
        expression.append(expr_)
    out = df.select(expression)
    out = out.select(pl.implode('*'))
    out = out.unpivot(index=index)
    out = out.with_columns(grouped.get_column('child'))
    out = out.drop('variable')
    out = out.explode(['family','value'])
    out = out.unnest('value')
    return out
%lprun -f lreshape_optimised lreshape_optimised(household_large)
Timer unit: 1e-09 s

Total time: 0.08356 s
File: /var/folders/yh/003fk4jn5fl9kl3mnm1y0b1h0000gr/T/ipykernel_83318/2813683113.py
Function: lreshape_optimised at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def lreshape_optimised(df):
     2         1          0.0      0.0      0.0      index = "family"
     3         1      34000.0  34000.0      0.0      columns = pl.Series("header", df.columns)
     4         1     165000.0 165000.0      0.2      mask = columns.is_in([index]).not_()
     5         1       6000.0   6000.0      0.0      columns = columns.filter(mask)
     6         1     136000.0 136000.0      0.2      columns_ = columns.str.split("_")
     7         1     157000.0 157000.0      0.2      columns_ = columns_.list.to_struct(fields=["header", "child"])
     8         1       8000.0   8000.0      0.0      columns_ = columns_.struct.unnest()
     9         1     187000.0 187000.0      0.2      columns_ = columns_.with_columns(columns=columns)
    10         1       4000.0   4000.0      0.0      grouped = columns_.group_by("child")
    11         1     274000.0 274000.0      0.3      grouped = grouped.agg(pl.all())
    12         1       1000.0   1000.0      0.0      expression = [index]
    13         2       1000.0    500.0      0.0      zipped = zip(
    14         1      14000.0  14000.0      0.0          grouped.get_column("header").to_list(),
    15         1       6000.0   6000.0      0.0          grouped.get_column("columns").to_list(),
    16                                               )
    17        31       8000.0    258.1      0.0      for num, (heads, column_names) in enumerate(zipped):
    18        30      12000.0    400.0      0.0          expr_ = dict(zip(heads, column_names))
    19        30     193000.0   6433.3      0.2          expr_ = pl.struct(**expr_).alias(str(num))
    20        30       5000.0    166.7      0.0          expression.append(expr_)
    21         1     719000.0 719000.0      0.9      out = df.select(expression)
    22         1    6927000.0    7e+06      8.3      out = out.select(pl.implode('*'))
    23         1   57090000.0    6e+07     68.3      out = out.unpivot(index=index)
    24         1      93000.0  93000.0      0.1      out = out.with_columns(grouped.get_column('child'))
    25         1      23000.0  23000.0      0.0      out = out.drop('variable')
    26         1   17479000.0    2e+07     20.9      out = out.explode(['family','value'])
    27         1      18000.0  18000.0      0.0      out = out.unnest('value')
    28         1          0.0      0.0      0.0      return out

Summary#

This blog post highlights how to flip from wide to long form efficiently.

One thing to note is the verbosity of the faster approach. In another blog post, we’ll have a look at two functions from the pyjanitor library that offer conciseness and efficiency when reshaping from wide to long form.

Comments#