Pivot_longer : Reshape Data in Pandas Efficiently and with Ease from Wide to Long

Pivot_longer : Reshape Data in Pandas Efficiently and with Ease from Wide to Long#

Introduction#

You’ve got data in a wide form (lots of columns), and you wish to transform it into a long form (fewer columns, more rows). In pandas, depending on the form of the data, this can be achieved with either pd.stack, pd.melt, pd.wide_to_long, and sometimes require a bit of data massaging before arriving at the final, desired form.

This article suggests a simple, efficient, opinionated way, that resolves most wide to long transformations, within a single function, using the pivot_longer function from pyjanitor.

Let’s look at an example, based on real data - this is a subset of the who data from R:

import pandas as pd
import numpy as np
import sys

print(
    " pandas version :",
    pd.__version__,
    "\n",
    "numpy version :",
    np.__version__,
    "\n",
    "python version :",
    sys.version,
)

 pandas version : 2.2.2 
 numpy version : 2.0.2 
 python version : 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]

df = pd.DataFrame({"id": [1], "new_sp_m5564": [2], "newrel_f65": [3]})
df

	id	new_sp_m5564	newrel_f65
0	1	2	3

Looking at the dataframe above, we have multiple variables crammed into the column names. Using the definitions as described here:

    - new_/new prefix indicates these are counts of new cases
    - sp/rel/ep describe how the case was diagnosed
    - m/f gives the gender
    - 5564/65 supplies the age range

Our goal here is to separate the dataframe into individual columns of count, diagnosis,gender, age. How do we pull this off, with the current tools in Pandas?

pd.wide_to_long is not applicable here, as there is no unique separator. Our best option would be to use a regular expression to pull out the individual columns.

With pd.stack:

# split the columns using a regex, before stacking:
DF = df.set_index("id")
regex = r"new_?(.+)_(.)(\d+)"
columns = DF.columns.str.extract(regex)
new_names = ["diagnosis", "gender", "age"]
columns = pd.MultiIndex.from_frame(columns, names=new_names)
DF.columns = columns
DF.stack(new_names, future_stack=True).rename("count").reset_index()

	id	diagnosis	gender	age	count
0	1	sp	m	5564	2
1	1	rel	f	65	3

With pd.melt:

DF = df.melt("id")
out = DF.variable.str.extract(regex).set_axis(new_names, axis=1)
pd.concat([DF.drop(columns="variable"), out], axis=1)

	id	value	diagnosis	gender	age
0	1	2	sp	m	5564
1	1	3	rel	f	65

Note the extra steps we had to take to get to our final form. However, we know the pattern (a regex that matches the columns and the sub groups in the columns that we wish to extract), so why not take advantage of that? This is where pivot_longer from pyjanitor shines:

# pip install pyjanitor
import janitor

df.pivot_longer(index="id", names_to=new_names, names_pattern=regex)

	id	diagnosis	gender	age	value
0	1	sp	m	5564	2
1	1	rel	f	65	3

Straight to the point. If there is a pattern, pivot_longer will figure out what to do and deliver the results. It is a simple, efficient, reshaping abstraction.

Basics#

Let’s walk through some of the parameters in pivot_longer:

index : These are the columns that will not be transformed to long form - the values are repeated as necessary.
column_names: These are the columns that will be transformed to long form.
names_to : These will be the new column names of the reshaped DataFrame.
values_to: These will contain the values collated from the original DataFrame.
names_sep: extract individual columns with a separator.
names_pattern: extract individual columns with a regular expression, or a list of regular expressions.
names_transform: efficiently convert the columns aggregated from the column_names from string to another dtype, as long as it is supported by pd.astype

Keep in mind that pivot_longer works on patterns - deciphering what pattern to use is the key to its simplicity.

The rest of the article outlines the various patterns that pivot_longer works with.

Pattern 1 - Melt all Columns#

This is basic and is similar to what you get with pd.melt.

Let’s use another realistic dataset - relig_income - the original example is from here:

url = "https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/relig_income.csv"
religion = pd.read_csv(url)
religion.head()

	religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k	$75-100k	$100-150k	>150k	Don't know/refused
0	Agnostic	27	34	60	81	76	137	122	109	84	96
1	Atheist	12	27	37	52	35	70	73	59	74	76
2	Buddhist	27	21	30	34	33	58	62	39	53	54
3	Catholic	418	617	732	670	638	1116	949	792	633	1489
4	Don’t know/refused	15	14	15	11	10	35	21	17	18	116

religion.pivot_longer(index="religion", names_to="income", values_to="count")

	religion	income	count
0	Agnostic	<$10k	27
1	Atheist	<$10k	12
2	Buddhist	<$10k	27
3	Catholic	<$10k	418
4	Don’t know/refused	<$10k	15
...	...	...	...
175	Orthodox	Don't know/refused	73
176	Other Christian	Don't know/refused	18
177	Other Faiths	Don't know/refused	71
178	Other World Religions	Don't know/refused	8
179	Unaffiliated	Don't know/refused	597

180 rows × 3 columns

If there are a lot of columns, we can dynamically select the index, or column_names, using the janitor.select syntax. Let’s apply it to another dataset related to billboard charts - original example is here:

url = "https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/billboard.csv"
billboard = pd.read_csv(url)
billboard.head()

	year	artist	track	time	date.entered	wk1	wk2	wk3	wk4	wk5	...	wk67	wk68	wk69	wk70	wk71	wk72	wk73	wk74	wk75	wk76
0	2000	2 Pac	Baby Don't Cry (Keep...	4:22	2000-02-26	87	82.0	72.0	77.0	87.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2000	2Ge+her	The Hardest Part Of ...	3:15	2000-09-02	91	87.0	92.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2000	3 Doors Down	Kryptonite	3:53	2000-04-08	81	70.0	68.0	67.0	66.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2000	3 Doors Down	Loser	4:24	2000-10-21	76	76.0	72.0	69.0	67.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2000	504 Boyz	Wobble Wobble	3:35	2000-04-15	57	34.0	25.0	17.0	17.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 81 columns

The columns that will be flipped to long form are the columns that start with wk - these will be passed to the column_names parameter. That is a lot of columns. Let’s see how pivot_longer simplifies this:

billboard.pivot_longer(column_names="wk*", names_to="week")

	year	artist	track	time	date.entered	week	value
0	2000	2 Pac	Baby Don't Cry (Keep...	4:22	2000-02-26	wk1	87.0
1	2000	2Ge+her	The Hardest Part Of ...	3:15	2000-09-02	wk1	91.0
2	2000	3 Doors Down	Kryptonite	3:53	2000-04-08	wk1	81.0
3	2000	3 Doors Down	Loser	4:24	2000-10-21	wk1	76.0
4	2000	504 Boyz	Wobble Wobble	3:35	2000-04-15	wk1	57.0
...	...	...	...	...	...	...	...
24087	2000	Yankee Grey	Another Nine Minutes	3:10	2000-04-29	wk76	NaN
24088	2000	Yearwood, Trisha	Real Live Woman	3:55	2000-04-01	wk76	NaN
24089	2000	Ying Yang Twins	Whistle While You Tw...	4:19	2000-03-18	wk76	NaN
24090	2000	Zombie Nation	Kernkraft 400	3:30	2000-09-02	wk76	NaN
24091	2000	matchbox twenty	Bent	4:12	2000-04-29	wk76	NaN

24092 rows × 7 columns

If we want to use the index parameter instead, that’s easy as well:

billboard.pivot_longer(index=slice("year", "date.entered"), names_to="week")

	year	artist	track	time	date.entered	week	value
0	2000	2 Pac	Baby Don't Cry (Keep...	4:22	2000-02-26	wk1	87.0
1	2000	2Ge+her	The Hardest Part Of ...	3:15	2000-09-02	wk1	91.0
2	2000	3 Doors Down	Kryptonite	3:53	2000-04-08	wk1	81.0
3	2000	3 Doors Down	Loser	4:24	2000-10-21	wk1	76.0
4	2000	504 Boyz	Wobble Wobble	3:35	2000-04-15	wk1	57.0
...	...	...	...	...	...	...	...
24087	2000	Yankee Grey	Another Nine Minutes	3:10	2000-04-29	wk76	NaN
24088	2000	Yearwood, Trisha	Real Live Woman	3:55	2000-04-01	wk76	NaN
24089	2000	Ying Yang Twins	Whistle While You Tw...	4:19	2000-03-18	wk76	NaN
24090	2000	Zombie Nation	Kernkraft 400	3:30	2000-09-02	wk76	NaN
24091	2000	matchbox twenty	Bent	4:12	2000-04-29	wk76	NaN

24092 rows × 7 columns

Pattern 2A - Melt into multiple columns - new column names#

There are scenarios where multiple variables are crammed into the column names - we had a look at that already with the who example.

Let’s look at an example, adapted from data.table vignette:

iris = pd.DataFrame(
    {
        "Sepal.Length": [5.1, 5.9],
        "Sepal.Width": [3.5, 3.0],
        "Petal.Length": [1.4, 5.1],
        "Petal.Width": [0.2, 1.8],
        "Species": ["setosa", "virginica"],
    }
)

iris

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	5.9	3.0	5.1	1.8	virginica

For the data above, we wish to consolidate the columns into part and dimensions - the Sepal/Petal prefixes go into the part column, while Width/Length go into the dimension column.

In this case, we can use either the names_sep or the names_pattern parameter - for this particular scenario, the names_sep option is a better fit:

iris.pivot_longer(index="Species", names_to=["part", "dimension"], names_sep=r".")

	Species	part	dimension	value
0	setosa	Sepal	Length	5.1
1	virginica	Sepal	Length	5.9
2	setosa	Sepal	Width	3.5
3	virginica	Sepal	Width	3.0
4	setosa	Petal	Length	1.4
5	virginica	Petal	Length	5.1
6	setosa	Petal	Width	0.2
7	virginica	Petal	Width	1.8

Using names_pattern:

iris.pivot_longer(
    index="Species", names_to=["part", "dimension"], names_pattern=r"(.+)\.(.+)"
)

	Species	part	dimension	value
0	setosa	Sepal	Length	5.1
1	virginica	Sepal	Length	5.9
2	setosa	Sepal	Width	3.5
3	virginica	Sepal	Width	3.0
4	setosa	Petal	Length	1.4
5	virginica	Petal	Length	5.1
6	setosa	Petal	Width	0.2
7	virginica	Petal	Width	1.8

Pattern 2B - Melt into multiple columns - Use variables within the column names as new names#

Instead of new column names, what if we wanted to use existing variables within the column names?

Let’s look at an example adapted from data.table vignette:

df = pd.DataFrame(
    {
        "family": [1, 2, 3, 4, 5],
        "dob_child1": [
            "1998-11-26",
            "1996-06-22",
            "2002-07-11",
            "2004-10-10",
            "2000-12-05",
        ],
        "dob_child2": [
            "2000-01-29",
            np.nan,
            "2004-04-05",
            "2009-08-27",
            "2005-02-28",
        ],
        "gender_child1": [1, 2, 2, 1, 2],
        "gender_child2": [2.0, np.nan, 2.0, 1.0, 1.0],
    }
)

df

	family	dob_child1	dob_child2	gender_child1	gender_child2
0	1	1998-11-26	2000-01-29	1	2.0
1	2	1996-06-22	NaN	2	NaN
2	3	2002-07-11	2004-04-05	2	2.0
3	4	2004-10-10	2009-08-27	1	1.0
4	5	2000-12-05	2005-02-28	2	1.0

For the data above, we wish to keep dob and gender as column names, while moving child1 and child2 to a new column. How does pivot_longer reshape this?

df.pivot_longer(index="family", names_to=(".value", "child"), names_sep="_")

	family	child	dob	gender
0	1	child1	1998-11-26	1.0
1	2	child1	1996-06-22	2.0
2	3	child1	2002-07-11	2.0
3	4	child1	2004-10-10	1.0
4	5	child1	2000-12-05	2.0
5	1	child2	2000-01-29	2.0
6	2	child2	NaN	NaN
7	3	child2	2004-04-05	2.0
8	4	child2	2009-08-27	1.0
9	5	child2	2005-02-28	1.0

For patterns such as this, where a part of the column name is to be retained as header in the new dataframe, the .value placeholder comes in handy. The .value placeholder tells pivot_longer - hey, this variable in the column name is to be retained as a header. pivot_longer gets the message and acts accordingly.

Take column label dob_child1 - the first part is dob and the second part is child1. This pairs with (.value, child), meaning dob stays as header, and the rest gets lumped into the child column. The names_sep value - _ - determines how the column is split into the new columns.

With names_pattern we can have more control on the output, selecting just the numbers in the child column:

df.pivot_longer(
    index="family", names_to=(".value", "child"), names_pattern=r"(.+)_child(\d)"
)

	family	child	dob	gender
0	1	1	1998-11-26	1.0
1	2	1	1996-06-22	2.0
2	3	1	2002-07-11	2.0
3	4	1	2004-10-10	1.0
4	5	1	2000-12-05	2.0
5	1	2	2000-01-29	2.0
6	2	2	NaN	NaN
7	3	2	2004-04-05	2.0
8	4	2	2009-08-27	1.0
9	5	2	2005-02-28	1.0

We can also efficiently convert the data type of child to integer, using the names_transform option:

df.pivot_longer(
    index="family",
    names_to=(".value", "child"),
    names_pattern=r"(.+)_child(\d)",
    names_transform=np.int8,
)

	family	child	dob	gender
0	1	1	1998-11-26	1.0
1	2	1	1996-06-22	2.0
2	3	1	2002-07-11	2.0
3	4	1	2004-10-10	1.0
4	5	1	2000-12-05	2.0
5	1	2	2000-01-29	2.0
6	2	2	NaN	NaN
7	3	2	2004-04-05	2.0
8	4	2	2009-08-27	1.0
9	5	2	2005-02-28	1.0

The flexibility of .value means you can place it anywhere - you are not limited by position - as long as the right argument is passed to the names_sep or names_pattern parameter, your output should be fine.

Let’s apply this to the iris dataset to test the flexibility of .value:

iris.pivot_longer(index="Species", names_to=("part", ".value"), names_sep=".")

	Species	part	Length	Width
0	setosa	Sepal	5.1	3.5
1	virginica	Sepal	5.9	3.0
2	setosa	Petal	1.4	0.2
3	virginica	Petal	5.1	1.8

You can use multiple .value if required - let’s apply this to a question from Stack Overflow:

df = pd.DataFrame(
    [
        {
            "ID": 1,
            "DateRange1Start": "1/1/90",
            "DateRange1End": "3/1/90",
            "Value1": 4.4,
            "DateRange2Start": "4/5/91",
            "DateRange2End": "6/7/91",
            "Value2": 6.2,
            "DateRange3Start": "5/5/95",
            "DateRange3End": "6/6/96",
            "Value3": 3.3,
        }
    ]
)

df

	ID	DateRange1Start	DateRange1End	Value1	DateRange2Start	DateRange2End	Value2	DateRange3Start	DateRange3End	Value3
0	1	1/1/90	3/1/90	4.4	4/5/91	6/7/91	6.2	5/5/95	6/6/96	3.3

df.pivot_longer(index="ID", names_to=(".value", ".value"), names_pattern=r"(.+)\d(.*)")

	ID	DateRangeStart	DateRangeEnd	Value
0	1	1/1/90	3/1/90	4.4
1	1	4/5/91	6/7/91	6.2
2	1	5/5/95	6/6/96	3.3

Again, as long as the pattern can be identified, pivot_longer will take care of the rest.

One more example, adapted from Stack Overflow:

df = pd.DataFrame(
    {
        "Sony | TV | Model | value": {0: "A222", 1: "A234", 2: "A4345"},
        "Sony | TV | Quantity | value": {0: 5, 1: 5, 2: 4},
        "Sony | TV | Max-quant | value": {0: 10, 1: 9, 2: 9},
        "Panasonic | TV | Model | value": {0: "T232", 1: "S3424", 2: "X3421"},
        "Panasonic | TV | Quantity | value": {0: 1, 1: 5, 2: 1},
        "Panasonic | TV | Max-quant | value": {0: 10, 1: 12, 2: 11},
        "Sanyo | Radio | Model | value": {0: "S111", 1: "S1s1", 2: "S1s2"},
        "Sanyo | Radio | Quantity | value": {0: 4, 1: 2, 2: 4},
        "Sanyo | Radio | Max-quant | value": {0: 9, 1: 9, 2: 10},
    }
)

df

	Sony \| TV \| Model \| value	Sony \| TV \| Quantity \| value	Sony \| TV \| Max-quant \| value	Panasonic \| TV \| Model \| value	Panasonic \| TV \| Quantity \| value	Panasonic \| TV \| Max-quant \| value	Sanyo \| Radio \| Model \| value	Sanyo \| Radio \| Quantity \| value	Sanyo \| Radio \| Max-quant \| value
0	A222	5	10	T232	1	10	S111	4	9
1	A234	5	9	S3424	5	12	S1s1	2	9
2	A4345	4	9	X3421	1	11	S1s2	4	10

df.pivot_longer(
    names_to=("Manufacturer", "Device", ".value"),
    names_pattern=r"(.+)\s\|(.+)\s\|(.+)\s\|.*",
)

	Manufacturer	Device	Model	Quantity	Max-quant
0	Sony	TV	A222	5	10
1	Sony	TV	A234	5	9
2	Sony	TV	A4345	4	9
3	Panasonic	TV	T232	1	10
4	Panasonic	TV	S3424	5	12
5	Panasonic	TV	X3421	1	11
6	Sanyo	Radio	S111	4	9
7	Sanyo	Radio	S1s1	2	9
8	Sanyo	Radio	S1s2	4	10

Pattern 3A - Group similar columns together#

This is best explained with an example:

df_mean = pd.DataFrame(
    {
        "x_1_mean": [10],
        "x_2_mean": [20],
        "y_1_mean": [30],
        "y_2_mean": [40],
        "unit": [50],
    }
)

df_mean

	x_1_mean	x_2_mean	y_1_mean	y_2_mean	unit
0	10	20	30	40	50

For the dataframe above, we wish to lump similar columns together - specifically, we want to lump the x* columns into one, same goes for the y* columns.

For this pattern, we pass a list of names to names_to, while passing a list of regular expressions to names_pattern:

df_mean.pivot_longer(
    index="unit", names_to=["x_mean", "y_mean"], names_pattern=["x", "y"]
)

	unit	x_mean	y_mean
0	50	10	30
1	50	20	40

Using a list of regular expressions for names_pattern implies that you want to group similar items under specified columns in names_to.

Basically, it is a pairing. the first x_1_mean pairs with the first y* mean which it encounters, which is y_1_mean, same goes for the next x*, which is x_2_mean, it pairs with y_2_mean, since that is the next available y*. The pairing is on a first come first serve basis.

Let’s look at another example:

url = "https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie_altered.csv"
movies = pd.read_csv(url)
movies.head()

	title	rating	year	duration	director_1	director_fb_likes_1	actor_1	actor_2	actor_3	actor_fb_likes_1	actor_fb_likes_2	actor_fb_likes_3
0	Avatar	PG-13	2009.0	178.0	James Cameron	0.0	CCH Pounder	Joel David Moore	Wes Studi	1000.0	936.0	855.0
1	Pirates of the Caribbean: At World's End	PG-13	2007.0	169.0	Gore Verbinski	563.0	Johnny Depp	Orlando Bloom	Jack Davenport	40000.0	5000.0	1000.0
2	Spectre	PG-13	2015.0	148.0	Sam Mendes	0.0	Christoph Waltz	Rory Kinnear	Stephanie Sigman	11000.0	393.0	161.0
3	The Dark Knight Rises	PG-13	2012.0	164.0	Christopher Nolan	22000.0	Tom Hardy	Christian Bale	Joseph Gordon-Levitt	27000.0	23000.0	23000.0
4	Star Wars: Episode VII - The Force Awakens	NaN	NaN	NaN	Doug Walker	131.0	Doug Walker	Rob Walker	NaN	131.0	12.0	NaN

The goal for the dataframe above is to group the actors into another column, and the facebook likes into a separate column. The pattern here is obvious - we are grouping similar columns, so we pass a list of regular expressions to names_pattern, with an equal number of names to names_to:

movies.pivot_longer(
    index=slice("title", "duration"),
    names_to=["director", "director_fb_likes", "actor", "actor_fb_likes"],
    names_pattern=["director_\\d$", "director_fb_likes", "actor_\\d$", "actor.+"],
)

	title	rating	year	duration	director	director_fb_likes	actor	actor_fb_likes
0	Avatar	PG-13	2009.0	178.0	James Cameron	0.0	CCH Pounder	1000.0
1	Pirates of the Caribbean: At World's End	PG-13	2007.0	169.0	Gore Verbinski	563.0	Johnny Depp	40000.0
2	Spectre	PG-13	2015.0	148.0	Sam Mendes	0.0	Christoph Waltz	11000.0
3	The Dark Knight Rises	PG-13	2012.0	164.0	Christopher Nolan	22000.0	Tom Hardy	27000.0
4	Star Wars: Episode VII - The Force Awakens	NaN	NaN	NaN	Doug Walker	131.0	Doug Walker	131.0
...	...	...	...	...	...	...	...	...
14743	Signed Sealed Delivered	NaN	2013.0	87.0	NaN	NaN	Crystal Lowe	319.0
14744	The Following	TV-14	NaN	43.0	NaN	NaN	Sam Underwood	319.0
14745	A Plague So Pleasant	NaN	2013.0	76.0	NaN	NaN	David Chandler	0.0
14746	Shanghai Calling	PG-13	2012.0	100.0	NaN	NaN	Eliza Coupe	490.0
14747	My Date with Drew	PG	2004.0	90.0	NaN	NaN	Jon Gunn	16.0

14748 rows × 8 columns

Based on the same first come first serve approach, director_1 is paired with director_fb_likes_1, actor_1 is paired with actor_fb_likes_1; the pairing continues for the rest.

This could also have been solved using the .value approach:

(
    movies.pivot_longer(
        index=slice("title", "duration"), names_to=".value", names_pattern="(.+)_\\d"
    )
)

	title	rating	year	duration	director	director_fb_likes	actor	actor_fb_likes
0	Avatar	PG-13	2009.0	178.0	James Cameron	0.0	CCH Pounder	1000.0
1	Pirates of the Caribbean: At World's End	PG-13	2007.0	169.0	Gore Verbinski	563.0	Johnny Depp	40000.0
2	Spectre	PG-13	2015.0	148.0	Sam Mendes	0.0	Christoph Waltz	11000.0
3	The Dark Knight Rises	PG-13	2012.0	164.0	Christopher Nolan	22000.0	Tom Hardy	27000.0
4	Star Wars: Episode VII - The Force Awakens	NaN	NaN	NaN	Doug Walker	131.0	Doug Walker	131.0
...	...	...	...	...	...	...	...	...
14743	Signed Sealed Delivered	NaN	2013.0	87.0	NaN	NaN	Crystal Lowe	319.0
14744	The Following	TV-14	NaN	43.0	NaN	NaN	Sam Underwood	319.0
14745	A Plague So Pleasant	NaN	2013.0	76.0	NaN	NaN	David Chandler	0.0
14746	Shanghai Calling	PG-13	2012.0	100.0	NaN	NaN	Eliza Coupe	490.0
14747	My Date with Drew	PG	2004.0	90.0	NaN	NaN	Jon Gunn	16.0

14748 rows × 8 columns

Again, it is all about identifying the pattern, and using whichever option is more suitable/convenient. One more example:

from pandas import NA

treatments = dict(
    id=range(1, 6),
    A=("A", NA, "A", NA, NA),
    A_date=(1, NA, 2, NA, NA),
    B=(NA, "B", "B", NA, NA),
    B_date=(NA, 3, 2, NA, NA),
    other=(NA, NA, NA, "C", "D"),
    other_date=(NA, NA, NA, 1, 5),
)
treatments = pd.DataFrame(treatments)
treatments

	id	A	A_date	B	B_date	other	other_date
0	1	A	1	<NA>	<NA>	<NA>	<NA>
1	2	<NA>	<NA>	B	3	<NA>	<NA>
2	3	A	2	B	2	<NA>	<NA>
3	4	<NA>	<NA>	<NA>	<NA>	C	1
4	5	<NA>	<NA>	<NA>	<NA>	D	5

The goal is to transform the dataframe into three columns, an id column, a treatment column, and a date column. Since we are grouping similar columns, we pass a list of regular expressions to names_pattern:

treatments.pivot_longer(
    index="id",
    names_to=["treatment", "date"],
    names_pattern=["A$|B$|other$", ".+date$"],
    dropna=True,
)

	id	treatment	date
0	1	A	1
1	3	A	2
2	2	B	3
3	3	B	2
4	4	C	1
5	5	D	5

Another way is to access the dates before the treatments, by reversing the order of the regular expressions:

treatments.pivot_longer(
    index="id",
    names_to=["date", "treatment"],
    names_pattern=[".+date$", ".+"],
    dropna=True,
)

	id	treatment	date
0	1	A	1
1	3	A	2
2	2	B	3
3	3	B	2
4	4	C	1
5	5	D	5

Pattern 3B - Group similar columns and values together#

This is best illustrated with an example:

df = pd.DataFrame(
    {
        "City": ["Houston", "Austin", "Hoover"],
        "State": ["Texas", "Texas", "Alabama"],
        "Name": ["Aria", "Penelope", "Niko"],
        "Mango": [4, 10, 90],
        "Orange": [10, 8, 14],
        "Watermelon": [40, 99, 43],
        "Gin": [16, 200, 34],
        "Vodka": [20, 33, 18],
    },
    columns=["City", "State", "Name", "Mango", "Orange", "Watermelon", "Gin", "Vodka"],
)


df

	City	State	Name	Mango	Orange	Watermelon	Gin	Vodka
0	Houston	Texas	Aria	4	10	40	16	20
1	Austin	Texas	Penelope	10	8	99	200	33
2	Hoover	Alabama	Niko	90	14	43	34	18

The goal in this case is to group similar columns and values together - group the drinks into one column, group the fruits into another column. When that is done, group the values for the drinks into one column, and the values for the fruits into another column. That’s the pattern, and for this pattern we continue with the idea from earlier, by passing a list of regular expressions to names_pattern, an equal number of names to names_to, and an equal number of names to values_to:

df.pivot_longer(
    index=["City", "State"],
    column_names=slice("Mango", "Vodka"),
    names_to=("Fruit", "Drink"),
    values_to=("Pounds", "Ounces"),
    names_pattern=[r"M|O|W", r"G|V"],
)

	City	State	Fruit	Drink	Pounds	Ounces
0	Houston	Texas	Mango	Gin	4	16.0
1	Austin	Texas	Mango	Gin	10	200.0
2	Hoover	Alabama	Mango	Gin	90	34.0
3	Houston	Texas	Orange	Vodka	10	20.0
4	Austin	Texas	Orange	Vodka	8	33.0
5	Hoover	Alabama	Orange	Vodka	14	18.0
6	Houston	Texas	Watermelon	None	40	NaN
7	Austin	Texas	Watermelon	None	99	NaN
8	Hoover	Alabama	Watermelon	None	43	NaN

Simple, straightforward, to the point, and efficient. Let’s look at another example:

df = {
    "State": ["CA", "CA", "FL", "FL"],
    "ItemN": [1, 2, 3, 4],
    "x1": [6, 7, 3, 9],
    "x2": [4, 3, 2, 4],
    "x3": [3, 1, 1, 2],
    "y1": [7, 15, 5, 16],
    "y2": [5, 10, 3, 14],
    "y3": [3, 5, 2, 12],
    "z1": [11, 4, 13, 14],
    "z2": [5, 2, 7, 5],
    "z3": [1, 1, 2, 4],
}

df = pd.DataFrame(df)

df

	State	ItemN	x1	x2	x3	y1	y2	y3	z1	z2	z3
0	CA	1	6	4	3	7	5	3	11	5	1
1	CA	2	7	3	1	15	10	5	4	2	1
2	FL	3	3	2	1	5	3	2	13	7	2
3	FL	4	9	4	2	16	14	12	14	5	4

The goal is to group all the x columns into one, the y columns into one column, and group the associated values into individual columns as well. We are grouping similar columns and values - that’s the pattern - again we pass a list of regular expressions to names_pattern, a list of names to names_to, and a list of names to values_to:

df.pivot_longer(
    index=["State", "ItemN"],
    names_to=["x", "y", "z"],
    values_to=["xvalue", "yvalue", "zvalue"],
    names_pattern=["x", "y", "z"],
)

	State	ItemN	x	y	z	xvalue	yvalue	zvalue
0	CA	1	x1	y1	z1	6	7	11
1	CA	2	x1	y1	z1	7	15	4
2	FL	3	x1	y1	z1	3	5	13
3	FL	4	x1	y1	z1	9	16	14
4	CA	1	x2	y2	z2	4	5	5
5	CA	2	x2	y2	z2	3	10	2
6	FL	3	x2	y2	z2	2	3	7
7	FL	4	x2	y2	z2	4	14	5
8	CA	1	x3	y3	z3	3	3	1
9	CA	2	x3	y3	z3	1	5	1
10	FL	3	x3	y3	z3	1	2	2
11	FL	4	x3	y3	z3	2	12	4

Performance#

pivot_longer is not just simple, convenient, and straightforward, it is efficient as well:

%timeit religion.pivot_longer(index = 'religion', names_to = 'income', values_to = 'count')

488 μs ± 18 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit religion.melt(id_vars = 'religion', var_name = 'income', value_name = 'count')

573 μs ± 3.03 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Sample data, using 1000 columns (you can personally increase the number of columns to 1e6 or 1e8, depending on the power of your CPU):

sparse_wide = pd.read_csv("Data_files/sparse-wide.csv")
sparse_wide.head()

	id	grp	x1	x2	x3	x4	x5	x6	x7	x8	...	x991	x992	x993	x994	x995	x996	x997	x998	x999	x1000
0	1	1000	A	I	A	M	O	E	F	D	...	C	L	G	I	R	X	O	W	L	L
1	2	528	I	X	U	Q	B	X	I	J	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	3	151	J	Q	H	F	B	X	B	F	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	4	343	P	A	K	K	N	R	E	E	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	5	460	L	D	G	X	I	O	C	W	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 1002 columns

%timeit sparse_wide.melt(id_vars = ['id', 'grp'])

50 ms ± 313 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit sparse_wide.pivot_longer(index = ['id', 'grp'])

15.4 ms ± 65 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Let’s test on the movies dataset:

wl = pd.wide_to_long(
    movies,
    stubnames=["director", "director_fb_likes", "actor", "actor_fb_likes"],
    i=["title", "rating", "year", "duration"],
    j="num",
    sep="_",
)
wl

					director	director_fb_likes	actor	actor_fb_likes
title	rating	year	duration	num
Avatar	PG-13	2009.0	178.0	1	James Cameron	0.0	CCH Pounder	1000.0
				2	NaN	NaN	Joel David Moore	936.0
				3	NaN	NaN	Wes Studi	855.0
Pirates of the Caribbean: At World's End	PG-13	2007.0	169.0	1	Gore Verbinski	563.0	Johnny Depp	40000.0
Pirates of the Caribbean: At World's End	PG-13	2007.0	169.0	2	NaN	NaN	Orlando Bloom	5000.0
...	...	...	...	...	...	...	...	...
Shanghai Calling	PG-13	2012.0	100.0	2	NaN	NaN	Daniel Henney	719.0
Shanghai Calling	PG-13	2012.0	100.0	3	NaN	NaN	Eliza Coupe	490.0
My Date with Drew	PG	2004.0	90.0	1	Jon Gunn	16.0	John August	86.0
				2	NaN	NaN	Brian Herzlinger	23.0
				3	NaN	NaN	Jon Gunn	16.0

14748 rows × 4 columns

A more efficient approach than pd.wide_to_long, using pd.stack:

index = ["title", "rating", "year", "duration"]
vid = movies.set_index(index)
vid.columns = vid.columns.str.rsplit("_", n=1, expand=True)
vid.columns.names = [None, "num"]
vid = vid.stack(level="num", future_stack=True)
vid

					director	director_fb_likes	actor	actor_fb_likes
title	rating	year	duration	num
Avatar	PG-13	2009.0	178.0	1	James Cameron	0.0	CCH Pounder	1000.0
				2	NaN	NaN	Joel David Moore	936.0
				3	NaN	NaN	Wes Studi	855.0
Pirates of the Caribbean: At World's End	PG-13	2007.0	169.0	1	Gore Verbinski	563.0	Johnny Depp	40000.0
Pirates of the Caribbean: At World's End	PG-13	2007.0	169.0	2	NaN	NaN	Orlando Bloom	5000.0
...	...	...	...	...	...	...	...	...
Shanghai Calling	PG-13	2012.0	100.0	2	NaN	NaN	Daniel Henney	719.0
Shanghai Calling	PG-13	2012.0	100.0	3	NaN	NaN	Eliza Coupe	490.0
My Date with Drew	PG	2004.0	90.0	1	Jon Gunn	16.0	John August	86.0
				2	NaN	NaN	Brian Herzlinger	23.0
				3	NaN	NaN	Jon Gunn	16.0

14748 rows × 4 columns

out = movies.pivot_longer(
    index=slice("title", "duration"),
    names_to=".value",
    names_pattern="(.+)_\\d",
    sort_by_appearance=True,
)

out

	title	rating	year	duration	director	director_fb_likes	actor	actor_fb_likes
0	Avatar	PG-13	2009.0	178.0	James Cameron	0.0	CCH Pounder	1000.0
1	Avatar	PG-13	2009.0	178.0	NaN	NaN	Joel David Moore	936.0
2	Avatar	PG-13	2009.0	178.0	NaN	NaN	Wes Studi	855.0
3	Pirates of the Caribbean: At World's End	PG-13	2007.0	169.0	Gore Verbinski	563.0	Johnny Depp	40000.0
4	Pirates of the Caribbean: At World's End	PG-13	2007.0	169.0	NaN	NaN	Orlando Bloom	5000.0
...	...	...	...	...	...	...	...	...
14743	Shanghai Calling	PG-13	2012.0	100.0	NaN	NaN	Daniel Henney	719.0
14744	Shanghai Calling	PG-13	2012.0	100.0	NaN	NaN	Eliza Coupe	490.0
14745	My Date with Drew	PG	2004.0	90.0	Jon Gunn	16.0	John August	86.0
14746	My Date with Drew	PG	2004.0	90.0	NaN	NaN	Brian Herzlinger	23.0
14747	My Date with Drew	PG	2004.0	90.0	NaN	NaN	Jon Gunn	16.0

14748 rows × 8 columns

wl = wl.droplevel("num").reset_index()
wl.equals(out)

True

vid = (
    vid.droplevel("num")
    .loc[:, ["director", "director_fb_likes", "actor", "actor_fb_likes"]]
    .reset_index()
)

wl.equals(vid)

True

vid.equals(out)

True

%%timeit 
pd.wide_to_long(
    movies, 
    stubnames=['director', 'director_fb_likes', 'actor', 'actor_fb_likes'], 
    i = ['title', 'rating', 'year', 'duration'], 
    j = 'num', 
    sep='_')

51.8 ms ± 293 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
index = ['title', 'rating', 'year', 'duration']
vid = movies.set_index(index)
vid.columns = vid.columns.str.rsplit("_", n = 1, expand = True)
vid.stack(level = 1, future_stack=True)

5.15 ms ± 174 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
(movies
.pivot_longer(
    index = slice('title', 'duration'),
    names_to = ".value",
    names_pattern = "(.+)_\\d",
    sort_by_appearance = True)
)

1.66 ms ± 56.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
movies.pivot_longer(
    index = slice('title', 'duration'),
    names_to = ['director', 'director_fb_likes', 'actor', 'actor_fb_likes'],
    names_pattern = ['director_\\d$', 'director_fb_likes', 'actor_\\d$', 'actor.+'],
    sort_by_appearance = True
)

1.9 ms ± 65.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Your mileage may vary with these speed tests.

Summary#

This blog post shows various ways to reshape data into long form, using pivot_longer, based on some common patterns. Once the pattern is identified, use the appropriate syntax for pivot_longer. It is meant to be simple, straightforward, and efficient. For more control, have a look at the pivot_longer_spec

Pivot_longer : Reshape Data in Pandas Efficiently and with Ease from Wide to Long

Contents

Pivot_longer : Reshape Data in Pandas Efficiently and with Ease from Wide to Long#

Introduction#

Basics#

Pattern 1 - Melt all Columns#

Pattern 2A - Melt into multiple columns - new column names#

Pattern 2B - Melt into multiple columns - Use variables within the column names as new names#

Pattern 3A - Group similar columns together#

Pattern 3B - Group similar columns and values together#

Performance#

Summary#

Comments#

	State	ItemN	x1	x2	x3	y1	y2	y3	z1	z2	z3
0	CA	1	6	4	3	7	5	3	11	5	1
1	CA	2	7	3	1	15	10	5	4	2	1
2	FL	3	3	2	1	5	3	2	13	7	2
3	FL	4	9	4	2	16	14	12	14	5	4

	State	ItemN	x	y	z	xvalue	yvalue	zvalue
0	CA	1	x1	y1	z1	6	7	11
1	CA	2	x1	y1	z1	7	15	4
2	FL	3	x1	y1	z1	3	5	13
3	FL	4	x1	y1	z1	9	16	14
4	CA	1	x2	y2	z2	4	5	5
5	CA	2	x2	y2	z2	3	10	2
6	FL	3	x2	y2	z2	2	3	7
7	FL	4	x2	y2	z2	4	14	5
8	CA	1	x3	y3	z3	3	3	1
9	CA	2	x3	y3	z3	1	5	1
10	FL	3	x3	y3	z3	1	2	2
11	FL	4	x3	y3	z3	2	12	4

	State	ItemN	x1	x2	x3	y1	y2	y3	z1	z2	z3
0	CA	1	6	4	3	7	5	3	11	5	1
1	CA	2	7	3	1	15	10	5	4	2	1
2	FL	3	3	2	1	5	3	2	13	7	2
3	FL	4	9	4	2	16	14	12	14	5	4

	State	ItemN	x	y	z	xvalue	yvalue	zvalue
0	CA	1	x1	y1	z1	6	7	11
1	CA	2	x1	y1	z1	7	15	4
2	FL	3	x1	y1	z1	3	5	13
3	FL	4	x1	y1	z1	9	16	14
4	CA	1	x2	y2	z2	4	5	5
5	CA	2	x2	y2	z2	3	10	2
6	FL	3	x2	y2	z2	2	3	7
7	FL	4	x2	y2	z2	4	14	5
8	CA	1	x3	y3	z3	3	3	1
9	CA	2	x3	y3	z3	1	5	1
10	FL	3	x3	y3	z3	1	2	2
11	FL	4	x3	y3	z3	2	12	4

	State	ItemN	x1	x2	x3	y1	y2	y3	z1	z2	z3
0	CA	1	6	4	3	7	5	3	11	5	1
1	CA	2	7	3	1	15	10	5	4	2	1
2	FL	3	3	2	1	5	3	2	13	7	2
3	FL	4	9	4	2	16	14	12	14	5	4

	State	ItemN	x	y	z	xvalue	yvalue	zvalue
0	CA	1	x1	y1	z1	6	7	11
1	CA	2	x1	y1	z1	7	15	4
2	FL	3	x1	y1	z1	3	5	13
3	FL	4	x1	y1	z1	9	16	14
4	CA	1	x2	y2	z2	4	5	5
5	CA	2	x2	y2	z2	3	10	2
6	FL	3	x2	y2	z2	2	3	7
7	FL	4	x2	y2	z2	4	14	5
8	CA	1	x3	y3	z3	3	3	1
9	CA	2	x3	y3	z3	1	5	1
10	FL	3	x3	y3	z3	1	2	2
11	FL	4	x3	y3	z3	2	12	4