{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pivot_longer : Reshape Data in Pandas Efficiently and with Ease from Wide to Long"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(introduction)=\n",
"## **Introduction**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You've got data in a wide form (lots of columns), and you wish to transform it into a long form (fewer columns, more rows). In pandas, depending on the form of the data, this can be achieved with either `pd.stack`, `pd.melt`, `pd.wide_to_long`, and sometimes require a bit of data massaging before arriving at the final, desired form.\n",
"\n",
"This article suggests a simple, efficient, opinionated way, that resolves most wide to long transformations, within a single function, using the [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) function from [pyjanitor](https://pyjanitor-devs.github.io/pyjanitor/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at an example, based on real data - this is a subset of the [who](https://github.com/tidyverse/tidyr/blob/main/data-raw/who.csv) data from R:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" pandas version : 2.2.2 \n",
" numpy version : 2.0.2 \n",
" python version : 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import sys\n",
"\n",
"print(\n",
" \" pandas version :\",\n",
" pd.__version__,\n",
" \"\\n\",\n",
" \"numpy version :\",\n",
" np.__version__,\n",
" \"\\n\",\n",
" \"python version :\",\n",
" sys.version,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" new_sp_m5564 | \n",
" newrel_f65 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id new_sp_m5564 newrel_f65\n",
"0 1 2 3"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({\"id\": [1], \"new_sp_m5564\": [2], \"newrel_f65\": [3]})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the dataframe above, we have multiple variables crammed into the column names. Using the definitions as described [here](https://tidyr.tidyverse.org/articles/pivot.html#many-variables-in-column-names): \n",
"\n",
" - new_/new prefix indicates these are counts of new cases\n",
" - sp/rel/ep describe how the case was diagnosed\n",
" - m/f gives the gender\n",
" - 5564/65 supplies the age range\n",
"\n",
"Our goal here is to separate the dataframe into individual columns of `count`, `diagnosis`,`gender`, `age`. How do we pull this off, with the current tools in Pandas? \n",
"\n",
"`pd.wide_to_long` is not applicable here, as there is no unique separator. Our best option would be to use a regular expression to pull out the individual columns."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With `pd.stack`:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" diagnosis | \n",
" gender | \n",
" age | \n",
" count | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" sp | \n",
" m | \n",
" 5564 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" rel | \n",
" f | \n",
" 65 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id diagnosis gender age count\n",
"0 1 sp m 5564 2\n",
"1 1 rel f 65 3"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# split the columns using a regex, before stacking:\n",
"DF = df.set_index(\"id\")\n",
"regex = r\"new_?(.+)_(.)(\\d+)\"\n",
"columns = DF.columns.str.extract(regex)\n",
"new_names = [\"diagnosis\", \"gender\", \"age\"]\n",
"columns = pd.MultiIndex.from_frame(columns, names=new_names)\n",
"DF.columns = columns\n",
"DF.stack(new_names, future_stack=True).rename(\"count\").reset_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With `pd.melt`:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" value | \n",
" diagnosis | \n",
" gender | \n",
" age | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" sp | \n",
" m | \n",
" 5564 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 3 | \n",
" rel | \n",
" f | \n",
" 65 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id value diagnosis gender age\n",
"0 1 2 sp m 5564\n",
"1 1 3 rel f 65"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DF = df.melt(\"id\")\n",
"out = DF.variable.str.extract(regex).set_axis(new_names, axis=1)\n",
"pd.concat([DF.drop(columns=\"variable\"), out], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note the extra steps we had to take to get to our final form. However, we know the pattern (a regex that matches the columns and the sub groups in the columns that we wish to extract), so why not take advantage of that? This is where [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) from [pyjanitor](https://pyjanitor-devs.github.io/pyjanitor/) shines:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" diagnosis | \n",
" gender | \n",
" age | \n",
" value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" sp | \n",
" m | \n",
" 5564 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" rel | \n",
" f | \n",
" 65 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id diagnosis gender age value\n",
"0 1 sp m 5564 2\n",
"1 1 rel f 65 3"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# pip install pyjanitor\n",
"import janitor\n",
"\n",
"df.pivot_longer(index=\"id\", names_to=new_names, names_pattern=regex)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Straight to the point. If there is a pattern, [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) will figure out what to do and deliver the results. It is a simple, efficient, reshaping abstraction."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Basics**\n",
"\n",
"Let's walk through some of the parameters in [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer):\n",
"\n",
"- index : These are the columns that will not be transformed to long form - the values are repeated as necessary.\n",
"- column_names: These are the columns that will be transformed to long form.\n",
"- names_to : These will be the new column names of the reshaped DataFrame.\n",
"- values_to: These will contain the values collated from the original DataFrame.\n",
"- names_sep: extract individual columns with a separator.\n",
"- names_pattern: extract individual columns with a regular expression, or a list of regular expressions.\n",
"- names_transform: efficiently convert the columns aggregated from the `column_names` from string to another dtype, as long as it is supported by `pd.astype`\n",
"\n",
"Keep in mind that [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) works on patterns - deciphering what pattern to use is the key to its simplicity. \n",
"\n",
"The rest of the article outlines the various patterns that [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) works with."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Pattern 1 - Melt all Columns**\n",
"\n",
"This is basic and is similar to what you get with `pd.melt`.\n",
"\n",
"Let's use another realistic dataset - [relig_income](https://github.com/tidyverse/tidyr/blob/main/data-raw/relig_income.csv) - the original example is from [here](https://tidyr.tidyverse.org/articles/pivot.html#pew):"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" religion | \n",
" <$10k | \n",
" $10-20k | \n",
" $20-30k | \n",
" $30-40k | \n",
" $40-50k | \n",
" $50-75k | \n",
" $75-100k | \n",
" $100-150k | \n",
" >150k | \n",
" Don't know/refused | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Agnostic | \n",
" 27 | \n",
" 34 | \n",
" 60 | \n",
" 81 | \n",
" 76 | \n",
" 137 | \n",
" 122 | \n",
" 109 | \n",
" 84 | \n",
" 96 | \n",
"
\n",
" \n",
" 1 | \n",
" Atheist | \n",
" 12 | \n",
" 27 | \n",
" 37 | \n",
" 52 | \n",
" 35 | \n",
" 70 | \n",
" 73 | \n",
" 59 | \n",
" 74 | \n",
" 76 | \n",
"
\n",
" \n",
" 2 | \n",
" Buddhist | \n",
" 27 | \n",
" 21 | \n",
" 30 | \n",
" 34 | \n",
" 33 | \n",
" 58 | \n",
" 62 | \n",
" 39 | \n",
" 53 | \n",
" 54 | \n",
"
\n",
" \n",
" 3 | \n",
" Catholic | \n",
" 418 | \n",
" 617 | \n",
" 732 | \n",
" 670 | \n",
" 638 | \n",
" 1116 | \n",
" 949 | \n",
" 792 | \n",
" 633 | \n",
" 1489 | \n",
"
\n",
" \n",
" 4 | \n",
" Don’t know/refused | \n",
" 15 | \n",
" 14 | \n",
" 15 | \n",
" 11 | \n",
" 10 | \n",
" 35 | \n",
" 21 | \n",
" 17 | \n",
" 18 | \n",
" 116 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k \\\n",
"0 Agnostic 27 34 60 81 76 137 \n",
"1 Atheist 12 27 37 52 35 70 \n",
"2 Buddhist 27 21 30 34 33 58 \n",
"3 Catholic 418 617 732 670 638 1116 \n",
"4 Don’t know/refused 15 14 15 11 10 35 \n",
"\n",
" $75-100k $100-150k >150k Don't know/refused \n",
"0 122 109 84 96 \n",
"1 73 59 74 76 \n",
"2 62 39 53 54 \n",
"3 949 792 633 1489 \n",
"4 21 17 18 116 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = \"https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/relig_income.csv\"\n",
"religion = pd.read_csv(url)\n",
"religion.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" religion | \n",
" income | \n",
" count | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Agnostic | \n",
" <$10k | \n",
" 27 | \n",
"
\n",
" \n",
" 1 | \n",
" Atheist | \n",
" <$10k | \n",
" 12 | \n",
"
\n",
" \n",
" 2 | \n",
" Buddhist | \n",
" <$10k | \n",
" 27 | \n",
"
\n",
" \n",
" 3 | \n",
" Catholic | \n",
" <$10k | \n",
" 418 | \n",
"
\n",
" \n",
" 4 | \n",
" Don’t know/refused | \n",
" <$10k | \n",
" 15 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 175 | \n",
" Orthodox | \n",
" Don't know/refused | \n",
" 73 | \n",
"
\n",
" \n",
" 176 | \n",
" Other Christian | \n",
" Don't know/refused | \n",
" 18 | \n",
"
\n",
" \n",
" 177 | \n",
" Other Faiths | \n",
" Don't know/refused | \n",
" 71 | \n",
"
\n",
" \n",
" 178 | \n",
" Other World Religions | \n",
" Don't know/refused | \n",
" 8 | \n",
"
\n",
" \n",
" 179 | \n",
" Unaffiliated | \n",
" Don't know/refused | \n",
" 597 | \n",
"
\n",
" \n",
"
\n",
"
180 rows × 3 columns
\n",
"
"
],
"text/plain": [
" religion income count\n",
"0 Agnostic <$10k 27\n",
"1 Atheist <$10k 12\n",
"2 Buddhist <$10k 27\n",
"3 Catholic <$10k 418\n",
"4 Don’t know/refused <$10k 15\n",
".. ... ... ...\n",
"175 Orthodox Don't know/refused 73\n",
"176 Other Christian Don't know/refused 18\n",
"177 Other Faiths Don't know/refused 71\n",
"178 Other World Religions Don't know/refused 8\n",
"179 Unaffiliated Don't know/refused 597\n",
"\n",
"[180 rows x 3 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"religion.pivot_longer(index=\"religion\", names_to=\"income\", values_to=\"count\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If there are a lot of columns, we can dynamically select the `index`, or `column_names`, using the [janitor.select](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.select.select) syntax. Let's apply it to another [dataset](https://github.com/tidyverse/tidyr/blob/main/data-raw/billboard.csv) related to billboard charts - original example is [here](https://tidyr.tidyverse.org/articles/pivot.html#billboard): "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" artist | \n",
" track | \n",
" time | \n",
" date.entered | \n",
" wk1 | \n",
" wk2 | \n",
" wk3 | \n",
" wk4 | \n",
" wk5 | \n",
" ... | \n",
" wk67 | \n",
" wk68 | \n",
" wk69 | \n",
" wk70 | \n",
" wk71 | \n",
" wk72 | \n",
" wk73 | \n",
" wk74 | \n",
" wk75 | \n",
" wk76 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2000 | \n",
" 2 Pac | \n",
" Baby Don't Cry (Keep... | \n",
" 4:22 | \n",
" 2000-02-26 | \n",
" 87 | \n",
" 82.0 | \n",
" 72.0 | \n",
" 77.0 | \n",
" 87.0 | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 2000 | \n",
" 2Ge+her | \n",
" The Hardest Part Of ... | \n",
" 3:15 | \n",
" 2000-09-02 | \n",
" 91 | \n",
" 87.0 | \n",
" 92.0 | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 2000 | \n",
" 3 Doors Down | \n",
" Kryptonite | \n",
" 3:53 | \n",
" 2000-04-08 | \n",
" 81 | \n",
" 70.0 | \n",
" 68.0 | \n",
" 67.0 | \n",
" 66.0 | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 2000 | \n",
" 3 Doors Down | \n",
" Loser | \n",
" 4:24 | \n",
" 2000-10-21 | \n",
" 76 | \n",
" 76.0 | \n",
" 72.0 | \n",
" 69.0 | \n",
" 67.0 | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 2000 | \n",
" 504 Boyz | \n",
" Wobble Wobble | \n",
" 3:35 | \n",
" 2000-04-15 | \n",
" 57 | \n",
" 34.0 | \n",
" 25.0 | \n",
" 17.0 | \n",
" 17.0 | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 81 columns
\n",
"
"
],
"text/plain": [
" year artist track time date.entered wk1 wk2 \\\n",
"0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 87 82.0 \n",
"1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 91 87.0 \n",
"2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 70.0 \n",
"3 2000 3 Doors Down Loser 4:24 2000-10-21 76 76.0 \n",
"4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 57 34.0 \n",
"\n",
" wk3 wk4 wk5 ... wk67 wk68 wk69 wk70 wk71 wk72 wk73 wk74 \\\n",
"0 72.0 77.0 87.0 ... NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1 92.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN \n",
"2 68.0 67.0 66.0 ... NaN NaN NaN NaN NaN NaN NaN NaN \n",
"3 72.0 69.0 67.0 ... NaN NaN NaN NaN NaN NaN NaN NaN \n",
"4 25.0 17.0 17.0 ... NaN NaN NaN NaN NaN NaN NaN NaN \n",
"\n",
" wk75 wk76 \n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
"[5 rows x 81 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = \"https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/billboard.csv\"\n",
"billboard = pd.read_csv(url)\n",
"billboard.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The columns that will be flipped to long form are the columns that start with `wk` - these will be passed to the `column_names` parameter. That is a lot of columns. Let's see how [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) simplifies this:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" artist | \n",
" track | \n",
" time | \n",
" date.entered | \n",
" week | \n",
" value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2000 | \n",
" 2 Pac | \n",
" Baby Don't Cry (Keep... | \n",
" 4:22 | \n",
" 2000-02-26 | \n",
" wk1 | \n",
" 87.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2000 | \n",
" 2Ge+her | \n",
" The Hardest Part Of ... | \n",
" 3:15 | \n",
" 2000-09-02 | \n",
" wk1 | \n",
" 91.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 2000 | \n",
" 3 Doors Down | \n",
" Kryptonite | \n",
" 3:53 | \n",
" 2000-04-08 | \n",
" wk1 | \n",
" 81.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 2000 | \n",
" 3 Doors Down | \n",
" Loser | \n",
" 4:24 | \n",
" 2000-10-21 | \n",
" wk1 | \n",
" 76.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 2000 | \n",
" 504 Boyz | \n",
" Wobble Wobble | \n",
" 3:35 | \n",
" 2000-04-15 | \n",
" wk1 | \n",
" 57.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 24087 | \n",
" 2000 | \n",
" Yankee Grey | \n",
" Another Nine Minutes | \n",
" 3:10 | \n",
" 2000-04-29 | \n",
" wk76 | \n",
" NaN | \n",
"
\n",
" \n",
" 24088 | \n",
" 2000 | \n",
" Yearwood, Trisha | \n",
" Real Live Woman | \n",
" 3:55 | \n",
" 2000-04-01 | \n",
" wk76 | \n",
" NaN | \n",
"
\n",
" \n",
" 24089 | \n",
" 2000 | \n",
" Ying Yang Twins | \n",
" Whistle While You Tw... | \n",
" 4:19 | \n",
" 2000-03-18 | \n",
" wk76 | \n",
" NaN | \n",
"
\n",
" \n",
" 24090 | \n",
" 2000 | \n",
" Zombie Nation | \n",
" Kernkraft 400 | \n",
" 3:30 | \n",
" 2000-09-02 | \n",
" wk76 | \n",
" NaN | \n",
"
\n",
" \n",
" 24091 | \n",
" 2000 | \n",
" matchbox twenty | \n",
" Bent | \n",
" 4:12 | \n",
" 2000-04-29 | \n",
" wk76 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
24092 rows × 7 columns
\n",
"
"
],
"text/plain": [
" year artist track time date.entered \\\n",
"0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 \n",
"1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 \n",
"2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 \n",
"3 2000 3 Doors Down Loser 4:24 2000-10-21 \n",
"4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 \n",
"... ... ... ... ... ... \n",
"24087 2000 Yankee Grey Another Nine Minutes 3:10 2000-04-29 \n",
"24088 2000 Yearwood, Trisha Real Live Woman 3:55 2000-04-01 \n",
"24089 2000 Ying Yang Twins Whistle While You Tw... 4:19 2000-03-18 \n",
"24090 2000 Zombie Nation Kernkraft 400 3:30 2000-09-02 \n",
"24091 2000 matchbox twenty Bent 4:12 2000-04-29 \n",
"\n",
" week value \n",
"0 wk1 87.0 \n",
"1 wk1 91.0 \n",
"2 wk1 81.0 \n",
"3 wk1 76.0 \n",
"4 wk1 57.0 \n",
"... ... ... \n",
"24087 wk76 NaN \n",
"24088 wk76 NaN \n",
"24089 wk76 NaN \n",
"24090 wk76 NaN \n",
"24091 wk76 NaN \n",
"\n",
"[24092 rows x 7 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"billboard.pivot_longer(column_names=\"wk*\", names_to=\"week\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we want to use the `index` parameter instead, that's easy as well: "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" artist | \n",
" track | \n",
" time | \n",
" date.entered | \n",
" week | \n",
" value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2000 | \n",
" 2 Pac | \n",
" Baby Don't Cry (Keep... | \n",
" 4:22 | \n",
" 2000-02-26 | \n",
" wk1 | \n",
" 87.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2000 | \n",
" 2Ge+her | \n",
" The Hardest Part Of ... | \n",
" 3:15 | \n",
" 2000-09-02 | \n",
" wk1 | \n",
" 91.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 2000 | \n",
" 3 Doors Down | \n",
" Kryptonite | \n",
" 3:53 | \n",
" 2000-04-08 | \n",
" wk1 | \n",
" 81.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 2000 | \n",
" 3 Doors Down | \n",
" Loser | \n",
" 4:24 | \n",
" 2000-10-21 | \n",
" wk1 | \n",
" 76.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 2000 | \n",
" 504 Boyz | \n",
" Wobble Wobble | \n",
" 3:35 | \n",
" 2000-04-15 | \n",
" wk1 | \n",
" 57.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 24087 | \n",
" 2000 | \n",
" Yankee Grey | \n",
" Another Nine Minutes | \n",
" 3:10 | \n",
" 2000-04-29 | \n",
" wk76 | \n",
" NaN | \n",
"
\n",
" \n",
" 24088 | \n",
" 2000 | \n",
" Yearwood, Trisha | \n",
" Real Live Woman | \n",
" 3:55 | \n",
" 2000-04-01 | \n",
" wk76 | \n",
" NaN | \n",
"
\n",
" \n",
" 24089 | \n",
" 2000 | \n",
" Ying Yang Twins | \n",
" Whistle While You Tw... | \n",
" 4:19 | \n",
" 2000-03-18 | \n",
" wk76 | \n",
" NaN | \n",
"
\n",
" \n",
" 24090 | \n",
" 2000 | \n",
" Zombie Nation | \n",
" Kernkraft 400 | \n",
" 3:30 | \n",
" 2000-09-02 | \n",
" wk76 | \n",
" NaN | \n",
"
\n",
" \n",
" 24091 | \n",
" 2000 | \n",
" matchbox twenty | \n",
" Bent | \n",
" 4:12 | \n",
" 2000-04-29 | \n",
" wk76 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
24092 rows × 7 columns
\n",
"
"
],
"text/plain": [
" year artist track time date.entered \\\n",
"0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 \n",
"1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 \n",
"2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 \n",
"3 2000 3 Doors Down Loser 4:24 2000-10-21 \n",
"4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 \n",
"... ... ... ... ... ... \n",
"24087 2000 Yankee Grey Another Nine Minutes 3:10 2000-04-29 \n",
"24088 2000 Yearwood, Trisha Real Live Woman 3:55 2000-04-01 \n",
"24089 2000 Ying Yang Twins Whistle While You Tw... 4:19 2000-03-18 \n",
"24090 2000 Zombie Nation Kernkraft 400 3:30 2000-09-02 \n",
"24091 2000 matchbox twenty Bent 4:12 2000-04-29 \n",
"\n",
" week value \n",
"0 wk1 87.0 \n",
"1 wk1 91.0 \n",
"2 wk1 81.0 \n",
"3 wk1 76.0 \n",
"4 wk1 57.0 \n",
"... ... ... \n",
"24087 wk76 NaN \n",
"24088 wk76 NaN \n",
"24089 wk76 NaN \n",
"24090 wk76 NaN \n",
"24091 wk76 NaN \n",
"\n",
"[24092 rows x 7 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"billboard.pivot_longer(index=slice(\"year\", \"date.entered\"), names_to=\"week\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(pattern-2a)=\n",
"## **Pattern 2A - Melt into multiple columns - new column names**\n",
"\n",
"There are scenarios where multiple variables are crammed into the column names - we had a look at that already with the `who` [example](#introduction).\n",
"\n",
"Let's look at an example, adapted from [data.table vignette](https://rdatatable.gitlab.io/data.table/articles/datatable-reshape.html#a-enhanced-melt):"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sepal.Length | \n",
" Sepal.Width | \n",
" Petal.Length | \n",
" Petal.Width | \n",
" Species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 5.9 | \n",
" 3.0 | \n",
" 5.1 | \n",
" 1.8 | \n",
" virginica | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 5.9 3.0 5.1 1.8 virginica"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris = pd.DataFrame(\n",
" {\n",
" \"Sepal.Length\": [5.1, 5.9],\n",
" \"Sepal.Width\": [3.5, 3.0],\n",
" \"Petal.Length\": [1.4, 5.1],\n",
" \"Petal.Width\": [0.2, 1.8],\n",
" \"Species\": [\"setosa\", \"virginica\"],\n",
" }\n",
")\n",
"\n",
"iris"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the data above, we wish to consolidate the columns into `part` and `dimensions` - the `Sepal/Petal` prefixes go into the `part` column, while `Width/Length` go into the `dimension` column. \n",
"\n",
"In this case, we can use either the `names_sep` or the `names_pattern` parameter - for this particular scenario, the `names_sep` option is a better fit: "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Species | \n",
" part | \n",
" dimension | \n",
" value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" setosa | \n",
" Sepal | \n",
" Length | \n",
" 5.1 | \n",
"
\n",
" \n",
" 1 | \n",
" virginica | \n",
" Sepal | \n",
" Length | \n",
" 5.9 | \n",
"
\n",
" \n",
" 2 | \n",
" setosa | \n",
" Sepal | \n",
" Width | \n",
" 3.5 | \n",
"
\n",
" \n",
" 3 | \n",
" virginica | \n",
" Sepal | \n",
" Width | \n",
" 3.0 | \n",
"
\n",
" \n",
" 4 | \n",
" setosa | \n",
" Petal | \n",
" Length | \n",
" 1.4 | \n",
"
\n",
" \n",
" 5 | \n",
" virginica | \n",
" Petal | \n",
" Length | \n",
" 5.1 | \n",
"
\n",
" \n",
" 6 | \n",
" setosa | \n",
" Petal | \n",
" Width | \n",
" 0.2 | \n",
"
\n",
" \n",
" 7 | \n",
" virginica | \n",
" Petal | \n",
" Width | \n",
" 1.8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Species part dimension value\n",
"0 setosa Sepal Length 5.1\n",
"1 virginica Sepal Length 5.9\n",
"2 setosa Sepal Width 3.5\n",
"3 virginica Sepal Width 3.0\n",
"4 setosa Petal Length 1.4\n",
"5 virginica Petal Length 5.1\n",
"6 setosa Petal Width 0.2\n",
"7 virginica Petal Width 1.8"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.pivot_longer(index=\"Species\", names_to=[\"part\", \"dimension\"], names_sep=r\".\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using names_pattern:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Species | \n",
" part | \n",
" dimension | \n",
" value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" setosa | \n",
" Sepal | \n",
" Length | \n",
" 5.1 | \n",
"
\n",
" \n",
" 1 | \n",
" virginica | \n",
" Sepal | \n",
" Length | \n",
" 5.9 | \n",
"
\n",
" \n",
" 2 | \n",
" setosa | \n",
" Sepal | \n",
" Width | \n",
" 3.5 | \n",
"
\n",
" \n",
" 3 | \n",
" virginica | \n",
" Sepal | \n",
" Width | \n",
" 3.0 | \n",
"
\n",
" \n",
" 4 | \n",
" setosa | \n",
" Petal | \n",
" Length | \n",
" 1.4 | \n",
"
\n",
" \n",
" 5 | \n",
" virginica | \n",
" Petal | \n",
" Length | \n",
" 5.1 | \n",
"
\n",
" \n",
" 6 | \n",
" setosa | \n",
" Petal | \n",
" Width | \n",
" 0.2 | \n",
"
\n",
" \n",
" 7 | \n",
" virginica | \n",
" Petal | \n",
" Width | \n",
" 1.8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Species part dimension value\n",
"0 setosa Sepal Length 5.1\n",
"1 virginica Sepal Length 5.9\n",
"2 setosa Sepal Width 3.5\n",
"3 virginica Sepal Width 3.0\n",
"4 setosa Petal Length 1.4\n",
"5 virginica Petal Length 5.1\n",
"6 setosa Petal Width 0.2\n",
"7 virginica Petal Width 1.8"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.pivot_longer(\n",
" index=\"Species\", names_to=[\"part\", \"dimension\"], names_pattern=r\"(.+)\\.(.+)\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Pattern 2B - Melt into multiple columns - Use variables within the column names as new names**\n",
"\n",
"Instead of new column names, what if we wanted to use existing variables within the column names? \n",
"\n",
"Let's look at an example adapted from [data.table vignette](https://rdatatable.gitlab.io/data.table/articles/datatable-reshape.html#a-melting-data-tables-wide-to-long):"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" family | \n",
" dob_child1 | \n",
" dob_child2 | \n",
" gender_child1 | \n",
" gender_child2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1998-11-26 | \n",
" 2000-01-29 | \n",
" 1 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1996-06-22 | \n",
" NaN | \n",
" 2 | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 2002-07-11 | \n",
" 2004-04-05 | \n",
" 2 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 2004-10-10 | \n",
" 2009-08-27 | \n",
" 1 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 2000-12-05 | \n",
" 2005-02-28 | \n",
" 2 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" family dob_child1 dob_child2 gender_child1 gender_child2\n",
"0 1 1998-11-26 2000-01-29 1 2.0\n",
"1 2 1996-06-22 NaN 2 NaN\n",
"2 3 2002-07-11 2004-04-05 2 2.0\n",
"3 4 2004-10-10 2009-08-27 1 1.0\n",
"4 5 2000-12-05 2005-02-28 2 1.0"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(\n",
" {\n",
" \"family\": [1, 2, 3, 4, 5],\n",
" \"dob_child1\": [\n",
" \"1998-11-26\",\n",
" \"1996-06-22\",\n",
" \"2002-07-11\",\n",
" \"2004-10-10\",\n",
" \"2000-12-05\",\n",
" ],\n",
" \"dob_child2\": [\n",
" \"2000-01-29\",\n",
" np.nan,\n",
" \"2004-04-05\",\n",
" \"2009-08-27\",\n",
" \"2005-02-28\",\n",
" ],\n",
" \"gender_child1\": [1, 2, 2, 1, 2],\n",
" \"gender_child2\": [2.0, np.nan, 2.0, 1.0, 1.0],\n",
" }\n",
")\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the data above, we wish to keep `dob` and `gender` as column names, while moving `child1` and `child2` to a new column. How does [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) reshape this?"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" family | \n",
" child | \n",
" dob | \n",
" gender | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" child1 | \n",
" 1998-11-26 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" child1 | \n",
" 1996-06-22 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" child1 | \n",
" 2002-07-11 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" child1 | \n",
" 2004-10-10 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" child1 | \n",
" 2000-12-05 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 1 | \n",
" child2 | \n",
" 2000-01-29 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 6 | \n",
" 2 | \n",
" child2 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 7 | \n",
" 3 | \n",
" child2 | \n",
" 2004-04-05 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 8 | \n",
" 4 | \n",
" child2 | \n",
" 2009-08-27 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 9 | \n",
" 5 | \n",
" child2 | \n",
" 2005-02-28 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" family child dob gender\n",
"0 1 child1 1998-11-26 1.0\n",
"1 2 child1 1996-06-22 2.0\n",
"2 3 child1 2002-07-11 2.0\n",
"3 4 child1 2004-10-10 1.0\n",
"4 5 child1 2000-12-05 2.0\n",
"5 1 child2 2000-01-29 2.0\n",
"6 2 child2 NaN NaN\n",
"7 3 child2 2004-04-05 2.0\n",
"8 4 child2 2009-08-27 1.0\n",
"9 5 child2 2005-02-28 1.0"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot_longer(index=\"family\", names_to=(\".value\", \"child\"), names_sep=\"_\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For patterns such as this, where a part of the column name is to be retained as header in the new dataframe, the `.value` placeholder comes in handy. The `.value` placeholder tells [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) - `hey, this variable in the column name is to be retained as a header`. [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) gets the message and acts accordingly. \n",
"\n",
"Take column label `dob_child1` - the first part is `dob` and the second part is `child1`. This pairs with `(.value, child)`, meaning `dob` stays as header, and the rest gets lumped into the `child` column. The `names_sep` value - `_` - determines how the column is split into the new columns.\n",
"\n",
"With `names_pattern` we can have more control on the output, selecting just the numbers in the `child` column:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" family | \n",
" child | \n",
" dob | \n",
" gender | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1998-11-26 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1996-06-22 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 2002-07-11 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 2004-10-10 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 1 | \n",
" 2000-12-05 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 1 | \n",
" 2 | \n",
" 2000-01-29 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 6 | \n",
" 2 | \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 7 | \n",
" 3 | \n",
" 2 | \n",
" 2004-04-05 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 8 | \n",
" 4 | \n",
" 2 | \n",
" 2009-08-27 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 9 | \n",
" 5 | \n",
" 2 | \n",
" 2005-02-28 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" family child dob gender\n",
"0 1 1 1998-11-26 1.0\n",
"1 2 1 1996-06-22 2.0\n",
"2 3 1 2002-07-11 2.0\n",
"3 4 1 2004-10-10 1.0\n",
"4 5 1 2000-12-05 2.0\n",
"5 1 2 2000-01-29 2.0\n",
"6 2 2 NaN NaN\n",
"7 3 2 2004-04-05 2.0\n",
"8 4 2 2009-08-27 1.0\n",
"9 5 2 2005-02-28 1.0"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot_longer(\n",
" index=\"family\", names_to=(\".value\", \"child\"), names_pattern=r\"(.+)_child(\\d)\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also efficiently convert the data type of `child` to integer, using the `names_transform` option:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" family | \n",
" child | \n",
" dob | \n",
" gender | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1998-11-26 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1996-06-22 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 2002-07-11 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 2004-10-10 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 1 | \n",
" 2000-12-05 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 1 | \n",
" 2 | \n",
" 2000-01-29 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 6 | \n",
" 2 | \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 7 | \n",
" 3 | \n",
" 2 | \n",
" 2004-04-05 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 8 | \n",
" 4 | \n",
" 2 | \n",
" 2009-08-27 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 9 | \n",
" 5 | \n",
" 2 | \n",
" 2005-02-28 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" family child dob gender\n",
"0 1 1 1998-11-26 1.0\n",
"1 2 1 1996-06-22 2.0\n",
"2 3 1 2002-07-11 2.0\n",
"3 4 1 2004-10-10 1.0\n",
"4 5 1 2000-12-05 2.0\n",
"5 1 2 2000-01-29 2.0\n",
"6 2 2 NaN NaN\n",
"7 3 2 2004-04-05 2.0\n",
"8 4 2 2009-08-27 1.0\n",
"9 5 2 2005-02-28 1.0"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot_longer(\n",
" index=\"family\",\n",
" names_to=(\".value\", \"child\"),\n",
" names_pattern=r\"(.+)_child(\\d)\",\n",
" names_transform=np.int8,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The flexibility of `.value` means you can place it anywhere - you are not limited by position - as long as the right argument is passed to the `names_sep` or `names_pattern` parameter, your output should be fine.\n",
"\n",
"Let's apply this to the `iris` [dataset](#pattern-2a) to test the flexibility of `.value`:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Species | \n",
" part | \n",
" Length | \n",
" Width | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" setosa | \n",
" Sepal | \n",
" 5.1 | \n",
" 3.5 | \n",
"
\n",
" \n",
" 1 | \n",
" virginica | \n",
" Sepal | \n",
" 5.9 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 2 | \n",
" setosa | \n",
" Petal | \n",
" 1.4 | \n",
" 0.2 | \n",
"
\n",
" \n",
" 3 | \n",
" virginica | \n",
" Petal | \n",
" 5.1 | \n",
" 1.8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Species part Length Width\n",
"0 setosa Sepal 5.1 3.5\n",
"1 virginica Sepal 5.9 3.0\n",
"2 setosa Petal 1.4 0.2\n",
"3 virginica Petal 5.1 1.8"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.pivot_longer(index=\"Species\", names_to=(\"part\", \".value\"), names_sep=\".\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use multiple `.value` if required - let's apply this to a question from [Stack Overflow](https://stackoverflow.com/questions/12466493/reshaping-multiple-sets-of-measurement-columns-wide-format-into-single-columns):"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" DateRange1Start | \n",
" DateRange1End | \n",
" Value1 | \n",
" DateRange2Start | \n",
" DateRange2End | \n",
" Value2 | \n",
" DateRange3Start | \n",
" DateRange3End | \n",
" Value3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1/1/90 | \n",
" 3/1/90 | \n",
" 4.4 | \n",
" 4/5/91 | \n",
" 6/7/91 | \n",
" 6.2 | \n",
" 5/5/95 | \n",
" 6/6/96 | \n",
" 3.3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End \\\n",
"0 1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 \n",
"\n",
" Value2 DateRange3Start DateRange3End Value3 \n",
"0 6.2 5/5/95 6/6/96 3.3 "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(\n",
" [\n",
" {\n",
" \"ID\": 1,\n",
" \"DateRange1Start\": \"1/1/90\",\n",
" \"DateRange1End\": \"3/1/90\",\n",
" \"Value1\": 4.4,\n",
" \"DateRange2Start\": \"4/5/91\",\n",
" \"DateRange2End\": \"6/7/91\",\n",
" \"Value2\": 6.2,\n",
" \"DateRange3Start\": \"5/5/95\",\n",
" \"DateRange3End\": \"6/6/96\",\n",
" \"Value3\": 3.3,\n",
" }\n",
" ]\n",
")\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" DateRangeStart | \n",
" DateRangeEnd | \n",
" Value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1/1/90 | \n",
" 3/1/90 | \n",
" 4.4 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 4/5/91 | \n",
" 6/7/91 | \n",
" 6.2 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 5/5/95 | \n",
" 6/6/96 | \n",
" 3.3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID DateRangeStart DateRangeEnd Value\n",
"0 1 1/1/90 3/1/90 4.4\n",
"1 1 4/5/91 6/7/91 6.2\n",
"2 1 5/5/95 6/6/96 3.3"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot_longer(index=\"ID\", names_to=(\".value\", \".value\"), names_pattern=r\"(.+)\\d(.*)\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, as long as the pattern can be identified, `pivot_longer` will take care of the rest. \n",
"\n",
"One more example, adapted from [Stack Overflow](https://stackoverflow.com/questions/64107566/how-to-pivot-longer-and-populate-with-fields-from-column-names-at-the-same-tim?noredirect=1#comment113369419_64107566):"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sony | TV | Model | value | \n",
" Sony | TV | Quantity | value | \n",
" Sony | TV | Max-quant | value | \n",
" Panasonic | TV | Model | value | \n",
" Panasonic | TV | Quantity | value | \n",
" Panasonic | TV | Max-quant | value | \n",
" Sanyo | Radio | Model | value | \n",
" Sanyo | Radio | Quantity | value | \n",
" Sanyo | Radio | Max-quant | value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" A222 | \n",
" 5 | \n",
" 10 | \n",
" T232 | \n",
" 1 | \n",
" 10 | \n",
" S111 | \n",
" 4 | \n",
" 9 | \n",
"
\n",
" \n",
" 1 | \n",
" A234 | \n",
" 5 | \n",
" 9 | \n",
" S3424 | \n",
" 5 | \n",
" 12 | \n",
" S1s1 | \n",
" 2 | \n",
" 9 | \n",
"
\n",
" \n",
" 2 | \n",
" A4345 | \n",
" 4 | \n",
" 9 | \n",
" X3421 | \n",
" 1 | \n",
" 11 | \n",
" S1s2 | \n",
" 4 | \n",
" 10 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sony | TV | Model | value Sony | TV | Quantity | value \\\n",
"0 A222 5 \n",
"1 A234 5 \n",
"2 A4345 4 \n",
"\n",
" Sony | TV | Max-quant | value Panasonic | TV | Model | value \\\n",
"0 10 T232 \n",
"1 9 S3424 \n",
"2 9 X3421 \n",
"\n",
" Panasonic | TV | Quantity | value Panasonic | TV | Max-quant | value \\\n",
"0 1 10 \n",
"1 5 12 \n",
"2 1 11 \n",
"\n",
" Sanyo | Radio | Model | value Sanyo | Radio | Quantity | value \\\n",
"0 S111 4 \n",
"1 S1s1 2 \n",
"2 S1s2 4 \n",
"\n",
" Sanyo | Radio | Max-quant | value \n",
"0 9 \n",
"1 9 \n",
"2 10 "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(\n",
" {\n",
" \"Sony | TV | Model | value\": {0: \"A222\", 1: \"A234\", 2: \"A4345\"},\n",
" \"Sony | TV | Quantity | value\": {0: 5, 1: 5, 2: 4},\n",
" \"Sony | TV | Max-quant | value\": {0: 10, 1: 9, 2: 9},\n",
" \"Panasonic | TV | Model | value\": {0: \"T232\", 1: \"S3424\", 2: \"X3421\"},\n",
" \"Panasonic | TV | Quantity | value\": {0: 1, 1: 5, 2: 1},\n",
" \"Panasonic | TV | Max-quant | value\": {0: 10, 1: 12, 2: 11},\n",
" \"Sanyo | Radio | Model | value\": {0: \"S111\", 1: \"S1s1\", 2: \"S1s2\"},\n",
" \"Sanyo | Radio | Quantity | value\": {0: 4, 1: 2, 2: 4},\n",
" \"Sanyo | Radio | Max-quant | value\": {0: 9, 1: 9, 2: 10},\n",
" }\n",
")\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Manufacturer | \n",
" Device | \n",
" Model | \n",
" Quantity | \n",
" Max-quant | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Sony | \n",
" TV | \n",
" A222 | \n",
" 5 | \n",
" 10 | \n",
"
\n",
" \n",
" 1 | \n",
" Sony | \n",
" TV | \n",
" A234 | \n",
" 5 | \n",
" 9 | \n",
"
\n",
" \n",
" 2 | \n",
" Sony | \n",
" TV | \n",
" A4345 | \n",
" 4 | \n",
" 9 | \n",
"
\n",
" \n",
" 3 | \n",
" Panasonic | \n",
" TV | \n",
" T232 | \n",
" 1 | \n",
" 10 | \n",
"
\n",
" \n",
" 4 | \n",
" Panasonic | \n",
" TV | \n",
" S3424 | \n",
" 5 | \n",
" 12 | \n",
"
\n",
" \n",
" 5 | \n",
" Panasonic | \n",
" TV | \n",
" X3421 | \n",
" 1 | \n",
" 11 | \n",
"
\n",
" \n",
" 6 | \n",
" Sanyo | \n",
" Radio | \n",
" S111 | \n",
" 4 | \n",
" 9 | \n",
"
\n",
" \n",
" 7 | \n",
" Sanyo | \n",
" Radio | \n",
" S1s1 | \n",
" 2 | \n",
" 9 | \n",
"
\n",
" \n",
" 8 | \n",
" Sanyo | \n",
" Radio | \n",
" S1s2 | \n",
" 4 | \n",
" 10 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Manufacturer Device Model Quantity Max-quant\n",
"0 Sony TV A222 5 10\n",
"1 Sony TV A234 5 9\n",
"2 Sony TV A4345 4 9\n",
"3 Panasonic TV T232 1 10\n",
"4 Panasonic TV S3424 5 12\n",
"5 Panasonic TV X3421 1 11\n",
"6 Sanyo Radio S111 4 9\n",
"7 Sanyo Radio S1s1 2 9\n",
"8 Sanyo Radio S1s2 4 10"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot_longer(\n",
" names_to=(\"Manufacturer\", \"Device\", \".value\"),\n",
" names_pattern=r\"(.+)\\s\\|(.+)\\s\\|(.+)\\s\\|.*\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(pattern-3a)=\n",
"## **Pattern 3A - Group similar columns together**\n",
"\n",
"This is best explained with an example:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" x_1_mean | \n",
" x_2_mean | \n",
" y_1_mean | \n",
" y_2_mean | \n",
" unit | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10 | \n",
" 20 | \n",
" 30 | \n",
" 40 | \n",
" 50 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" x_1_mean x_2_mean y_1_mean y_2_mean unit\n",
"0 10 20 30 40 50"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_mean = pd.DataFrame(\n",
" {\n",
" \"x_1_mean\": [10],\n",
" \"x_2_mean\": [20],\n",
" \"y_1_mean\": [30],\n",
" \"y_2_mean\": [40],\n",
" \"unit\": [50],\n",
" }\n",
")\n",
"\n",
"df_mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the dataframe above, we wish to lump similar columns together - specifically, we want to lump the `x*` columns into one, same goes for the `y*` columns. \n",
"\n",
"For this pattern, we pass a list of names to `names_to`, while passing a list of regular expressions to `names_pattern`:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" unit | \n",
" x_mean | \n",
" y_mean | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 50 | \n",
" 10 | \n",
" 30 | \n",
"
\n",
" \n",
" 1 | \n",
" 50 | \n",
" 20 | \n",
" 40 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" unit x_mean y_mean\n",
"0 50 10 30\n",
"1 50 20 40"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_mean.pivot_longer(\n",
" index=\"unit\", names_to=[\"x_mean\", \"y_mean\"], names_pattern=[\"x\", \"y\"]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using a list of regular expressions for `names_pattern` implies that you want to group similar items under specified columns in `names_to`. \n",
"\n",
"Basically, it is a pairing. the first `x_1_mean` pairs with the first `y*` mean which it encounters, which is `y_1_mean`, same goes for the next `x*`, which is `x_2_mean`, it pairs with `y_2_mean`, since that is the next available `y*`. The pairing is on a first come first serve basis.\n",
"\n",
"Let's look at another [example](https://github.com/PacktPublishing/Pandas-Cookbook/blob/master/data/movie_altered.csv):"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" rating | \n",
" year | \n",
" duration | \n",
" director_1 | \n",
" director_fb_likes_1 | \n",
" actor_1 | \n",
" actor_2 | \n",
" actor_3 | \n",
" actor_fb_likes_1 | \n",
" actor_fb_likes_2 | \n",
" actor_fb_likes_3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Avatar | \n",
" PG-13 | \n",
" 2009.0 | \n",
" 178.0 | \n",
" James Cameron | \n",
" 0.0 | \n",
" CCH Pounder | \n",
" Joel David Moore | \n",
" Wes Studi | \n",
" 1000.0 | \n",
" 936.0 | \n",
" 855.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Pirates of the Caribbean: At World's End | \n",
" PG-13 | \n",
" 2007.0 | \n",
" 169.0 | \n",
" Gore Verbinski | \n",
" 563.0 | \n",
" Johnny Depp | \n",
" Orlando Bloom | \n",
" Jack Davenport | \n",
" 40000.0 | \n",
" 5000.0 | \n",
" 1000.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Spectre | \n",
" PG-13 | \n",
" 2015.0 | \n",
" 148.0 | \n",
" Sam Mendes | \n",
" 0.0 | \n",
" Christoph Waltz | \n",
" Rory Kinnear | \n",
" Stephanie Sigman | \n",
" 11000.0 | \n",
" 393.0 | \n",
" 161.0 | \n",
"
\n",
" \n",
" 3 | \n",
" The Dark Knight Rises | \n",
" PG-13 | \n",
" 2012.0 | \n",
" 164.0 | \n",
" Christopher Nolan | \n",
" 22000.0 | \n",
" Tom Hardy | \n",
" Christian Bale | \n",
" Joseph Gordon-Levitt | \n",
" 27000.0 | \n",
" 23000.0 | \n",
" 23000.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Star Wars: Episode VII - The Force Awakens | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Doug Walker | \n",
" 131.0 | \n",
" Doug Walker | \n",
" Rob Walker | \n",
" NaN | \n",
" 131.0 | \n",
" 12.0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title rating year duration \\\n",
"0 Avatar PG-13 2009.0 178.0 \n",
"1 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 \n",
"2 Spectre PG-13 2015.0 148.0 \n",
"3 The Dark Knight Rises PG-13 2012.0 164.0 \n",
"4 Star Wars: Episode VII - The Force Awakens NaN NaN NaN \n",
"\n",
" director_1 director_fb_likes_1 actor_1 actor_2 \\\n",
"0 James Cameron 0.0 CCH Pounder Joel David Moore \n",
"1 Gore Verbinski 563.0 Johnny Depp Orlando Bloom \n",
"2 Sam Mendes 0.0 Christoph Waltz Rory Kinnear \n",
"3 Christopher Nolan 22000.0 Tom Hardy Christian Bale \n",
"4 Doug Walker 131.0 Doug Walker Rob Walker \n",
"\n",
" actor_3 actor_fb_likes_1 actor_fb_likes_2 actor_fb_likes_3 \n",
"0 Wes Studi 1000.0 936.0 855.0 \n",
"1 Jack Davenport 40000.0 5000.0 1000.0 \n",
"2 Stephanie Sigman 11000.0 393.0 161.0 \n",
"3 Joseph Gordon-Levitt 27000.0 23000.0 23000.0 \n",
"4 NaN 131.0 12.0 NaN "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = \"https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie_altered.csv\"\n",
"movies = pd.read_csv(url)\n",
"movies.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The goal for the dataframe above is to group the actors into another column, and the facebook likes into a separate column. The pattern here is obvious - we are grouping similar columns, so we pass a list of regular expressions to `names_pattern`, with an equal number of names to `names_to`:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" rating | \n",
" year | \n",
" duration | \n",
" director | \n",
" director_fb_likes | \n",
" actor | \n",
" actor_fb_likes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Avatar | \n",
" PG-13 | \n",
" 2009.0 | \n",
" 178.0 | \n",
" James Cameron | \n",
" 0.0 | \n",
" CCH Pounder | \n",
" 1000.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Pirates of the Caribbean: At World's End | \n",
" PG-13 | \n",
" 2007.0 | \n",
" 169.0 | \n",
" Gore Verbinski | \n",
" 563.0 | \n",
" Johnny Depp | \n",
" 40000.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Spectre | \n",
" PG-13 | \n",
" 2015.0 | \n",
" 148.0 | \n",
" Sam Mendes | \n",
" 0.0 | \n",
" Christoph Waltz | \n",
" 11000.0 | \n",
"
\n",
" \n",
" 3 | \n",
" The Dark Knight Rises | \n",
" PG-13 | \n",
" 2012.0 | \n",
" 164.0 | \n",
" Christopher Nolan | \n",
" 22000.0 | \n",
" Tom Hardy | \n",
" 27000.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Star Wars: Episode VII - The Force Awakens | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Doug Walker | \n",
" 131.0 | \n",
" Doug Walker | \n",
" 131.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 14743 | \n",
" Signed Sealed Delivered | \n",
" NaN | \n",
" 2013.0 | \n",
" 87.0 | \n",
" NaN | \n",
" NaN | \n",
" Crystal Lowe | \n",
" 319.0 | \n",
"
\n",
" \n",
" 14744 | \n",
" The Following | \n",
" TV-14 | \n",
" NaN | \n",
" 43.0 | \n",
" NaN | \n",
" NaN | \n",
" Sam Underwood | \n",
" 319.0 | \n",
"
\n",
" \n",
" 14745 | \n",
" A Plague So Pleasant | \n",
" NaN | \n",
" 2013.0 | \n",
" 76.0 | \n",
" NaN | \n",
" NaN | \n",
" David Chandler | \n",
" 0.0 | \n",
"
\n",
" \n",
" 14746 | \n",
" Shanghai Calling | \n",
" PG-13 | \n",
" 2012.0 | \n",
" 100.0 | \n",
" NaN | \n",
" NaN | \n",
" Eliza Coupe | \n",
" 490.0 | \n",
"
\n",
" \n",
" 14747 | \n",
" My Date with Drew | \n",
" PG | \n",
" 2004.0 | \n",
" 90.0 | \n",
" NaN | \n",
" NaN | \n",
" Jon Gunn | \n",
" 16.0 | \n",
"
\n",
" \n",
"
\n",
"
14748 rows × 8 columns
\n",
"
"
],
"text/plain": [
" title rating year duration \\\n",
"0 Avatar PG-13 2009.0 178.0 \n",
"1 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 \n",
"2 Spectre PG-13 2015.0 148.0 \n",
"3 The Dark Knight Rises PG-13 2012.0 164.0 \n",
"4 Star Wars: Episode VII - The Force Awakens NaN NaN NaN \n",
"... ... ... ... ... \n",
"14743 Signed Sealed Delivered NaN 2013.0 87.0 \n",
"14744 The Following TV-14 NaN 43.0 \n",
"14745 A Plague So Pleasant NaN 2013.0 76.0 \n",
"14746 Shanghai Calling PG-13 2012.0 100.0 \n",
"14747 My Date with Drew PG 2004.0 90.0 \n",
"\n",
" director director_fb_likes actor actor_fb_likes \n",
"0 James Cameron 0.0 CCH Pounder 1000.0 \n",
"1 Gore Verbinski 563.0 Johnny Depp 40000.0 \n",
"2 Sam Mendes 0.0 Christoph Waltz 11000.0 \n",
"3 Christopher Nolan 22000.0 Tom Hardy 27000.0 \n",
"4 Doug Walker 131.0 Doug Walker 131.0 \n",
"... ... ... ... ... \n",
"14743 NaN NaN Crystal Lowe 319.0 \n",
"14744 NaN NaN Sam Underwood 319.0 \n",
"14745 NaN NaN David Chandler 0.0 \n",
"14746 NaN NaN Eliza Coupe 490.0 \n",
"14747 NaN NaN Jon Gunn 16.0 \n",
"\n",
"[14748 rows x 8 columns]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies.pivot_longer(\n",
" index=slice(\"title\", \"duration\"),\n",
" names_to=[\"director\", \"director_fb_likes\", \"actor\", \"actor_fb_likes\"],\n",
" names_pattern=[\"director_\\\\d$\", \"director_fb_likes\", \"actor_\\\\d$\", \"actor.+\"],\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on the same first come first serve approach, `director_1` is paired with `director_fb_likes_1`, `actor_1` is paired with `actor_fb_likes_1`; the pairing continues for the rest.\n",
"\n",
"This could also have been solved using the `.value` approach:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" rating | \n",
" year | \n",
" duration | \n",
" director | \n",
" director_fb_likes | \n",
" actor | \n",
" actor_fb_likes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Avatar | \n",
" PG-13 | \n",
" 2009.0 | \n",
" 178.0 | \n",
" James Cameron | \n",
" 0.0 | \n",
" CCH Pounder | \n",
" 1000.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Pirates of the Caribbean: At World's End | \n",
" PG-13 | \n",
" 2007.0 | \n",
" 169.0 | \n",
" Gore Verbinski | \n",
" 563.0 | \n",
" Johnny Depp | \n",
" 40000.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Spectre | \n",
" PG-13 | \n",
" 2015.0 | \n",
" 148.0 | \n",
" Sam Mendes | \n",
" 0.0 | \n",
" Christoph Waltz | \n",
" 11000.0 | \n",
"
\n",
" \n",
" 3 | \n",
" The Dark Knight Rises | \n",
" PG-13 | \n",
" 2012.0 | \n",
" 164.0 | \n",
" Christopher Nolan | \n",
" 22000.0 | \n",
" Tom Hardy | \n",
" 27000.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Star Wars: Episode VII - The Force Awakens | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Doug Walker | \n",
" 131.0 | \n",
" Doug Walker | \n",
" 131.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 14743 | \n",
" Signed Sealed Delivered | \n",
" NaN | \n",
" 2013.0 | \n",
" 87.0 | \n",
" NaN | \n",
" NaN | \n",
" Crystal Lowe | \n",
" 319.0 | \n",
"
\n",
" \n",
" 14744 | \n",
" The Following | \n",
" TV-14 | \n",
" NaN | \n",
" 43.0 | \n",
" NaN | \n",
" NaN | \n",
" Sam Underwood | \n",
" 319.0 | \n",
"
\n",
" \n",
" 14745 | \n",
" A Plague So Pleasant | \n",
" NaN | \n",
" 2013.0 | \n",
" 76.0 | \n",
" NaN | \n",
" NaN | \n",
" David Chandler | \n",
" 0.0 | \n",
"
\n",
" \n",
" 14746 | \n",
" Shanghai Calling | \n",
" PG-13 | \n",
" 2012.0 | \n",
" 100.0 | \n",
" NaN | \n",
" NaN | \n",
" Eliza Coupe | \n",
" 490.0 | \n",
"
\n",
" \n",
" 14747 | \n",
" My Date with Drew | \n",
" PG | \n",
" 2004.0 | \n",
" 90.0 | \n",
" NaN | \n",
" NaN | \n",
" Jon Gunn | \n",
" 16.0 | \n",
"
\n",
" \n",
"
\n",
"
14748 rows × 8 columns
\n",
"
"
],
"text/plain": [
" title rating year duration \\\n",
"0 Avatar PG-13 2009.0 178.0 \n",
"1 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 \n",
"2 Spectre PG-13 2015.0 148.0 \n",
"3 The Dark Knight Rises PG-13 2012.0 164.0 \n",
"4 Star Wars: Episode VII - The Force Awakens NaN NaN NaN \n",
"... ... ... ... ... \n",
"14743 Signed Sealed Delivered NaN 2013.0 87.0 \n",
"14744 The Following TV-14 NaN 43.0 \n",
"14745 A Plague So Pleasant NaN 2013.0 76.0 \n",
"14746 Shanghai Calling PG-13 2012.0 100.0 \n",
"14747 My Date with Drew PG 2004.0 90.0 \n",
"\n",
" director director_fb_likes actor actor_fb_likes \n",
"0 James Cameron 0.0 CCH Pounder 1000.0 \n",
"1 Gore Verbinski 563.0 Johnny Depp 40000.0 \n",
"2 Sam Mendes 0.0 Christoph Waltz 11000.0 \n",
"3 Christopher Nolan 22000.0 Tom Hardy 27000.0 \n",
"4 Doug Walker 131.0 Doug Walker 131.0 \n",
"... ... ... ... ... \n",
"14743 NaN NaN Crystal Lowe 319.0 \n",
"14744 NaN NaN Sam Underwood 319.0 \n",
"14745 NaN NaN David Chandler 0.0 \n",
"14746 NaN NaN Eliza Coupe 490.0 \n",
"14747 NaN NaN Jon Gunn 16.0 \n",
"\n",
"[14748 rows x 8 columns]"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(\n",
" movies.pivot_longer(\n",
" index=slice(\"title\", \"duration\"), names_to=\".value\", names_pattern=\"(.+)_\\\\d\"\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, it is all about identifying the pattern, and using whichever option is more suitable/convenient. One more [example](https://github.com/tidyverse/tidyr/issues/1378#issue-1300743144):"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" A | \n",
" A_date | \n",
" B | \n",
" B_date | \n",
" other | \n",
" other_date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" A | \n",
" 1 | \n",
" <NA> | \n",
" <NA> | \n",
" <NA> | \n",
" <NA> | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" <NA> | \n",
" <NA> | \n",
" B | \n",
" 3 | \n",
" <NA> | \n",
" <NA> | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" A | \n",
" 2 | \n",
" B | \n",
" 2 | \n",
" <NA> | \n",
" <NA> | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" <NA> | \n",
" <NA> | \n",
" <NA> | \n",
" <NA> | \n",
" C | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" <NA> | \n",
" <NA> | \n",
" <NA> | \n",
" <NA> | \n",
" D | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id A A_date B B_date other other_date\n",
"0 1 A 1 \n",
"1 2 B 3 \n",
"2 3 A 2 B 2 \n",
"3 4 C 1\n",
"4 5 D 5"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from pandas import NA\n",
"\n",
"treatments = dict(\n",
" id=range(1, 6),\n",
" A=(\"A\", NA, \"A\", NA, NA),\n",
" A_date=(1, NA, 2, NA, NA),\n",
" B=(NA, \"B\", \"B\", NA, NA),\n",
" B_date=(NA, 3, 2, NA, NA),\n",
" other=(NA, NA, NA, \"C\", \"D\"),\n",
" other_date=(NA, NA, NA, 1, 5),\n",
")\n",
"treatments = pd.DataFrame(treatments)\n",
"treatments"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The goal is to transform the dataframe into three columns, an `id` column, a `treatment` column, and a `date` column. Since we are grouping similar columns, we pass a list of regular expressions to `names_pattern`:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" treatment | \n",
" date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" A | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 3 | \n",
" A | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" B | \n",
" 3 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" B | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" C | \n",
" 1 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" D | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id treatment date\n",
"0 1 A 1\n",
"1 3 A 2\n",
"2 2 B 3\n",
"3 3 B 2\n",
"4 4 C 1\n",
"5 5 D 5"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"treatments.pivot_longer(\n",
" index=\"id\",\n",
" names_to=[\"treatment\", \"date\"],\n",
" names_pattern=[\"A$|B$|other$\", \".+date$\"],\n",
" dropna=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another way is to access the dates before the treatments, by reversing the order of the regular expressions:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" treatment | \n",
" date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" A | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 3 | \n",
" A | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" B | \n",
" 3 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" B | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" C | \n",
" 1 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" D | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id treatment date\n",
"0 1 A 1\n",
"1 3 A 2\n",
"2 2 B 3\n",
"3 3 B 2\n",
"4 4 C 1\n",
"5 5 D 5"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"treatments.pivot_longer(\n",
" index=\"id\",\n",
" names_to=[\"date\", \"treatment\"],\n",
" names_pattern=[\".+date$\", \".+\"],\n",
" dropna=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Pattern 3B - Group similar columns and values together**\n",
"\n",
"This is best illustrated with an [example](https://stackoverflow.com/q/51519101/7175713):\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" City | \n",
" State | \n",
" Name | \n",
" Mango | \n",
" Orange | \n",
" Watermelon | \n",
" Gin | \n",
" Vodka | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Houston | \n",
" Texas | \n",
" Aria | \n",
" 4 | \n",
" 10 | \n",
" 40 | \n",
" 16 | \n",
" 20 | \n",
"
\n",
" \n",
" 1 | \n",
" Austin | \n",
" Texas | \n",
" Penelope | \n",
" 10 | \n",
" 8 | \n",
" 99 | \n",
" 200 | \n",
" 33 | \n",
"
\n",
" \n",
" 2 | \n",
" Hoover | \n",
" Alabama | \n",
" Niko | \n",
" 90 | \n",
" 14 | \n",
" 43 | \n",
" 34 | \n",
" 18 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" City State Name Mango Orange Watermelon Gin Vodka\n",
"0 Houston Texas Aria 4 10 40 16 20\n",
"1 Austin Texas Penelope 10 8 99 200 33\n",
"2 Hoover Alabama Niko 90 14 43 34 18"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(\n",
" {\n",
" \"City\": [\"Houston\", \"Austin\", \"Hoover\"],\n",
" \"State\": [\"Texas\", \"Texas\", \"Alabama\"],\n",
" \"Name\": [\"Aria\", \"Penelope\", \"Niko\"],\n",
" \"Mango\": [4, 10, 90],\n",
" \"Orange\": [10, 8, 14],\n",
" \"Watermelon\": [40, 99, 43],\n",
" \"Gin\": [16, 200, 34],\n",
" \"Vodka\": [20, 33, 18],\n",
" },\n",
" columns=[\"City\", \"State\", \"Name\", \"Mango\", \"Orange\", \"Watermelon\", \"Gin\", \"Vodka\"],\n",
")\n",
"\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The goal in this case is to group similar columns **and** values together - group the drinks into one column, group the fruits into another column. When that is done, group the values for the drinks into one column, and the values for the fruits into another column. That's the pattern, and for this pattern we continue with the idea from [earlier](#pattern-3a), by passing a list of regular expressions to `names_pattern`, an equal number of names to `names_to`, **and** an equal number of names to `values_to`:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" City | \n",
" State | \n",
" Fruit | \n",
" Drink | \n",
" Pounds | \n",
" Ounces | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Houston | \n",
" Texas | \n",
" Mango | \n",
" Gin | \n",
" 4 | \n",
" 16.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Austin | \n",
" Texas | \n",
" Mango | \n",
" Gin | \n",
" 10 | \n",
" 200.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Hoover | \n",
" Alabama | \n",
" Mango | \n",
" Gin | \n",
" 90 | \n",
" 34.0 | \n",
"
\n",
" \n",
" 3 | \n",
" Houston | \n",
" Texas | \n",
" Orange | \n",
" Vodka | \n",
" 10 | \n",
" 20.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Austin | \n",
" Texas | \n",
" Orange | \n",
" Vodka | \n",
" 8 | \n",
" 33.0 | \n",
"
\n",
" \n",
" 5 | \n",
" Hoover | \n",
" Alabama | \n",
" Orange | \n",
" Vodka | \n",
" 14 | \n",
" 18.0 | \n",
"
\n",
" \n",
" 6 | \n",
" Houston | \n",
" Texas | \n",
" Watermelon | \n",
" None | \n",
" 40 | \n",
" NaN | \n",
"
\n",
" \n",
" 7 | \n",
" Austin | \n",
" Texas | \n",
" Watermelon | \n",
" None | \n",
" 99 | \n",
" NaN | \n",
"
\n",
" \n",
" 8 | \n",
" Hoover | \n",
" Alabama | \n",
" Watermelon | \n",
" None | \n",
" 43 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" City State Fruit Drink Pounds Ounces\n",
"0 Houston Texas Mango Gin 4 16.0\n",
"1 Austin Texas Mango Gin 10 200.0\n",
"2 Hoover Alabama Mango Gin 90 34.0\n",
"3 Houston Texas Orange Vodka 10 20.0\n",
"4 Austin Texas Orange Vodka 8 33.0\n",
"5 Hoover Alabama Orange Vodka 14 18.0\n",
"6 Houston Texas Watermelon None 40 NaN\n",
"7 Austin Texas Watermelon None 99 NaN\n",
"8 Hoover Alabama Watermelon None 43 NaN"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot_longer(\n",
" index=[\"City\", \"State\"],\n",
" column_names=slice(\"Mango\", \"Vodka\"),\n",
" names_to=(\"Fruit\", \"Drink\"),\n",
" values_to=(\"Pounds\", \"Ounces\"),\n",
" names_pattern=[r\"M|O|W\", r\"G|V\"],\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Simple, straightforward, to the point, and efficient. Let's look at another [example](https://stackoverflow.com/q/37422487/7175713):"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" State | \n",
" ItemN | \n",
" x1 | \n",
" x2 | \n",
" x3 | \n",
" y1 | \n",
" y2 | \n",
" y3 | \n",
" z1 | \n",
" z2 | \n",
" z3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" CA | \n",
" 1 | \n",
" 6 | \n",
" 4 | \n",
" 3 | \n",
" 7 | \n",
" 5 | \n",
" 3 | \n",
" 11 | \n",
" 5 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" CA | \n",
" 2 | \n",
" 7 | \n",
" 3 | \n",
" 1 | \n",
" 15 | \n",
" 10 | \n",
" 5 | \n",
" 4 | \n",
" 2 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" FL | \n",
" 3 | \n",
" 3 | \n",
" 2 | \n",
" 1 | \n",
" 5 | \n",
" 3 | \n",
" 2 | \n",
" 13 | \n",
" 7 | \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
" FL | \n",
" 4 | \n",
" 9 | \n",
" 4 | \n",
" 2 | \n",
" 16 | \n",
" 14 | \n",
" 12 | \n",
" 14 | \n",
" 5 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" State ItemN x1 x2 x3 y1 y2 y3 z1 z2 z3\n",
"0 CA 1 6 4 3 7 5 3 11 5 1\n",
"1 CA 2 7 3 1 15 10 5 4 2 1\n",
"2 FL 3 3 2 1 5 3 2 13 7 2\n",
"3 FL 4 9 4 2 16 14 12 14 5 4"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = {\n",
" \"State\": [\"CA\", \"CA\", \"FL\", \"FL\"],\n",
" \"ItemN\": [1, 2, 3, 4],\n",
" \"x1\": [6, 7, 3, 9],\n",
" \"x2\": [4, 3, 2, 4],\n",
" \"x3\": [3, 1, 1, 2],\n",
" \"y1\": [7, 15, 5, 16],\n",
" \"y2\": [5, 10, 3, 14],\n",
" \"y3\": [3, 5, 2, 12],\n",
" \"z1\": [11, 4, 13, 14],\n",
" \"z2\": [5, 2, 7, 5],\n",
" \"z3\": [1, 1, 2, 4],\n",
"}\n",
"\n",
"df = pd.DataFrame(df)\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The goal is to group all the `x` columns into one, the `y` columns into one column, and group the associated values into individual columns as well. We are grouping similar columns **and** values - that's the pattern - again we pass a list of regular expressions to `names_pattern`, a list of names to `names_to`, and a list of names to `values_to`:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" State | \n",
" ItemN | \n",
" x | \n",
" y | \n",
" z | \n",
" xvalue | \n",
" yvalue | \n",
" zvalue | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" CA | \n",
" 1 | \n",
" x1 | \n",
" y1 | \n",
" z1 | \n",
" 6 | \n",
" 7 | \n",
" 11 | \n",
"
\n",
" \n",
" 1 | \n",
" CA | \n",
" 2 | \n",
" x1 | \n",
" y1 | \n",
" z1 | \n",
" 7 | \n",
" 15 | \n",
" 4 | \n",
"
\n",
" \n",
" 2 | \n",
" FL | \n",
" 3 | \n",
" x1 | \n",
" y1 | \n",
" z1 | \n",
" 3 | \n",
" 5 | \n",
" 13 | \n",
"
\n",
" \n",
" 3 | \n",
" FL | \n",
" 4 | \n",
" x1 | \n",
" y1 | \n",
" z1 | \n",
" 9 | \n",
" 16 | \n",
" 14 | \n",
"
\n",
" \n",
" 4 | \n",
" CA | \n",
" 1 | \n",
" x2 | \n",
" y2 | \n",
" z2 | \n",
" 4 | \n",
" 5 | \n",
" 5 | \n",
"
\n",
" \n",
" 5 | \n",
" CA | \n",
" 2 | \n",
" x2 | \n",
" y2 | \n",
" z2 | \n",
" 3 | \n",
" 10 | \n",
" 2 | \n",
"
\n",
" \n",
" 6 | \n",
" FL | \n",
" 3 | \n",
" x2 | \n",
" y2 | \n",
" z2 | \n",
" 2 | \n",
" 3 | \n",
" 7 | \n",
"
\n",
" \n",
" 7 | \n",
" FL | \n",
" 4 | \n",
" x2 | \n",
" y2 | \n",
" z2 | \n",
" 4 | \n",
" 14 | \n",
" 5 | \n",
"
\n",
" \n",
" 8 | \n",
" CA | \n",
" 1 | \n",
" x3 | \n",
" y3 | \n",
" z3 | \n",
" 3 | \n",
" 3 | \n",
" 1 | \n",
"
\n",
" \n",
" 9 | \n",
" CA | \n",
" 2 | \n",
" x3 | \n",
" y3 | \n",
" z3 | \n",
" 1 | \n",
" 5 | \n",
" 1 | \n",
"
\n",
" \n",
" 10 | \n",
" FL | \n",
" 3 | \n",
" x3 | \n",
" y3 | \n",
" z3 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
" 11 | \n",
" FL | \n",
" 4 | \n",
" x3 | \n",
" y3 | \n",
" z3 | \n",
" 2 | \n",
" 12 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" State ItemN x y z xvalue yvalue zvalue\n",
"0 CA 1 x1 y1 z1 6 7 11\n",
"1 CA 2 x1 y1 z1 7 15 4\n",
"2 FL 3 x1 y1 z1 3 5 13\n",
"3 FL 4 x1 y1 z1 9 16 14\n",
"4 CA 1 x2 y2 z2 4 5 5\n",
"5 CA 2 x2 y2 z2 3 10 2\n",
"6 FL 3 x2 y2 z2 2 3 7\n",
"7 FL 4 x2 y2 z2 4 14 5\n",
"8 CA 1 x3 y3 z3 3 3 1\n",
"9 CA 2 x3 y3 z3 1 5 1\n",
"10 FL 3 x3 y3 z3 1 2 2\n",
"11 FL 4 x3 y3 z3 2 12 4"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot_longer(\n",
" index=[\"State\", \"ItemN\"],\n",
" names_to=[\"x\", \"y\", \"z\"],\n",
" values_to=[\"xvalue\", \"yvalue\", \"zvalue\"],\n",
" names_pattern=[\"x\", \"y\", \"z\"],\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Performance**\n",
"\n",
"[pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) is not just simple, convenient, and straightforward, it is efficient as well:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"503 μs ± 4.79 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n"
]
}
],
"source": [
"%timeit religion.pivot_longer(index = 'religion', names_to = 'income', values_to = 'count')"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"582 μs ± 13.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n"
]
}
],
"source": [
"%timeit religion.melt(id_vars = 'religion', var_name = 'income', value_name = 'count')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Sample data](https://grantmcdermott.com/reshape-benchmarks/), using 1000 columns (you can personally increase the number of columns to 1e6 or 1e8, depending on the power of your CPU):"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" grp | \n",
" x1 | \n",
" x2 | \n",
" x3 | \n",
" x4 | \n",
" x5 | \n",
" x6 | \n",
" x7 | \n",
" x8 | \n",
" ... | \n",
" x991 | \n",
" x992 | \n",
" x993 | \n",
" x994 | \n",
" x995 | \n",
" x996 | \n",
" x997 | \n",
" x998 | \n",
" x999 | \n",
" x1000 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1000 | \n",
" A | \n",
" I | \n",
" A | \n",
" M | \n",
" O | \n",
" E | \n",
" F | \n",
" D | \n",
" ... | \n",
" C | \n",
" L | \n",
" G | \n",
" I | \n",
" R | \n",
" X | \n",
" O | \n",
" W | \n",
" L | \n",
" L | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 528 | \n",
" I | \n",
" X | \n",
" U | \n",
" Q | \n",
" B | \n",
" X | \n",
" I | \n",
" J | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 151 | \n",
" J | \n",
" Q | \n",
" H | \n",
" F | \n",
" B | \n",
" X | \n",
" B | \n",
" F | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 343 | \n",
" P | \n",
" A | \n",
" K | \n",
" K | \n",
" N | \n",
" R | \n",
" E | \n",
" E | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 460 | \n",
" L | \n",
" D | \n",
" G | \n",
" X | \n",
" I | \n",
" O | \n",
" C | \n",
" W | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 1002 columns
\n",
"
"
],
"text/plain": [
" id grp x1 x2 x3 x4 x5 x6 x7 x8 ... x991 x992 x993 x994 x995 x996 x997 \\\n",
"0 1 1000 A I A M O E F D ... C L G I R X O \n",
"1 2 528 I X U Q B X I J ... NaN NaN NaN NaN NaN NaN NaN \n",
"2 3 151 J Q H F B X B F ... NaN NaN NaN NaN NaN NaN NaN \n",
"3 4 343 P A K K N R E E ... NaN NaN NaN NaN NaN NaN NaN \n",
"4 5 460 L D G X I O C W ... NaN NaN NaN NaN NaN NaN NaN \n",
"\n",
" x998 x999 x1000 \n",
"0 W L L \n",
"1 NaN NaN NaN \n",
"2 NaN NaN NaN \n",
"3 NaN NaN NaN \n",
"4 NaN NaN NaN \n",
"\n",
"[5 rows x 1002 columns]"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sparse_wide = pd.read_csv(\"Data_files/sparse-wide.csv\")\n",
"sparse_wide.head()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"53.8 ms ± 410 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit sparse_wide.melt(id_vars = ['id', 'grp'])"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"15.7 ms ± 201 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%timeit sparse_wide.pivot_longer(index = ['id', 'grp'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's test on the `movies` dataset:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" director | \n",
" director_fb_likes | \n",
" actor | \n",
" actor_fb_likes | \n",
"
\n",
" \n",
" title | \n",
" rating | \n",
" year | \n",
" duration | \n",
" num | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Avatar | \n",
" PG-13 | \n",
" 2009.0 | \n",
" 178.0 | \n",
" 1 | \n",
" James Cameron | \n",
" 0.0 | \n",
" CCH Pounder | \n",
" 1000.0 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" Joel David Moore | \n",
" 936.0 | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" NaN | \n",
" Wes Studi | \n",
" 855.0 | \n",
"
\n",
" \n",
" Pirates of the Caribbean: At World's End | \n",
" PG-13 | \n",
" 2007.0 | \n",
" 169.0 | \n",
" 1 | \n",
" Gore Verbinski | \n",
" 563.0 | \n",
" Johnny Depp | \n",
" 40000.0 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" Orlando Bloom | \n",
" 5000.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" Shanghai Calling | \n",
" PG-13 | \n",
" 2012.0 | \n",
" 100.0 | \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" Daniel Henney | \n",
" 719.0 | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" NaN | \n",
" Eliza Coupe | \n",
" 490.0 | \n",
"
\n",
" \n",
" My Date with Drew | \n",
" PG | \n",
" 2004.0 | \n",
" 90.0 | \n",
" 1 | \n",
" Jon Gunn | \n",
" 16.0 | \n",
" John August | \n",
" 86.0 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" Brian Herzlinger | \n",
" 23.0 | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" NaN | \n",
" Jon Gunn | \n",
" 16.0 | \n",
"
\n",
" \n",
"
\n",
"
14748 rows × 4 columns
\n",
"
"
],
"text/plain": [
" director \\\n",
"title rating year duration num \n",
"Avatar PG-13 2009.0 178.0 1 James Cameron \n",
" 2 NaN \n",
" 3 NaN \n",
"Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 Gore Verbinski \n",
" 2 NaN \n",
"... ... \n",
"Shanghai Calling PG-13 2012.0 100.0 2 NaN \n",
" 3 NaN \n",
"My Date with Drew PG 2004.0 90.0 1 Jon Gunn \n",
" 2 NaN \n",
" 3 NaN \n",
"\n",
" director_fb_likes \\\n",
"title rating year duration num \n",
"Avatar PG-13 2009.0 178.0 1 0.0 \n",
" 2 NaN \n",
" 3 NaN \n",
"Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 563.0 \n",
" 2 NaN \n",
"... ... \n",
"Shanghai Calling PG-13 2012.0 100.0 2 NaN \n",
" 3 NaN \n",
"My Date with Drew PG 2004.0 90.0 1 16.0 \n",
" 2 NaN \n",
" 3 NaN \n",
"\n",
" actor \\\n",
"title rating year duration num \n",
"Avatar PG-13 2009.0 178.0 1 CCH Pounder \n",
" 2 Joel David Moore \n",
" 3 Wes Studi \n",
"Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 Johnny Depp \n",
" 2 Orlando Bloom \n",
"... ... \n",
"Shanghai Calling PG-13 2012.0 100.0 2 Daniel Henney \n",
" 3 Eliza Coupe \n",
"My Date with Drew PG 2004.0 90.0 1 John August \n",
" 2 Brian Herzlinger \n",
" 3 Jon Gunn \n",
"\n",
" actor_fb_likes \n",
"title rating year duration num \n",
"Avatar PG-13 2009.0 178.0 1 1000.0 \n",
" 2 936.0 \n",
" 3 855.0 \n",
"Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 40000.0 \n",
" 2 5000.0 \n",
"... ... \n",
"Shanghai Calling PG-13 2012.0 100.0 2 719.0 \n",
" 3 490.0 \n",
"My Date with Drew PG 2004.0 90.0 1 86.0 \n",
" 2 23.0 \n",
" 3 16.0 \n",
"\n",
"[14748 rows x 4 columns]"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wl = pd.wide_to_long(\n",
" movies,\n",
" stubnames=[\"director\", \"director_fb_likes\", \"actor\", \"actor_fb_likes\"],\n",
" i=[\"title\", \"rating\", \"year\", \"duration\"],\n",
" j=\"num\",\n",
" sep=\"_\",\n",
")\n",
"wl"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A more efficient approach than `pd.wide_to_long`, using `pd.stack`:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" director | \n",
" director_fb_likes | \n",
" actor | \n",
" actor_fb_likes | \n",
"
\n",
" \n",
" title | \n",
" rating | \n",
" year | \n",
" duration | \n",
" num | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Avatar | \n",
" PG-13 | \n",
" 2009.0 | \n",
" 178.0 | \n",
" 1 | \n",
" James Cameron | \n",
" 0.0 | \n",
" CCH Pounder | \n",
" 1000.0 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" Joel David Moore | \n",
" 936.0 | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" NaN | \n",
" Wes Studi | \n",
" 855.0 | \n",
"
\n",
" \n",
" Pirates of the Caribbean: At World's End | \n",
" PG-13 | \n",
" 2007.0 | \n",
" 169.0 | \n",
" 1 | \n",
" Gore Verbinski | \n",
" 563.0 | \n",
" Johnny Depp | \n",
" 40000.0 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" Orlando Bloom | \n",
" 5000.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" Shanghai Calling | \n",
" PG-13 | \n",
" 2012.0 | \n",
" 100.0 | \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" Daniel Henney | \n",
" 719.0 | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" NaN | \n",
" Eliza Coupe | \n",
" 490.0 | \n",
"
\n",
" \n",
" My Date with Drew | \n",
" PG | \n",
" 2004.0 | \n",
" 90.0 | \n",
" 1 | \n",
" Jon Gunn | \n",
" 16.0 | \n",
" John August | \n",
" 86.0 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" Brian Herzlinger | \n",
" 23.0 | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" NaN | \n",
" Jon Gunn | \n",
" 16.0 | \n",
"
\n",
" \n",
"
\n",
"
14748 rows × 4 columns
\n",
"
"
],
"text/plain": [
" director \\\n",
"title rating year duration num \n",
"Avatar PG-13 2009.0 178.0 1 James Cameron \n",
" 2 NaN \n",
" 3 NaN \n",
"Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 Gore Verbinski \n",
" 2 NaN \n",
"... ... \n",
"Shanghai Calling PG-13 2012.0 100.0 2 NaN \n",
" 3 NaN \n",
"My Date with Drew PG 2004.0 90.0 1 Jon Gunn \n",
" 2 NaN \n",
" 3 NaN \n",
"\n",
" director_fb_likes \\\n",
"title rating year duration num \n",
"Avatar PG-13 2009.0 178.0 1 0.0 \n",
" 2 NaN \n",
" 3 NaN \n",
"Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 563.0 \n",
" 2 NaN \n",
"... ... \n",
"Shanghai Calling PG-13 2012.0 100.0 2 NaN \n",
" 3 NaN \n",
"My Date with Drew PG 2004.0 90.0 1 16.0 \n",
" 2 NaN \n",
" 3 NaN \n",
"\n",
" actor \\\n",
"title rating year duration num \n",
"Avatar PG-13 2009.0 178.0 1 CCH Pounder \n",
" 2 Joel David Moore \n",
" 3 Wes Studi \n",
"Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 Johnny Depp \n",
" 2 Orlando Bloom \n",
"... ... \n",
"Shanghai Calling PG-13 2012.0 100.0 2 Daniel Henney \n",
" 3 Eliza Coupe \n",
"My Date with Drew PG 2004.0 90.0 1 John August \n",
" 2 Brian Herzlinger \n",
" 3 Jon Gunn \n",
"\n",
" actor_fb_likes \n",
"title rating year duration num \n",
"Avatar PG-13 2009.0 178.0 1 1000.0 \n",
" 2 936.0 \n",
" 3 855.0 \n",
"Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 40000.0 \n",
" 2 5000.0 \n",
"... ... \n",
"Shanghai Calling PG-13 2012.0 100.0 2 719.0 \n",
" 3 490.0 \n",
"My Date with Drew PG 2004.0 90.0 1 86.0 \n",
" 2 23.0 \n",
" 3 16.0 \n",
"\n",
"[14748 rows x 4 columns]"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index = [\"title\", \"rating\", \"year\", \"duration\"]\n",
"vid = movies.set_index(index)\n",
"vid.columns = vid.columns.str.rsplit(\"_\", n=1, expand=True)\n",
"vid.columns.names = [None, \"num\"]\n",
"vid = vid.stack(level=\"num\", future_stack=True)\n",
"vid"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" rating | \n",
" year | \n",
" duration | \n",
" director | \n",
" director_fb_likes | \n",
" actor | \n",
" actor_fb_likes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Avatar | \n",
" PG-13 | \n",
" 2009.0 | \n",
" 178.0 | \n",
" James Cameron | \n",
" 0.0 | \n",
" CCH Pounder | \n",
" 1000.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Avatar | \n",
" PG-13 | \n",
" 2009.0 | \n",
" 178.0 | \n",
" NaN | \n",
" NaN | \n",
" Joel David Moore | \n",
" 936.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Avatar | \n",
" PG-13 | \n",
" 2009.0 | \n",
" 178.0 | \n",
" NaN | \n",
" NaN | \n",
" Wes Studi | \n",
" 855.0 | \n",
"
\n",
" \n",
" 3 | \n",
" Pirates of the Caribbean: At World's End | \n",
" PG-13 | \n",
" 2007.0 | \n",
" 169.0 | \n",
" Gore Verbinski | \n",
" 563.0 | \n",
" Johnny Depp | \n",
" 40000.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Pirates of the Caribbean: At World's End | \n",
" PG-13 | \n",
" 2007.0 | \n",
" 169.0 | \n",
" NaN | \n",
" NaN | \n",
" Orlando Bloom | \n",
" 5000.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 14743 | \n",
" Shanghai Calling | \n",
" PG-13 | \n",
" 2012.0 | \n",
" 100.0 | \n",
" NaN | \n",
" NaN | \n",
" Daniel Henney | \n",
" 719.0 | \n",
"
\n",
" \n",
" 14744 | \n",
" Shanghai Calling | \n",
" PG-13 | \n",
" 2012.0 | \n",
" 100.0 | \n",
" NaN | \n",
" NaN | \n",
" Eliza Coupe | \n",
" 490.0 | \n",
"
\n",
" \n",
" 14745 | \n",
" My Date with Drew | \n",
" PG | \n",
" 2004.0 | \n",
" 90.0 | \n",
" Jon Gunn | \n",
" 16.0 | \n",
" John August | \n",
" 86.0 | \n",
"
\n",
" \n",
" 14746 | \n",
" My Date with Drew | \n",
" PG | \n",
" 2004.0 | \n",
" 90.0 | \n",
" NaN | \n",
" NaN | \n",
" Brian Herzlinger | \n",
" 23.0 | \n",
"
\n",
" \n",
" 14747 | \n",
" My Date with Drew | \n",
" PG | \n",
" 2004.0 | \n",
" 90.0 | \n",
" NaN | \n",
" NaN | \n",
" Jon Gunn | \n",
" 16.0 | \n",
"
\n",
" \n",
"
\n",
"
14748 rows × 8 columns
\n",
"
"
],
"text/plain": [
" title rating year duration \\\n",
"0 Avatar PG-13 2009.0 178.0 \n",
"1 Avatar PG-13 2009.0 178.0 \n",
"2 Avatar PG-13 2009.0 178.0 \n",
"3 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 \n",
"4 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 \n",
"... ... ... ... ... \n",
"14743 Shanghai Calling PG-13 2012.0 100.0 \n",
"14744 Shanghai Calling PG-13 2012.0 100.0 \n",
"14745 My Date with Drew PG 2004.0 90.0 \n",
"14746 My Date with Drew PG 2004.0 90.0 \n",
"14747 My Date with Drew PG 2004.0 90.0 \n",
"\n",
" director director_fb_likes actor actor_fb_likes \n",
"0 James Cameron 0.0 CCH Pounder 1000.0 \n",
"1 NaN NaN Joel David Moore 936.0 \n",
"2 NaN NaN Wes Studi 855.0 \n",
"3 Gore Verbinski 563.0 Johnny Depp 40000.0 \n",
"4 NaN NaN Orlando Bloom 5000.0 \n",
"... ... ... ... ... \n",
"14743 NaN NaN Daniel Henney 719.0 \n",
"14744 NaN NaN Eliza Coupe 490.0 \n",
"14745 Jon Gunn 16.0 John August 86.0 \n",
"14746 NaN NaN Brian Herzlinger 23.0 \n",
"14747 NaN NaN Jon Gunn 16.0 \n",
"\n",
"[14748 rows x 8 columns]"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"out = movies.pivot_longer(\n",
" index=slice(\"title\", \"duration\"),\n",
" names_to=\".value\",\n",
" names_pattern=\"(.+)_\\\\d\",\n",
" sort_by_appearance=True,\n",
")\n",
"\n",
"out"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wl = wl.droplevel(\"num\").reset_index()\n",
"wl.equals(out)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vid = (\n",
" vid.droplevel(\"num\")\n",
" .loc[:, [\"director\", \"director_fb_likes\", \"actor\", \"actor_fb_likes\"]]\n",
" .reset_index()\n",
")\n",
"\n",
"wl.equals(vid)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vid.equals(out)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"52.7 ms ± 784 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%%timeit \n",
"pd.wide_to_long(\n",
" movies, \n",
" stubnames=['director', 'director_fb_likes', 'actor', 'actor_fb_likes'], \n",
" i = ['title', 'rating', 'year', 'duration'], \n",
" j = 'num', \n",
" sep='_')"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5.19 ms ± 122 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"index = ['title', 'rating', 'year', 'duration']\n",
"vid = movies.set_index(index)\n",
"vid.columns = vid.columns.str.rsplit(\"_\", n = 1, expand = True)\n",
"vid.stack(level = 1, future_stack=True)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.79 ms ± 53.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"(movies\n",
".pivot_longer(\n",
" index = slice('title', 'duration'),\n",
" names_to = \".value\",\n",
" names_pattern = \"(.+)_\\\\d\",\n",
" sort_by_appearance = True)\n",
")\n"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.93 ms ± 49.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"movies.pivot_longer(\n",
" index = slice('title', 'duration'),\n",
" names_to = ['director', 'director_fb_likes', 'actor', 'actor_fb_likes'],\n",
" names_pattern = ['director_\\\\d$', 'director_fb_likes', 'actor_\\\\d$', 'actor.+'],\n",
" sort_by_appearance = True\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Your mileage may vary with these speed tests."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-07-30T10:57:18.265994Z",
"iopub.status.busy": "2020-07-30T10:57:18.265751Z",
"iopub.status.idle": "2020-07-30T10:57:18.283104Z",
"shell.execute_reply": "2020-07-30T10:57:18.280684Z",
"shell.execute_reply.started": "2020-07-30T10:57:18.265970Z"
}
},
"source": [
"## **Summary**\n",
"\n",
"This blog post shows various ways to reshape data into long form, using `pivot_longer`, based on some common patterns. Once the pattern is identified, use the appropriate syntax for `pivot_longer`. It is meant to be simple, straightforward, and efficient. For more control, have a look at the [pivot_longer_spec](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer_spec)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comments\n",
""
]
}
],
"metadata": {
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "blogger",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
}
},
"nbformat": 4,
"nbformat_minor": 4
}