{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Pivot_longer : Reshape Data in Pandas Efficiently and with Ease from Wide to Long" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(introduction)=\n", "## **Introduction**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You've got data in a wide form (lots of columns), and you wish to transform it into a long form (fewer columns, more rows). In pandas, depending on the form of the data, this can be achieved with either `pd.stack`, `pd.melt`, `pd.wide_to_long`, and sometimes require a bit of data massaging before arriving at the final, desired form.\n", "\n", "This article suggests a simple, efficient, opinionated way, that resolves most wide to long transformations, within a single function, using the [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) function from [pyjanitor](https://pyjanitor-devs.github.io/pyjanitor/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at an example, based on real data - this is a subset of the [who](https://github.com/tidyverse/tidyr/blob/main/data-raw/who.csv) data from R:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " pandas version : 2.2.2 \n", " numpy version : 2.0.2 \n", " python version : 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import sys\n", "\n", "print(\n", " \" pandas version :\",\n", " pd.__version__,\n", " \"\\n\",\n", " \"numpy version :\",\n", " np.__version__,\n", " \"\\n\",\n", " \"python version :\",\n", " sys.version,\n", ")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnew_sp_m5564newrel_f65
0123
\n", "
" ], "text/plain": [ " id new_sp_m5564 newrel_f65\n", "0 1 2 3" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({\"id\": [1], \"new_sp_m5564\": [2], \"newrel_f65\": [3]})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the dataframe above, we have multiple variables crammed into the column names. Using the definitions as described [here](https://tidyr.tidyverse.org/articles/pivot.html#many-variables-in-column-names): \n", "\n", " - new_/new prefix indicates these are counts of new cases\n", " - sp/rel/ep describe how the case was diagnosed\n", " - m/f gives the gender\n", " - 5564/65 supplies the age range\n", "\n", "Our goal here is to separate the dataframe into individual columns of `count`, `diagnosis`,`gender`, `age`. How do we pull this off, with the current tools in Pandas? \n", "\n", "`pd.wide_to_long` is not applicable here, as there is no unique separator. Our best option would be to use a regular expression to pull out the individual columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `pd.stack`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddiagnosisgenderagecount
01spm55642
11relf653
\n", "
" ], "text/plain": [ " id diagnosis gender age count\n", "0 1 sp m 5564 2\n", "1 1 rel f 65 3" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# split the columns using a regex, before stacking:\n", "DF = df.set_index(\"id\")\n", "regex = r\"new_?(.+)_(.)(\\d+)\"\n", "columns = DF.columns.str.extract(regex)\n", "new_names = [\"diagnosis\", \"gender\", \"age\"]\n", "columns = pd.MultiIndex.from_frame(columns, names=new_names)\n", "DF.columns = columns\n", "DF.stack(new_names, future_stack=True).rename(\"count\").reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `pd.melt`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idvaluediagnosisgenderage
012spm5564
113relf65
\n", "
" ], "text/plain": [ " id value diagnosis gender age\n", "0 1 2 sp m 5564\n", "1 1 3 rel f 65" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DF = df.melt(\"id\")\n", "out = DF.variable.str.extract(regex).set_axis(new_names, axis=1)\n", "pd.concat([DF.drop(columns=\"variable\"), out], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the extra steps we had to take to get to our final form. However, we know the pattern (a regex that matches the columns and the sub groups in the columns that we wish to extract), so why not take advantage of that? This is where [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) from [pyjanitor](https://pyjanitor-devs.github.io/pyjanitor/) shines:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddiagnosisgenderagevalue
01spm55642
11relf653
\n", "
" ], "text/plain": [ " id diagnosis gender age value\n", "0 1 sp m 5564 2\n", "1 1 rel f 65 3" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pip install pyjanitor\n", "import janitor\n", "\n", "df.pivot_longer(index=\"id\", names_to=new_names, names_pattern=regex)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Straight to the point. If there is a pattern, [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) will figure out what to do and deliver the results. It is a simple, efficient, reshaping abstraction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Basics**\n", "\n", "Let's walk through some of the parameters in [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer):\n", "\n", "- index : These are the columns that will not be transformed to long form - the values are repeated as necessary.\n", "- column_names: These are the columns that will be transformed to long form.\n", "- names_to : These will be the new column names of the reshaped DataFrame.\n", "- values_to: These will contain the values collated from the original DataFrame.\n", "- names_sep: extract individual columns with a separator.\n", "- names_pattern: extract individual columns with a regular expression, or a list of regular expressions.\n", "- names_transform: efficiently convert the columns aggregated from the `column_names` from string to another dtype, as long as it is supported by `pd.astype`\n", "\n", "Keep in mind that [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) works on patterns - deciphering what pattern to use is the key to its simplicity. \n", "\n", "The rest of the article outlines the various patterns that [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) works with." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Pattern 1 - Melt all Columns**\n", "\n", "This is basic and is similar to what you get with `pd.melt`.\n", "\n", "Let's use another realistic dataset - [relig_income](https://github.com/tidyverse/tidyr/blob/main/data-raw/relig_income.csv) - the original example is from [here](https://tidyr.tidyverse.org/articles/pivot.html#pew):" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
religion<$10k$10-20k$20-30k$30-40k$40-50k$50-75k$75-100k$100-150k>150kDon't know/refused
0Agnostic27346081761371221098496
1Atheist12273752357073597476
2Buddhist27213034335862395354
3Catholic41861773267063811169497926331489
4Don’t know/refused151415111035211718116
\n", "
" ], "text/plain": [ " religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k \\\n", "0 Agnostic 27 34 60 81 76 137 \n", "1 Atheist 12 27 37 52 35 70 \n", "2 Buddhist 27 21 30 34 33 58 \n", "3 Catholic 418 617 732 670 638 1116 \n", "4 Don’t know/refused 15 14 15 11 10 35 \n", "\n", " $75-100k $100-150k >150k Don't know/refused \n", "0 122 109 84 96 \n", "1 73 59 74 76 \n", "2 62 39 53 54 \n", "3 949 792 633 1489 \n", "4 21 17 18 116 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = \"https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/relig_income.csv\"\n", "religion = pd.read_csv(url)\n", "religion.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
religionincomecount
0Agnostic<$10k27
1Atheist<$10k12
2Buddhist<$10k27
3Catholic<$10k418
4Don’t know/refused<$10k15
............
175OrthodoxDon't know/refused73
176Other ChristianDon't know/refused18
177Other FaithsDon't know/refused71
178Other World ReligionsDon't know/refused8
179UnaffiliatedDon't know/refused597
\n", "

180 rows × 3 columns

\n", "
" ], "text/plain": [ " religion income count\n", "0 Agnostic <$10k 27\n", "1 Atheist <$10k 12\n", "2 Buddhist <$10k 27\n", "3 Catholic <$10k 418\n", "4 Don’t know/refused <$10k 15\n", ".. ... ... ...\n", "175 Orthodox Don't know/refused 73\n", "176 Other Christian Don't know/refused 18\n", "177 Other Faiths Don't know/refused 71\n", "178 Other World Religions Don't know/refused 8\n", "179 Unaffiliated Don't know/refused 597\n", "\n", "[180 rows x 3 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "religion.pivot_longer(index=\"religion\", names_to=\"income\", values_to=\"count\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If there are a lot of columns, we can dynamically select the `index`, or `column_names`, using the [janitor.select](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.select.select) syntax. Let's apply it to another [dataset](https://github.com/tidyverse/tidyr/blob/main/data-raw/billboard.csv) related to billboard charts - original example is [here](https://tidyr.tidyverse.org/articles/pivot.html#billboard): " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearartisttracktimedate.enteredwk1wk2wk3wk4wk5...wk67wk68wk69wk70wk71wk72wk73wk74wk75wk76
020002 PacBaby Don't Cry (Keep...4:222000-02-268782.072.077.087.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
120002Ge+herThe Hardest Part Of ...3:152000-09-029187.092.0NaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
220003 Doors DownKryptonite3:532000-04-088170.068.067.066.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
320003 Doors DownLoser4:242000-10-217676.072.069.067.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
42000504 BoyzWobble Wobble3:352000-04-155734.025.017.017.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 81 columns

\n", "
" ], "text/plain": [ " year artist track time date.entered wk1 wk2 \\\n", "0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 87 82.0 \n", "1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 91 87.0 \n", "2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 70.0 \n", "3 2000 3 Doors Down Loser 4:24 2000-10-21 76 76.0 \n", "4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 57 34.0 \n", "\n", " wk3 wk4 wk5 ... wk67 wk68 wk69 wk70 wk71 wk72 wk73 wk74 \\\n", "0 72.0 77.0 87.0 ... NaN NaN NaN NaN NaN NaN NaN NaN \n", "1 92.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN \n", "2 68.0 67.0 66.0 ... NaN NaN NaN NaN NaN NaN NaN NaN \n", "3 72.0 69.0 67.0 ... NaN NaN NaN NaN NaN NaN NaN NaN \n", "4 25.0 17.0 17.0 ... NaN NaN NaN NaN NaN NaN NaN NaN \n", "\n", " wk75 wk76 \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "\n", "[5 rows x 81 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = \"https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/billboard.csv\"\n", "billboard = pd.read_csv(url)\n", "billboard.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The columns that will be flipped to long form are the columns that start with `wk` - these will be passed to the `column_names` parameter. That is a lot of columns. Let's see how [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) simplifies this:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearartisttracktimedate.enteredweekvalue
020002 PacBaby Don't Cry (Keep...4:222000-02-26wk187.0
120002Ge+herThe Hardest Part Of ...3:152000-09-02wk191.0
220003 Doors DownKryptonite3:532000-04-08wk181.0
320003 Doors DownLoser4:242000-10-21wk176.0
42000504 BoyzWobble Wobble3:352000-04-15wk157.0
........................
240872000Yankee GreyAnother Nine Minutes3:102000-04-29wk76NaN
240882000Yearwood, TrishaReal Live Woman3:552000-04-01wk76NaN
240892000Ying Yang TwinsWhistle While You Tw...4:192000-03-18wk76NaN
240902000Zombie NationKernkraft 4003:302000-09-02wk76NaN
240912000matchbox twentyBent4:122000-04-29wk76NaN
\n", "

24092 rows × 7 columns

\n", "
" ], "text/plain": [ " year artist track time date.entered \\\n", "0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 \n", "1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 \n", "2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 \n", "3 2000 3 Doors Down Loser 4:24 2000-10-21 \n", "4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 \n", "... ... ... ... ... ... \n", "24087 2000 Yankee Grey Another Nine Minutes 3:10 2000-04-29 \n", "24088 2000 Yearwood, Trisha Real Live Woman 3:55 2000-04-01 \n", "24089 2000 Ying Yang Twins Whistle While You Tw... 4:19 2000-03-18 \n", "24090 2000 Zombie Nation Kernkraft 400 3:30 2000-09-02 \n", "24091 2000 matchbox twenty Bent 4:12 2000-04-29 \n", "\n", " week value \n", "0 wk1 87.0 \n", "1 wk1 91.0 \n", "2 wk1 81.0 \n", "3 wk1 76.0 \n", "4 wk1 57.0 \n", "... ... ... \n", "24087 wk76 NaN \n", "24088 wk76 NaN \n", "24089 wk76 NaN \n", "24090 wk76 NaN \n", "24091 wk76 NaN \n", "\n", "[24092 rows x 7 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "billboard.pivot_longer(column_names=\"wk*\", names_to=\"week\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to use the `index` parameter instead, that's easy as well: " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearartisttracktimedate.enteredweekvalue
020002 PacBaby Don't Cry (Keep...4:222000-02-26wk187.0
120002Ge+herThe Hardest Part Of ...3:152000-09-02wk191.0
220003 Doors DownKryptonite3:532000-04-08wk181.0
320003 Doors DownLoser4:242000-10-21wk176.0
42000504 BoyzWobble Wobble3:352000-04-15wk157.0
........................
240872000Yankee GreyAnother Nine Minutes3:102000-04-29wk76NaN
240882000Yearwood, TrishaReal Live Woman3:552000-04-01wk76NaN
240892000Ying Yang TwinsWhistle While You Tw...4:192000-03-18wk76NaN
240902000Zombie NationKernkraft 4003:302000-09-02wk76NaN
240912000matchbox twentyBent4:122000-04-29wk76NaN
\n", "

24092 rows × 7 columns

\n", "
" ], "text/plain": [ " year artist track time date.entered \\\n", "0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 \n", "1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 \n", "2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 \n", "3 2000 3 Doors Down Loser 4:24 2000-10-21 \n", "4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 \n", "... ... ... ... ... ... \n", "24087 2000 Yankee Grey Another Nine Minutes 3:10 2000-04-29 \n", "24088 2000 Yearwood, Trisha Real Live Woman 3:55 2000-04-01 \n", "24089 2000 Ying Yang Twins Whistle While You Tw... 4:19 2000-03-18 \n", "24090 2000 Zombie Nation Kernkraft 400 3:30 2000-09-02 \n", "24091 2000 matchbox twenty Bent 4:12 2000-04-29 \n", "\n", " week value \n", "0 wk1 87.0 \n", "1 wk1 91.0 \n", "2 wk1 81.0 \n", "3 wk1 76.0 \n", "4 wk1 57.0 \n", "... ... ... \n", "24087 wk76 NaN \n", "24088 wk76 NaN \n", "24089 wk76 NaN \n", "24090 wk76 NaN \n", "24091 wk76 NaN \n", "\n", "[24092 rows x 7 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "billboard.pivot_longer(index=slice(\"year\", \"date.entered\"), names_to=\"week\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(pattern-2a)=\n", "## **Pattern 2A - Melt into multiple columns - new column names**\n", "\n", "There are scenarios where multiple variables are crammed into the column names - we had a look at that already with the `who` [example](#introduction).\n", "\n", "Let's look at an example, adapted from [data.table vignette](https://rdatatable.gitlab.io/data.table/articles/datatable-reshape.html#a-enhanced-melt):" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
05.13.51.40.2setosa
15.93.05.11.8virginica
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 5.9 3.0 5.1 1.8 virginica" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris = pd.DataFrame(\n", " {\n", " \"Sepal.Length\": [5.1, 5.9],\n", " \"Sepal.Width\": [3.5, 3.0],\n", " \"Petal.Length\": [1.4, 5.1],\n", " \"Petal.Width\": [0.2, 1.8],\n", " \"Species\": [\"setosa\", \"virginica\"],\n", " }\n", ")\n", "\n", "iris" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the data above, we wish to consolidate the columns into `part` and `dimensions` - the `Sepal/Petal` prefixes go into the `part` column, while `Width/Length` go into the `dimension` column. \n", "\n", "In this case, we can use either the `names_sep` or the `names_pattern` parameter - for this particular scenario, the `names_sep` option is a better fit: " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Speciespartdimensionvalue
0setosaSepalLength5.1
1virginicaSepalLength5.9
2setosaSepalWidth3.5
3virginicaSepalWidth3.0
4setosaPetalLength1.4
5virginicaPetalLength5.1
6setosaPetalWidth0.2
7virginicaPetalWidth1.8
\n", "
" ], "text/plain": [ " Species part dimension value\n", "0 setosa Sepal Length 5.1\n", "1 virginica Sepal Length 5.9\n", "2 setosa Sepal Width 3.5\n", "3 virginica Sepal Width 3.0\n", "4 setosa Petal Length 1.4\n", "5 virginica Petal Length 5.1\n", "6 setosa Petal Width 0.2\n", "7 virginica Petal Width 1.8" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.pivot_longer(index=\"Species\", names_to=[\"part\", \"dimension\"], names_sep=r\".\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using names_pattern:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Speciespartdimensionvalue
0setosaSepalLength5.1
1virginicaSepalLength5.9
2setosaSepalWidth3.5
3virginicaSepalWidth3.0
4setosaPetalLength1.4
5virginicaPetalLength5.1
6setosaPetalWidth0.2
7virginicaPetalWidth1.8
\n", "
" ], "text/plain": [ " Species part dimension value\n", "0 setosa Sepal Length 5.1\n", "1 virginica Sepal Length 5.9\n", "2 setosa Sepal Width 3.5\n", "3 virginica Sepal Width 3.0\n", "4 setosa Petal Length 1.4\n", "5 virginica Petal Length 5.1\n", "6 setosa Petal Width 0.2\n", "7 virginica Petal Width 1.8" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.pivot_longer(\n", " index=\"Species\", names_to=[\"part\", \"dimension\"], names_pattern=r\"(.+)\\.(.+)\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Pattern 2B - Melt into multiple columns - Use variables within the column names as new names**\n", "\n", "Instead of new column names, what if we wanted to use existing variables within the column names? \n", "\n", "Let's look at an example adapted from [data.table vignette](https://rdatatable.gitlab.io/data.table/articles/datatable-reshape.html#a-melting-data-tables-wide-to-long):" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
familydob_child1dob_child2gender_child1gender_child2
011998-11-262000-01-2912.0
121996-06-22NaN2NaN
232002-07-112004-04-0522.0
342004-10-102009-08-2711.0
452000-12-052005-02-2821.0
\n", "
" ], "text/plain": [ " family dob_child1 dob_child2 gender_child1 gender_child2\n", "0 1 1998-11-26 2000-01-29 1 2.0\n", "1 2 1996-06-22 NaN 2 NaN\n", "2 3 2002-07-11 2004-04-05 2 2.0\n", "3 4 2004-10-10 2009-08-27 1 1.0\n", "4 5 2000-12-05 2005-02-28 2 1.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(\n", " {\n", " \"family\": [1, 2, 3, 4, 5],\n", " \"dob_child1\": [\n", " \"1998-11-26\",\n", " \"1996-06-22\",\n", " \"2002-07-11\",\n", " \"2004-10-10\",\n", " \"2000-12-05\",\n", " ],\n", " \"dob_child2\": [\n", " \"2000-01-29\",\n", " np.nan,\n", " \"2004-04-05\",\n", " \"2009-08-27\",\n", " \"2005-02-28\",\n", " ],\n", " \"gender_child1\": [1, 2, 2, 1, 2],\n", " \"gender_child2\": [2.0, np.nan, 2.0, 1.0, 1.0],\n", " }\n", ")\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the data above, we wish to keep `dob` and `gender` as column names, while moving `child1` and `child2` to a new column. How does [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) reshape this?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
familychilddobgender
01child11998-11-261.0
12child11996-06-222.0
23child12002-07-112.0
34child12004-10-101.0
45child12000-12-052.0
51child22000-01-292.0
62child2NaNNaN
73child22004-04-052.0
84child22009-08-271.0
95child22005-02-281.0
\n", "
" ], "text/plain": [ " family child dob gender\n", "0 1 child1 1998-11-26 1.0\n", "1 2 child1 1996-06-22 2.0\n", "2 3 child1 2002-07-11 2.0\n", "3 4 child1 2004-10-10 1.0\n", "4 5 child1 2000-12-05 2.0\n", "5 1 child2 2000-01-29 2.0\n", "6 2 child2 NaN NaN\n", "7 3 child2 2004-04-05 2.0\n", "8 4 child2 2009-08-27 1.0\n", "9 5 child2 2005-02-28 1.0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_longer(index=\"family\", names_to=(\".value\", \"child\"), names_sep=\"_\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For patterns such as this, where a part of the column name is to be retained as header in the new dataframe, the `.value` placeholder comes in handy. The `.value` placeholder tells [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) - `hey, this variable in the column name is to be retained as a header`. [pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) gets the message and acts accordingly. \n", "\n", "Take column label `dob_child1` - the first part is `dob` and the second part is `child1`. This pairs with `(.value, child)`, meaning `dob` stays as header, and the rest gets lumped into the `child` column. The `names_sep` value - `_` - determines how the column is split into the new columns.\n", "\n", "With `names_pattern` we can have more control on the output, selecting just the numbers in the `child` column:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
familychilddobgender
0111998-11-261.0
1211996-06-222.0
2312002-07-112.0
3412004-10-101.0
4512000-12-052.0
5122000-01-292.0
622NaNNaN
7322004-04-052.0
8422009-08-271.0
9522005-02-281.0
\n", "
" ], "text/plain": [ " family child dob gender\n", "0 1 1 1998-11-26 1.0\n", "1 2 1 1996-06-22 2.0\n", "2 3 1 2002-07-11 2.0\n", "3 4 1 2004-10-10 1.0\n", "4 5 1 2000-12-05 2.0\n", "5 1 2 2000-01-29 2.0\n", "6 2 2 NaN NaN\n", "7 3 2 2004-04-05 2.0\n", "8 4 2 2009-08-27 1.0\n", "9 5 2 2005-02-28 1.0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_longer(\n", " index=\"family\", names_to=(\".value\", \"child\"), names_pattern=r\"(.+)_child(\\d)\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also efficiently convert the data type of `child` to integer, using the `names_transform` option:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
familychilddobgender
0111998-11-261.0
1211996-06-222.0
2312002-07-112.0
3412004-10-101.0
4512000-12-052.0
5122000-01-292.0
622NaNNaN
7322004-04-052.0
8422009-08-271.0
9522005-02-281.0
\n", "
" ], "text/plain": [ " family child dob gender\n", "0 1 1 1998-11-26 1.0\n", "1 2 1 1996-06-22 2.0\n", "2 3 1 2002-07-11 2.0\n", "3 4 1 2004-10-10 1.0\n", "4 5 1 2000-12-05 2.0\n", "5 1 2 2000-01-29 2.0\n", "6 2 2 NaN NaN\n", "7 3 2 2004-04-05 2.0\n", "8 4 2 2009-08-27 1.0\n", "9 5 2 2005-02-28 1.0" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_longer(\n", " index=\"family\",\n", " names_to=(\".value\", \"child\"),\n", " names_pattern=r\"(.+)_child(\\d)\",\n", " names_transform=np.int8,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The flexibility of `.value` means you can place it anywhere - you are not limited by position - as long as the right argument is passed to the `names_sep` or `names_pattern` parameter, your output should be fine.\n", "\n", "Let's apply this to the `iris` [dataset](#pattern-2a) to test the flexibility of `.value`:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SpeciespartLengthWidth
0setosaSepal5.13.5
1virginicaSepal5.93.0
2setosaPetal1.40.2
3virginicaPetal5.11.8
\n", "
" ], "text/plain": [ " Species part Length Width\n", "0 setosa Sepal 5.1 3.5\n", "1 virginica Sepal 5.9 3.0\n", "2 setosa Petal 1.4 0.2\n", "3 virginica Petal 5.1 1.8" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.pivot_longer(index=\"Species\", names_to=(\"part\", \".value\"), names_sep=\".\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use multiple `.value` if required - let's apply this to a question from [Stack Overflow](https://stackoverflow.com/questions/12466493/reshaping-multiple-sets-of-measurement-columns-wide-format-into-single-columns):" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDDateRange1StartDateRange1EndValue1DateRange2StartDateRange2EndValue2DateRange3StartDateRange3EndValue3
011/1/903/1/904.44/5/916/7/916.25/5/956/6/963.3
\n", "
" ], "text/plain": [ " ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End \\\n", "0 1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 \n", "\n", " Value2 DateRange3Start DateRange3End Value3 \n", "0 6.2 5/5/95 6/6/96 3.3 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(\n", " [\n", " {\n", " \"ID\": 1,\n", " \"DateRange1Start\": \"1/1/90\",\n", " \"DateRange1End\": \"3/1/90\",\n", " \"Value1\": 4.4,\n", " \"DateRange2Start\": \"4/5/91\",\n", " \"DateRange2End\": \"6/7/91\",\n", " \"Value2\": 6.2,\n", " \"DateRange3Start\": \"5/5/95\",\n", " \"DateRange3End\": \"6/6/96\",\n", " \"Value3\": 3.3,\n", " }\n", " ]\n", ")\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDDateRangeStartDateRangeEndValue
011/1/903/1/904.4
114/5/916/7/916.2
215/5/956/6/963.3
\n", "
" ], "text/plain": [ " ID DateRangeStart DateRangeEnd Value\n", "0 1 1/1/90 3/1/90 4.4\n", "1 1 4/5/91 6/7/91 6.2\n", "2 1 5/5/95 6/6/96 3.3" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_longer(index=\"ID\", names_to=(\".value\", \".value\"), names_pattern=r\"(.+)\\d(.*)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, as long as the pattern can be identified, `pivot_longer` will take care of the rest. \n", "\n", "One more example, adapted from [Stack Overflow](https://stackoverflow.com/questions/64107566/how-to-pivot-longer-and-populate-with-fields-from-column-names-at-the-same-tim?noredirect=1#comment113369419_64107566):" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sony | TV | Model | valueSony | TV | Quantity | valueSony | TV | Max-quant | valuePanasonic | TV | Model | valuePanasonic | TV | Quantity | valuePanasonic | TV | Max-quant | valueSanyo | Radio | Model | valueSanyo | Radio | Quantity | valueSanyo | Radio | Max-quant | value
0A222510T232110S11149
1A23459S3424512S1s129
2A434549X3421111S1s2410
\n", "
" ], "text/plain": [ " Sony | TV | Model | value Sony | TV | Quantity | value \\\n", "0 A222 5 \n", "1 A234 5 \n", "2 A4345 4 \n", "\n", " Sony | TV | Max-quant | value Panasonic | TV | Model | value \\\n", "0 10 T232 \n", "1 9 S3424 \n", "2 9 X3421 \n", "\n", " Panasonic | TV | Quantity | value Panasonic | TV | Max-quant | value \\\n", "0 1 10 \n", "1 5 12 \n", "2 1 11 \n", "\n", " Sanyo | Radio | Model | value Sanyo | Radio | Quantity | value \\\n", "0 S111 4 \n", "1 S1s1 2 \n", "2 S1s2 4 \n", "\n", " Sanyo | Radio | Max-quant | value \n", "0 9 \n", "1 9 \n", "2 10 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(\n", " {\n", " \"Sony | TV | Model | value\": {0: \"A222\", 1: \"A234\", 2: \"A4345\"},\n", " \"Sony | TV | Quantity | value\": {0: 5, 1: 5, 2: 4},\n", " \"Sony | TV | Max-quant | value\": {0: 10, 1: 9, 2: 9},\n", " \"Panasonic | TV | Model | value\": {0: \"T232\", 1: \"S3424\", 2: \"X3421\"},\n", " \"Panasonic | TV | Quantity | value\": {0: 1, 1: 5, 2: 1},\n", " \"Panasonic | TV | Max-quant | value\": {0: 10, 1: 12, 2: 11},\n", " \"Sanyo | Radio | Model | value\": {0: \"S111\", 1: \"S1s1\", 2: \"S1s2\"},\n", " \"Sanyo | Radio | Quantity | value\": {0: 4, 1: 2, 2: 4},\n", " \"Sanyo | Radio | Max-quant | value\": {0: 9, 1: 9, 2: 10},\n", " }\n", ")\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ManufacturerDeviceModelQuantityMax-quant
0SonyTVA222510
1SonyTVA23459
2SonyTVA434549
3PanasonicTVT232110
4PanasonicTVS3424512
5PanasonicTVX3421111
6SanyoRadioS11149
7SanyoRadioS1s129
8SanyoRadioS1s2410
\n", "
" ], "text/plain": [ " Manufacturer Device Model Quantity Max-quant\n", "0 Sony TV A222 5 10\n", "1 Sony TV A234 5 9\n", "2 Sony TV A4345 4 9\n", "3 Panasonic TV T232 1 10\n", "4 Panasonic TV S3424 5 12\n", "5 Panasonic TV X3421 1 11\n", "6 Sanyo Radio S111 4 9\n", "7 Sanyo Radio S1s1 2 9\n", "8 Sanyo Radio S1s2 4 10" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_longer(\n", " names_to=(\"Manufacturer\", \"Device\", \".value\"),\n", " names_pattern=r\"(.+)\\s\\|(.+)\\s\\|(.+)\\s\\|.*\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(pattern-3a)=\n", "## **Pattern 3A - Group similar columns together**\n", "\n", "This is best explained with an example:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x_1_meanx_2_meany_1_meany_2_meanunit
01020304050
\n", "
" ], "text/plain": [ " x_1_mean x_2_mean y_1_mean y_2_mean unit\n", "0 10 20 30 40 50" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_mean = pd.DataFrame(\n", " {\n", " \"x_1_mean\": [10],\n", " \"x_2_mean\": [20],\n", " \"y_1_mean\": [30],\n", " \"y_2_mean\": [40],\n", " \"unit\": [50],\n", " }\n", ")\n", "\n", "df_mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the dataframe above, we wish to lump similar columns together - specifically, we want to lump the `x*` columns into one, same goes for the `y*` columns. \n", "\n", "For this pattern, we pass a list of names to `names_to`, while passing a list of regular expressions to `names_pattern`:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
unitx_meany_mean
0501030
1502040
\n", "
" ], "text/plain": [ " unit x_mean y_mean\n", "0 50 10 30\n", "1 50 20 40" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_mean.pivot_longer(\n", " index=\"unit\", names_to=[\"x_mean\", \"y_mean\"], names_pattern=[\"x\", \"y\"]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using a list of regular expressions for `names_pattern` implies that you want to group similar items under specified columns in `names_to`. \n", "\n", "Basically, it is a pairing. the first `x_1_mean` pairs with the first `y*` mean which it encounters, which is `y_1_mean`, same goes for the next `x*`, which is `x_2_mean`, it pairs with `y_2_mean`, since that is the next available `y*`. The pairing is on a first come first serve basis.\n", "\n", "Let's look at another [example](https://github.com/PacktPublishing/Pandas-Cookbook/blob/master/data/movie_altered.csv):" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleratingyeardurationdirector_1director_fb_likes_1actor_1actor_2actor_3actor_fb_likes_1actor_fb_likes_2actor_fb_likes_3
0AvatarPG-132009.0178.0James Cameron0.0CCH PounderJoel David MooreWes Studi1000.0936.0855.0
1Pirates of the Caribbean: At World's EndPG-132007.0169.0Gore Verbinski563.0Johnny DeppOrlando BloomJack Davenport40000.05000.01000.0
2SpectrePG-132015.0148.0Sam Mendes0.0Christoph WaltzRory KinnearStephanie Sigman11000.0393.0161.0
3The Dark Knight RisesPG-132012.0164.0Christopher Nolan22000.0Tom HardyChristian BaleJoseph Gordon-Levitt27000.023000.023000.0
4Star Wars: Episode VII - The Force AwakensNaNNaNNaNDoug Walker131.0Doug WalkerRob WalkerNaN131.012.0NaN
\n", "
" ], "text/plain": [ " title rating year duration \\\n", "0 Avatar PG-13 2009.0 178.0 \n", "1 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 \n", "2 Spectre PG-13 2015.0 148.0 \n", "3 The Dark Knight Rises PG-13 2012.0 164.0 \n", "4 Star Wars: Episode VII - The Force Awakens NaN NaN NaN \n", "\n", " director_1 director_fb_likes_1 actor_1 actor_2 \\\n", "0 James Cameron 0.0 CCH Pounder Joel David Moore \n", "1 Gore Verbinski 563.0 Johnny Depp Orlando Bloom \n", "2 Sam Mendes 0.0 Christoph Waltz Rory Kinnear \n", "3 Christopher Nolan 22000.0 Tom Hardy Christian Bale \n", "4 Doug Walker 131.0 Doug Walker Rob Walker \n", "\n", " actor_3 actor_fb_likes_1 actor_fb_likes_2 actor_fb_likes_3 \n", "0 Wes Studi 1000.0 936.0 855.0 \n", "1 Jack Davenport 40000.0 5000.0 1000.0 \n", "2 Stephanie Sigman 11000.0 393.0 161.0 \n", "3 Joseph Gordon-Levitt 27000.0 23000.0 23000.0 \n", "4 NaN 131.0 12.0 NaN " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = \"https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie_altered.csv\"\n", "movies = pd.read_csv(url)\n", "movies.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal for the dataframe above is to group the actors into another column, and the facebook likes into a separate column. The pattern here is obvious - we are grouping similar columns, so we pass a list of regular expressions to `names_pattern`, with an equal number of names to `names_to`:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleratingyeardurationdirectordirector_fb_likesactoractor_fb_likes
0AvatarPG-132009.0178.0James Cameron0.0CCH Pounder1000.0
1Pirates of the Caribbean: At World's EndPG-132007.0169.0Gore Verbinski563.0Johnny Depp40000.0
2SpectrePG-132015.0148.0Sam Mendes0.0Christoph Waltz11000.0
3The Dark Knight RisesPG-132012.0164.0Christopher Nolan22000.0Tom Hardy27000.0
4Star Wars: Episode VII - The Force AwakensNaNNaNNaNDoug Walker131.0Doug Walker131.0
...........................
14743Signed Sealed DeliveredNaN2013.087.0NaNNaNCrystal Lowe319.0
14744The FollowingTV-14NaN43.0NaNNaNSam Underwood319.0
14745A Plague So PleasantNaN2013.076.0NaNNaNDavid Chandler0.0
14746Shanghai CallingPG-132012.0100.0NaNNaNEliza Coupe490.0
14747My Date with DrewPG2004.090.0NaNNaNJon Gunn16.0
\n", "

14748 rows × 8 columns

\n", "
" ], "text/plain": [ " title rating year duration \\\n", "0 Avatar PG-13 2009.0 178.0 \n", "1 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 \n", "2 Spectre PG-13 2015.0 148.0 \n", "3 The Dark Knight Rises PG-13 2012.0 164.0 \n", "4 Star Wars: Episode VII - The Force Awakens NaN NaN NaN \n", "... ... ... ... ... \n", "14743 Signed Sealed Delivered NaN 2013.0 87.0 \n", "14744 The Following TV-14 NaN 43.0 \n", "14745 A Plague So Pleasant NaN 2013.0 76.0 \n", "14746 Shanghai Calling PG-13 2012.0 100.0 \n", "14747 My Date with Drew PG 2004.0 90.0 \n", "\n", " director director_fb_likes actor actor_fb_likes \n", "0 James Cameron 0.0 CCH Pounder 1000.0 \n", "1 Gore Verbinski 563.0 Johnny Depp 40000.0 \n", "2 Sam Mendes 0.0 Christoph Waltz 11000.0 \n", "3 Christopher Nolan 22000.0 Tom Hardy 27000.0 \n", "4 Doug Walker 131.0 Doug Walker 131.0 \n", "... ... ... ... ... \n", "14743 NaN NaN Crystal Lowe 319.0 \n", "14744 NaN NaN Sam Underwood 319.0 \n", "14745 NaN NaN David Chandler 0.0 \n", "14746 NaN NaN Eliza Coupe 490.0 \n", "14747 NaN NaN Jon Gunn 16.0 \n", "\n", "[14748 rows x 8 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies.pivot_longer(\n", " index=slice(\"title\", \"duration\"),\n", " names_to=[\"director\", \"director_fb_likes\", \"actor\", \"actor_fb_likes\"],\n", " names_pattern=[\"director_\\\\d$\", \"director_fb_likes\", \"actor_\\\\d$\", \"actor.+\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the same first come first serve approach, `director_1` is paired with `director_fb_likes_1`, `actor_1` is paired with `actor_fb_likes_1`; the pairing continues for the rest.\n", "\n", "This could also have been solved using the `.value` approach:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleratingyeardurationdirectordirector_fb_likesactoractor_fb_likes
0AvatarPG-132009.0178.0James Cameron0.0CCH Pounder1000.0
1Pirates of the Caribbean: At World's EndPG-132007.0169.0Gore Verbinski563.0Johnny Depp40000.0
2SpectrePG-132015.0148.0Sam Mendes0.0Christoph Waltz11000.0
3The Dark Knight RisesPG-132012.0164.0Christopher Nolan22000.0Tom Hardy27000.0
4Star Wars: Episode VII - The Force AwakensNaNNaNNaNDoug Walker131.0Doug Walker131.0
...........................
14743Signed Sealed DeliveredNaN2013.087.0NaNNaNCrystal Lowe319.0
14744The FollowingTV-14NaN43.0NaNNaNSam Underwood319.0
14745A Plague So PleasantNaN2013.076.0NaNNaNDavid Chandler0.0
14746Shanghai CallingPG-132012.0100.0NaNNaNEliza Coupe490.0
14747My Date with DrewPG2004.090.0NaNNaNJon Gunn16.0
\n", "

14748 rows × 8 columns

\n", "
" ], "text/plain": [ " title rating year duration \\\n", "0 Avatar PG-13 2009.0 178.0 \n", "1 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 \n", "2 Spectre PG-13 2015.0 148.0 \n", "3 The Dark Knight Rises PG-13 2012.0 164.0 \n", "4 Star Wars: Episode VII - The Force Awakens NaN NaN NaN \n", "... ... ... ... ... \n", "14743 Signed Sealed Delivered NaN 2013.0 87.0 \n", "14744 The Following TV-14 NaN 43.0 \n", "14745 A Plague So Pleasant NaN 2013.0 76.0 \n", "14746 Shanghai Calling PG-13 2012.0 100.0 \n", "14747 My Date with Drew PG 2004.0 90.0 \n", "\n", " director director_fb_likes actor actor_fb_likes \n", "0 James Cameron 0.0 CCH Pounder 1000.0 \n", "1 Gore Verbinski 563.0 Johnny Depp 40000.0 \n", "2 Sam Mendes 0.0 Christoph Waltz 11000.0 \n", "3 Christopher Nolan 22000.0 Tom Hardy 27000.0 \n", "4 Doug Walker 131.0 Doug Walker 131.0 \n", "... ... ... ... ... \n", "14743 NaN NaN Crystal Lowe 319.0 \n", "14744 NaN NaN Sam Underwood 319.0 \n", "14745 NaN NaN David Chandler 0.0 \n", "14746 NaN NaN Eliza Coupe 490.0 \n", "14747 NaN NaN Jon Gunn 16.0 \n", "\n", "[14748 rows x 8 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " movies.pivot_longer(\n", " index=slice(\"title\", \"duration\"), names_to=\".value\", names_pattern=\"(.+)_\\\\d\"\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, it is all about identifying the pattern, and using whichever option is more suitable/convenient. One more [example](https://github.com/tidyverse/tidyr/issues/1378#issue-1300743144):" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idAA_dateBB_dateotherother_date
01A1<NA><NA><NA><NA>
12<NA><NA>B3<NA><NA>
23A2B2<NA><NA>
34<NA><NA><NA><NA>C1
45<NA><NA><NA><NA>D5
\n", "
" ], "text/plain": [ " id A A_date B B_date other other_date\n", "0 1 A 1 \n", "1 2 B 3 \n", "2 3 A 2 B 2 \n", "3 4 C 1\n", "4 5 D 5" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pandas import NA\n", "\n", "treatments = dict(\n", " id=range(1, 6),\n", " A=(\"A\", NA, \"A\", NA, NA),\n", " A_date=(1, NA, 2, NA, NA),\n", " B=(NA, \"B\", \"B\", NA, NA),\n", " B_date=(NA, 3, 2, NA, NA),\n", " other=(NA, NA, NA, \"C\", \"D\"),\n", " other_date=(NA, NA, NA, 1, 5),\n", ")\n", "treatments = pd.DataFrame(treatments)\n", "treatments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal is to transform the dataframe into three columns, an `id` column, a `treatment` column, and a `date` column. Since we are grouping similar columns, we pass a list of regular expressions to `names_pattern`:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtreatmentdate
01A1
13A2
22B3
33B2
44C1
55D5
\n", "
" ], "text/plain": [ " id treatment date\n", "0 1 A 1\n", "1 3 A 2\n", "2 2 B 3\n", "3 3 B 2\n", "4 4 C 1\n", "5 5 D 5" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "treatments.pivot_longer(\n", " index=\"id\",\n", " names_to=[\"treatment\", \"date\"],\n", " names_pattern=[\"A$|B$|other$\", \".+date$\"],\n", " dropna=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way is to access the dates before the treatments, by reversing the order of the regular expressions:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtreatmentdate
01A1
13A2
22B3
33B2
44C1
55D5
\n", "
" ], "text/plain": [ " id treatment date\n", "0 1 A 1\n", "1 3 A 2\n", "2 2 B 3\n", "3 3 B 2\n", "4 4 C 1\n", "5 5 D 5" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "treatments.pivot_longer(\n", " index=\"id\",\n", " names_to=[\"date\", \"treatment\"],\n", " names_pattern=[\".+date$\", \".+\"],\n", " dropna=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Pattern 3B - Group similar columns and values together**\n", "\n", "This is best illustrated with an [example](https://stackoverflow.com/q/51519101/7175713):\n" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityStateNameMangoOrangeWatermelonGinVodka
0HoustonTexasAria410401620
1AustinTexasPenelope1089920033
2HooverAlabamaNiko9014433418
\n", "
" ], "text/plain": [ " City State Name Mango Orange Watermelon Gin Vodka\n", "0 Houston Texas Aria 4 10 40 16 20\n", "1 Austin Texas Penelope 10 8 99 200 33\n", "2 Hoover Alabama Niko 90 14 43 34 18" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(\n", " {\n", " \"City\": [\"Houston\", \"Austin\", \"Hoover\"],\n", " \"State\": [\"Texas\", \"Texas\", \"Alabama\"],\n", " \"Name\": [\"Aria\", \"Penelope\", \"Niko\"],\n", " \"Mango\": [4, 10, 90],\n", " \"Orange\": [10, 8, 14],\n", " \"Watermelon\": [40, 99, 43],\n", " \"Gin\": [16, 200, 34],\n", " \"Vodka\": [20, 33, 18],\n", " },\n", " columns=[\"City\", \"State\", \"Name\", \"Mango\", \"Orange\", \"Watermelon\", \"Gin\", \"Vodka\"],\n", ")\n", "\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal in this case is to group similar columns **and** values together - group the drinks into one column, group the fruits into another column. When that is done, group the values for the drinks into one column, and the values for the fruits into another column. That's the pattern, and for this pattern we continue with the idea from [earlier](#pattern-3a), by passing a list of regular expressions to `names_pattern`, an equal number of names to `names_to`, **and** an equal number of names to `values_to`:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityStateFruitDrinkPoundsOunces
0HoustonTexasMangoGin416.0
1AustinTexasMangoGin10200.0
2HooverAlabamaMangoGin9034.0
3HoustonTexasOrangeVodka1020.0
4AustinTexasOrangeVodka833.0
5HooverAlabamaOrangeVodka1418.0
6HoustonTexasWatermelonNone40NaN
7AustinTexasWatermelonNone99NaN
8HooverAlabamaWatermelonNone43NaN
\n", "
" ], "text/plain": [ " City State Fruit Drink Pounds Ounces\n", "0 Houston Texas Mango Gin 4 16.0\n", "1 Austin Texas Mango Gin 10 200.0\n", "2 Hoover Alabama Mango Gin 90 34.0\n", "3 Houston Texas Orange Vodka 10 20.0\n", "4 Austin Texas Orange Vodka 8 33.0\n", "5 Hoover Alabama Orange Vodka 14 18.0\n", "6 Houston Texas Watermelon None 40 NaN\n", "7 Austin Texas Watermelon None 99 NaN\n", "8 Hoover Alabama Watermelon None 43 NaN" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_longer(\n", " index=[\"City\", \"State\"],\n", " column_names=slice(\"Mango\", \"Vodka\"),\n", " names_to=(\"Fruit\", \"Drink\"),\n", " values_to=(\"Pounds\", \"Ounces\"),\n", " names_pattern=[r\"M|O|W\", r\"G|V\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Simple, straightforward, to the point, and efficient. Let's look at another [example](https://stackoverflow.com/q/37422487/7175713):" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateItemNx1x2x3y1y2y3z1z2z3
0CA16437531151
1CA273115105421
2FL33215321372
3FL49421614121454
\n", "
" ], "text/plain": [ " State ItemN x1 x2 x3 y1 y2 y3 z1 z2 z3\n", "0 CA 1 6 4 3 7 5 3 11 5 1\n", "1 CA 2 7 3 1 15 10 5 4 2 1\n", "2 FL 3 3 2 1 5 3 2 13 7 2\n", "3 FL 4 9 4 2 16 14 12 14 5 4" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = {\n", " \"State\": [\"CA\", \"CA\", \"FL\", \"FL\"],\n", " \"ItemN\": [1, 2, 3, 4],\n", " \"x1\": [6, 7, 3, 9],\n", " \"x2\": [4, 3, 2, 4],\n", " \"x3\": [3, 1, 1, 2],\n", " \"y1\": [7, 15, 5, 16],\n", " \"y2\": [5, 10, 3, 14],\n", " \"y3\": [3, 5, 2, 12],\n", " \"z1\": [11, 4, 13, 14],\n", " \"z2\": [5, 2, 7, 5],\n", " \"z3\": [1, 1, 2, 4],\n", "}\n", "\n", "df = pd.DataFrame(df)\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal is to group all the `x` columns into one, the `y` columns into one column, and group the associated values into individual columns as well. We are grouping similar columns **and** values - that's the pattern - again we pass a list of regular expressions to `names_pattern`, a list of names to `names_to`, and a list of names to `values_to`:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateItemNxyzxvalueyvaluezvalue
0CA1x1y1z16711
1CA2x1y1z17154
2FL3x1y1z13513
3FL4x1y1z191614
4CA1x2y2z2455
5CA2x2y2z23102
6FL3x2y2z2237
7FL4x2y2z24145
8CA1x3y3z3331
9CA2x3y3z3151
10FL3x3y3z3122
11FL4x3y3z32124
\n", "
" ], "text/plain": [ " State ItemN x y z xvalue yvalue zvalue\n", "0 CA 1 x1 y1 z1 6 7 11\n", "1 CA 2 x1 y1 z1 7 15 4\n", "2 FL 3 x1 y1 z1 3 5 13\n", "3 FL 4 x1 y1 z1 9 16 14\n", "4 CA 1 x2 y2 z2 4 5 5\n", "5 CA 2 x2 y2 z2 3 10 2\n", "6 FL 3 x2 y2 z2 2 3 7\n", "7 FL 4 x2 y2 z2 4 14 5\n", "8 CA 1 x3 y3 z3 3 3 1\n", "9 CA 2 x3 y3 z3 1 5 1\n", "10 FL 3 x3 y3 z3 1 2 2\n", "11 FL 4 x3 y3 z3 2 12 4" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_longer(\n", " index=[\"State\", \"ItemN\"],\n", " names_to=[\"x\", \"y\", \"z\"],\n", " values_to=[\"xvalue\", \"yvalue\", \"zvalue\"],\n", " names_pattern=[\"x\", \"y\", \"z\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Performance**\n", "\n", "[pivot_longer](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer) is not just simple, convenient, and straightforward, it is efficient as well:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "503 μs ± 4.79 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" ] } ], "source": [ "%timeit religion.pivot_longer(index = 'religion', names_to = 'income', values_to = 'count')" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "582 μs ± 13.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" ] } ], "source": [ "%timeit religion.melt(id_vars = 'religion', var_name = 'income', value_name = 'count')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Sample data](https://grantmcdermott.com/reshape-benchmarks/), using 1000 columns (you can personally increase the number of columns to 1e6 or 1e8, depending on the power of your CPU):" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idgrpx1x2x3x4x5x6x7x8...x991x992x993x994x995x996x997x998x999x1000
011000AIAMOEFD...CLGIRXOWLL
12528IXUQBXIJ...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
23151JQHFBXBF...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
34343PAKKNREE...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
45460LDGXIOCW...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 1002 columns

\n", "
" ], "text/plain": [ " id grp x1 x2 x3 x4 x5 x6 x7 x8 ... x991 x992 x993 x994 x995 x996 x997 \\\n", "0 1 1000 A I A M O E F D ... C L G I R X O \n", "1 2 528 I X U Q B X I J ... NaN NaN NaN NaN NaN NaN NaN \n", "2 3 151 J Q H F B X B F ... NaN NaN NaN NaN NaN NaN NaN \n", "3 4 343 P A K K N R E E ... NaN NaN NaN NaN NaN NaN NaN \n", "4 5 460 L D G X I O C W ... NaN NaN NaN NaN NaN NaN NaN \n", "\n", " x998 x999 x1000 \n", "0 W L L \n", "1 NaN NaN NaN \n", "2 NaN NaN NaN \n", "3 NaN NaN NaN \n", "4 NaN NaN NaN \n", "\n", "[5 rows x 1002 columns]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sparse_wide = pd.read_csv(\"Data_files/sparse-wide.csv\")\n", "sparse_wide.head()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "53.8 ms ± 410 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "%timeit sparse_wide.melt(id_vars = ['id', 'grp'])" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15.7 ms ± 201 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%timeit sparse_wide.pivot_longer(index = ['id', 'grp'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's test on the `movies` dataset:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
directordirector_fb_likesactoractor_fb_likes
titleratingyeardurationnum
AvatarPG-132009.0178.01James Cameron0.0CCH Pounder1000.0
2NaNNaNJoel David Moore936.0
3NaNNaNWes Studi855.0
Pirates of the Caribbean: At World's EndPG-132007.0169.01Gore Verbinski563.0Johnny Depp40000.0
2NaNNaNOrlando Bloom5000.0
...........................
Shanghai CallingPG-132012.0100.02NaNNaNDaniel Henney719.0
3NaNNaNEliza Coupe490.0
My Date with DrewPG2004.090.01Jon Gunn16.0John August86.0
2NaNNaNBrian Herzlinger23.0
3NaNNaNJon Gunn16.0
\n", "

14748 rows × 4 columns

\n", "
" ], "text/plain": [ " director \\\n", "title rating year duration num \n", "Avatar PG-13 2009.0 178.0 1 James Cameron \n", " 2 NaN \n", " 3 NaN \n", "Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 Gore Verbinski \n", " 2 NaN \n", "... ... \n", "Shanghai Calling PG-13 2012.0 100.0 2 NaN \n", " 3 NaN \n", "My Date with Drew PG 2004.0 90.0 1 Jon Gunn \n", " 2 NaN \n", " 3 NaN \n", "\n", " director_fb_likes \\\n", "title rating year duration num \n", "Avatar PG-13 2009.0 178.0 1 0.0 \n", " 2 NaN \n", " 3 NaN \n", "Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 563.0 \n", " 2 NaN \n", "... ... \n", "Shanghai Calling PG-13 2012.0 100.0 2 NaN \n", " 3 NaN \n", "My Date with Drew PG 2004.0 90.0 1 16.0 \n", " 2 NaN \n", " 3 NaN \n", "\n", " actor \\\n", "title rating year duration num \n", "Avatar PG-13 2009.0 178.0 1 CCH Pounder \n", " 2 Joel David Moore \n", " 3 Wes Studi \n", "Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 Johnny Depp \n", " 2 Orlando Bloom \n", "... ... \n", "Shanghai Calling PG-13 2012.0 100.0 2 Daniel Henney \n", " 3 Eliza Coupe \n", "My Date with Drew PG 2004.0 90.0 1 John August \n", " 2 Brian Herzlinger \n", " 3 Jon Gunn \n", "\n", " actor_fb_likes \n", "title rating year duration num \n", "Avatar PG-13 2009.0 178.0 1 1000.0 \n", " 2 936.0 \n", " 3 855.0 \n", "Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 40000.0 \n", " 2 5000.0 \n", "... ... \n", "Shanghai Calling PG-13 2012.0 100.0 2 719.0 \n", " 3 490.0 \n", "My Date with Drew PG 2004.0 90.0 1 86.0 \n", " 2 23.0 \n", " 3 16.0 \n", "\n", "[14748 rows x 4 columns]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wl = pd.wide_to_long(\n", " movies,\n", " stubnames=[\"director\", \"director_fb_likes\", \"actor\", \"actor_fb_likes\"],\n", " i=[\"title\", \"rating\", \"year\", \"duration\"],\n", " j=\"num\",\n", " sep=\"_\",\n", ")\n", "wl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A more efficient approach than `pd.wide_to_long`, using `pd.stack`:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
directordirector_fb_likesactoractor_fb_likes
titleratingyeardurationnum
AvatarPG-132009.0178.01James Cameron0.0CCH Pounder1000.0
2NaNNaNJoel David Moore936.0
3NaNNaNWes Studi855.0
Pirates of the Caribbean: At World's EndPG-132007.0169.01Gore Verbinski563.0Johnny Depp40000.0
2NaNNaNOrlando Bloom5000.0
...........................
Shanghai CallingPG-132012.0100.02NaNNaNDaniel Henney719.0
3NaNNaNEliza Coupe490.0
My Date with DrewPG2004.090.01Jon Gunn16.0John August86.0
2NaNNaNBrian Herzlinger23.0
3NaNNaNJon Gunn16.0
\n", "

14748 rows × 4 columns

\n", "
" ], "text/plain": [ " director \\\n", "title rating year duration num \n", "Avatar PG-13 2009.0 178.0 1 James Cameron \n", " 2 NaN \n", " 3 NaN \n", "Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 Gore Verbinski \n", " 2 NaN \n", "... ... \n", "Shanghai Calling PG-13 2012.0 100.0 2 NaN \n", " 3 NaN \n", "My Date with Drew PG 2004.0 90.0 1 Jon Gunn \n", " 2 NaN \n", " 3 NaN \n", "\n", " director_fb_likes \\\n", "title rating year duration num \n", "Avatar PG-13 2009.0 178.0 1 0.0 \n", " 2 NaN \n", " 3 NaN \n", "Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 563.0 \n", " 2 NaN \n", "... ... \n", "Shanghai Calling PG-13 2012.0 100.0 2 NaN \n", " 3 NaN \n", "My Date with Drew PG 2004.0 90.0 1 16.0 \n", " 2 NaN \n", " 3 NaN \n", "\n", " actor \\\n", "title rating year duration num \n", "Avatar PG-13 2009.0 178.0 1 CCH Pounder \n", " 2 Joel David Moore \n", " 3 Wes Studi \n", "Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 Johnny Depp \n", " 2 Orlando Bloom \n", "... ... \n", "Shanghai Calling PG-13 2012.0 100.0 2 Daniel Henney \n", " 3 Eliza Coupe \n", "My Date with Drew PG 2004.0 90.0 1 John August \n", " 2 Brian Herzlinger \n", " 3 Jon Gunn \n", "\n", " actor_fb_likes \n", "title rating year duration num \n", "Avatar PG-13 2009.0 178.0 1 1000.0 \n", " 2 936.0 \n", " 3 855.0 \n", "Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 1 40000.0 \n", " 2 5000.0 \n", "... ... \n", "Shanghai Calling PG-13 2012.0 100.0 2 719.0 \n", " 3 490.0 \n", "My Date with Drew PG 2004.0 90.0 1 86.0 \n", " 2 23.0 \n", " 3 16.0 \n", "\n", "[14748 rows x 4 columns]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index = [\"title\", \"rating\", \"year\", \"duration\"]\n", "vid = movies.set_index(index)\n", "vid.columns = vid.columns.str.rsplit(\"_\", n=1, expand=True)\n", "vid.columns.names = [None, \"num\"]\n", "vid = vid.stack(level=\"num\", future_stack=True)\n", "vid" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleratingyeardurationdirectordirector_fb_likesactoractor_fb_likes
0AvatarPG-132009.0178.0James Cameron0.0CCH Pounder1000.0
1AvatarPG-132009.0178.0NaNNaNJoel David Moore936.0
2AvatarPG-132009.0178.0NaNNaNWes Studi855.0
3Pirates of the Caribbean: At World's EndPG-132007.0169.0Gore Verbinski563.0Johnny Depp40000.0
4Pirates of the Caribbean: At World's EndPG-132007.0169.0NaNNaNOrlando Bloom5000.0
...........................
14743Shanghai CallingPG-132012.0100.0NaNNaNDaniel Henney719.0
14744Shanghai CallingPG-132012.0100.0NaNNaNEliza Coupe490.0
14745My Date with DrewPG2004.090.0Jon Gunn16.0John August86.0
14746My Date with DrewPG2004.090.0NaNNaNBrian Herzlinger23.0
14747My Date with DrewPG2004.090.0NaNNaNJon Gunn16.0
\n", "

14748 rows × 8 columns

\n", "
" ], "text/plain": [ " title rating year duration \\\n", "0 Avatar PG-13 2009.0 178.0 \n", "1 Avatar PG-13 2009.0 178.0 \n", "2 Avatar PG-13 2009.0 178.0 \n", "3 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 \n", "4 Pirates of the Caribbean: At World's End PG-13 2007.0 169.0 \n", "... ... ... ... ... \n", "14743 Shanghai Calling PG-13 2012.0 100.0 \n", "14744 Shanghai Calling PG-13 2012.0 100.0 \n", "14745 My Date with Drew PG 2004.0 90.0 \n", "14746 My Date with Drew PG 2004.0 90.0 \n", "14747 My Date with Drew PG 2004.0 90.0 \n", "\n", " director director_fb_likes actor actor_fb_likes \n", "0 James Cameron 0.0 CCH Pounder 1000.0 \n", "1 NaN NaN Joel David Moore 936.0 \n", "2 NaN NaN Wes Studi 855.0 \n", "3 Gore Verbinski 563.0 Johnny Depp 40000.0 \n", "4 NaN NaN Orlando Bloom 5000.0 \n", "... ... ... ... ... \n", "14743 NaN NaN Daniel Henney 719.0 \n", "14744 NaN NaN Eliza Coupe 490.0 \n", "14745 Jon Gunn 16.0 John August 86.0 \n", "14746 NaN NaN Brian Herzlinger 23.0 \n", "14747 NaN NaN Jon Gunn 16.0 \n", "\n", "[14748 rows x 8 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out = movies.pivot_longer(\n", " index=slice(\"title\", \"duration\"),\n", " names_to=\".value\",\n", " names_pattern=\"(.+)_\\\\d\",\n", " sort_by_appearance=True,\n", ")\n", "\n", "out" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wl = wl.droplevel(\"num\").reset_index()\n", "wl.equals(out)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vid = (\n", " vid.droplevel(\"num\")\n", " .loc[:, [\"director\", \"director_fb_likes\", \"actor\", \"actor_fb_likes\"]]\n", " .reset_index()\n", ")\n", "\n", "wl.equals(vid)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vid.equals(out)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "52.7 ms ± 784 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "%%timeit \n", "pd.wide_to_long(\n", " movies, \n", " stubnames=['director', 'director_fb_likes', 'actor', 'actor_fb_likes'], \n", " i = ['title', 'rating', 'year', 'duration'], \n", " j = 'num', \n", " sep='_')" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5.19 ms ± 122 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%%timeit\n", "index = ['title', 'rating', 'year', 'duration']\n", "vid = movies.set_index(index)\n", "vid.columns = vid.columns.str.rsplit(\"_\", n = 1, expand = True)\n", "vid.stack(level = 1, future_stack=True)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.79 ms ± 53.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" ] } ], "source": [ "%%timeit\n", "(movies\n", ".pivot_longer(\n", " index = slice('title', 'duration'),\n", " names_to = \".value\",\n", " names_pattern = \"(.+)_\\\\d\",\n", " sort_by_appearance = True)\n", ")\n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.93 ms ± 49.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%%timeit\n", "movies.pivot_longer(\n", " index = slice('title', 'duration'),\n", " names_to = ['director', 'director_fb_likes', 'actor', 'actor_fb_likes'],\n", " names_pattern = ['director_\\\\d$', 'director_fb_likes', 'actor_\\\\d$', 'actor.+'],\n", " sort_by_appearance = True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your mileage may vary with these speed tests." ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-07-30T10:57:18.265994Z", "iopub.status.busy": "2020-07-30T10:57:18.265751Z", "iopub.status.idle": "2020-07-30T10:57:18.283104Z", "shell.execute_reply": "2020-07-30T10:57:18.280684Z", "shell.execute_reply.started": "2020-07-30T10:57:18.265970Z" } }, "source": [ "## **Summary**\n", "\n", "This blog post shows various ways to reshape data into long form, using `pivot_longer`, based on some common patterns. Once the pattern is identified, use the appropriate syntax for `pivot_longer`. It is meant to be simple, straightforward, and efficient. For more control, have a look at the [pivot_longer_spec](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot.pivot_longer_spec)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comments\n", "" ] } ], "metadata": { "jupytext": { "formats": "ipynb,md" }, "kernelspec": { "display_name": "blogger", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 4 }