{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Data Wrangling with Python Datatable - Selecting Columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This article highlights various ways to select columns in python datatable. The examples used here are based off the excellent [article](https://suzan.rbind.io/2018/01/dplyr-tutorial-1/) by [Susan Baert](https://twitter.com/SuzanBaert).\n", "\n", "The data file can be accessed [here](https://github.com/samukweku/data-wrangling-blog/raw/master/_notebooks/Data_files/msleep.txt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Selecting Columns**" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-10-31T01:26:23.474681Z", "iopub.status.busy": "2020-10-31T01:26:23.474464Z", "iopub.status.idle": "2020-10-31T01:26:23.477682Z", "shell.execute_reply": "2020-10-31T01:26:23.477047Z", "shell.execute_reply.started": "2020-10-31T01:26:23.474658Z" } }, "source": [ "### The Basics" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenusvoreorderconservationsleep_totalsleep_remsleep_cycleawakebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc12.1NANA11.9NA50
1Owl monkeyAotusomniPrimatesNA171.8NA70.01550.48
2Mountain beaverAplodontiaherbiRodentiant14.42.4NA9.6NA1.35
3Greater short-tailed shrewBlarinaomniSoricomorphalc14.92.30.1333339.10.000290.019
4CowBosherbiArtiodactyladomesticated40.70.666667200.423600
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from datatable import dt, f, ltype, stype\n", "import re\n", "\n", "file_path='Data_files/msleep.txt'\n", "DT = dt.fread(file_path)\n", "DT.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can select columns by name or position in the `j` section:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genus
▪▪▪▪
0Acinonyx
1Aotus
2Aplodontia
3Blarina
4Bos
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, 'genus'].head(5)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genus
▪▪▪▪
0Acinonyx
1Aotus
2Aplodontia
3Blarina
4Bos
5Bradypus
6Callorhinus
7Calomys
8Canis
9Capreolus
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, 1].head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genus
▪▪▪▪
0Acinonyx
1Aotus
2Aplodontia
3Blarina
4Bos
5Bradypus
6Callorhinus
7Calomys
8Canis
9Capreolus
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, -10].head()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are selecting a single column, you can pass it into the brackets without specifying the `i` section:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genus
▪▪▪▪
0Acinonyx
1Aotus
2Aplodontia
3Blarina
4Bos
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT['genus'].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the rest of this article, I will be focusing on column selection by name." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can select columns by passing a list/tuple of the column names:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenussleep_totalawake
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyx12.111.9
1Owl monkeyAotus177
2Mountain beaverAplodontia14.49.6
3Greater short-tailed shrewBlarina14.99.1
4CowBos420
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns_to_select = [\"name\", \"genus\", \"sleep_total\", \"awake\"]\n", "DT[:, columns_to_select].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can pass a list/tuple of booleans:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenussleep_totalsleep_cycleawake
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyx12.1NA11.9
1Owl monkeyAotus17NA7
2Mountain beaverAplodontia14.4NA9.6
3Greater short-tailed shrewBlarina14.90.1333339.1
4CowBos40.66666720
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns_to_select = [True, True, False, False, False, True,False,True,True,False,False]\n", "DT[:, columns_to_select].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can select chunks of columns using python's [slice](https://docs.python.org/3/library/functions.html#slice) syntax or via the ``start:end`` shortcut:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenusvoreorder
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivora
1Owl monkeyAotusomniPrimates
2Mountain beaverAplodontiaherbiRodentia
3Greater short-tailed shrewBlarinaomniSoricomorpha
4CowBosherbiArtiodactyla
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, slice(\"name\", \"order\")].head(5)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenusvoreorder
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivora
1Owl monkeyAotusomniPrimates
2Mountain beaverAplodontiaherbiRodentia
3Greater short-tailed shrewBlarinaomniSoricomorpha
4CowBosherbiArtiodactyla
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, \"name\" : \"order\"].head(5)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multiple chunk selection is possible:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenusvoreordersleep_totalsleep_remsleep_cycle
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivora12.1NANA
1Owl monkeyAotusomniPrimates171.8NA
2Mountain beaverAplodontiaherbiRodentia14.42.4NA
3Greater short-tailed shrewBlarinaomniSoricomorpha14.92.30.133333
4CowBosherbiArtiodactyla40.70.666667
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns_to_select = [slice(\"name\", \"order\"), slice(\"sleep_total\", \"sleep_cycle\")]\n", "DT[:, columns_to_select].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the shortcut notation, for multiple selections, it has to be prefixed with datatable's [f](https://datatable.readthedocs.io/en/latest/manual/f-expressions.html) symbol:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenusvoreordersleep_totalsleep_remsleep_cycle
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivora12.1NANA
1Owl monkeyAotusomniPrimates171.8NA
2Mountain beaverAplodontiaherbiRodentia14.42.4NA
3Greater short-tailed shrewBlarinaomniSoricomorpha14.92.30.133333
4CowBosherbiArtiodactyla40.70.666667
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns_to_select = [f[\"name\" : \"order\", \"sleep_total\" : \"sleep_cycle\"]]\n", "DT[:, columns_to_select].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To deselect/drop columns you can use the [remove](https://datatable.readthedocs.io/en/latest/manual/f-expressions.html#modifying-a-columnset) function:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenusvoreorderbrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoraNA50
1Owl monkeyAotusomniPrimates0.01550.48
2Mountain beaverAplodontiaherbiRodentiaNA1.35
3Greater short-tailed shrewBlarinaomniSoricomorpha0.000290.019
4CowBosherbiArtiodactyla0.423600
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns_to_remove = [f[\"sleep_total\" : \"awake\", \"conservation\"]]\n", "DT[:, f[:].remove(columns_to_remove)].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can deselect a whole chunk, and then re-add a column again; this combines the [remove](https://datatable.readthedocs.io/en/latest/manual/f-expressions.html#modifying-a-columnset) and [extend](https://datatable.readthedocs.io/en/latest/manual/f-expressions.html#modifying-a-columnset) functions:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
brainwtbodywtconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NA50lc
10.01550.48NA
2NA1.35nt
30.000290.019lc
40.423600domesticated
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, f[:].remove(f[\"name\" : \"awake\"]).extend(f[\"conservation\"])].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting Columns based on Partial Names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use python's string functions to filter for columns with partial matching:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sleep_totalsleep_remsleep_cycle
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.1NANA
1171.8NA
214.42.4NA
314.92.30.133333
440.70.666667
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns_to_select = [name.startswith(\"sleep\") for name in DT.names]\n", "DT[:, columns_to_select].head(5)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sleep_totalsleep_remsleep_cyclebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.1NANANA50
1171.8NA0.01550.48
214.42.4NANA1.35
314.92.30.1333330.000290.019
440.70.6666670.423600
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns_to_select = [\"eep\" in name or name.endswith(\"wt\") for name in DT.names]\n", "DT[:, columns_to_select].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting Columns based on Regex" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python's [re](https://docs.python.org/3/library/re.html) module can be used to select columns based on a regular expression:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
orderconservation
▪▪▪▪▪▪▪▪
0Carnivoralc
1PrimatesNA
2Rodentiant
3Soricomorphalc
4Artiodactyladomesticated
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this returns a list of booleans\n", "columns_to_select = [True if re.search(r\"o.+er\", name) else False for name in DT.names]\n", "DT[:, columns_to_select].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting columns by their data type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can pass a data type in the ``j`` section:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenusvoreorderconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc
1Owl monkeyAotusomniPrimatesNA
2Mountain beaverAplodontiaherbiRodentiant
3Greater short-tailed shrewBlarinaomniSoricomorphalc
4CowBosherbiArtiodactyladomesticated
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, str].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can pass a list of data types:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sleep_totalsleep_remsleep_cycleawakebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.1NANA11.9NA50
1171.8NA70.01550.48
214.42.4NA9.6NA1.35
314.92.30.1333339.10.000290.019
440.70.666667200.423600
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, [int, float]].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also pass datatable's [stype](https://datatable.readthedocs.io/en/latest/api/stype.html#) or [ltype](https://datatable.readthedocs.io/en/latest/api/ltype.html#) data types:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenusvoreorderconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc
1Owl monkeyAotusomniPrimatesNA
2Mountain beaverAplodontiaherbiRodentiant
3Greater short-tailed shrewBlarinaomniSoricomorphalc
4CowBosherbiArtiodactyladomesticated
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, ltype.str].head(5)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sleep_totalsleep_remsleep_cycleawakebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.1NANA11.9NA50
1171.8NA70.01550.48
214.42.4NA9.6NA1.35
314.92.30.1333339.10.000290.019
440.70.666667200.423600
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, stype.float64].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can remove columns based on their data type:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenusvoreorderconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc
1Owl monkeyAotusomniPrimatesNA
2Mountain beaverAplodontiaherbiRodentiant
3Greater short-tailed shrewBlarinaomniSoricomorphalc
4CowBosherbiArtiodactyladomesticated
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns_to_remove = [f[int, float]]\n", "DT[:, f[:].remove(columns_to_remove)].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An alternative is to preselect the columns you intend to keep:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenusvoreorderconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc
1Owl monkeyAotusomniPrimatesNA
2Mountain beaverAplodontiaherbiRodentiant
3Greater short-tailed shrewBlarinaomniSoricomorphalc
4CowBosherbiArtiodactyladomesticated
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# creates a list of booleans\n", "columns_to_select = [\n", " dtype not in (ltype.int, ltype.real)\n", " for _, dtype in zip(DT.names, DT.ltypes) \n", "]\n", "\n", "DT[:, columns_to_select].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could also iterate through the frame and check each column's type, before recombining with [cbind](https://datatable.readthedocs.io/en/latest/api/dt/cbind.html):" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenusvoreorderconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc
1Owl monkeyAotusomniPrimatesNA
2Mountain beaverAplodontiaherbiRodentiant
3Greater short-tailed shrewBlarinaomniSoricomorphalc
4CowBosherbiArtiodactyladomesticated
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matching_frames = [frame for frame in DT if frame.ltypes[0] not in (ltype.real, ltype.int)]\n", "dt.cbind(matching_frames).head(5)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each column in a frame is treated as a frame, allowing for the list comprehension above.\n", "\n", "You could also pass the `matching frames` to the `j` section of `DT`:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenusvoreorderconservation
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc
1Owl monkeyAotusomniPrimatesNA
2Mountain beaverAplodontiaherbiRodentiant
3Greater short-tailed shrewBlarinaomniSoricomorphalc
4CowBosherbiArtiodactyladomesticated
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, matching_frames].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting columns by logical expressions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ideas expressed in the previous section allows for more nifty column selection. \n", "\n", "Say we wish to select columns that are numeric, and have a mean greater than 10:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sleep_totalawakebodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.111.950
11770.48
214.49.61.35
314.99.10.019
4420600
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# returns a list of booleans\n", "columns_to_select = [\n", " ltype in (ltype.real, ltype.int) and DT[name].mean()[0, 0] > 10\n", " for name, ltype in zip(DT.names, DT.ltypes)\n", "]\n", "DT[:, columns_to_select].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code above preselects the columns before passing it to datatable. Note the use of `[0,0]` to return a scalar value; this allows us to compare with the scalar value `10`.\n", "\n", "Alternatively, in the list comprehension, instead of a list of booleans, you could return the column names:\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sleep_totalawakebodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.111.950
11770.48
214.49.61.35
314.99.10.019
4420600
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns_to_select = [\n", " name\n", " for name, ltype in zip(DT.names, DT.ltypes)\n", " if ltype in (ltype.real, ltype.int) and DT[name].mean()[0, 0] > 10\n", "]\n", "DT[:, columns_to_select].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "You could also iterate through the frame in a list comprehension and check each column, before recombining with [cbind](https://datatable.readthedocs.io/en/latest/api/dt/cbind.html):" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sleep_totalawakebodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.111.950
11770.48
214.49.61.35
314.99.10.019
4420600
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matching_frames = [frame for frame in DT \n", " if frame.ltypes[0] in (ltype.int, ltype.real) \n", " and frame.mean()[0,0] > 10]\n", "dt.cbind(matching_frames).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of recombining with [cbind](https://datatable.readthedocs.io/en/latest/api/dt/cbind.html), you could pass the `matching_frames` to the ``j`` section:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sleep_totalawakebodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.111.950
11770.48
214.49.61.35
314.99.10.019
4420600
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, matching_frames].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at another example, where we select only columns where the number of distinct values is less than 10:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
voreconservation
▪▪▪▪▪▪▪▪
0carnilc
1omniNA
2herbint
3omnilc
4herbidomesticated
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# returns a list of booleans\n", "columns_to_select = [frame.nunique()[0, 0] < 10 for frame in DT]\n", "DT[:, columns_to_select].head(5)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
voreconservation
▪▪▪▪▪▪▪▪
0carnilc
1omniNA
2herbint
3omnilc
4herbidomesticated
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matching_frames = [frame for frame in DT if frame.nunique()[0,0] < 10]\n", "dt.cbind(matching_frames).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or pass `matching_frames` to the `j` section in `DT`:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
voreconservation
▪▪▪▪▪▪▪▪
0carnilc
1omniNA
2herbint
3omnilc
4herbidomesticated
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, matching_frames].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Reordering Columns**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can select columns in the order that you want:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
conservationsleep_totalname
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0lc12.1Cheetah
1NA17Owl monkey
2nt14.4Mountain beaver
3lc14.9Greater short-tailed shrew
4domesticated4Cow
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns_to_select = ['conservation', 'sleep_total', 'name']\n", "DT[:, columns_to_select].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To move some columns to the front, you could write a function to cover that:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "def move_to_the_front(frame, front_columns):\n", " column_names = list(frame.names)\n", " for name in front_columns:\n", " column_names.remove(name)\n", " front_columns.extend(column_names)\n", " return front_columns" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
conservationsleep_totalnamegenusvoreordersleep_remsleep_cycleawakebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0lc12.1CheetahAcinonyxcarniCarnivoraNANA11.9NA50
1NA17Owl monkeyAotusomniPrimates1.8NA70.01550.48
2nt14.4Mountain beaverAplodontiaherbiRodentia2.4NA9.6NA1.35
3lc14.9Greater short-tailed shrewBlarinaomniSoricomorpha2.30.1333339.10.000290.019
4domesticated4CowBosherbiArtiodactyla0.70.666667200.423600
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, move_to_the_front(DT, ['conservation', 'sleep_total'])].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Column Names**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Renaming Columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Columns with new names can be created within the `j` section by passing a dictionary:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sleep_totalanimalextinction_threat
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
012.1Cheetahlc
117Owl monkeyNA
214.4Mountain beavernt
314.9Greater short-tailed shrewlc
44Cowdomesticated
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_names = {\"animal\": f.name, \"extinction_threat\": f.conservation}\n", "DT[:, f.sleep_total.extend(new_names)].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also rename the columns via a dictionary that maps the old column name to the new column name, and assign it to ``DT.names``:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
animalsleep_totalextinction_threat
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Cheetah12.1lc
1Owl monkey17NA
2Mountain beaver14.4nt
3Greater short-tailed shrew14.9lc
4Cow4domesticated
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT_copy = DT.copy()\n", "DT_copy.names = {\"name\": \"animal\", \"conservation\": \"extinction_threat\"}\n", "DT_copy[:, ['animal', 'sleep_total', 'extinction_threat']].head(5)\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
animalgenusvoreorderextinction_threatsleep_totalsleep_remsleep_cycleawakebrainwtbodywt
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc12.1NANA11.9NA50
1Owl monkeyAotusomniPrimatesNA171.8NA70.01550.48
2Mountain beaverAplodontiaherbiRodentiant14.42.4NA9.6NA1.35
3Greater short-tailed shrewBlarinaomniSoricomorphalc14.92.30.1333339.10.000290.019
4CowBosherbiArtiodactyladomesticated40.70.666667200.423600
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT_copy.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reformatting all Column Names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use python's string functions to reformat column names.\n", "\n", "Let's convert all column names to uppercase:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NAMEGENUSVOREORDERCONSERVATIONSLEEP_TOTALSLEEP_REMSLEEP_CYCLEAWAKEBRAINWTBODYWT
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0CheetahAcinonyxcarniCarnivoralc12.1NANA11.9NA50
1Owl monkeyAotusomniPrimatesNA171.8NA70.01550.48
2Mountain beaverAplodontiaherbiRodentiant14.42.4NA9.6NA1.35
3Greater short-tailed shrewBlarinaomniSoricomorphalc14.92.30.1333339.10.000290.019
4CowBosherbiArtiodactyladomesticated40.70.666667200.423600
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT_copy.names = [name.upper() for name in DT.names] # or list(map(str.upper, DT.names))\n", "DT_copy.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Resources: \n", "\n", "- [datatable docs](https://datatable.readthedocs.io/en/latest/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comments\n", "" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 4 }