{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Selecting and Grouping Data with Python Datatable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Datatable**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to [pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) or [SFrame](https://github.com/apple/turicreate); however we put specific emphasis on speed and big data support. As the name suggests, the package is closely related to R's [data.table](https://github.com/Rdatatable/data.table) and attempts to mimic its core algorithms and API." ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-26T01:27:05.586391Z", "iopub.status.busy": "2020-06-26T01:27:05.586173Z", "iopub.status.idle": "2020-06-26T01:27:05.592158Z", "shell.execute_reply": "2020-06-26T01:27:05.591344Z", "shell.execute_reply.started": "2020-06-26T01:27:05.586367Z" } }, "source": [ "I like ``datatable`` primarily because of its simple syntax. Yes, there are significant speed gains, which is the primary aim of the package, but the simplicity is compelling. \n", "

Do note that ``datatable`` is in active development - more features will be added. Check out the [documentation](https://datatable.readthedocs.io/en/latest/?badge=latest) for more information. \n", "

This introduction focuses on how to select rows, select and perform calculations on columns, and perform aggregations by group. \n" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-23T04:49:52.844408Z", "iopub.status.busy": "2020-06-23T04:49:52.844179Z", "iopub.status.idle": "2020-06-23T04:49:52.856885Z", "shell.execute_reply": "2020-06-23T04:49:52.856044Z", "shell.execute_reply.started": "2020-06-23T04:49:52.844385Z" } }, "source": [ "### __Datatable syntax__\n", "\n", "\n", "![datatable_syntax.png](Images/datatable_syntax.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Short syntax right? Let's break it down : \n", "
- ``DT`` refers to the data frame. This is the fundamental building block in ``datatable``. It is a 2-dimensional array with rows and columns, similar to an Excel/SQL table.\n", "
- The ``i`` part is used for subsetting on rows and shares a similar concept with SQL's WHERE clause.\n", "
- The ``j`` part is used to select columns and act on them. \n", "
- `...` are for extra modifiers, e.g grouping, sorting, joining, etc. \n", "

Let's dive into some examples to see how ``datatable`` works. Our data will be the [iris](https://notebooks.azure.com/kvp15comp/projects/FDPMLDS/html/iris.csv) dataset." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#import datatable\n", "from datatable import dt, f, by\n", "\n", "#read in data\n", "DT = dt.fread(\"Data_files/iris.csv\")\n", "\n", "DT.head()" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T00:59:50.199147Z", "iopub.status.busy": "2020-06-24T00:59:50.198788Z", "iopub.status.idle": "2020-06-24T00:59:50.207032Z", "shell.execute_reply": "2020-06-24T00:59:50.206313Z", "shell.execute_reply.started": "2020-06-24T00:59:50.199106Z" } }, "source": [ "**Notes :**\n", "- ``dt`` refers to the datatable module.\n", "- All computations occur within the ``[]`` bracket.\n", "- ``dt.fread`` is a powerful and very fast function for reading in various text files, zip archives, and urls. It can even read in data from the command line.\n", "- ``by`` is a function for grouping.\n", "- ``f`` is a variable that provides a convenient way to reference the data frame's column within the square brackets. It is really useful when performing computations or creating expressions.\n", "- In a jupyter notebook, the tab colours for the columns indicate various data types - ``blue`` is for float column, ``green`` is for integer column, ``red`` is for string column, ``yellow`` is for boolean, while ``black`` is for object column." ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T02:09:49.680410Z", "iopub.status.busy": "2020-06-24T02:09:49.680177Z", "iopub.status.idle": "2020-06-24T02:09:49.684525Z", "shell.execute_reply": "2020-06-24T02:09:49.683668Z", "shell.execute_reply.started": "2020-06-24T02:09:49.680384Z" } }, "source": [ "Some basic information about the data frame : " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(150, 5)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#shape of data\n", "DT.shape" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#column names\n", "DT.names" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(stype.float64, stype.float64, stype.float64, stype.float64, stype.str32)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#data types of the columns\n", "DT.stypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __The i part - Subset Rows__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Select the first three rows**" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We can use python's slicing syntax to get the rows. \n", "DT[:3, :]" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T01:24:00.010598Z", "iopub.status.busy": "2020-06-24T01:24:00.010346Z", "iopub.status.idle": "2020-06-24T01:24:00.018129Z", "shell.execute_reply": "2020-06-24T01:24:00.017275Z", "shell.execute_reply.started": "2020-06-24T01:24:00.010572Z" } }, "source": [ "**Notes :**\n", "- No new syntax, just our knowledge of python's sequence indexing.\n", "- Note also that nothing was selected in the columns. Passing ``None`` or `...` in the j section will work as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Select the 2nd, 4th and 8th rows**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
04.931.40.2setosa
14.63.11.50.2setosa
253.41.50.2setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Python has a zero based indexing notation,\n", "#so, we will pass in a list [1,3,7]\n", "\n", "DT[[1,3,7], :]" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T01:25:07.520476Z", "iopub.status.busy": "2020-06-24T01:25:07.520247Z", "iopub.status.idle": "2020-06-24T01:25:07.531674Z", "shell.execute_reply": "2020-06-24T01:25:07.530705Z", "shell.execute_reply.started": "2020-06-24T01:25:07.520453Z" } }, "source": [ "#### **Find rows where species == \"versicolor\"**" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T01:26:11.318536Z", "iopub.status.busy": "2020-06-24T01:26:11.318109Z", "iopub.status.idle": "2020-06-24T01:26:11.325278Z", "shell.execute_reply": "2020-06-24T01:26:11.324480Z", "shell.execute_reply.started": "2020-06-24T01:26:11.318490Z" } }, "source": [ "This is an expression. How do we talk to the dataframe to filter for only rows where the ``species`` column is equal to ``versicolor``? Through the ``f`` variable : " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
073.24.71.4versicolor
16.43.24.51.5versicolor
26.93.14.91.5versicolor
35.52.341.3versicolor
46.52.84.61.5versicolor
55.72.84.51.3versicolor
66.33.34.71.6versicolor
74.92.43.31versicolor
86.62.94.61.3versicolor
95.22.73.91.4versicolor
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = DT[f.species == \"versicolor\", :]\n", "result.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notes :**\n", "- We create a boolean expression, using the ``f`` symbol, to select rows that match the condition." ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T01:25:07.520476Z", "iopub.status.busy": "2020-06-24T01:25:07.520247Z", "iopub.status.idle": "2020-06-24T01:25:07.531674Z", "shell.execute_reply": "2020-06-24T01:25:07.530705Z", "shell.execute_reply.started": "2020-06-24T01:25:07.520453Z" } }, "source": [ "#### **Find rows where species == \"versicolor\" and sepal_length == 7**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
073.24.71.4versicolor
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[(f.species == \"versicolor\") & (f.sepal_length == 7), :]" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T01:31:51.142075Z", "iopub.status.busy": "2020-06-24T01:31:51.141749Z", "iopub.status.idle": "2020-06-24T01:31:51.147868Z", "shell.execute_reply": "2020-06-24T01:31:51.146730Z", "shell.execute_reply.started": "2020-06-24T01:31:51.142038Z" } }, "source": [ "**Notes :**\n", "- Again, a filter expression is needed -> we use the ``f`` symbol to access the columns and create our expression. \n", "- Note also that we have two conditions; as such, each condition is wrapped in parentheses to ensure the operator precedence.\n", "- Operator ``&`` for ``and``, ``|`` for ``or``.\n", "- You can chain as many conditions as needed within the ``i`` section." ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T01:36:56.575216Z", "iopub.status.busy": "2020-06-24T01:36:56.574806Z", "iopub.status.idle": "2020-06-24T01:36:56.580729Z", "shell.execute_reply": "2020-06-24T01:36:56.579990Z", "shell.execute_reply.started": "2020-06-24T01:36:56.575170Z" } }, "source": [ "### __The j part - Columns__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Select species, petal_width and petal_length columns**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciespetal_widthpetal_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa0.21.4
1setosa0.21.4
2setosa0.21.3
3setosa0.21.5
4setosa0.21.4
5setosa0.41.7
6setosa0.31.4
7setosa0.21.5
8setosa0.21.4
9setosa0.11.5
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Simply pass a list of the names in the j section \n", "result = DT[:, [\"species\",\"petal_width\",\"petal_length\"]]\n", "\n", "result.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notes :**\n", "- Note that we passed ``:`` in the ``i`` section to indicate that all rows will be selected. ``None`` or ``...`` works fine as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Select the last two columns**" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
petal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪
00.2setosa
10.2setosa
20.2setosa
30.2setosa
40.2setosa
50.4setosa
60.3setosa
70.2setosa
80.2setosa
90.1setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Again, we use python's indexing syntax\n", "result = DT[:, -2:]\n", "\n", "result.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Select only columns whose names end with ``length``**" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthpetal_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.11.4
14.91.4
24.71.3
34.61.5
451.4
55.41.7
64.61.4
751.5
84.41.4
94.91.5
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = DT[:, [col.endswith(\"length\") for col in DT.names]]\n", "\n", "result.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notes:**\n", "- For this, we pass a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) to extract the matches. Again, we are using familar python constructs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Calculate the mean value of sepal_length**" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_length
▪▪▪▪▪▪▪▪
05.84333
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, f.sepal_length.mean()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Here, we have to compute a value, and as such, the ``f`` symbol comes into play. \n", "- To get the average, we use the ``mean`` function from the datatable module." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**REMEMBER**: Any time you want to compute a value or create an expression, make use of the ``f`` symbol." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**TIP** : We can export the column names and use them, instead of the ``f`` symbol :" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_length
▪▪▪▪▪▪▪▪
05.84333
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sepal_length, *_ = DT.export_names()\n", "\n", "#calculate the mean value of sepal_length\n", "DT[:, sepal_length.mean()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also export only the names we need :" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
petal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
03.7581.19933
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "petal_width, petal_length = DT[:, [2,3]].export_names()\n", "\n", "#calculate mean value of petal_width and petal_length\n", "DT[:, dt.mean([petal_width.mean(), petal_length])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notes:**\n", "- To get aggregates of more than one column, pass a list of column names prefixed with the `f` symbol to j.\n", "- Alternatively, you can pass a dictionary of key - value pairs, where the keys will be the new column names, and the values are the computed columns.\n", "\n", "For the rest of the tutorial, I will continue using ``f-expressions``, as they are convenient to use, without having to explicitly export column names." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Select only species and petal_length columns for rows where the petal_length is greater than 1.5**" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciespetal_length
▪▪▪▪▪▪▪▪▪▪▪▪
0setosa1.7
1setosa1.6
2setosa1.7
3setosa1.7
4setosa1.7
5setosa1.9
6setosa1.6
7setosa1.6
8setosa1.6
9setosa1.6
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = DT[f.petal_length > 1.5, [\"species\", \"petal_length\"]]\n", "\n", "result.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Select only string columns**" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
species
▪▪▪▪
0setosa
1setosa
2setosa
3setosa
4setosa
5setosa
6setosa
7setosa
8setosa
9setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = DT[:, f[str]]\n", "result.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notes:**\n", "- The above concept is applicable for any data type that we wish to include or exclude from the dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Convert petal_length from float to integer**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Option 1. Direct assignment : " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "stype.float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = DT.copy()\n", "\n", "result[\"petal_length\"].stype" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "stype.int32" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#change column type\n", "result[\"petal_length\"] = dt.int32\n", "\n", "#check new data type\n", "result[:, f.petal_length].stype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Option 2. Use the ``update`` method. The process is done in-place; no assignment is needed :" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "stype.float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res = DT.copy()\n", "\n", "res[\"petal_length\"].stype" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "#update is an inplace operation\n", "res[:, dt.update(petal_length = dt.int32(f.petal_length))]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "stype.int32" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#check data type after update \n", "res['petal_length'].stype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Option 3. Use the ``as_type`` method. This is the recommended way to change the column type." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "stype.int32" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, f.petal_length.as_type(dt.int32)].stype" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T08:18:43.028387Z", "iopub.status.busy": "2020-06-24T08:18:43.028155Z", "iopub.status.idle": "2020-06-24T08:18:43.032364Z", "shell.execute_reply": "2020-06-24T08:18:43.031571Z", "shell.execute_reply.started": "2020-06-24T08:18:43.028365Z" } }, "source": [ "**TIP** : The ``update`` method allows creation of new columns, or updating existing ones." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Multiply petal_length by 2 and add it as a new column ``petal_double``**" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T05:43:22.952457Z", "iopub.status.busy": "2020-06-24T05:43:22.951858Z", "iopub.status.idle": "2020-06-24T05:43:22.957449Z", "shell.execute_reply": "2020-06-24T05:43:22.956528Z", "shell.execute_reply.started": "2020-06-24T05:43:22.952433Z" } }, "source": [ "- Option 1. Direct Assignment :" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetal_double
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa2.8
14.931.40.2setosa2.8
24.73.21.30.2setosa2.6
34.63.11.50.2setosa3
453.61.40.2setosa2.8
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sol_1 = DT.copy()\n", "\n", "sol_1[\"petal_double\"] = sol_1[:, 2 * f.petal_length]\n", "\n", "sol_1.head(5)" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T05:43:22.952457Z", "iopub.status.busy": "2020-06-24T05:43:22.951858Z", "iopub.status.idle": "2020-06-24T05:43:22.957449Z", "shell.execute_reply": "2020-06-24T05:43:22.956528Z", "shell.execute_reply.started": "2020-06-24T05:43:22.952433Z" } }, "source": [ "- Option 2. Use the ``extend`` method and pass in a dictionary of key value pairs, where the key is the new column name, and the value is the computed column:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetal_double
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa2.8
14.931.40.2setosa2.8
24.73.21.30.2setosa2.6
34.63.11.50.2setosa3
453.61.40.2setosa2.8
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sol_2 = DT.copy()\n", "\n", "#f[:] means the entire columns are selected\n", "sol_2 = sol_2[:, f[:].extend({\"petal_double\" : 2 * f.petal_length})]\n", "\n", "sol_2.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Option 3. The ``update`` method :" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "sol_3 = DT.copy()\n", "\n", "sol_3[:, dt.update(petal_double = 2 * f.petal_length)]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetal_double
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa2.8
14.931.40.2setosa2.8
24.73.21.30.2setosa2.6
34.63.11.50.2setosa3
453.61.40.2setosa2.8
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sol_3.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Option 4. Use the ``alias`` method:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciespetal_double
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa2.8
14.931.40.2setosa2.8
24.73.21.30.2setosa2.6
34.63.11.50.2setosa3
453.61.40.2setosa2.8
55.43.91.70.4setosa3.4
64.63.41.40.3setosa2.8
753.41.50.2setosa3
84.42.91.40.2setosa2.8
94.93.11.50.1setosa3
105.43.71.50.2setosa3
114.83.41.60.2setosa3.2
124.831.40.1setosa2.8
134.331.10.1setosa2.2
145.841.20.2setosa2.4
1456.735.22.3virginica10.4
1466.32.551.9virginica10
1476.535.22virginica10.4
1486.23.45.42.3virginica10.8
1495.935.11.8virginica10.2
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, [f[:], (f.petal_length*2).alias('petal_double')]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Drop the sepal_width column from the table**" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T05:50:08.472417Z", "iopub.status.busy": "2020-06-24T05:50:08.472181Z", "iopub.status.idle": "2020-06-24T05:50:08.476546Z", "shell.execute_reply": "2020-06-24T05:50:08.475805Z", "shell.execute_reply.started": "2020-06-24T05:50:08.472394Z" } }, "source": [ "- Option 1. Use the ``del`` keyword :" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "del sol_1[\"sepal_width\"]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('sepal_length', 'petal_length', 'petal_width', 'species', 'petal_double')" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sol_1.names" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthpetal_lengthpetal_widthspeciespetal_double
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.11.40.2setosa2.8
14.91.40.2setosa2.8
24.71.30.2setosa2.6
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sol_1.head(3)" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T05:51:02.184668Z", "iopub.status.busy": "2020-06-24T05:51:02.184442Z", "iopub.status.idle": "2020-06-24T05:51:02.192227Z", "shell.execute_reply": "2020-06-24T05:51:02.188226Z", "shell.execute_reply.started": "2020-06-24T05:51:02.184644Z" } }, "source": [ "Option 2. Use the ``remove`` method :" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('sepal_length', 'petal_length', 'petal_width', 'species', 'petal_double')" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sol_3 = sol_3[:, f[:].remove(f.sepal_width)]\n", "sol_3.names" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthpetal_lengthpetal_widthspeciespetal_double
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.11.40.2setosa2.8
14.91.40.2setosa2.8
24.71.30.2setosa2.6
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sol_3.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Option 3. Use a list comprehension" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.11.40.2setosa
14.91.40.2setosa
24.71.30.2setosa
34.61.50.2setosa
451.40.2setosa
55.41.70.4setosa
64.61.40.3setosa
751.50.2setosa
84.41.40.2setosa
94.91.50.1setosa
105.41.50.2setosa
114.81.60.2setosa
124.81.40.1setosa
134.31.10.1setosa
145.81.20.2setosa
1456.75.22.3virginica
1466.351.9virginica
1476.55.22virginica
1486.25.42.3virginica
1495.95.11.8virginica
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, [name!='sepal_width' for name in DT.names]]" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T01:36:56.575216Z", "iopub.status.busy": "2020-06-24T01:36:56.574806Z", "iopub.status.idle": "2020-06-24T01:36:56.580729Z", "shell.execute_reply": "2020-06-24T01:36:56.579990Z", "shell.execute_reply.started": "2020-06-24T01:36:56.575170Z" } }, "source": [ "### __Extra Modifiers__" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T02:32:38.252411Z", "iopub.status.busy": "2020-06-24T02:32:38.252098Z", "iopub.status.idle": "2020-06-24T02:32:38.259836Z", "shell.execute_reply": "2020-06-24T02:32:38.259023Z", "shell.execute_reply.started": "2020-06-24T02:32:38.252376Z" } }, "source": [ "**Notes :**\n", "- We will focus on ``grouping`` and ``sorting``." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Sort data by sepal_width**" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0523.51versicolor
162.241versicolor
26.22.24.51.5versicolor
362.251.5virginica
44.52.31.30.3setosa
55.52.341.3versicolor
66.32.34.41.3versicolor
752.33.31versicolor
84.92.43.31versicolor
95.52.43.81.1versicolor
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outcome = DT[:, :, dt.sort('sepal_width')]\n", "\n", "outcome.head()" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T05:05:21.874851Z", "iopub.status.busy": "2020-06-24T05:05:21.874559Z", "iopub.status.idle": "2020-06-24T05:05:21.879940Z", "shell.execute_reply": "2020-06-24T05:05:21.879082Z", "shell.execute_reply.started": "2020-06-24T05:05:21.874818Z" } }, "source": [ "- That's all there is to it, just pass in the column name, either as a string or as an ``f`` expression.\n", "- By default, sorting is in ascending order." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Sort by sepal_width ascending and petal_width descending**" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0523.51versicolor
16.22.24.51.5versicolor
262.251.5virginica
362.241versicolor
45.52.341.3versicolor
56.32.34.41.3versicolor
652.33.31versicolor
74.52.31.30.3setosa
85.52.43.81.1versicolor
94.92.43.31versicolor
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outcome = DT[:, :, dt.sort([f.sepal_width, -f.petal_width])]\n", "\n", "outcome.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notes :**\n", "- When sorting on multiple columns, pass a list of the column names to the ``sort`` function.\n", "- The ``-`` sign in front of ``petal_width`` instructs data table to sort in descending order. The ``-`` is symbolic and no actual negation occurs; as such it can be used on string columns.\n", "- Note that at this point, we had to switch to ``f-expression`` to get the desired result." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Group By - Get the average sepal length per species**" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessepal_length
▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.006
1versicolor5.936
2virginica6.588
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, f.sepal_length.mean(), by(\"species\")] " ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T04:35:34.612160Z", "iopub.status.busy": "2020-06-24T04:35:34.611775Z", "iopub.status.idle": "2020-06-24T04:35:34.616696Z", "shell.execute_reply": "2020-06-24T04:35:34.615827Z", "shell.execute_reply.started": "2020-06-24T04:35:34.612136Z" } }, "source": [ "**Notes :**\n", "- ``by`` is the primary way to group data in the datatable.\n", "- Since we are computing a value (average), the ``f`` symbol is used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Group By - Get the average sepal length and petal length per species**" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessepal_lengthpetal_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.0061.462
1versicolor5.9364.26
2virginica6.5885.552
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, dt.mean([f.sepal_length, f.petal_length]), by(\"species\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Group By - Filter the rows for sepal_width greater than or equal to 3, then get the average sepal length and petal length per species**" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessepal_lengthpetal_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.029171.46667
1versicolor6.218754.54375
2virginica6.768975.64138
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[f.sepal_width>=3,:][:,dt.mean([f.sepal_length, f.petal_length]), by(\"species\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notes :**\n", "- At the moment, computation on ``i``, ``j`` and ``by`` at the same time is not yet implemented. As such, we have to break down the steps into two parts.\n", "- This also shows the chaining capabilities of datatable; simply put the next ``[]``, and ``datatable`` will compute the results from left to right." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Group By - Get the average sepal length and group on sepal width greater than 3**" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_width_gt_3sepal_length
▪▪▪▪▪▪▪▪
005.97229
115.68358
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, dt.mean(f.sepal_length), by((f.sepal_width > 3).alias('sepal_width_gt_3'))]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notes :**\n", "- With ``by``, you can group on boolean expressions. Note the use of ``f-expression`` in the ``by`` function.\n", "- Integers ``0`` and ``1`` signifiy ``False`` and ``True`` respectively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Group By - Get the count of each species, and label the aggregation column as ``species_count``**" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesspecies_count
▪▪▪▪▪▪▪▪▪▪▪▪
0setosa50
1versicolor50
2virginica50
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, {\"species_count\" : dt.count()}, by(\"species\")]" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T04:52:46.255072Z", "iopub.status.busy": "2020-06-24T04:52:46.254674Z", "iopub.status.idle": "2020-06-24T04:52:46.262214Z", "shell.execute_reply": "2020-06-24T04:52:46.261400Z", "shell.execute_reply.started": "2020-06-24T04:52:46.255025Z" } }, "source": [ "**Notes :**\n", "- As with previous examples, a column can be renamed by passing a dictionary, where the key is the new column name and the value is the aggregation result.\n", "- ``count`` is another aggregation function within datatable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Group By - Get the count of each species where the sepal width is greater than 3**" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessepal_width_gt_3count
▪▪▪▪▪▪▪▪▪▪▪▪
0setosa08
1setosa142
2versicolor042
3versicolor18
4virginica033
5virginica117
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT[:, dt.count(), by(f.species, (f.sepal_width > 3).alias('sepal_width_gt_3'))]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notes :**\n", "- The ``by`` function can take a combination of columns, including booleans." ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-06-24T01:36:56.575216Z", "iopub.status.busy": "2020-06-24T01:36:56.574806Z", "iopub.status.idle": "2020-06-24T01:36:56.580729Z", "shell.execute_reply": "2020-06-24T01:36:56.579990Z", "shell.execute_reply.started": "2020-06-24T01:36:56.575170Z" } }, "source": [ "### __Summary__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This was an introduction to the syntax of python ``datatable``. We will explore more features in subsequent blog posts. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comments\n", "" ] } ], "metadata": { "jupytext": { "formats": "ipynb,md" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 4 }