{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Selecting and Grouping Data with Python Datatable"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Datatable**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
">This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to [pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) or [SFrame](https://github.com/apple/turicreate); however we put specific emphasis on speed and big data support. As the name suggests, the package is closely related to R's [data.table](https://github.com/Rdatatable/data.table) and attempts to mimic its core algorithms and API."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-26T01:27:05.586391Z",
"iopub.status.busy": "2020-06-26T01:27:05.586173Z",
"iopub.status.idle": "2020-06-26T01:27:05.592158Z",
"shell.execute_reply": "2020-06-26T01:27:05.591344Z",
"shell.execute_reply.started": "2020-06-26T01:27:05.586367Z"
}
},
"source": [
"I like ``datatable`` primarily because of its simple syntax. Yes, there are significant speed gains, which is the primary aim of the package, but the simplicity is compelling. \n",
"
Do note that ``datatable`` is in active development - more features will be added. Check out the [documentation](https://datatable.readthedocs.io/en/latest/?badge=latest) for more information. \n",
"
This introduction focuses on how to select rows, select and perform calculations on columns, and perform aggregations by group. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-23T04:49:52.844408Z",
"iopub.status.busy": "2020-06-23T04:49:52.844179Z",
"iopub.status.idle": "2020-06-23T04:49:52.856885Z",
"shell.execute_reply": "2020-06-23T04:49:52.856044Z",
"shell.execute_reply.started": "2020-06-23T04:49:52.844385Z"
}
},
"source": [
"### __Datatable syntax__\n",
"\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Short syntax right? Let's break it down : \n",
"
- ``DT`` refers to the data frame. This is the fundamental building block in ``datatable``. It is a 2-dimensional array with rows and columns, similar to an Excel/SQL table.\n",
"
- The ``i`` part is used for subsetting on rows and shares a similar concept with SQL's WHERE clause.\n",
"
- The ``j`` part is used to select columns and act on them. \n",
"
- `...` are for extra modifiers, e.g grouping, sorting, joining, etc. \n",
"
Let's dive into some examples to see how ``datatable`` works. Our data will be the [iris](https://notebooks.azure.com/kvp15comp/projects/FDPMLDS/html/iris.csv) dataset."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
\n",
" 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
\n",
" 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
\n",
" 7 | 5 | 3.4 | 1.5 | 0.2 | setosa |
\n",
" 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
\n",
" 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#import datatable\n",
"from datatable import dt, f, by\n",
"\n",
"#read in data\n",
"DT = dt.fread(\"Data_files/iris.csv\")\n",
"\n",
"DT.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T00:59:50.199147Z",
"iopub.status.busy": "2020-06-24T00:59:50.198788Z",
"iopub.status.idle": "2020-06-24T00:59:50.207032Z",
"shell.execute_reply": "2020-06-24T00:59:50.206313Z",
"shell.execute_reply.started": "2020-06-24T00:59:50.199106Z"
}
},
"source": [
"**Notes :**\n",
"- ``dt`` refers to the datatable module.\n",
"- All computations occur within the ``[]`` bracket.\n",
"- ``dt.fread`` is a powerful and very fast function for reading in various text files, zip archives, and urls. It can even read in data from the command line.\n",
"- ``by`` is a function for grouping.\n",
"- ``f`` is a variable that provides a convenient way to reference the data frame's column within the square brackets. It is really useful when performing computations or creating expressions.\n",
"- In a jupyter notebook, the tab colours for the columns indicate various data types - ``blue`` is for float column, ``green`` is for integer column, ``red`` is for string column, ``yellow`` is for boolean, while ``black`` is for object column."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T02:09:49.680410Z",
"iopub.status.busy": "2020-06-24T02:09:49.680177Z",
"iopub.status.idle": "2020-06-24T02:09:49.684525Z",
"shell.execute_reply": "2020-06-24T02:09:49.683668Z",
"shell.execute_reply.started": "2020-06-24T02:09:49.680384Z"
}
},
"source": [
"Some basic information about the data frame : "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(150, 5)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#shape of data\n",
"DT.shape"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species')"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#column names\n",
"DT.names"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(stype.float64, stype.float64, stype.float64, stype.float64, stype.str32)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#data types of the columns\n",
"DT.stypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __The i part - Subset Rows__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Select the first three rows**"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#We can use python's slicing syntax to get the rows. \n",
"DT[:3, :]"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T01:24:00.010598Z",
"iopub.status.busy": "2020-06-24T01:24:00.010346Z",
"iopub.status.idle": "2020-06-24T01:24:00.018129Z",
"shell.execute_reply": "2020-06-24T01:24:00.017275Z",
"shell.execute_reply.started": "2020-06-24T01:24:00.010572Z"
}
},
"source": [
"**Notes :**\n",
"- No new syntax, just our knowledge of python's sequence indexing.\n",
"- Note also that nothing was selected in the columns. Passing ``None`` or `...` in the j section will work as well."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Select the 2nd, 4th and 8th rows**"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 4.9 | 3 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
\n",
" 2 | 5 | 3.4 | 1.5 | 0.2 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Python has a zero based indexing notation,\n",
"#so, we will pass in a list [1,3,7]\n",
"\n",
"DT[[1,3,7], :]"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T01:25:07.520476Z",
"iopub.status.busy": "2020-06-24T01:25:07.520247Z",
"iopub.status.idle": "2020-06-24T01:25:07.531674Z",
"shell.execute_reply": "2020-06-24T01:25:07.530705Z",
"shell.execute_reply.started": "2020-06-24T01:25:07.520453Z"
}
},
"source": [
"#### **Find rows where species == \"versicolor\"**"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T01:26:11.318536Z",
"iopub.status.busy": "2020-06-24T01:26:11.318109Z",
"iopub.status.idle": "2020-06-24T01:26:11.325278Z",
"shell.execute_reply": "2020-06-24T01:26:11.324480Z",
"shell.execute_reply.started": "2020-06-24T01:26:11.318490Z"
}
},
"source": [
"This is an expression. How do we talk to the dataframe to filter for only rows where the ``species`` column is equal to ``versicolor``? Through the ``f`` variable : "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 7 | 3.2 | 4.7 | 1.4 | versicolor |
\n",
" 1 | 6.4 | 3.2 | 4.5 | 1.5 | versicolor |
\n",
" 2 | 6.9 | 3.1 | 4.9 | 1.5 | versicolor |
\n",
" 3 | 5.5 | 2.3 | 4 | 1.3 | versicolor |
\n",
" 4 | 6.5 | 2.8 | 4.6 | 1.5 | versicolor |
\n",
" 5 | 5.7 | 2.8 | 4.5 | 1.3 | versicolor |
\n",
" 6 | 6.3 | 3.3 | 4.7 | 1.6 | versicolor |
\n",
" 7 | 4.9 | 2.4 | 3.3 | 1 | versicolor |
\n",
" 8 | 6.6 | 2.9 | 4.6 | 1.3 | versicolor |
\n",
" 9 | 5.2 | 2.7 | 3.9 | 1.4 | versicolor |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result = DT[f.species == \"versicolor\", :]\n",
"result.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Notes :**\n",
"- We create a boolean expression, using the ``f`` symbol, to select rows that match the condition."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T01:25:07.520476Z",
"iopub.status.busy": "2020-06-24T01:25:07.520247Z",
"iopub.status.idle": "2020-06-24T01:25:07.531674Z",
"shell.execute_reply": "2020-06-24T01:25:07.530705Z",
"shell.execute_reply.started": "2020-06-24T01:25:07.520453Z"
}
},
"source": [
"#### **Find rows where species == \"versicolor\" and sepal_length == 7**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 7 | 3.2 | 4.7 | 1.4 | versicolor |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT[(f.species == \"versicolor\") & (f.sepal_length == 7), :]"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T01:31:51.142075Z",
"iopub.status.busy": "2020-06-24T01:31:51.141749Z",
"iopub.status.idle": "2020-06-24T01:31:51.147868Z",
"shell.execute_reply": "2020-06-24T01:31:51.146730Z",
"shell.execute_reply.started": "2020-06-24T01:31:51.142038Z"
}
},
"source": [
"**Notes :**\n",
"- Again, a filter expression is needed -> we use the ``f`` symbol to access the columns and create our expression. \n",
"- Note also that we have two conditions; as such, each condition is wrapped in parentheses to ensure the operator precedence.\n",
"- Operator ``&`` for ``and``, ``|`` for ``or``.\n",
"- You can chain as many conditions as needed within the ``i`` section."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T01:36:56.575216Z",
"iopub.status.busy": "2020-06-24T01:36:56.574806Z",
"iopub.status.idle": "2020-06-24T01:36:56.580729Z",
"shell.execute_reply": "2020-06-24T01:36:56.579990Z",
"shell.execute_reply.started": "2020-06-24T01:36:56.575170Z"
}
},
"source": [
"### __The j part - Columns__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Select species, petal_width and petal_length columns**"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | petal_width | petal_length |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 0.2 | 1.4 |
\n",
" 1 | setosa | 0.2 | 1.4 |
\n",
" 2 | setosa | 0.2 | 1.3 |
\n",
" 3 | setosa | 0.2 | 1.5 |
\n",
" 4 | setosa | 0.2 | 1.4 |
\n",
" 5 | setosa | 0.4 | 1.7 |
\n",
" 6 | setosa | 0.3 | 1.4 |
\n",
" 7 | setosa | 0.2 | 1.5 |
\n",
" 8 | setosa | 0.2 | 1.4 |
\n",
" 9 | setosa | 0.1 | 1.5 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Simply pass a list of the names in the j section \n",
"result = DT[:, [\"species\",\"petal_width\",\"petal_length\"]]\n",
"\n",
"result.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Notes :**\n",
"- Note that we passed ``:`` in the ``i`` section to indicate that all rows will be selected. ``None`` or ``...`` works fine as well."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Select the last two columns**"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 0.2 | setosa |
\n",
" 1 | 0.2 | setosa |
\n",
" 2 | 0.2 | setosa |
\n",
" 3 | 0.2 | setosa |
\n",
" 4 | 0.2 | setosa |
\n",
" 5 | 0.4 | setosa |
\n",
" 6 | 0.3 | setosa |
\n",
" 7 | 0.2 | setosa |
\n",
" 8 | 0.2 | setosa |
\n",
" 9 | 0.1 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Again, we use python's indexing syntax\n",
"result = DT[:, -2:]\n",
"\n",
"result.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Select only columns whose names end with ``length``**"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | petal_length |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 1.4 |
\n",
" 1 | 4.9 | 1.4 |
\n",
" 2 | 4.7 | 1.3 |
\n",
" 3 | 4.6 | 1.5 |
\n",
" 4 | 5 | 1.4 |
\n",
" 5 | 5.4 | 1.7 |
\n",
" 6 | 4.6 | 1.4 |
\n",
" 7 | 5 | 1.5 |
\n",
" 8 | 4.4 | 1.4 |
\n",
" 9 | 4.9 | 1.5 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result = DT[:, [col.endswith(\"length\") for col in DT.names]]\n",
"\n",
"result.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Notes:**\n",
"- For this, we pass a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) to extract the matches. Again, we are using familar python constructs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Calculate the mean value of sepal_length**"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length |
\n",
" | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.84333 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT[:, f.sepal_length.mean()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Here, we have to compute a value, and as such, the ``f`` symbol comes into play. \n",
"- To get the average, we use the ``mean`` function from the datatable module."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**REMEMBER**: Any time you want to compute a value or create an expression, make use of the ``f`` symbol."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TIP** : We can export the column names and use them, instead of the ``f`` symbol :"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length |
\n",
" | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.84333 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sepal_length, *_ = DT.export_names()\n",
"\n",
"#calculate the mean value of sepal_length\n",
"DT[:, sepal_length.mean()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also export only the names we need :"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | petal_length | petal_width |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 3.758 | 1.19933 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"petal_width, petal_length = DT[:, [2,3]].export_names()\n",
"\n",
"#calculate mean value of petal_width and petal_length\n",
"DT[:, dt.mean([petal_width.mean(), petal_length])]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Notes:**\n",
"- To get aggregates of more than one column, pass a list of column names prefixed with the `f` symbol to j.\n",
"- Alternatively, you can pass a dictionary of key - value pairs, where the keys will be the new column names, and the values are the computed columns.\n",
"\n",
"For the rest of the tutorial, I will continue using ``f-expressions``, as they are convenient to use, without having to explicitly export column names."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Select only species and petal_length columns for rows where the petal_length is greater than 1.5**"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | petal_length |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 1.7 |
\n",
" 1 | setosa | 1.6 |
\n",
" 2 | setosa | 1.7 |
\n",
" 3 | setosa | 1.7 |
\n",
" 4 | setosa | 1.7 |
\n",
" 5 | setosa | 1.9 |
\n",
" 6 | setosa | 1.6 |
\n",
" 7 | setosa | 1.6 |
\n",
" 8 | setosa | 1.6 |
\n",
" 9 | setosa | 1.6 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result = DT[f.petal_length > 1.5, [\"species\", \"petal_length\"]]\n",
"\n",
"result.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Select only string columns**"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species |
\n",
" | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa |
\n",
" 1 | setosa |
\n",
" 2 | setosa |
\n",
" 3 | setosa |
\n",
" 4 | setosa |
\n",
" 5 | setosa |
\n",
" 6 | setosa |
\n",
" 7 | setosa |
\n",
" 8 | setosa |
\n",
" 9 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result = DT[:, f[str]]\n",
"result.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Notes:**\n",
"- The above concept is applicable for any data type that we wish to include or exclude from the dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Convert petal_length from float to integer**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Option 1. Direct assignment : "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"stype.float64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result = DT.copy()\n",
"\n",
"result[\"petal_length\"].stype"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"stype.int32"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#change column type\n",
"result[\"petal_length\"] = dt.int32\n",
"\n",
"#check new data type\n",
"result[:, f.petal_length].stype"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Option 2. Use the ``update`` method. The process is done in-place; no assignment is needed :"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"stype.float64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res = DT.copy()\n",
"\n",
"res[\"petal_length\"].stype"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"#update is an inplace operation\n",
"res[:, dt.update(petal_length = dt.int32(f.petal_length))]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"stype.int32"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#check data type after update \n",
"res['petal_length'].stype"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Option 3. Use the ``as_type`` method. This is the recommended way to change the column type."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"stype.int32"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT[:, f.petal_length.as_type(dt.int32)].stype"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T08:18:43.028387Z",
"iopub.status.busy": "2020-06-24T08:18:43.028155Z",
"iopub.status.idle": "2020-06-24T08:18:43.032364Z",
"shell.execute_reply": "2020-06-24T08:18:43.031571Z",
"shell.execute_reply.started": "2020-06-24T08:18:43.028365Z"
}
},
"source": [
"**TIP** : The ``update`` method allows creation of new columns, or updating existing ones."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Multiply petal_length by 2 and add it as a new column ``petal_double``**"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T05:43:22.952457Z",
"iopub.status.busy": "2020-06-24T05:43:22.951858Z",
"iopub.status.idle": "2020-06-24T05:43:22.957449Z",
"shell.execute_reply": "2020-06-24T05:43:22.956528Z",
"shell.execute_reply.started": "2020-06-24T05:43:22.952433Z"
}
},
"source": [
"- Option 1. Direct Assignment :"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species | petal_double |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 2.6 |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 3 |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sol_1 = DT.copy()\n",
"\n",
"sol_1[\"petal_double\"] = sol_1[:, 2 * f.petal_length]\n",
"\n",
"sol_1.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T05:43:22.952457Z",
"iopub.status.busy": "2020-06-24T05:43:22.951858Z",
"iopub.status.idle": "2020-06-24T05:43:22.957449Z",
"shell.execute_reply": "2020-06-24T05:43:22.956528Z",
"shell.execute_reply.started": "2020-06-24T05:43:22.952433Z"
}
},
"source": [
"- Option 2. Use the ``extend`` method and pass in a dictionary of key value pairs, where the key is the new column name, and the value is the computed column:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species | petal_double |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 2.6 |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 3 |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sol_2 = DT.copy()\n",
"\n",
"#f[:] means the entire columns are selected\n",
"sol_2 = sol_2[:, f[:].extend({\"petal_double\" : 2 * f.petal_length})]\n",
"\n",
"sol_2.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Option 3. The ``update`` method :"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"sol_3 = DT.copy()\n",
"\n",
"sol_3[:, dt.update(petal_double = 2 * f.petal_length)]"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species | petal_double |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 2.6 |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 3 |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sol_3.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Option 4. Use the ``alias`` method:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species | petal_double |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 2.6 |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 3 |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa | 3.4 |
\n",
" 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa | 2.8 |
\n",
" 7 | 5 | 3.4 | 1.5 | 0.2 | setosa | 3 |
\n",
" 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa | 3 |
\n",
" 10 | 5.4 | 3.7 | 1.5 | 0.2 | setosa | 3 |
\n",
" 11 | 4.8 | 3.4 | 1.6 | 0.2 | setosa | 3.2 |
\n",
" 12 | 4.8 | 3 | 1.4 | 0.1 | setosa | 2.8 |
\n",
" 13 | 4.3 | 3 | 1.1 | 0.1 | setosa | 2.2 |
\n",
" 14 | 5.8 | 4 | 1.2 | 0.2 | setosa | 2.4 |
\n",
" ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
\n",
" 145 | 6.7 | 3 | 5.2 | 2.3 | virginica | 10.4 |
\n",
" 146 | 6.3 | 2.5 | 5 | 1.9 | virginica | 10 |
\n",
" 147 | 6.5 | 3 | 5.2 | 2 | virginica | 10.4 |
\n",
" 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | 10.8 |
\n",
" 149 | 5.9 | 3 | 5.1 | 1.8 | virginica | 10.2 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT[:, [f[:], (f.petal_length*2).alias('petal_double')]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Drop the sepal_width column from the table**"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T05:50:08.472417Z",
"iopub.status.busy": "2020-06-24T05:50:08.472181Z",
"iopub.status.idle": "2020-06-24T05:50:08.476546Z",
"shell.execute_reply": "2020-06-24T05:50:08.475805Z",
"shell.execute_reply.started": "2020-06-24T05:50:08.472394Z"
}
},
"source": [
"- Option 1. Use the ``del`` keyword :"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"del sol_1[\"sepal_width\"]"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('sepal_length', 'petal_length', 'petal_width', 'species', 'petal_double')"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sol_1.names"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | petal_length | petal_width | species | petal_double |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 1 | 4.9 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 2 | 4.7 | 1.3 | 0.2 | setosa | 2.6 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sol_1.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T05:51:02.184668Z",
"iopub.status.busy": "2020-06-24T05:51:02.184442Z",
"iopub.status.idle": "2020-06-24T05:51:02.192227Z",
"shell.execute_reply": "2020-06-24T05:51:02.188226Z",
"shell.execute_reply.started": "2020-06-24T05:51:02.184644Z"
}
},
"source": [
"Option 2. Use the ``remove`` method :"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('sepal_length', 'petal_length', 'petal_width', 'species', 'petal_double')"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sol_3 = sol_3[:, f[:].remove(f.sepal_width)]\n",
"sol_3.names"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | petal_length | petal_width | species | petal_double |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 1 | 4.9 | 1.4 | 0.2 | setosa | 2.8 |
\n",
" 2 | 4.7 | 1.3 | 0.2 | setosa | 2.6 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sol_3.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Option 3. Use a list comprehension"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.9 | 1.4 | 0.2 | setosa |
\n",
" 2 | 4.7 | 1.3 | 0.2 | setosa |
\n",
" 3 | 4.6 | 1.5 | 0.2 | setosa |
\n",
" 4 | 5 | 1.4 | 0.2 | setosa |
\n",
" 5 | 5.4 | 1.7 | 0.4 | setosa |
\n",
" 6 | 4.6 | 1.4 | 0.3 | setosa |
\n",
" 7 | 5 | 1.5 | 0.2 | setosa |
\n",
" 8 | 4.4 | 1.4 | 0.2 | setosa |
\n",
" 9 | 4.9 | 1.5 | 0.1 | setosa |
\n",
" 10 | 5.4 | 1.5 | 0.2 | setosa |
\n",
" 11 | 4.8 | 1.6 | 0.2 | setosa |
\n",
" 12 | 4.8 | 1.4 | 0.1 | setosa |
\n",
" 13 | 4.3 | 1.1 | 0.1 | setosa |
\n",
" 14 | 5.8 | 1.2 | 0.2 | setosa |
\n",
" ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
\n",
" 145 | 6.7 | 5.2 | 2.3 | virginica |
\n",
" 146 | 6.3 | 5 | 1.9 | virginica |
\n",
" 147 | 6.5 | 5.2 | 2 | virginica |
\n",
" 148 | 6.2 | 5.4 | 2.3 | virginica |
\n",
" 149 | 5.9 | 5.1 | 1.8 | virginica |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT[:, [name!='sepal_width' for name in DT.names]]"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T01:36:56.575216Z",
"iopub.status.busy": "2020-06-24T01:36:56.574806Z",
"iopub.status.idle": "2020-06-24T01:36:56.580729Z",
"shell.execute_reply": "2020-06-24T01:36:56.579990Z",
"shell.execute_reply.started": "2020-06-24T01:36:56.575170Z"
}
},
"source": [
"### __Extra Modifiers__"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T02:32:38.252411Z",
"iopub.status.busy": "2020-06-24T02:32:38.252098Z",
"iopub.status.idle": "2020-06-24T02:32:38.259836Z",
"shell.execute_reply": "2020-06-24T02:32:38.259023Z",
"shell.execute_reply.started": "2020-06-24T02:32:38.252376Z"
}
},
"source": [
"**Notes :**\n",
"- We will focus on ``grouping`` and ``sorting``."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Sort data by sepal_width**"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5 | 2 | 3.5 | 1 | versicolor |
\n",
" 1 | 6 | 2.2 | 4 | 1 | versicolor |
\n",
" 2 | 6.2 | 2.2 | 4.5 | 1.5 | versicolor |
\n",
" 3 | 6 | 2.2 | 5 | 1.5 | virginica |
\n",
" 4 | 4.5 | 2.3 | 1.3 | 0.3 | setosa |
\n",
" 5 | 5.5 | 2.3 | 4 | 1.3 | versicolor |
\n",
" 6 | 6.3 | 2.3 | 4.4 | 1.3 | versicolor |
\n",
" 7 | 5 | 2.3 | 3.3 | 1 | versicolor |
\n",
" 8 | 4.9 | 2.4 | 3.3 | 1 | versicolor |
\n",
" 9 | 5.5 | 2.4 | 3.8 | 1.1 | versicolor |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"outcome = DT[:, :, dt.sort('sepal_width')]\n",
"\n",
"outcome.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T05:05:21.874851Z",
"iopub.status.busy": "2020-06-24T05:05:21.874559Z",
"iopub.status.idle": "2020-06-24T05:05:21.879940Z",
"shell.execute_reply": "2020-06-24T05:05:21.879082Z",
"shell.execute_reply.started": "2020-06-24T05:05:21.874818Z"
}
},
"source": [
"- That's all there is to it, just pass in the column name, either as a string or as an ``f`` expression.\n",
"- By default, sorting is in ascending order."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Sort by sepal_width ascending and petal_width descending**"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5 | 2 | 3.5 | 1 | versicolor |
\n",
" 1 | 6.2 | 2.2 | 4.5 | 1.5 | versicolor |
\n",
" 2 | 6 | 2.2 | 5 | 1.5 | virginica |
\n",
" 3 | 6 | 2.2 | 4 | 1 | versicolor |
\n",
" 4 | 5.5 | 2.3 | 4 | 1.3 | versicolor |
\n",
" 5 | 6.3 | 2.3 | 4.4 | 1.3 | versicolor |
\n",
" 6 | 5 | 2.3 | 3.3 | 1 | versicolor |
\n",
" 7 | 4.5 | 2.3 | 1.3 | 0.3 | setosa |
\n",
" 8 | 5.5 | 2.4 | 3.8 | 1.1 | versicolor |
\n",
" 9 | 4.9 | 2.4 | 3.3 | 1 | versicolor |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"outcome = DT[:, :, dt.sort([f.sepal_width, -f.petal_width])]\n",
"\n",
"outcome.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Notes :**\n",
"- When sorting on multiple columns, pass a list of the column names to the ``sort`` function.\n",
"- The ``-`` sign in front of ``petal_width`` instructs data table to sort in descending order. The ``-`` is symbolic and no actual negation occurs; as such it can be used on string columns.\n",
"- Note that at this point, we had to switch to ``f-expression`` to get the desired result."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Group By - Get the average sepal length per species**"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | sepal_length |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 5.006 |
\n",
" 1 | versicolor | 5.936 |
\n",
" 2 | virginica | 6.588 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT[:, f.sepal_length.mean(), by(\"species\")] "
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T04:35:34.612160Z",
"iopub.status.busy": "2020-06-24T04:35:34.611775Z",
"iopub.status.idle": "2020-06-24T04:35:34.616696Z",
"shell.execute_reply": "2020-06-24T04:35:34.615827Z",
"shell.execute_reply.started": "2020-06-24T04:35:34.612136Z"
}
},
"source": [
"**Notes :**\n",
"- ``by`` is the primary way to group data in the datatable.\n",
"- Since we are computing a value (average), the ``f`` symbol is used."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Group By - Get the average sepal length and petal length per species**"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | sepal_length | petal_length |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 5.006 | 1.462 |
\n",
" 1 | versicolor | 5.936 | 4.26 |
\n",
" 2 | virginica | 6.588 | 5.552 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT[:, dt.mean([f.sepal_length, f.petal_length]), by(\"species\")]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Group By - Filter the rows for sepal_width greater than or equal to 3, then get the average sepal length and petal length per species**"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | sepal_length | petal_length |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 5.02917 | 1.46667 |
\n",
" 1 | versicolor | 6.21875 | 4.54375 |
\n",
" 2 | virginica | 6.76897 | 5.64138 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT[f.sepal_width>=3,:][:,dt.mean([f.sepal_length, f.petal_length]), by(\"species\")]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Notes :**\n",
"- At the moment, computation on ``i``, ``j`` and ``by`` at the same time is not yet implemented. As such, we have to break down the steps into two parts.\n",
"- This also shows the chaining capabilities of datatable; simply put the next ``[]``, and ``datatable`` will compute the results from left to right."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Group By - Get the average sepal length and group on sepal width greater than 3**"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_width_gt_3 | sepal_length |
\n",
" | ▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 0 | 5.97229 |
\n",
" 1 | 1 | 5.68358 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT[:, dt.mean(f.sepal_length), by((f.sepal_width > 3).alias('sepal_width_gt_3'))]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Notes :**\n",
"- With ``by``, you can group on boolean expressions. Note the use of ``f-expression`` in the ``by`` function.\n",
"- Integers ``0`` and ``1`` signifiy ``False`` and ``True`` respectively."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Group By - Get the count of each species, and label the aggregation column as ``species_count``**"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | species_count |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 50 |
\n",
" 1 | versicolor | 50 |
\n",
" 2 | virginica | 50 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT[:, {\"species_count\" : dt.count()}, by(\"species\")]"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T04:52:46.255072Z",
"iopub.status.busy": "2020-06-24T04:52:46.254674Z",
"iopub.status.idle": "2020-06-24T04:52:46.262214Z",
"shell.execute_reply": "2020-06-24T04:52:46.261400Z",
"shell.execute_reply.started": "2020-06-24T04:52:46.255025Z"
}
},
"source": [
"**Notes :**\n",
"- As with previous examples, a column can be renamed by passing a dictionary, where the key is the new column name and the value is the aggregation result.\n",
"- ``count`` is another aggregation function within datatable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Group By - Get the count of each species where the sepal width is greater than 3**"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | sepal_width_gt_3 | count |
\n",
" | ▪▪▪▪ | ▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 0 | 8 |
\n",
" 1 | setosa | 1 | 42 |
\n",
" 2 | versicolor | 0 | 42 |
\n",
" 3 | versicolor | 1 | 8 |
\n",
" 4 | virginica | 0 | 33 |
\n",
" 5 | virginica | 1 | 17 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT[:, dt.count(), by(f.species, (f.sepal_width > 3).alias('sepal_width_gt_3'))]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Notes :**\n",
"- The ``by`` function can take a combination of columns, including booleans."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-06-24T01:36:56.575216Z",
"iopub.status.busy": "2020-06-24T01:36:56.574806Z",
"iopub.status.idle": "2020-06-24T01:36:56.580729Z",
"shell.execute_reply": "2020-06-24T01:36:56.579990Z",
"shell.execute_reply.started": "2020-06-24T01:36:56.575170Z"
}
},
"source": [
"### __Summary__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This was an introduction to the syntax of python ``datatable``. We will explore more features in subsequent blog posts. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comments\n",
""
]
}
],
"metadata": {
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 4
}