{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Replicating .SD in Python Datatable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **.SD - Subset of Data**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I will be using [Jose Morales](https://twitter.com/jmrlsz) excellent [post](https://rpubs.com/josemz/SDbf) to show how .SD's functionality can be replicated in python's [datatable](https://datatable.readthedocs.io/en/latest/index.html). Not all functions can be replicated; R [data.table](https://github.com/Rdatatable/data.table) has a whole lot more functions and features that are not yet implemented in [datatable](https://datatable.readthedocs.io/en/latest/index.html).\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from datatable import dt, by, sort, f, fread" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1.1.0'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt.__version__" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DT = fread('Data_files/iris.csv')\n", "DT.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Number of unique observations per column" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0352343223
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DT[, lapply(.SD, uniqueN)] --> Rdatatable\n", "\n", "DT.nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Mean of all columns by `species`" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessepal_lengthsepal_widthpetal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.0063.4281.4620.246
1versicolor5.9362.774.261.326
2virginica6.5882.9745.5522.026
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DT[, lapply(.SD, mean), by = species] --> Rdatatable\n", "\n", "DT[:, f[:].mean(), by('species')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __Filtering__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### First two observations by species" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessepal_lengthsepal_widthpetal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.13.51.40.2
1setosa4.931.40.2
2versicolor73.24.71.4
3versicolor6.43.24.51.5
4virginica6.33.362.5
5virginica5.82.75.11.9
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DT[, .SD[1:2], by = species]\n", "\n", "DT[:2, :, by('species')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In [datatable](https://datatable.readthedocs.io/en/latest/index.html), rows are selected in the `i` section after the grouping, unlike in R's [data.table](https://github.com/Rdatatable/data.table), where rows are selected in `i` before grouping, and rows selected in the `.SD` after grouping." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Last two observations by `species`" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessepal_lengthsepal_widthpetal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.33.71.50.2
1setosa53.31.40.2
2versicolor5.12.531.1
3versicolor5.72.84.11.3
4virginica6.23.45.42.3
5virginica5.935.11.8
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DT[, tail(.SD, 2), by = species] \n", "\n", "DT[-2:, :, by('species')]" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-08-10T09:32:46.775190Z", "iopub.status.busy": "2020-08-10T09:32:46.774933Z", "iopub.status.idle": "2020-08-10T09:32:46.779722Z", "shell.execute_reply": "2020-08-10T09:32:46.778946Z", "shell.execute_reply.started": "2020-08-10T09:32:46.775157Z" } }, "source": [ "Again, the rows are selected after grouping by using Python's negative index slicing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Select the top two sorted by `sepal length` in descending order" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessepal_lengthsepal_widthpetal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.841.20.2
1setosa5.74.41.50.4
2versicolor73.24.71.4
3versicolor6.93.14.91.5
4virginica7.93.86.42
5virginica7.73.86.72.2
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DT[order(-sepal_length), head(.SD, 2), by = species] \n", "\n", "DT[:2, :, by('species'), sort(-f.sepal_length)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In [datatable](https://datatable.readthedocs.io/en/latest/index.html), the [sort](https://datatable.readthedocs.io/en/latest/api/dt/sort.html#) function replicates the `order` function in R's [data.table](https://github.com/Rdatatable/data.table). Note the `-` symbol before the sepal_length *f-expression*; this instructs the dataframe to sort in descending order." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Select the top two sorted by the difference between the `sepal length` and `sepal width`" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessepal_lengthsepal_widthpetal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa4.63.610.2
1setosa5.24.11.50.1
2versicolor5.434.51.5
3versicolor5.22.73.91.4
4virginica4.92.54.51.7
5virginica5.62.84.92
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DT[order(sepal_length - sepal_width), head(.SD, 2), by = species] \n", "\n", "DT[:2, :, by('species'), sort(f.sepal_length - f.sepal_width)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just like in R's [data.table](https://github.com/Rdatatable/data.table), boolean expressions can be passed to the [sort](https://datatable.readthedocs.io/en/latest/api/dt/sort.html#) function." ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-08-10T09:39:44.509104Z", "iopub.status.busy": "2020-08-10T09:39:44.508762Z", "iopub.status.idle": "2020-08-10T09:39:44.516955Z", "shell.execute_reply": "2020-08-10T09:39:44.516055Z", "shell.execute_reply.started": "2020-08-10T09:39:44.509066Z" } }, "source": [ "#### Filter observations above the mean of `sepal_length` by species" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessepal_lengthsepal_widthpetal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.13.51.40.2
1setosa5.43.91.70.4
2setosa5.43.71.50.2
3setosa5.841.20.2
4setosa5.74.41.50.4
5setosa5.43.91.30.4
6setosa5.13.51.40.3
7setosa5.73.81.70.3
8setosa5.13.81.50.3
9setosa5.43.41.70.2
10setosa5.13.71.50.4
11setosa5.13.31.70.5
12setosa5.23.51.50.2
13setosa5.23.41.40.2
14setosa5.43.41.50.4
63virginica6.73.15.62.4
64virginica6.93.15.12.3
65virginica6.83.25.92.3
66virginica6.73.35.72.5
67virginica6.735.22.3
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DT[, .SD[sepal_length > mean(sepal_length)], by = species] \n", "DT[:, [f[:], (f.sepal_length > f.sepal_length.mean()).alias('temp')], by('species')][f.temp==1, :-1]" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-08-10T09:43:21.190855Z", "iopub.status.busy": "2020-08-10T09:43:21.190636Z", "iopub.status.idle": "2020-08-10T09:43:21.201057Z", "shell.execute_reply": "2020-08-10T09:43:21.200305Z", "shell.execute_reply.started": "2020-08-10T09:43:21.190833Z" } }, "source": [ "Unlike in R's [data.table](https://github.com/Rdatatable/data.table), boolean expressions can not be applied within the `i` section, in the presence of `by`. The next best thing is to break it down into two steps - create a temporary column to hold the boolean value, and then filter on that column." ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-08-10T09:45:04.539605Z", "iopub.status.busy": "2020-08-10T09:45:04.539188Z", "iopub.status.idle": "2020-08-10T09:45:04.544671Z", "shell.execute_reply": "2020-08-10T09:45:04.543872Z", "shell.execute_reply.started": "2020-08-10T09:45:04.539559Z" } }, "source": [ "#### Filter rows with group size greater than 10 " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciespetal_widthsepal_lengthsepal_widthpetal_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa0.25.13.51.4
1setosa0.24.931.4
2setosa0.24.73.21.3
3setosa0.24.63.11.5
4setosa0.253.61.4
5setosa0.253.41.5
6setosa0.24.42.91.4
7setosa0.25.43.71.5
8setosa0.24.83.41.6
9setosa0.25.841.2
10setosa0.25.43.41.7
11setosa0.24.63.61
12setosa0.24.83.41.9
13setosa0.2531.6
14setosa0.25.23.51.5
48virginica1.86.22.84.8
49virginica1.86.134.9
50virginica1.86.43.15.5
51virginica1.8634.8
52virginica1.85.935.1
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DT[, .SD[.N > 10], keyby = .(species, petal_width)] \n", "DT[:, [f[:], (dt.count()>10).alias('temp')], by('species', 'petal_width')][f.temp==1, :-1]" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-08-10T09:46:03.123002Z", "iopub.status.busy": "2020-08-10T09:46:03.122700Z", "iopub.status.idle": "2020-08-10T09:46:03.131610Z", "shell.execute_reply": "2020-08-10T09:46:03.130801Z", "shell.execute_reply.started": "2020-08-10T09:46:03.122970Z" } }, "source": [ "#### Get the row with the max petal_length by species." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessepal_lengthsepal_widthpetal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa4.83.41.90.2
1versicolor62.75.11.6
2virginica7.72.66.92.3
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DT[, .SD[which.max(petal_length)], by = species] OR \n", "# DT[, .SD[petal_length == max(petal_length)], by = species] \n", "\n", "# get rid of temp column\n", "\n", "DT[0, :, by('species'), sort(-f.petal_length)]" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-08-10T09:43:21.190855Z", "iopub.status.busy": "2020-08-10T09:43:21.190636Z", "iopub.status.idle": "2020-08-10T09:43:21.201057Z", "shell.execute_reply": "2020-08-10T09:43:21.200305Z", "shell.execute_reply.started": "2020-08-10T09:43:21.190833Z" } }, "source": [ "In the above code, we take advantage of the fact that sorting is done within each group; this allows us to pick the first row per group when `petal_length` is sorted in descending order." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### __.SDCols__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Including columns in `.SD`" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.843333.05733
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# col_idx <- grep(\"^sepal\", names(DT)) --> filter for the specicfic columns\n", "# DT[, lapply(.SD, mean), .SDcols = col_idx]\n", "\n", "# filter for the specific columns with a list comprehension\n", "names = [name for name in DT.names\n", " if name.startswith('sepal')]\n", "\n", "DT[:, f[names].mean()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Removing columns from `.SD`" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.843333.05733
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# col_idx <- grep(\"^(petal|species)\", names(DT))\n", "# DT[, lapply(.SD, mean), .SDcols = -col_idx] --> exclusion occurs within .SDcols\n", "\n", "# here, exclusion occurs within the list comprehension\n", "names = [name for name in DT.names \n", " if not name.startswith(('petal','species'))] \n", "\n", "DT[:, f[names].mean()]" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-08-10T09:54:47.180961Z", "iopub.status.busy": "2020-08-10T09:54:47.180667Z", "iopub.status.idle": "2020-08-10T09:54:47.194089Z", "shell.execute_reply": "2020-08-10T09:54:47.193220Z", "shell.execute_reply.started": "2020-08-10T09:54:47.180935Z" } }, "source": [ "#### Column ranges" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.843333.05733
\n", " \n", "
\n" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DT[, lapply(.SD, mean), .SDcols = sepal_length:sepal_width]\n", "\n", "DT[:, f['sepal_length':'sepal_width'].mean()]" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-08-10T09:56:18.579374Z", "iopub.status.busy": "2020-08-10T09:56:18.578943Z", "iopub.status.idle": "2020-08-10T09:56:18.583241Z", "shell.execute_reply": "2020-08-10T09:56:18.582596Z", "shell.execute_reply.started": "2020-08-10T09:56:18.579328Z" } }, "source": [ "### __Summary__" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-08-10T09:58:00.848013Z", "iopub.status.busy": "2020-08-10T09:58:00.847759Z", "iopub.status.idle": "2020-08-10T09:58:00.862752Z", "shell.execute_reply": "2020-08-10T09:58:00.860433Z", "shell.execute_reply.started": "2020-08-10T09:58:00.847991Z" } }, "source": [ "We've seen how to replicate `.SD` in [datatable](https://datatable.readthedocs.io/en/latest/index.html). There are other functionalities in `.SD` that are not presently possible in Python's [datatable](https://datatable.readthedocs.io/en/latest/index.html). It is possible that in the future, `.SD` will be implemented to allow for custom aggregation functions. That would be truly awesome, as it would allow [numpy](https://numpy.org/doc/stable/index.html) functions and functions from other Python libraries into [datatable](https://datatable.readthedocs.io/en/latest/index.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comments\n", "" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 4 }