{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Replicating .SD in Python Datatable"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **.SD - Subset of Data**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I will be using [Jose Morales](https://twitter.com/jmrlsz) excellent [post](https://rpubs.com/josemz/SDbf) to show how .SD's functionality can be replicated in python's [datatable](https://datatable.readthedocs.io/en/latest/index.html). Not all functions can be replicated; R [data.table](https://github.com/Rdatatable/data.table) has a whole lot more functions and features that are not yet implemented in [datatable](https://datatable.readthedocs.io/en/latest/index.html).\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from datatable import dt, by, sort, f, fread"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'1.1.0'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dt.__version__"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
\n",
" 1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
\n",
" 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
\n",
" 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
\n",
" 4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
\n",
" 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
\n",
" 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
\n",
" 7 | 5 | 3.4 | 1.5 | 0.2 | setosa |
\n",
" 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
\n",
" 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT = fread('Data_files/iris.csv')\n",
"DT.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Number of unique observations per column"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width | petal_length | petal_width | species |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 35 | 23 | 43 | 22 | 3 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DT[, lapply(.SD, uniqueN)] --> Rdatatable\n",
"\n",
"DT.nunique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Mean of all columns by `species`"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | sepal_length | sepal_width | petal_length | petal_width |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 5.006 | 3.428 | 1.462 | 0.246 |
\n",
" 1 | versicolor | 5.936 | 2.77 | 4.26 | 1.326 |
\n",
" 2 | virginica | 6.588 | 2.974 | 5.552 | 2.026 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DT[, lapply(.SD, mean), by = species] --> Rdatatable\n",
"\n",
"DT[:, f[:].mean(), by('species')]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __Filtering__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### First two observations by species"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | sepal_length | sepal_width | petal_length | petal_width |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 5.1 | 3.5 | 1.4 | 0.2 |
\n",
" 1 | setosa | 4.9 | 3 | 1.4 | 0.2 |
\n",
" 2 | versicolor | 7 | 3.2 | 4.7 | 1.4 |
\n",
" 3 | versicolor | 6.4 | 3.2 | 4.5 | 1.5 |
\n",
" 4 | virginica | 6.3 | 3.3 | 6 | 2.5 |
\n",
" 5 | virginica | 5.8 | 2.7 | 5.1 | 1.9 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DT[, .SD[1:2], by = species]\n",
"\n",
"DT[:2, :, by('species')]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In [datatable](https://datatable.readthedocs.io/en/latest/index.html), rows are selected in the `i` section after the grouping, unlike in R's [data.table](https://github.com/Rdatatable/data.table), where rows are selected in `i` before grouping, and rows selected in the `.SD` after grouping."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Last two observations by `species`"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | sepal_length | sepal_width | petal_length | petal_width |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 5.3 | 3.7 | 1.5 | 0.2 |
\n",
" 1 | setosa | 5 | 3.3 | 1.4 | 0.2 |
\n",
" 2 | versicolor | 5.1 | 2.5 | 3 | 1.1 |
\n",
" 3 | versicolor | 5.7 | 2.8 | 4.1 | 1.3 |
\n",
" 4 | virginica | 6.2 | 3.4 | 5.4 | 2.3 |
\n",
" 5 | virginica | 5.9 | 3 | 5.1 | 1.8 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DT[, tail(.SD, 2), by = species] \n",
"\n",
"DT[-2:, :, by('species')]"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-08-10T09:32:46.775190Z",
"iopub.status.busy": "2020-08-10T09:32:46.774933Z",
"iopub.status.idle": "2020-08-10T09:32:46.779722Z",
"shell.execute_reply": "2020-08-10T09:32:46.778946Z",
"shell.execute_reply.started": "2020-08-10T09:32:46.775157Z"
}
},
"source": [
"Again, the rows are selected after grouping by using Python's negative index slicing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Select the top two sorted by `sepal length` in descending order"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | sepal_length | sepal_width | petal_length | petal_width |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 5.8 | 4 | 1.2 | 0.2 |
\n",
" 1 | setosa | 5.7 | 4.4 | 1.5 | 0.4 |
\n",
" 2 | versicolor | 7 | 3.2 | 4.7 | 1.4 |
\n",
" 3 | versicolor | 6.9 | 3.1 | 4.9 | 1.5 |
\n",
" 4 | virginica | 7.9 | 3.8 | 6.4 | 2 |
\n",
" 5 | virginica | 7.7 | 3.8 | 6.7 | 2.2 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DT[order(-sepal_length), head(.SD, 2), by = species] \n",
"\n",
"DT[:2, :, by('species'), sort(-f.sepal_length)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In [datatable](https://datatable.readthedocs.io/en/latest/index.html), the [sort](https://datatable.readthedocs.io/en/latest/api/dt/sort.html#) function replicates the `order` function in R's [data.table](https://github.com/Rdatatable/data.table). Note the `-` symbol before the sepal_length *f-expression*; this instructs the dataframe to sort in descending order."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Select the top two sorted by the difference between the `sepal length` and `sepal width`"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | sepal_length | sepal_width | petal_length | petal_width |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 4.6 | 3.6 | 1 | 0.2 |
\n",
" 1 | setosa | 5.2 | 4.1 | 1.5 | 0.1 |
\n",
" 2 | versicolor | 5.4 | 3 | 4.5 | 1.5 |
\n",
" 3 | versicolor | 5.2 | 2.7 | 3.9 | 1.4 |
\n",
" 4 | virginica | 4.9 | 2.5 | 4.5 | 1.7 |
\n",
" 5 | virginica | 5.6 | 2.8 | 4.9 | 2 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DT[order(sepal_length - sepal_width), head(.SD, 2), by = species] \n",
"\n",
"DT[:2, :, by('species'), sort(f.sepal_length - f.sepal_width)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just like in R's [data.table](https://github.com/Rdatatable/data.table), boolean expressions can be passed to the [sort](https://datatable.readthedocs.io/en/latest/api/dt/sort.html#) function."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-08-10T09:39:44.509104Z",
"iopub.status.busy": "2020-08-10T09:39:44.508762Z",
"iopub.status.idle": "2020-08-10T09:39:44.516955Z",
"shell.execute_reply": "2020-08-10T09:39:44.516055Z",
"shell.execute_reply.started": "2020-08-10T09:39:44.509066Z"
}
},
"source": [
"#### Filter observations above the mean of `sepal_length` by species"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | sepal_length | sepal_width | petal_length | petal_width |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 5.1 | 3.5 | 1.4 | 0.2 |
\n",
" 1 | setosa | 5.4 | 3.9 | 1.7 | 0.4 |
\n",
" 2 | setosa | 5.4 | 3.7 | 1.5 | 0.2 |
\n",
" 3 | setosa | 5.8 | 4 | 1.2 | 0.2 |
\n",
" 4 | setosa | 5.7 | 4.4 | 1.5 | 0.4 |
\n",
" 5 | setosa | 5.4 | 3.9 | 1.3 | 0.4 |
\n",
" 6 | setosa | 5.1 | 3.5 | 1.4 | 0.3 |
\n",
" 7 | setosa | 5.7 | 3.8 | 1.7 | 0.3 |
\n",
" 8 | setosa | 5.1 | 3.8 | 1.5 | 0.3 |
\n",
" 9 | setosa | 5.4 | 3.4 | 1.7 | 0.2 |
\n",
" 10 | setosa | 5.1 | 3.7 | 1.5 | 0.4 |
\n",
" 11 | setosa | 5.1 | 3.3 | 1.7 | 0.5 |
\n",
" 12 | setosa | 5.2 | 3.5 | 1.5 | 0.2 |
\n",
" 13 | setosa | 5.2 | 3.4 | 1.4 | 0.2 |
\n",
" 14 | setosa | 5.4 | 3.4 | 1.5 | 0.4 |
\n",
" ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
\n",
" 63 | virginica | 6.7 | 3.1 | 5.6 | 2.4 |
\n",
" 64 | virginica | 6.9 | 3.1 | 5.1 | 2.3 |
\n",
" 65 | virginica | 6.8 | 3.2 | 5.9 | 2.3 |
\n",
" 66 | virginica | 6.7 | 3.3 | 5.7 | 2.5 |
\n",
" 67 | virginica | 6.7 | 3 | 5.2 | 2.3 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DT[, .SD[sepal_length > mean(sepal_length)], by = species] \n",
"DT[:, [f[:], (f.sepal_length > f.sepal_length.mean()).alias('temp')], by('species')][f.temp==1, :-1]"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-08-10T09:43:21.190855Z",
"iopub.status.busy": "2020-08-10T09:43:21.190636Z",
"iopub.status.idle": "2020-08-10T09:43:21.201057Z",
"shell.execute_reply": "2020-08-10T09:43:21.200305Z",
"shell.execute_reply.started": "2020-08-10T09:43:21.190833Z"
}
},
"source": [
"Unlike in R's [data.table](https://github.com/Rdatatable/data.table), boolean expressions can not be applied within the `i` section, in the presence of `by`. The next best thing is to break it down into two steps - create a temporary column to hold the boolean value, and then filter on that column."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-08-10T09:45:04.539605Z",
"iopub.status.busy": "2020-08-10T09:45:04.539188Z",
"iopub.status.idle": "2020-08-10T09:45:04.544671Z",
"shell.execute_reply": "2020-08-10T09:45:04.543872Z",
"shell.execute_reply.started": "2020-08-10T09:45:04.539559Z"
}
},
"source": [
"#### Filter rows with group size greater than 10 "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | petal_width | sepal_length | sepal_width | petal_length |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 0.2 | 5.1 | 3.5 | 1.4 |
\n",
" 1 | setosa | 0.2 | 4.9 | 3 | 1.4 |
\n",
" 2 | setosa | 0.2 | 4.7 | 3.2 | 1.3 |
\n",
" 3 | setosa | 0.2 | 4.6 | 3.1 | 1.5 |
\n",
" 4 | setosa | 0.2 | 5 | 3.6 | 1.4 |
\n",
" 5 | setosa | 0.2 | 5 | 3.4 | 1.5 |
\n",
" 6 | setosa | 0.2 | 4.4 | 2.9 | 1.4 |
\n",
" 7 | setosa | 0.2 | 5.4 | 3.7 | 1.5 |
\n",
" 8 | setosa | 0.2 | 4.8 | 3.4 | 1.6 |
\n",
" 9 | setosa | 0.2 | 5.8 | 4 | 1.2 |
\n",
" 10 | setosa | 0.2 | 5.4 | 3.4 | 1.7 |
\n",
" 11 | setosa | 0.2 | 4.6 | 3.6 | 1 |
\n",
" 12 | setosa | 0.2 | 4.8 | 3.4 | 1.9 |
\n",
" 13 | setosa | 0.2 | 5 | 3 | 1.6 |
\n",
" 14 | setosa | 0.2 | 5.2 | 3.5 | 1.5 |
\n",
" ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
\n",
" 48 | virginica | 1.8 | 6.2 | 2.8 | 4.8 |
\n",
" 49 | virginica | 1.8 | 6.1 | 3 | 4.9 |
\n",
" 50 | virginica | 1.8 | 6.4 | 3.1 | 5.5 |
\n",
" 51 | virginica | 1.8 | 6 | 3 | 4.8 |
\n",
" 52 | virginica | 1.8 | 5.9 | 3 | 5.1 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DT[, .SD[.N > 10], keyby = .(species, petal_width)] \n",
"DT[:, [f[:], (dt.count()>10).alias('temp')], by('species', 'petal_width')][f.temp==1, :-1]"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-08-10T09:46:03.123002Z",
"iopub.status.busy": "2020-08-10T09:46:03.122700Z",
"iopub.status.idle": "2020-08-10T09:46:03.131610Z",
"shell.execute_reply": "2020-08-10T09:46:03.130801Z",
"shell.execute_reply.started": "2020-08-10T09:46:03.122970Z"
}
},
"source": [
"#### Get the row with the max petal_length by species."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | species | sepal_length | sepal_width | petal_length | petal_width |
\n",
" | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | setosa | 4.8 | 3.4 | 1.9 | 0.2 |
\n",
" 1 | versicolor | 6 | 2.7 | 5.1 | 1.6 |
\n",
" 2 | virginica | 7.7 | 2.6 | 6.9 | 2.3 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DT[, .SD[which.max(petal_length)], by = species] OR \n",
"# DT[, .SD[petal_length == max(petal_length)], by = species] \n",
"\n",
"# get rid of temp column\n",
"\n",
"DT[0, :, by('species'), sort(-f.petal_length)]"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-08-10T09:43:21.190855Z",
"iopub.status.busy": "2020-08-10T09:43:21.190636Z",
"iopub.status.idle": "2020-08-10T09:43:21.201057Z",
"shell.execute_reply": "2020-08-10T09:43:21.200305Z",
"shell.execute_reply.started": "2020-08-10T09:43:21.190833Z"
}
},
"source": [
"In the above code, we take advantage of the fact that sorting is done within each group; this allows us to pick the first row per group when `petal_length` is sorted in descending order."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __.SDCols__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Including columns in `.SD`"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.84333 | 3.05733 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# col_idx <- grep(\"^sepal\", names(DT)) --> filter for the specicfic columns\n",
"# DT[, lapply(.SD, mean), .SDcols = col_idx]\n",
"\n",
"# filter for the specific columns with a list comprehension\n",
"names = [name for name in DT.names\n",
" if name.startswith('sepal')]\n",
"\n",
"DT[:, f[names].mean()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Removing columns from `.SD`"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.84333 | 3.05733 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# col_idx <- grep(\"^(petal|species)\", names(DT))\n",
"# DT[, lapply(.SD, mean), .SDcols = -col_idx] --> exclusion occurs within .SDcols\n",
"\n",
"# here, exclusion occurs within the list comprehension\n",
"names = [name for name in DT.names \n",
" if not name.startswith(('petal','species'))] \n",
"\n",
"DT[:, f[names].mean()]"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-08-10T09:54:47.180961Z",
"iopub.status.busy": "2020-08-10T09:54:47.180667Z",
"iopub.status.idle": "2020-08-10T09:54:47.194089Z",
"shell.execute_reply": "2020-08-10T09:54:47.193220Z",
"shell.execute_reply.started": "2020-08-10T09:54:47.180935Z"
}
},
"source": [
"#### Column ranges"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" | sepal_length | sepal_width |
\n",
" | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
\n",
" \n",
" \n",
" 0 | 5.84333 | 3.05733 |
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DT[, lapply(.SD, mean), .SDcols = sepal_length:sepal_width]\n",
"\n",
"DT[:, f['sepal_length':'sepal_width'].mean()]"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-08-10T09:56:18.579374Z",
"iopub.status.busy": "2020-08-10T09:56:18.578943Z",
"iopub.status.idle": "2020-08-10T09:56:18.583241Z",
"shell.execute_reply": "2020-08-10T09:56:18.582596Z",
"shell.execute_reply.started": "2020-08-10T09:56:18.579328Z"
}
},
"source": [
"### __Summary__"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2020-08-10T09:58:00.848013Z",
"iopub.status.busy": "2020-08-10T09:58:00.847759Z",
"iopub.status.idle": "2020-08-10T09:58:00.862752Z",
"shell.execute_reply": "2020-08-10T09:58:00.860433Z",
"shell.execute_reply.started": "2020-08-10T09:58:00.847991Z"
}
},
"source": [
"We've seen how to replicate `.SD` in [datatable](https://datatable.readthedocs.io/en/latest/index.html). There are other functionalities in `.SD` that are not presently possible in Python's [datatable](https://datatable.readthedocs.io/en/latest/index.html). It is possible that in the future, `.SD` will be implemented to allow for custom aggregation functions. That would be truly awesome, as it would allow [numpy](https://numpy.org/doc/stable/index.html) functions and functions from other Python libraries into [datatable](https://datatable.readthedocs.io/en/latest/index.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comments\n",
""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 4
}