Replicating .SD in Python Datatable

Replicating .SD in Python Datatable#

.SD - Subset of Data#

I will be using Jose Morales excellent post to show how .SD’s functionality can be replicated in python’s datatable. Not all functions can be replicated; R data.table has a whole lot more functions and features that are not yet implemented in datatable.

from datatable import dt, by, sort,  f, fread

dt.__version__

'1.1.0'

DT = fread('Data_files/iris.csv')
DT.head()

	sepal_length	sepal_width	petal_length	petal_width	species
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5	3.6	1.4	0.2	setosa
5	5.4	3.9	1.7	0.4	setosa
6	4.6	3.4	1.4	0.3	setosa
7	5	3.4	1.5	0.2	setosa
8	4.4	2.9	1.4	0.2	setosa
9	4.9	3.1	1.5	0.1	setosa

Number of unique observations per column#

# DT[, lapply(.SD, uniqueN)] --> Rdatatable

DT.nunique()

	sepal_length	sepal_width	petal_length	petal_width	species
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	35	23	43	22	3

Mean of all columns by `species`#

# DT[, lapply(.SD, mean), by = species] --> Rdatatable

DT[:, f[:].mean(), by('species')]

	species	sepal_length	sepal_width	petal_length	petal_width
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	setosa	5.006	3.428	1.462	0.246
1	versicolor	5.936	2.77	4.26	1.326
2	virginica	6.588	2.974	5.552	2.026

Filtering#

First two observations by species#

# DT[, .SD[1:2], by = species]

DT[:2, :, by('species')]

	species	sepal_length	sepal_width	petal_length	petal_width
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	setosa	5.1	3.5	1.4	0.2
1	setosa	4.9	3	1.4	0.2
2	versicolor	7	3.2	4.7	1.4
3	versicolor	6.4	3.2	4.5	1.5
4	virginica	6.3	3.3	6	2.5
5	virginica	5.8	2.7	5.1	1.9

In datatable, rows are selected in the i section after the grouping, unlike in R’s data.table, where rows are selected in i before grouping, and rows selected in the .SD after grouping.

Last two observations by `species`#

# DT[, tail(.SD, 2), by = species] 

DT[-2:, :, by('species')]

	species	sepal_length	sepal_width	petal_length	petal_width
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	setosa	5.3	3.7	1.5	0.2
1	setosa	5	3.3	1.4	0.2
2	versicolor	5.1	2.5	3	1.1
3	versicolor	5.7	2.8	4.1	1.3
4	virginica	6.2	3.4	5.4	2.3
5	virginica	5.9	3	5.1	1.8

Again, the rows are selected after grouping by using Python’s negative index slicing.

Select the top two sorted by `sepal length` in descending order#

# DT[order(-sepal_length), head(.SD, 2), by = species] 

DT[:2, :, by('species'), sort(-f.sepal_length)]

	species	sepal_length	sepal_width	petal_length	petal_width
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	setosa	5.8	4	1.2	0.2
1	setosa	5.7	4.4	1.5	0.4
2	versicolor	7	3.2	4.7	1.4
3	versicolor	6.9	3.1	4.9	1.5
4	virginica	7.9	3.8	6.4	2
5	virginica	7.7	3.8	6.7	2.2

In datatable, the sort function replicates the order function in R’s data.table. Note the - symbol before the sepal_length f-expression; this instructs the dataframe to sort in descending order.

Select the top two sorted by the difference between the `sepal length` and `sepal width`#

# DT[order(sepal_length - sepal_width), head(.SD, 2), by = species] 

DT[:2, :, by('species'), sort(f.sepal_length - f.sepal_width)]

	species	sepal_length	sepal_width	petal_length	petal_width
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	setosa	4.6	3.6	1	0.2
1	setosa	5.2	4.1	1.5	0.1
2	versicolor	5.4	3	4.5	1.5
3	versicolor	4.9	2.4	3.3	1
4	virginica	4.9	2.5	4.5	1.7
5	virginica	6.2	3.4	5.4	2.3

Just like in R’s data.table, boolean expressions can be passed to the sort function.

Filter observations above the mean of `sepal_length` by species#

# DT[, .SD[sepal_length > mean(sepal_length)], by = species] 
DT[:, [f[:], (f.sepal_length > f.sepal_length.mean()).alias('temp')], by('species')][f.temp==1, :-1]

	species	sepal_length	sepal_width	petal_length	petal_width
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	setosa	5.1	3.5	1.4	0.2
1	setosa	5.4	3.9	1.7	0.4
2	setosa	5.4	3.7	1.5	0.2
3	setosa	5.8	4	1.2	0.2
4	setosa	5.7	4.4	1.5	0.4
5	setosa	5.4	3.9	1.3	0.4
6	setosa	5.1	3.5	1.4	0.3
7	setosa	5.7	3.8	1.7	0.3
8	setosa	5.1	3.8	1.5	0.3
9	setosa	5.4	3.4	1.7	0.2
10	setosa	5.1	3.7	1.5	0.4
11	setosa	5.1	3.3	1.7	0.5
12	setosa	5.2	3.5	1.5	0.2
13	setosa	5.2	3.4	1.4	0.2
14	setosa	5.4	3.4	1.5	0.4
⋮	⋮	⋮	⋮	⋮	⋮
63	virginica	6.7	3.1	5.6	2.4
64	virginica	6.9	3.1	5.1	2.3
65	virginica	6.8	3.2	5.9	2.3
66	virginica	6.7	3.3	5.7	2.5
67	virginica	6.7	3	5.2	2.3

Unlike in R’s data.table, boolean expressions can not be applied within the i section, in the presence of by. The next best thing is to break it down into two steps - create a temporary column to hold the boolean value, and then filter on that column.

Filter rows with group size greater than 10#

# DT[, .SD[.N > 10], keyby = .(species, petal_width)] 
DT[:, [f[:], (dt.count()>10).alias('temp')], by('species', 'petal_width')][f.temp==1, :-1]

	species	petal_width	sepal_length	sepal_width	petal_length
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	setosa	0.2	5.1	3.5	1.4
1	setosa	0.2	4.9	3	1.4
2	setosa	0.2	4.7	3.2	1.3
3	setosa	0.2	4.6	3.1	1.5
4	setosa	0.2	5	3.6	1.4
5	setosa	0.2	5	3.4	1.5
6	setosa	0.2	4.4	2.9	1.4
7	setosa	0.2	5.4	3.7	1.5
8	setosa	0.2	4.8	3.4	1.6
9	setosa	0.2	5.8	4	1.2
10	setosa	0.2	5.4	3.4	1.7
11	setosa	0.2	4.6	3.6	1
12	setosa	0.2	4.8	3.4	1.9
13	setosa	0.2	5	3	1.6
14	setosa	0.2	5.2	3.5	1.5
⋮	⋮	⋮	⋮	⋮	⋮
48	virginica	1.8	6.2	2.8	4.8
49	virginica	1.8	6.1	3	4.9
50	virginica	1.8	6.4	3.1	5.5
51	virginica	1.8	6	3	4.8
52	virginica	1.8	5.9	3	5.1

Get the row with the max petal_length by species.#

# DT[, .SD[which.max(petal_length)], by = species] OR 
# DT[, .SD[petal_length == max(petal_length)], by = species]  

# get rid of temp column

DT[0, :, by('species'), sort(-f.petal_length)]

	species	sepal_length	sepal_width	petal_length	petal_width
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	setosa	4.8	3.4	1.9	0.2
1	versicolor	6	2.7	5.1	1.6
2	virginica	7.7	2.6	6.9	2.3

In the above code, we take advantage of the fact that sorting is done within each group; this allows us to pick the first row per group when petal_length is sorted in descending order.

.SDCols#

Including columns in `.SD`#

# col_idx <- grep("^sepal", names(DT)) --> filter for the specicfic columns
# DT[, lapply(.SD, mean), .SDcols = col_idx]

# filter for the specific columns with a list comprehension
names = [name for name in DT.names
         if name.startswith('sepal')]

DT[:, f[names].mean()]

	sepal_length	sepal_width
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	5.84333	3.05733

Removing columns from `.SD`#

# col_idx <- grep("^(petal|species)", names(DT))
# DT[, lapply(.SD, mean), .SDcols = -col_idx] --> exclusion occurs within .SDcols

# here, exclusion occurs within the list comprehension
names = [name for name in DT.names 
         if not name.startswith(('petal','species'))] 

DT[:, f[names].mean()]

	sepal_length	sepal_width
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	5.84333	3.05733

Column ranges#

# DT[, lapply(.SD, mean), .SDcols = sepal_length:sepal_width]

DT[:, f['sepal_length':'sepal_width'].mean()]

	sepal_length	sepal_width
	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
0	5.84333	3.05733

Summary#

We’ve seen how to replicate .SD in datatable. There are other functionalities in .SD that are not presently possible in Python’s datatable. It is possible that in the future, .SD will be implemented to allow for custom aggregation functions. That would be truly awesome, as it would allow numpy functions and functions from other Python libraries into datatable.

Replicating .SD in Python Datatable

Contents

Replicating .SD in Python Datatable#

.SD - Subset of Data#

Number of unique observations per column#

Mean of all columns by `species`#

Filtering#

First two observations by species#

Last two observations by `species`#

Select the top two sorted by `sepal length` in descending order#

Select the top two sorted by the difference between the `sepal length` and `sepal width`#

Filter observations above the mean of `sepal_length` by species#

Filter rows with group size greater than 10#

Get the row with the max petal_length by species.#

.SDCols#

Including columns in `.SD`#

Removing columns from `.SD`#

Column ranges#

Summary#

Comments#

Replicating .SD in Python Datatable

Contents

Replicating .SD in Python Datatable#

.SD - Subset of Data#

Number of unique observations per column#

Mean of all columns by species#

Filtering#

First two observations by species#

Last two observations by species#

Select the top two sorted by sepal length in descending order#

Select the top two sorted by the difference between the sepal length and sepal width#

Filter observations above the mean of sepal_length by species#

Filter rows with group size greater than 10#

Get the row with the max petal_length by species.#

.SDCols#

Including columns in .SD#

Removing columns from .SD#

Column ranges#

Summary#

Comments#

Mean of all columns by `species`#

Last two observations by `species`#

Select the top two sorted by `sepal length` in descending order#

Select the top two sorted by the difference between the `sepal length` and `sepal width`#

Filter observations above the mean of `sepal_length` by species#

Including columns in `.SD`#

Removing columns from `.SD`#