{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Fast and Efficient Inequality Joins in Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Introduction**\n", "\n", "Pandas supports *equi-join*, where the keys involved in the join are considered to be *equal*. This is implemented via the [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) and [join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html#pandas.DataFrame.join) functions. There are scenarios however, where the keys involved might not be equal; some other logical condition is involved in the join. This is known as a *non-equi join* or an *inequality join*. \n", "\n", "Let's look at some examples, culled from [here](https://github.com/pandas-dev/pandas/issues/34543#issue-629846136), that illustrate equi and non-equi joins:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df1 = pd.DataFrame({'left': [1,2,3], 'col_a': [\"A\", \"B\", \"C\"]})\n", "df2 = pd.DataFrame({'right': [0, 2, 3], 'col_b': [\"Z\", \"X\", \"Y\"]})" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
leftcol_a
01A
12B
23C
\n", "
" ], "text/plain": [ " left col_a\n", "0 1 A\n", "1 2 B\n", "2 3 C" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rightcol_b
00Z
12X
23Y
\n", "
" ], "text/plain": [ " right col_b\n", "0 0 Z\n", "1 2 X\n", "2 3 Y" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Join based on the condition `df1.left == df2.right` - this is an equi join, because the join operator is an *equality* operator, where df1.left is *equal to* df2.right:\n", "\n", "| left | col_a | right | col_b |\n", "|----- |:------:|:-----:|-----:|\n", "| 2 | B | 2 | X |\n", "| 3 | C | 3 | Y |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Join based on the condition `df1.left != df2.right` - this is an example of a non-equi join; it should return rows where df1.left *is not equal to* df2.right:\n", "\n", "| left | col_a | right | col_b |\n", "|----- |:------:|:-----:|-----:|\n", "| 1 | A | 2 | X |\n", "| 1 | A | 3 | Y |\n", "| 2 | B | 3 | Y |\n", "| 1 | A | 0 | Z |\n", "| 2 | B | 0 | Z |\n", "| 3 | C | 0 | Z |\n", "| 3 | C | 2 | X |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Join based on the condition `df1.left > df2.right` - this is another example of a non-equi join; it should return rows where df1.left *is greater than* df2.right:\n", "\n", "| left | col_a | right | col_b |\n", "|----- |:------:|:-----:|-----:|\n", "| 1 | A | 0 | Z |\n", "| 2 | B | 0 | Z |\n", "| 3 | C | 0 | Z |\n", "| 3 | C | 2 | X |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "non-equi joins are supported in SQL, and offer elegant solutions for specific uses cases. A good example is provided [here](https://github.com/pandas-dev/pandas/issues/34543#issue-629846136) - \n", "
\n", "Consider a dataframe `df_process` that contains `process_id`, `process_start_date` and `process_end_date` columns for some business process, and a second dataframe `df_events` that contains an `event_id` and an `event_date` column for particular events. If you want to find all events that occur during some business process, you can easily obtain these by applying the conditional join statement:\n", "\n", "\n", "``df_event.event_date >= df_process.process_start_date & df_event.event_date <= df_process.process_end_date``\n", "
\n", "\n", "The above scenario is an example of a range join, where the join is *between a range of values*. Currently in Pandas, to solve non-equi joins, one option is via an IntervalIndex, if the intervals do not overlap. Another option, which is more common, involves a cartesian join, where every row from the left dataframe is joined to every row from the right dataframe, before it is filtered with the non-equi joins: \n", "\n", "```py\n", "(df_event\n", ".merge(df_process, how = 'cross')\n", ".query('process_start_date <= event_date <= process_end_date')\n", ")\n", "```\n", "\n", "This is not an efficient way to deal with inequality joins, in terms of memory and speed, especially for large data. This is where the [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join) function from [pyjanitor](https://pyjanitor-devs.github.io/pyjanitor/) comes into play. It provides an efficient way to deal with non-equi joins, and does so in a more memory efficient way than cartesian joins, while still being performant. Under the hood it uses binary search, and in some special cases, such as range joins, it uses special algorithms to improve performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Strictly Non-equi Joins**\n", "\n", "Let's load the [pyjanitor](https://pyjanitor-devs.github.io/pyjanitor/) library:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# pip install pyjanitor\n", "import janitor as jn\n", "import sys\n", "from janitor import col" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " pandas version : 2.2.0 \n", " janitor version : 0.26.0 \n", " python version : 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) \n", "[GCC 12.3.0]\n" ] } ], "source": [ "print(\" pandas version :\", pd.__version__,\"\\n\", \n", " \"janitor version :\", jn.__version__, \"\\n\",\n", " \"python version :\", sys.version)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's walk through some of the parameters in [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join):\n", "\n", "- df : Left DataFrame.\n", "- right: Named Series or DataFrame to join to `df`.\n", "- conditions : Variable argument of tuple(s) of the form `(left_on, right_on, op)`- \n", " `left_on` is the column from `df`, `right_on` is the column from `right`, \n", " while `op` is the join operator. For multiple conditions, the and(&) operator\n", " is used to combine the results of the individual conditions.\n", "- how: indicates the type of join to be performed. Can be one of `left`, `right`, `inner`, or `outer`.\n", "- use_numba: Whether to use `numba`, for possible performance improvement.\n", "- df_columns: Select columns from `df`.\n", "- right_columns: Select columns from `right`.\n", "\n", "\n", "Let's apply the [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join) function to solve the non-equi joins earlier:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
leftcol_arightcol_b
01A2X
11A3Y
22B3Y
31A0Z
42B0Z
53C0Z
63C2X
\n", "
" ], "text/plain": [ " left col_a right col_b\n", "0 1 A 2 X\n", "1 1 A 3 Y\n", "2 2 B 3 Y\n", "3 1 A 0 Z\n", "4 2 B 0 Z\n", "5 3 C 0 Z\n", "6 3 C 2 X" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.conditional_join(df2, ('left', 'right', '!='))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
leftcol_arightcol_b
01A0Z
12B0Z
23C0Z
33C2X
\n", "
" ], "text/plain": [ " left col_a right col_b\n", "0 1 A 0 Z\n", "1 2 B 0 Z\n", "2 3 C 0 Z\n", "3 3 C 2 X" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.conditional_join(df2, ('left', 'right', '>'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use the [col](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.utils.col) function:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
leftcol_arightcol_b
01A2X
11A3Y
22B3Y
31A0Z
42B0Z
53C0Z
63C2X
\n", "
" ], "text/plain": [ " left col_a right col_b\n", "0 1 A 2 X\n", "1 1 A 3 Y\n", "2 2 B 3 Y\n", "3 1 A 0 Z\n", "4 2 B 0 Z\n", "5 3 C 0 Z\n", "6 3 C 2 X" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.conditional_join(df2, col('left') != col('right'))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
leftcol_arightcol_b
01A0Z
12B0Z
23C0Z
33C2X
\n", "
" ], "text/plain": [ " left col_a right col_b\n", "0 1 A 0 Z\n", "1 2 B 0 Z\n", "2 3 C 0 Z\n", "3 3 C 2 X" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.conditional_join(df2, col('left') > col('right'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For multiple non-equi conditions, the results of the individual conditions are combined with the and(&) operator. Let's look at a range join, to see this in action - the example is adapted from [here](http://da.qcri.org/ntang/pubs/vldbj2016.pdf):" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "east = dict(id = range(100,103), \n", " dur = [140, 100, 90], \n", " rev = [9, 12, 5], \n", " cores = [2, 8, 4])\n", "\n", "west = dict(t_id = [404,498, 676, 742], \n", " time = [100, 140, 80, 90], \n", " cost = [6, 11, 10, 5], \n", " cores = [4, 2, 1, 4])\n", "\n", "east = pd.DataFrame(east)\n", "west = pd.DataFrame(west)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddurrevcores
010014092
1101100128
21029054
\n", "
" ], "text/plain": [ " id dur rev cores\n", "0 100 140 9 2\n", "1 101 100 12 8\n", "2 102 90 5 4" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "east" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
t_idtimecostcores
040410064
1498140112
267680101
37429054
\n", "
" ], "text/plain": [ " t_id time cost cores\n", "0 404 100 6 4\n", "1 498 140 11 2\n", "2 676 80 10 1\n", "3 742 90 5 4" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "west" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Task : Look for any customer from the West Coast who rented a virtual machine for more hours than any customer from the East Coast, but who paid less. Basically, get rows where `east.dur < west.time` and `east.rev > west.cost`. \n", "\n", "This is a range join, and is solved easily and efficiently with [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join):" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddurrevt_idtimecost
01011001249814011
\n", "
" ], "text/plain": [ " id dur rev t_id time cost\n", "0 101 100 12 498 140 11" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(east\n", ".conditional_join(\n", " west, \n", " ('dur', 'time', '<'), \n", " ('rev', 'cost', '>'), \n", " df_columns = ['id', 'dur', 'rev'],\n", " right_columns = ['t_id', 'time', 'cost']\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddurrevt_idtimecost
01011001249814011
\n", "
" ], "text/plain": [ " id dur rev t_id time cost\n", "0 101 100 12 498 140 11" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(east\n", ".conditional_join(\n", " west, \n", " col('dur') < col('time'),\n", " col('rev') > col('cost'),\n", " df_columns = ['id', 'dur', 'rev'],\n", " right_columns = ['t_id', 'time', 'cost']\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For range joins, and when `use_numba = False`, [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join) uses an algorithm based on this [vertica blog post](https://www.vertica.com/blog/what-is-a-range-join-and-why-is-it-so-fastba-p223413/). \n", "\n", "You might get more performance with `use_numba = True`. The implementation is based on the algorithm in this [publication](https://www.scitepress.org/papers/2018/68268/68268.pdf)\n", "\n", "Note that non-equi join is supported only for numeric and date types." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Equi and Non-equi Joins**\n", "\n", "[conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join) also supports a combination of equi and non-equi join conditions - under the hood it uses Pandas' internal merge function to get the matching pairs, before pruning the non-equi joins to get the final dataframe. Depending on the size of the dataframes, this might offer more performance than the classic merge and filter approach:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame({'id': [1,1,1,2,2,3], \n", " 'value_1': [2,5,7,1,3,4]})\n", " \n", "df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], \n", " 'value_2A': [0,3,7,12,0,2,3,1], \n", " 'value_2B': [1,5,9,15,1,4,6,3]})" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idvalue_1
012
115
217
321
423
534
\n", "
" ], "text/plain": [ " id value_1\n", "0 1 2\n", "1 1 5\n", "2 1 7\n", "3 2 1\n", "4 2 3\n", "5 3 4" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idvalue_2Avalue_2B
0101
1135
2179
311215
4201
5224
6236
7313
\n", "
" ], "text/plain": [ " id value_2A value_2B\n", "0 1 0 1\n", "1 1 3 5\n", "2 1 7 9\n", "3 1 12 15\n", "4 2 0 1\n", "5 2 2 4\n", "6 2 3 6\n", "7 3 1 3" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Classic Pandas' merge and filter approach:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idvalue_1value_2Avalue_2B
51535
101779
122101
162324
172336
\n", "
" ], "text/plain": [ " id value_1 value_2A value_2B\n", "5 1 5 3 5\n", "10 1 7 7 9\n", "12 2 1 0 1\n", "16 2 3 2 4\n", "17 2 3 3 6" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df1\n", ".merge(df2, on = 'id')\n", ".query('value_2A <= value_1 <= value_2B')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join):" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
leftright
idvalue_1idvalue_2Avalue_2B
015135
117179
221201
323224
423236
\n", "
" ], "text/plain": [ " left right \n", " id value_1 id value_2A value_2B\n", "0 1 5 1 3 5\n", "1 1 7 1 7 9\n", "2 2 1 2 0 1\n", "3 2 3 2 2 4\n", "4 2 3 2 3 6" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df1\n", ".conditional_join(\n", " df2,\n", " ('id', 'id', '=='),\n", " ('value_1', 'value_2A', '>='),\n", " ('value_1', 'value_2B', '<=')\n", "))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
leftright
idvalue_1idvalue_2Avalue_2B
015135
117179
221201
323224
423236
\n", "
" ], "text/plain": [ " left right \n", " id value_1 id value_2A value_2B\n", "0 1 5 1 3 5\n", "1 1 7 1 7 9\n", "2 2 1 2 0 1\n", "3 2 3 2 2 4\n", "4 2 3 2 3 6" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df1\n", ".conditional_join(\n", " df2,\n", " col('id') == col('id'),\n", " col('value_1') >= col('value_2A'),\n", " col('value_1') <= col('value_2B')\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the above returned a MultiIndex column; if the columns from the left and right dataframes have something in common a MultiIndex column is returned. Of course, you can select/rename columns with `df_columns` and `right_columns` to get a single index column, if that is preferred." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Non-equi joins with OR**\n", "\n", "For multiple non-equi joins, an and(&) operator is applied, to combine the results of the individual conditions. There are scenarios however, where the join might have an *OR* condition. In this case, the joins are executed individually, and the resulting dataframes can then be concatenated into one. Let's look at an example from [Nelson Tang's blog post](https://www.nelsontang.com/blog/2021-11-25-pandas-conditional-merging):" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "sales_volume_table = pd.DataFrame.from_dict([\n", " {'date':'2021-11-15', 'quantity':1, 'brand':'Outdoor'},\n", " {'date':'2021-11-20', 'quantity':2, 'brand':'Leisure'},\n", " {'date':'2021-11-25', 'quantity':3, 'brand':'Athletic'},\n", " {'date':'2021-11-26', 'quantity':2, 'brand':'Outdoor'},\n", "])\n", "\n", "promos_table = pd.DataFrame.from_dict([\n", " {'start_date':'2021-11-01', 'end_date':'2021-11-25',\n", " 'brand':'ANY', 'rebate_per_unit':3},\n", " {'start_date':'2021-11-25', 'end_date':'2021-11-26',\n", " 'brand':'Outdoor', 'rebate_per_unit':5},\n", "])\n", "\n", "sales_volume_table['date'] = pd.to_datetime(sales_volume_table['date'])\n", "promos_table['start_date'] = pd.to_datetime(promos_table['start_date'])\n", "promos_table['end_date'] = pd.to_datetime(promos_table['end_date'])\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datequantitybrand
02021-11-151Outdoor
12021-11-202Leisure
22021-11-253Athletic
32021-11-262Outdoor
\n", "
" ], "text/plain": [ " date quantity brand\n", "0 2021-11-15 1 Outdoor\n", "1 2021-11-20 2 Leisure\n", "2 2021-11-25 3 Athletic\n", "3 2021-11-26 2 Outdoor" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales_volume_table" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
start_dateend_datebrandrebate_per_unit
02021-11-012021-11-25ANY3
12021-11-252021-11-26Outdoor5
\n", "
" ], "text/plain": [ " start_date end_date brand rebate_per_unit\n", "0 2021-11-01 2021-11-25 ANY 3\n", "1 2021-11-25 2021-11-26 Outdoor 5" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "promos_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Problem statement, culled from the blog post: \n", "\n", "1. The date in the left table was between two dates (a start and end date) in the second table\n", "\n", "2. ...AND the values in two other columns matched each other, OR the column on the right table was equal to 'ANY' (aka a 'wildcard' value)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Translating this into SQL is [easy](https://www.nelsontang.com/blog/2021-11-25-pandas-conditional-merging): \n", "```SQL\n", "SELECT *\n", "FROM sales_volume_table, promos_table\n", "WHERE (sales_volume_table.brand=promos_table.brand or promos_table.brand='ANY')\n", "AND (start_date <= date AND date <= end_date)\n", "```\n", "\n", "Replicating this in Pandas is more involved but doable:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
leftright
datequantitybrandstart_dateend_datebrandrebate_per_unit
02021-11-262Outdoor2021-11-252021-11-26Outdoor5
02021-11-151Outdoor2021-11-012021-11-25ANY3
12021-11-202Leisure2021-11-012021-11-25ANY3
22021-11-253Athletic2021-11-012021-11-25ANY3
\n", "
" ], "text/plain": [ " left right \n", " date quantity brand start_date end_date brand rebate_per_unit\n", "0 2021-11-26 2 Outdoor 2021-11-25 2021-11-26 Outdoor 5\n", "0 2021-11-15 1 Outdoor 2021-11-01 2021-11-25 ANY 3\n", "1 2021-11-20 2 Leisure 2021-11-01 2021-11-25 ANY 3\n", "2 2021-11-25 3 Athletic 2021-11-01 2021-11-25 ANY 3" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top = (sales_volume_table\n", " .conditional_join(\n", " promos_table,\n", " ('brand', 'brand', '=='),\n", " ('date', 'start_date', '>='),\n", " ('date', 'end_date', '<=')\n", " )\n", " )\n", "\n", "bottom = (sales_volume_table\n", " .conditional_join(\n", " promos_table.query('brand == \"ANY\"'),\n", " ('date', 'start_date', '>='),\n", " ('date', 'end_date', '<=')\n", " )\n", " )\n", "\n", "pd.concat([top, bottom])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Performance**\n", "\n", "Why bother with [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join)? simple syntax, memory and speed efficiency. Let's see the memory and speed benefits, compared to a cartesian join (the data used below is adapted from [here](https://stackoverflow.com/a/25655497/7175713)):" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/memory_profiler.py:1136: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", " ipython_version = LooseVersion(IPython.__version__)\n", "/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/setuptools/_distutils/version.py:345: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", " other = LooseVersion(other)\n" ] } ], "source": [ "%load_ext memory_profiler\n", "from string import ascii_lowercase\n", "import numpy as np\n", "np.random.seed(1)\n", "n = 2_000; k = 10_000\n", "\n", "idx1 = np.random.randint(0, high = 100, size = n)\n", "idx2 = np.random.randint(0, high = 100, size = n)\n", "\n", "d1 = dict(x = np.random.choice(list(ascii_lowercase[:5]), size=n),\n", " start = np.minimum(idx1, idx2),\n", " end = np.maximum(idx1, idx2))\n", "\n", "d2 = dict(x = np.random.choice(list(ascii_lowercase[:15]), size=k),\n", " pos1 = np.random.randint(low=60, high = 151, size=k))\n", "\n", "d1 = pd.DataFrame(d1)\n", "d2 = pd.DataFrame(d2)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xstartend
0c3761
1d1216
2b7279
3c974
4d7576
\n", "
" ], "text/plain": [ " x start end\n", "0 c 37 61\n", "1 d 12 16\n", "2 b 72 79\n", "3 c 9 74\n", "4 d 75 76" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d1.head()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xpos1
0f126
1m121
2o128
3a96
4n61
\n", "
" ], "text/plain": [ " x pos1\n", "0 f 126\n", "1 m 121\n", "2 o 128\n", "3 a 96\n", "4 n 61" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d2.head()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2000, 10000)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(d1), len(d2)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2740898, 3)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = (d1\n", " .merge(d2, how = 'cross')\n", " .query(\"start <= pos1 <= end\")\n", " .filter(['start', 'end', 'pos1'])\n", " )\n", "A.shape" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
startendpos1
4376161
53376160
74376160
104376160
105376160
\n", "
" ], "text/plain": [ " start end pos1\n", "4 37 61 61\n", "53 37 61 60\n", "74 37 61 60\n", "104 37 61 60\n", "105 37 61 60" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A.head()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2740898, 3)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B = (d1\n", " .conditional_join(\n", " d2, \n", " ('start', 'pos1', '<='), \n", " ('end', 'pos1', '>='), \n", " df_columns=['start', 'end'],\n", " right_columns='pos1',\n", " use_numba=False)\n", " )\n", "B.shape" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
startendpos1
0376160
1376160
2376160
3376160
4376160
\n", "
" ], "text/plain": [ " start end pos1\n", "0 37 61 60\n", "1 37 61 60\n", "2 37 61 60\n", "3 37 61 60\n", "4 37 61 60" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B.head()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2740898, 3)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C = (d1\n", " .conditional_join(\n", " d2, \n", " ('start', 'pos1', '<='), \n", " ('end', 'pos1', '>='), \n", " df_columns=['start', 'end'],\n", " right_columns='pos1',\n", " use_numba=True)\n", " )\n", "C.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that the outputs match:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(True, True, True)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols = ['start', 'end', 'pos1']\n", "A = A.sort_values(cols, ignore_index=True)\n", "B = B.sort_values(cols, ignore_index=True)\n", "C = C.sort_values(cols, ignore_index=True)\n", "A.equals(B), A.equals(C), B.equals(C)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's compare the speed:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.36 s ± 92.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%timeit d1.merge(d2, how = 'cross').query(\"start <= pos1 <= end\")" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "57.8 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "%timeit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=False)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "74.8 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "%timeit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare memory usage:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 3396.63 MiB, increment: 2683.73 MiB\n" ] } ], "source": [ "%memit d1.merge(d2, how = 'cross').query(\"start <= pos1 <= end\")" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 713.18 MiB, increment: 0.00 MiB\n" ] } ], "source": [ "%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=False)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 725.80 MiB, increment: 12.62 MiB\n" ] } ], "source": [ "%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's run another performance test - the code below is a self join to find overlapping events in a 30,000 row DataFrame, and adapted from [DuckDB](https://github.com/duckdb/duckdb/blob/master/benchmark/micro/join/iejoin_events.benchmark):" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnameaudiencestartsponsorend
01Event 111782022-11-19 10:00:00Sponsor 22022-11-19 10:15:00
12Event 214462015-09-27 15:00:00Sponsor 112015-09-27 15:11:00
23Event 322612019-11-12 18:00:00Sponsor 102019-11-12 18:53:00
34Event 414712019-12-24 22:00:00Sponsor 62019-12-24 22:11:00
45Event 526052028-06-20 12:00:00Sponsor 82028-06-20 12:31:00
\n", "
" ], "text/plain": [ " id name audience start sponsor end\n", "0 1 Event 1 1178 2022-11-19 10:00:00 Sponsor 2 2022-11-19 10:15:00\n", "1 2 Event 2 1446 2015-09-27 15:00:00 Sponsor 11 2015-09-27 15:11:00\n", "2 3 Event 3 2261 2019-11-12 18:00:00 Sponsor 10 2019-11-12 18:53:00\n", "3 4 Event 4 1471 2019-12-24 22:00:00 Sponsor 6 2019-12-24 22:11:00\n", "4 5 Event 5 2605 2028-06-20 12:00:00 Sponsor 8 2028-06-20 12:31:00" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = 'https://raw.githubusercontent.com/samukweku/data-wrangling-blog/master/notebooks/Data_files/results.csv'\n", "events = pd.read_csv(url, parse_dates=['start', 'end']).iloc[:, 1:]\n", "events.head()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
leftright
idstartendidstartend
0101993-11-27 12:00:001993-11-27 12:37:0023451993-11-27 10:00:001993-11-27 12:00:00
1151993-04-04 16:00:001993-04-04 18:00:00111781993-04-04 17:00:001993-04-04 17:22:00
2172030-10-25 07:00:002030-10-25 07:27:00196052030-10-25 06:00:002030-10-25 08:00:00
3262005-10-04 17:00:002005-10-04 17:18:0082182005-10-04 17:00:002005-10-04 17:27:00
4352024-05-02 15:00:002024-05-02 15:35:0069162024-05-02 15:00:002024-05-02 15:36:00
.....................
3697299662000-08-26 11:00:002000-08-26 13:00:00293752000-08-26 13:00:002000-08-26 13:53:00
3698299712018-05-18 04:00:002018-05-18 04:18:00241732018-05-18 04:00:002018-05-18 04:36:00
3699299781992-06-07 22:00:001992-06-07 22:23:009811992-06-07 22:00:001992-06-07 22:30:00
3700299842025-06-05 03:00:002025-06-05 03:17:00190512025-06-05 01:00:002025-06-05 03:00:00
3701299952016-09-04 14:00:002016-09-04 14:32:00122962016-09-04 14:00:002016-09-04 14:50:00
\n", "

3702 rows × 6 columns

\n", "
" ], "text/plain": [ " left right \\\n", " id start end id \n", "0 10 1993-11-27 12:00:00 1993-11-27 12:37:00 2345 \n", "1 15 1993-04-04 16:00:00 1993-04-04 18:00:00 11178 \n", "2 17 2030-10-25 07:00:00 2030-10-25 07:27:00 19605 \n", "3 26 2005-10-04 17:00:00 2005-10-04 17:18:00 8218 \n", "4 35 2024-05-02 15:00:00 2024-05-02 15:35:00 6916 \n", "... ... ... ... ... \n", "3697 29966 2000-08-26 11:00:00 2000-08-26 13:00:00 29375 \n", "3698 29971 2018-05-18 04:00:00 2018-05-18 04:18:00 24173 \n", "3699 29978 1992-06-07 22:00:00 1992-06-07 22:23:00 981 \n", "3700 29984 2025-06-05 03:00:00 2025-06-05 03:17:00 19051 \n", "3701 29995 2016-09-04 14:00:00 2016-09-04 14:32:00 12296 \n", "\n", " \n", " start end \n", "0 1993-11-27 10:00:00 1993-11-27 12:00:00 \n", "1 1993-04-04 17:00:00 1993-04-04 17:22:00 \n", "2 2030-10-25 06:00:00 2030-10-25 08:00:00 \n", "3 2005-10-04 17:00:00 2005-10-04 17:27:00 \n", "4 2024-05-02 15:00:00 2024-05-02 15:36:00 \n", "... ... ... \n", "3697 2000-08-26 13:00:00 2000-08-26 13:53:00 \n", "3698 2018-05-18 04:00:00 2018-05-18 04:36:00 \n", "3699 1992-06-07 22:00:00 1992-06-07 22:30:00 \n", "3700 2025-06-05 01:00:00 2025-06-05 03:00:00 \n", "3701 2016-09-04 14:00:00 2016-09-04 14:50:00 \n", "\n", "[3702 rows x 6 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(events\n", ".conditional_join(\n", " events,\n", " ('start', 'end', '<='),\n", " ('end', 'start', '>='),\n", " ('id', 'id', '!='),\n", " use_numba = False,\n", " df_columns = ['id', 'start', 'end'],\n", " right_columns = ['id', 'start', 'end'])\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cartesian join takes a very long time - too long for us to test here. The focus for this test will just be on [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join)." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "26.2 ms ± 961 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "%%timeit\n", "(events\n", ".conditional_join(\n", " events,\n", " ('start', 'end', '<='),\n", " ('end', 'start', '>='),\n", " ('id', 'id', '!='),\n", " use_numba = False,\n", " df_columns = ['id', 'start', 'end'],\n", " right_columns = ['id', 'start', 'end'])\n", ")" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "31.3 ms ± 3.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "%%timeit\n", "(events\n", ".conditional_join(\n", " events,\n", " ('start', 'end', '<='),\n", " ('end', 'start', '>='),\n", " ('id', 'id', '!='),\n", " use_numba = True,\n", " df_columns = ['id', 'start', 'end'],\n", " right_columns = ['id', 'start', 'end'])\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at performance when an equi join is present:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "from string import ascii_lowercase\n", "np.random.seed(1)\n", "n = 20_000; k = 50_000\n", "\n", "idx1 = np.random.randint(0, high = 100, size = n)\n", "idx2 = np.random.randint(0, high = 100, size = n)\n", "\n", "d1 = dict(x = np.random.choice(list(ascii_lowercase[:5]), size=n),\n", " start = np.minimum(idx1, idx2),\n", " end = np.maximum(idx1, idx2))\n", "\n", "d2 = dict(x = np.random.choice(list(ascii_lowercase[:15]), size=k),\n", " pos1 = np.random.randint(low=60, high = 151, size=k))\n", "\n", "d1 = pd.DataFrame(d1)\n", "d2 = pd.DataFrame(d2)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8886840, 4)" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out = (d1\n", " .merge(d2, on = 'x')\n", " .query('start <= pos1 <= end')\n", " )\n", "out.shape" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8886840, 5)" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out = (d1\n", " .conditional_join(\n", " d2, \n", " ('start', 'pos1', '<='), \n", " ('end', 'pos1', '>='), \n", " ('x', 'x', '=='), \n", " use_numba=False)\n", " )\n", "out.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "About 18 million rows are returned; let's see how fast each function runs:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.17 s ± 24.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%%timeit\n", "(d1\n", ".merge(d2, on = 'x')\n", ".query('start <= pos1 <= end')\n", ")" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.33 s ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%%timeit\n", "(d1\n", ".conditional_join(\n", " d2, \n", " ('start', 'pos1', '<='), \n", " ('end', 'pos1', '>='), \n", " ('x', 'x', '=='), \n", " use_numba=False)\n", ")" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 5829.88 MiB, increment: 4767.70 MiB\n" ] } ], "source": [ "%memit d1.merge(d2, on = 'x').query('start <= pos1 <= end')" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 4135.80 MiB, increment: 3073.66 MiB\n" ] } ], "source": [ "%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), ('x', 'x', '=='), use_numba=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join) is faster; this might not be the case all the time. \n", "\n", "Depending on the distribution of data, using `numba` where an equi-join is present can give significant performance speedups:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "from itertools import count\n", "mapper = dict(zip(ascii_lowercase, count()))\n", "d1['xx'] = d1.x.map(mapper)\n", "d2['xx'] = d2.x.map(mapper)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xstartendxx
0d37823
1c2122
2b46721
3b791
4c39752
\n", "
" ], "text/plain": [ " x start end xx\n", "0 d 37 82 3\n", "1 c 2 12 2\n", "2 b 46 72 1\n", "3 b 7 9 1\n", "4 c 39 75 2" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d1.head()" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xpos1xx
0c1432
1i858
2k12110
3m6812
4o9714
\n", "
" ], "text/plain": [ " x pos1 xx\n", "0 c 143 2\n", "1 i 85 8\n", "2 k 121 10\n", "3 m 68 12\n", "4 o 97 14" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d2.head()" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.97 s ± 47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%%timeit\n", "(d1\n", ".merge(d2, on = 'xx')\n", ".query('start <= pos1 <= end')\n", ")" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.45 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%%timeit\n", "(d1\n", ".conditional_join(\n", " d2, \n", " ('start', 'pos1', '<='), \n", " ('end', 'pos1', '>='), \n", " ('xx', 'xx', '=='), \n", " use_numba=False)\n", ")" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "324 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%%timeit\n", "(d1\n", ".conditional_join(\n", " d2, \n", " ('start', 'pos1', '<='), \n", " ('end', 'pos1', '>='), \n", " ('xx', 'xx', '=='), \n", " use_numba=True)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above numba implementation depends on a number of conditions:\n", "- Only a single equi join is supported\n", "- The columns in the equi join should be either numeric or datetime data types\n", "- The columns in the equi join are duplicated with large ranges, but small unique values" ] }, { "cell_type": "markdown", "metadata": { "execution": { "iopub.execute_input": "2020-07-30T10:57:18.265994Z", "iopub.status.busy": "2020-07-30T10:57:18.265751Z", "iopub.status.idle": "2020-07-30T10:57:18.283104Z", "shell.execute_reply": "2020-07-30T10:57:18.280684Z", "shell.execute_reply.started": "2020-07-30T10:57:18.265970Z" } }, "source": [ "## **Summary**\n", "\n", "This blog post shows how to efficiently join on non-equi conditions, using [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join). If you are comfortable with SQL, you could get more performance with [DuckDB](https://duckdb.org), which allows querying Pandas DataFrames efficiently with SQL, supports non-equi joins as well, and has a [performant implementation](https://duckdb.org/2022/05/27/iejoin.html) for range joins." ] } ], "metadata": { "jupytext": { "formats": "ipynb,md" }, "kernelspec": { "display_name": "Python 3.9.12 ('playground')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" }, "vscode": { "interpreter": { "hash": "2a407888a31a0a9019767319ea2d3065f86a5891a7d6f13880fa432bbfc1e5b5" } } }, "nbformat": 4, "nbformat_minor": 4 }