Fast and Efficient Inequality Joins in Pandas#
Introduction#
Pandas supports equi-join, where the keys involved in the join are considered to be equal. This is implemented via the merge and join functions. There are scenarios however, where the keys involved might not be equal; some other logical condition is involved in the join. This is known as a non-equi join or an inequality join.
Let’s look at some examples, culled from here, that illustrate equi and non-equi joins:
import pandas as pd
df1 = pd.DataFrame({'left': [1,2,3], 'col_a': ["A", "B", "C"]})
df2 = pd.DataFrame({'right': [0, 2, 3], 'col_b': ["Z", "X", "Y"]})
df1
left | col_a | |
---|---|---|
0 | 1 | A |
1 | 2 | B |
2 | 3 | C |
df2
right | col_b | |
---|---|---|
0 | 0 | Z |
1 | 2 | X |
2 | 3 | Y |
Join based on the condition df1.left == df2.right
- this is an equi join, because the join operator is an equality operator, where df1.left is equal to df2.right:
left |
col_a |
right |
col_b |
---|---|---|---|
2 |
B |
2 |
X |
3 |
C |
3 |
Y |
Join based on the condition df1.left != df2.right
- this is an example of a non-equi join; it should return rows where df1.left is not equal to df2.right:
left |
col_a |
right |
col_b |
---|---|---|---|
1 |
A |
2 |
X |
1 |
A |
3 |
Y |
2 |
B |
3 |
Y |
1 |
A |
0 |
Z |
2 |
B |
0 |
Z |
3 |
C |
0 |
Z |
3 |
C |
2 |
X |
Join based on the condition df1.left > df2.right
- this is another example of a non-equi join; it should return rows where df1.left is greater than df2.right:
left |
col_a |
right |
col_b |
---|---|---|---|
1 |
A |
0 |
Z |
2 |
B |
0 |
Z |
3 |
C |
0 |
Z |
3 |
C |
2 |
X |
non-equi joins are supported in SQL, and offer elegant solutions for specific uses cases. A good example is provided here -
Consider a dataframe `df_process` that contains `process_id`, `process_start_date` and `process_end_date` columns for some business process, and a second dataframe `df_events` that contains an `event_id` and an `event_date` column for particular events. If you want to find all events that occur during some business process, you can easily obtain these by applying the conditional join statement:
df_event.event_date >= df_process.process_start_date & df_event.event_date <= df_process.process_end_date
The above scenario is an example of a range join, where the join is between a range of values. Currently in Pandas, to solve non-equi joins, one option is via an IntervalIndex, if the intervals do not overlap. Another option, which is more common, involves a cartesian join, where every row from the left dataframe is joined to every row from the right dataframe, before it is filtered with the non-equi joins:
(df_event
.merge(df_process, how = 'cross')
.query('process_start_date <= event_date <= process_end_date')
)
This is not an efficient way to deal with inequality joins, in terms of memory and speed, especially for large data. This is where the conditional_join function from pyjanitor comes into play. It provides an efficient way to deal with non-equi joins, and does so in a more memory efficient way than cartesian joins, while still being performant. Under the hood it uses binary search, and in some special cases, such as range joins, it uses special algorithms to improve performance.
Strictly Non-equi Joins#
Let’s load the pyjanitor library:
# pip install pyjanitor
import janitor as jn
import sys
print("pandas version :", pd.__version__)
print("janitor version :", jn.__version__)
print("python version :", sys.version)
pandas version : 2.2.2
janitor version : 0.28.1
python version : 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
Let’s walk through some of the parameters in conditional_join:
df : Left DataFrame.
right: Named Series or DataFrame to join to
df
.conditions : Variable argument of tuple(s) of the form
(left_on, right_on, op)
-left_on
is the column fromdf
,right_on
is the column fromright
, whileop
is the join operator. For multiple conditions, the and(&) operator is used to combine the results of the individual conditions.how: indicates the type of join to be performed. Can be one of
left
,right
,inner
, orouter
.use_numba: Whether to use
numba
, for possible performance improvement.df_columns: Select columns from
df
.right_columns: Select columns from
right
.
Let’s apply the conditional_join function to solve the non-equi joins earlier:
df1.conditional_join(df2, ('left', 'right', '!='))
left | col_a | right | col_b | |
---|---|---|---|---|
0 | 1 | A | 2 | X |
1 | 1 | A | 3 | Y |
2 | 2 | B | 3 | Y |
3 | 1 | A | 0 | Z |
4 | 2 | B | 0 | Z |
5 | 3 | C | 0 | Z |
6 | 3 | C | 2 | X |
df1.conditional_join(df2, ('left', 'right', '>'))
left | col_a | right | col_b | |
---|---|---|---|---|
0 | 1 | A | 0 | Z |
1 | 2 | B | 0 | Z |
2 | 3 | C | 0 | Z |
3 | 3 | C | 2 | X |
For multiple non-equi conditions, the results of the individual conditions are combined with the and(&) operator. Let’s look at a range join, to see this in action - the example is adapted from here:
east = dict(id = range(100,103),
dur = [140, 100, 90],
rev = [9, 12, 5],
cores = [2, 8, 4])
west = dict(t_id = [404,498, 676, 742],
time = [100, 140, 80, 90],
cost = [6, 11, 10, 5],
cores = [4, 2, 1, 4])
east = pd.DataFrame(east)
west = pd.DataFrame(west)
east
id | dur | rev | cores | |
---|---|---|---|---|
0 | 100 | 140 | 9 | 2 |
1 | 101 | 100 | 12 | 8 |
2 | 102 | 90 | 5 | 4 |
west
t_id | time | cost | cores | |
---|---|---|---|---|
0 | 404 | 100 | 6 | 4 |
1 | 498 | 140 | 11 | 2 |
2 | 676 | 80 | 10 | 1 |
3 | 742 | 90 | 5 | 4 |
Task : Look for any customer from the West Coast who rented a virtual machine for more hours than any customer from the East Coast, but who paid less. Basically, get rows where east.dur < west.time
and east.rev > west.cost
.
This is a range join, and is solved easily and efficiently with conditional_join:
(east
.conditional_join(
west,
('dur', 'time', '<'),
('rev', 'cost', '>'),
df_columns = ['id', 'dur', 'rev'],
right_columns = ['t_id', 'time', 'cost']
)
)
id | dur | rev | t_id | time | cost | |
---|---|---|---|---|---|---|
0 | 101 | 100 | 12 | 498 | 140 | 11 |
For range joins, and when use_numba = False
, conditional_join uses an algorithm based on this vertica blog post.
You might get more performance with use_numba = True
. The implementation is based on the algorithm in this publication
Note that non-equi join is supported only for numeric and date types.
Equi and Non-equi Joins#
conditional_join also supports a combination of equi and non-equi join conditions - under the hood it uses Pandas’ internal merge function to get the matching pairs, before pruning the non-equi joins to get the final dataframe. Depending on the size of the dataframes, this might offer more performance than the classic merge and filter approach:
df1 = pd.DataFrame({'id': [1,1,1,2,2,3],
'value_1': [2,5,7,1,3,4]})
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3],
'value_2A': [0,3,7,12,0,2,3,1],
'value_2B': [1,5,9,15,1,4,6,3]})
df1
id | value_1 | |
---|---|---|
0 | 1 | 2 |
1 | 1 | 5 |
2 | 1 | 7 |
3 | 2 | 1 |
4 | 2 | 3 |
5 | 3 | 4 |
df2
id | value_2A | value_2B | |
---|---|---|---|
0 | 1 | 0 | 1 |
1 | 1 | 3 | 5 |
2 | 1 | 7 | 9 |
3 | 1 | 12 | 15 |
4 | 2 | 0 | 1 |
5 | 2 | 2 | 4 |
6 | 2 | 3 | 6 |
7 | 3 | 1 | 3 |
Classic Pandas’ merge and filter approach:
(df1
.merge(df2, on = 'id')
.query('value_2A <= value_1 <= value_2B')
)
id | value_1 | value_2A | value_2B | |
---|---|---|---|---|
5 | 1 | 5 | 3 | 5 |
10 | 1 | 7 | 7 | 9 |
12 | 2 | 1 | 0 | 1 |
16 | 2 | 3 | 2 | 4 |
17 | 2 | 3 | 3 | 6 |
With conditional_join:
(df1
.conditional_join(
df2,
('id', 'id', '=='),
('value_1', 'value_2A', '>='),
('value_1', 'value_2B', '<=')
))
left | right | ||||
---|---|---|---|---|---|
id | value_1 | id | value_2A | value_2B | |
0 | 1 | 5 | 1 | 3 | 5 |
1 | 1 | 7 | 1 | 7 | 9 |
2 | 2 | 1 | 2 | 0 | 1 |
3 | 2 | 3 | 2 | 2 | 4 |
4 | 2 | 3 | 2 | 3 | 6 |
Note the above returned a MultiIndex column; if the columns from the left and right dataframes have something in common a MultiIndex column is returned. Of course, you can select/rename columns with df_columns
and right_columns
to get a single index column, if that is preferred.
Non-equi joins with OR#
For multiple non-equi joins, an and(&) operator is applied, to combine the results of the individual conditions. There are scenarios however, where the join might have an OR condition. In this case, the joins are executed individually, and the resulting dataframes can then be concatenated into one. Let’s look at an example from Nelson Tang’s blog post:
sales_volume_table = pd.DataFrame.from_dict([
{'date':'2021-11-15', 'quantity':1, 'brand':'Outdoor'},
{'date':'2021-11-20', 'quantity':2, 'brand':'Leisure'},
{'date':'2021-11-25', 'quantity':3, 'brand':'Athletic'},
{'date':'2021-11-26', 'quantity':2, 'brand':'Outdoor'},
])
promos_table = pd.DataFrame.from_dict([
{'start_date':'2021-11-01', 'end_date':'2021-11-25',
'brand':'ANY', 'rebate_per_unit':3},
{'start_date':'2021-11-25', 'end_date':'2021-11-26',
'brand':'Outdoor', 'rebate_per_unit':5},
])
sales_volume_table['date'] = pd.to_datetime(sales_volume_table['date'])
promos_table['start_date'] = pd.to_datetime(promos_table['start_date'])
promos_table['end_date'] = pd.to_datetime(promos_table['end_date'])
sales_volume_table
date | quantity | brand | |
---|---|---|---|
0 | 2021-11-15 | 1 | Outdoor |
1 | 2021-11-20 | 2 | Leisure |
2 | 2021-11-25 | 3 | Athletic |
3 | 2021-11-26 | 2 | Outdoor |
promos_table
start_date | end_date | brand | rebate_per_unit | |
---|---|---|---|---|
0 | 2021-11-01 | 2021-11-25 | ANY | 3 |
1 | 2021-11-25 | 2021-11-26 | Outdoor | 5 |
Problem statement, culled from the blog post:
The date in the left table was between two dates (a start and end date) in the second table
…AND the values in two other columns matched each other, OR the column on the right table was equal to ‘ANY’ (aka a ‘wildcard’ value)
Translating this into SQL is easy:
SELECT *
FROM sales_volume_table, promos_table
WHERE (sales_volume_table.brand=promos_table.brand or promos_table.brand='ANY')
AND (start_date <= date AND date <= end_date)
Replicating this in Pandas is more involved but doable:
top = (sales_volume_table
.conditional_join(
promos_table,
('brand', 'brand', '=='),
('date', 'start_date', '>='),
('date', 'end_date', '<=')
)
)
bottom = (sales_volume_table
.conditional_join(
promos_table.query('brand == "ANY"'),
('date', 'start_date', '>='),
('date', 'end_date', '<=')
)
)
pd.concat([top, bottom])
left | right | ||||||
---|---|---|---|---|---|---|---|
date | quantity | brand | start_date | end_date | brand | rebate_per_unit | |
0 | 2021-11-26 | 2 | Outdoor | 2021-11-25 | 2021-11-26 | Outdoor | 5 |
0 | 2021-11-15 | 1 | Outdoor | 2021-11-01 | 2021-11-25 | ANY | 3 |
1 | 2021-11-20 | 2 | Leisure | 2021-11-01 | 2021-11-25 | ANY | 3 |
2 | 2021-11-25 | 3 | Athletic | 2021-11-01 | 2021-11-25 | ANY | 3 |
Performance#
Why bother with conditional_join? simple syntax, memory and speed efficiency. Let’s see the memory and speed benefits, compared to a cartesian join (the data used below is adapted from here):
%load_ext memory_profiler
from string import ascii_lowercase
import numpy as np
np.random.seed(1)
n = 2_000; k = 10_000
idx1 = np.random.randint(0, high = 100, size = n)
idx2 = np.random.randint(0, high = 100, size = n)
d1 = dict(x = np.random.choice(list(ascii_lowercase[:5]), size=n),
start = np.minimum(idx1, idx2),
end = np.maximum(idx1, idx2))
d2 = dict(x = np.random.choice(list(ascii_lowercase[:15]), size=k),
pos1 = np.random.randint(low=60, high = 151, size=k))
d1 = pd.DataFrame(d1)
d2 = pd.DataFrame(d2)
/Users/samuel.oranyeli/mambaforge/envs/blogger/lib/python3.10/site-packages/memory_profiler.py:1136: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
ipython_version = LooseVersion(IPython.__version__)
/Users/samuel.oranyeli/mambaforge/envs/blogger/lib/python3.10/site-packages/setuptools/_distutils/version.py:337: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)
d1.head()
x | start | end | |
---|---|---|---|
0 | c | 37 | 61 |
1 | d | 12 | 16 |
2 | b | 72 | 79 |
3 | c | 9 | 74 |
4 | d | 75 | 76 |
d2.head()
x | pos1 | |
---|---|---|
0 | f | 126 |
1 | m | 121 |
2 | o | 128 |
3 | a | 96 |
4 | n | 61 |
len(d1), len(d2)
(2000, 10000)
A = (d1
.merge(d2, how = 'cross')
.query("start <= pos1 <= end")
.filter(['start', 'end', 'pos1'])
)
A.shape
(2740898, 3)
A.head()
start | end | pos1 | |
---|---|---|---|
4 | 37 | 61 | 61 |
53 | 37 | 61 | 60 |
74 | 37 | 61 | 60 |
104 | 37 | 61 | 60 |
105 | 37 | 61 | 60 |
B = (d1
.conditional_join(
d2,
('start', 'pos1', '<='),
('end', 'pos1', '>='),
df_columns=['start', 'end'],
right_columns='pos1',
use_numba=False)
)
B.shape
(2740898, 3)
B.head()
start | end | pos1 | |
---|---|---|---|
0 | 37 | 61 | 60 |
1 | 37 | 61 | 60 |
2 | 37 | 61 | 60 |
3 | 37 | 61 | 60 |
4 | 37 | 61 | 60 |
C = (d1
.conditional_join(
d2,
('start', 'pos1', '<='),
('end', 'pos1', '>='),
df_columns=['start', 'end'],
right_columns='pos1',
use_numba=True)
)
C.shape
(2740898, 3)
Check that the outputs match:
cols = ['start', 'end', 'pos1']
A = A.sort_values(cols, ignore_index=True)
B = B.sort_values(cols, ignore_index=True)
C = C.sort_values(cols, ignore_index=True)
A.equals(B), A.equals(C), B.equals(C)
(True, True, True)
Let’s compare the speed:
%timeit d1.merge(d2, how = 'cross').query("start <= pos1 <= end")
1.11 s ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=False)
38.9 ms ± 401 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=True)
40.4 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Compare memory usage:
%memit d1.merge(d2, how = 'cross').query("start <= pos1 <= end")
peak memory: 3844.30 MiB, increment: 0.14 MiB
%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=False)
peak memory: 3844.30 MiB, increment: 0.00 MiB
%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=True)
peak memory: 3844.30 MiB, increment: 0.00 MiB
Let’s run another performance test - the code below is a self join to find overlapping events in a 30,000 row DataFrame, and adapted from DuckDB:
url = 'https://raw.githubusercontent.com/samukweku/data-wrangling-blog/master/notebooks/Data_files/results.csv'
events = pd.read_csv(url, parse_dates=['start', 'end']).iloc[:, 1:]
events.head()
id | name | audience | start | sponsor | end | |
---|---|---|---|---|---|---|
0 | 1 | Event 1 | 1178 | 2022-11-19 10:00:00 | Sponsor 2 | 2022-11-19 10:15:00 |
1 | 2 | Event 2 | 1446 | 2015-09-27 15:00:00 | Sponsor 11 | 2015-09-27 15:11:00 |
2 | 3 | Event 3 | 2261 | 2019-11-12 18:00:00 | Sponsor 10 | 2019-11-12 18:53:00 |
3 | 4 | Event 4 | 1471 | 2019-12-24 22:00:00 | Sponsor 6 | 2019-12-24 22:11:00 |
4 | 5 | Event 5 | 2605 | 2028-06-20 12:00:00 | Sponsor 8 | 2028-06-20 12:31:00 |
(events
.conditional_join(
events,
('start', 'end', '<='),
('end', 'start', '>='),
('id', 'id', '!='),
use_numba = False,
df_columns = ['id', 'start', 'end'],
right_columns = ['id', 'start', 'end'])
)
left | right | |||||
---|---|---|---|---|---|---|
id | start | end | id | start | end | |
0 | 10 | 1993-11-27 12:00:00 | 1993-11-27 12:37:00 | 2345 | 1993-11-27 10:00:00 | 1993-11-27 12:00:00 |
1 | 15 | 1993-04-04 16:00:00 | 1993-04-04 18:00:00 | 11178 | 1993-04-04 17:00:00 | 1993-04-04 17:22:00 |
2 | 17 | 2030-10-25 07:00:00 | 2030-10-25 07:27:00 | 19605 | 2030-10-25 06:00:00 | 2030-10-25 08:00:00 |
3 | 26 | 2005-10-04 17:00:00 | 2005-10-04 17:18:00 | 8218 | 2005-10-04 17:00:00 | 2005-10-04 17:27:00 |
4 | 35 | 2024-05-02 15:00:00 | 2024-05-02 15:35:00 | 6916 | 2024-05-02 15:00:00 | 2024-05-02 15:36:00 |
... | ... | ... | ... | ... | ... | ... |
3697 | 29966 | 2000-08-26 11:00:00 | 2000-08-26 13:00:00 | 29375 | 2000-08-26 13:00:00 | 2000-08-26 13:53:00 |
3698 | 29971 | 2018-05-18 04:00:00 | 2018-05-18 04:18:00 | 24173 | 2018-05-18 04:00:00 | 2018-05-18 04:36:00 |
3699 | 29978 | 1992-06-07 22:00:00 | 1992-06-07 22:23:00 | 981 | 1992-06-07 22:00:00 | 1992-06-07 22:30:00 |
3700 | 29984 | 2025-06-05 03:00:00 | 2025-06-05 03:17:00 | 19051 | 2025-06-05 01:00:00 | 2025-06-05 03:00:00 |
3701 | 29995 | 2016-09-04 14:00:00 | 2016-09-04 14:32:00 | 12296 | 2016-09-04 14:00:00 | 2016-09-04 14:50:00 |
3702 rows × 6 columns
The cartesian join takes a very long time - too long for us to test here. The focus for this test will just be on conditional_join.
%%timeit
(events
.conditional_join(
events,
('start', 'end', '<='),
('end', 'start', '>='),
('id', 'id', '!='),
use_numba = False,
df_columns = ['id', 'start', 'end'],
right_columns = ['id', 'start', 'end'])
)
19.6 ms ± 2.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
(events
.conditional_join(
events,
('start', 'end', '<='),
('end', 'start', '>='),
('id', 'id', '!='),
use_numba = True,
df_columns = ['id', 'start', 'end'],
right_columns = ['id', 'start', 'end'])
)
22.2 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Let’s look at performance when an equi join is present:
from string import ascii_lowercase
np.random.seed(1)
n = 20_000; k = 50_000
idx1 = np.random.randint(0, high = 100, size = n)
idx2 = np.random.randint(0, high = 100, size = n)
d1 = dict(x = np.random.choice(list(ascii_lowercase[:5]), size=n),
start = np.minimum(idx1, idx2),
end = np.maximum(idx1, idx2))
d2 = dict(x = np.random.choice(list(ascii_lowercase[:15]), size=k),
pos1 = np.random.randint(low=60, high = 151, size=k))
d1 = pd.DataFrame(d1)
d2 = pd.DataFrame(d2)
out = (d1
.merge(d2, on = 'x')
.query('start <= pos1 <= end')
)
out.shape
(8886840, 4)
out = (d1
.conditional_join(
d2,
('start', 'pos1', '<='),
('end', 'pos1', '>='),
('x', 'x', '=='),
use_numba=False)
)
out.shape
(8886840, 5)
About 18 million rows are returned; let’s see how fast each function runs:
%%timeit
(d1
.merge(d2, on = 'x')
.query('start <= pos1 <= end')
)
2.65 s ± 51.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(d1
.merge(d2, on = 'x')
.loc[lambda df: df.start.le(df.pos1) & df.pos1.le(df.end)]
)
2.56 s ± 50.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(d1
.conditional_join(
d2,
('start', 'pos1', '<='),
('end', 'pos1', '>='),
('x', 'x', '=='),
use_numba=False)
)
1.74 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%memit d1.merge(d2, on = 'x').query('start <= pos1 <= end')
peak memory: 7749.67 MiB, increment: 1846.95 MiB
%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), ('x', 'x', '=='), use_numba=False)
peak memory: 5711.58 MiB, increment: 0.00 MiB
conditional_join is slightly faster; this might not be the case all the time.
Depending on the distribution of data, using numba
where an equi-join is present can give significant performance speedups:
from itertools import count
mapper = dict(zip(ascii_lowercase, count()))
d1['xx'] = d1.x.map(mapper)
d2['xx'] = d2.x.map(mapper)
d1.head()
x | start | end | xx | |
---|---|---|---|---|
0 | d | 37 | 82 | 3 |
1 | c | 2 | 12 | 2 |
2 | b | 46 | 72 | 1 |
3 | b | 7 | 9 | 1 |
4 | c | 39 | 75 | 2 |
d2.head()
x | pos1 | xx | |
---|---|---|---|
0 | c | 143 | 2 |
1 | i | 85 | 8 |
2 | k | 121 | 10 |
3 | m | 68 | 12 |
4 | o | 97 | 14 |
%%timeit
(d1
.merge(d2, on = 'xx')
.query('start <= pos1 <= end')
)
4.44 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(d1
.merge(d2, on = 'xx')
.loc[lambda df: df.start.le(df.pos1) & df.pos1.le(df.end)]
)
4.23 s ± 156 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(d1
.conditional_join(
d2,
('start', 'pos1', '<='),
('end', 'pos1', '>='),
('xx', 'xx', '=='),
use_numba=False)
)
1.76 s ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(d1
.conditional_join(
d2,
('start', 'pos1', '<='),
('end', 'pos1', '>='),
('xx', 'xx', '=='),
use_numba=True)
)
159 ms ± 638 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
The above numba implementation depends on a number of conditions:
Only a single equi join is supported
The columns in the equi join should be either numeric or datetime data types
The columns in the equi join are duplicated with large ranges, but small unique values
Summary#
This blog post shows how to efficiently join on non-equi conditions, using conditional_join. If you are comfortable with SQL, you could get more performance with DuckDB, which allows querying Pandas DataFrames efficiently with SQL, supports non-equi joins as well, and has a performant implementation for range joins.
Comments#