{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fast and Efficient Inequality Joins in Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **Introduction**\n",
    "\n",
    "Pandas supports *equi-join*, where the keys involved in the join are considered to be *equal*. This is implemented via the [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) and [join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html#pandas.DataFrame.join) functions. There are scenarios however, where the keys involved might not be equal; some other logical condition is involved in the join. This is known as a *non-equi join* or an *inequality join*. \n",
    "\n",
    "Let's look at some examples, culled from [here](https://github.com/pandas-dev/pandas/issues/34543#issue-629846136), that illustrate equi and non-equi joins:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df1 = pd.DataFrame({'left': [1,2,3], 'col_a': [\"A\", \"B\", \"C\"]})\n",
    "df2 = pd.DataFrame({'right': [0, 2, 3], 'col_b': [\"Z\", \"X\", \"Y\"]})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>left</th>\n",
       "      <th>col_a</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>B</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   left col_a\n",
       "0     1     A\n",
       "1     2     B\n",
       "2     3     C"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>right</th>\n",
       "      <th>col_b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>X</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Y</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   right col_b\n",
       "0      0     Z\n",
       "1      2     X\n",
       "2      3     Y"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Join based on the condition `df1.left == df2.right` - this is an equi join, because the join operator is an *equality* operator, where df1.left is *equal to* df2.right:\n",
    "\n",
    "| left | col_a | right | col_b |\n",
    "|----- |:------:|:-----:|-----:|\n",
    "|   2  |   B   |   2   |  X    |\n",
    "|   3  |   C   |   3   |  Y    |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Join based on the condition `df1.left != df2.right` - this is an example of a non-equi join; it should return rows where df1.left *is not equal to* df2.right:\n",
    "\n",
    "| left | col_a | right | col_b |\n",
    "|----- |:------:|:-----:|-----:|\n",
    "|   1  |   A   |   2   |  X    |\n",
    "|   1  |   A   |   3   |  Y    |\n",
    "|   2  |   B   |   3   |  Y    |\n",
    "|   1  |   A   |   0   |  Z    |\n",
    "|   2  |   B   |   0   |  Z    |\n",
    "|   3  |   C   |   0   |  Z    |\n",
    "|   3  |   C   |   2   |  X    |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Join based on the condition `df1.left > df2.right` - this is another example of a non-equi join; it should return rows where df1.left *is greater than* df2.right:\n",
    "\n",
    "| left | col_a | right | col_b |\n",
    "|----- |:------:|:-----:|-----:|\n",
    "|   1  |   A   |   0   |  Z    |\n",
    "|   2  |   B   |   0   |  Z    |\n",
    "|   3  |   C   |   0   |  Z    |\n",
    "|   3  |   C   |   2   |  X    |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "non-equi joins are supported in SQL, and offer elegant solutions for specific uses cases. A good example is provided [here](https://github.com/pandas-dev/pandas/issues/34543#issue-629846136) - \n",
    "<blockquote>\n",
    "Consider a dataframe `df_process` that contains `process_id`, `process_start_date` and `process_end_date` columns for some business process, and a second dataframe `df_events` that contains an `event_id` and an `event_date` column for particular events. If you want to find all events that occur during some business process, you can easily obtain these by applying the conditional join statement:\n",
    "\n",
    "\n",
    "``df_event.event_date >= df_process.process_start_date & df_event.event_date <= df_process.process_end_date``\n",
    "</blockquote>\n",
    "\n",
    "The above scenario is an example of a range join, where the join is *between a range of values*. Currently in Pandas, to solve non-equi joins, one option is via an IntervalIndex, if the intervals do not overlap. Another option, which is more common, involves a cartesian join, where every row from the left dataframe is joined to every row from the right dataframe, before it is filtered with the non-equi joins: \n",
    "\n",
    "```py\n",
    "(df_event\n",
    ".merge(df_process, how = 'cross')\n",
    ".query('process_start_date <= event_date <= process_end_date')\n",
    ")\n",
    "```\n",
    "\n",
    "This is not an efficient way to deal with inequality joins, in terms of memory and speed, especially for large data. This is where the [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join) function from [pyjanitor](https://pyjanitor-devs.github.io/pyjanitor/) comes into play. It provides an efficient way to deal with non-equi joins, and does so in a more memory efficient way than cartesian joins, while still being performant. Under the hood it uses binary search, and in some special cases, such as range joins, it uses special algorithms to improve performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **Strictly Non-equi Joins**\n",
    "\n",
    "Let's load the [pyjanitor](https://pyjanitor-devs.github.io/pyjanitor/) library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pip install pyjanitor\n",
    "import janitor as jn\n",
    "import sys\n",
    "from janitor import col"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " pandas version : 2.2.0 \n",
      " janitor version : 0.26.0 \n",
      " python version : 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) \n",
      "[GCC 12.3.0]\n"
     ]
    }
   ],
   "source": [
    "print(\" pandas version :\", pd.__version__,\"\\n\", \n",
    "      \"janitor version :\", jn.__version__, \"\\n\",\n",
    "      \"python version :\", sys.version)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's walk through some of the parameters in [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join):\n",
    "\n",
    "- df : Left DataFrame.\n",
    "- right: Named Series or DataFrame to join to `df`.\n",
    "- conditions : Variable argument of tuple(s) of the form `(left_on, right_on, op)`- \n",
    "    `left_on` is the column from `df`, `right_on` is the column from `right`, \n",
    "    while `op` is the join operator. For multiple conditions, the and(&) operator\n",
    "    is used to combine the results of the individual conditions.\n",
    "- how: indicates the type of join to be performed. Can be one of `left`, `right`, `inner`, or `outer`.\n",
    "- use_numba: Whether to use `numba`, for possible performance improvement.\n",
    "- df_columns: Select columns from `df`.\n",
    "- right_columns: Select columns from `right`.\n",
    "\n",
    "\n",
    "Let's apply the [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join) function to solve the non-equi joins earlier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>left</th>\n",
       "      <th>col_a</th>\n",
       "      <th>right</th>\n",
       "      <th>col_b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>2</td>\n",
       "      <td>X</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>3</td>\n",
       "      <td>Y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>B</td>\n",
       "      <td>3</td>\n",
       "      <td>Y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>0</td>\n",
       "      <td>Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>B</td>\n",
       "      <td>0</td>\n",
       "      <td>Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>3</td>\n",
       "      <td>C</td>\n",
       "      <td>0</td>\n",
       "      <td>Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>3</td>\n",
       "      <td>C</td>\n",
       "      <td>2</td>\n",
       "      <td>X</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   left col_a  right col_b\n",
       "0     1     A      2     X\n",
       "1     1     A      3     Y\n",
       "2     2     B      3     Y\n",
       "3     1     A      0     Z\n",
       "4     2     B      0     Z\n",
       "5     3     C      0     Z\n",
       "6     3     C      2     X"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.conditional_join(df2, ('left', 'right', '!='))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>left</th>\n",
       "      <th>col_a</th>\n",
       "      <th>right</th>\n",
       "      <th>col_b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>0</td>\n",
       "      <td>Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>B</td>\n",
       "      <td>0</td>\n",
       "      <td>Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>C</td>\n",
       "      <td>0</td>\n",
       "      <td>Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>C</td>\n",
       "      <td>2</td>\n",
       "      <td>X</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   left col_a  right col_b\n",
       "0     1     A      0     Z\n",
       "1     2     B      0     Z\n",
       "2     3     C      0     Z\n",
       "3     3     C      2     X"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.conditional_join(df2, ('left', 'right', '>'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also use the [col](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.utils.col) function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>left</th>\n",
       "      <th>col_a</th>\n",
       "      <th>right</th>\n",
       "      <th>col_b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>2</td>\n",
       "      <td>X</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>3</td>\n",
       "      <td>Y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>B</td>\n",
       "      <td>3</td>\n",
       "      <td>Y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>0</td>\n",
       "      <td>Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>B</td>\n",
       "      <td>0</td>\n",
       "      <td>Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>3</td>\n",
       "      <td>C</td>\n",
       "      <td>0</td>\n",
       "      <td>Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>3</td>\n",
       "      <td>C</td>\n",
       "      <td>2</td>\n",
       "      <td>X</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   left col_a  right col_b\n",
       "0     1     A      2     X\n",
       "1     1     A      3     Y\n",
       "2     2     B      3     Y\n",
       "3     1     A      0     Z\n",
       "4     2     B      0     Z\n",
       "5     3     C      0     Z\n",
       "6     3     C      2     X"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.conditional_join(df2, col('left') != col('right'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>left</th>\n",
       "      <th>col_a</th>\n",
       "      <th>right</th>\n",
       "      <th>col_b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>A</td>\n",
       "      <td>0</td>\n",
       "      <td>Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>B</td>\n",
       "      <td>0</td>\n",
       "      <td>Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>C</td>\n",
       "      <td>0</td>\n",
       "      <td>Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>C</td>\n",
       "      <td>2</td>\n",
       "      <td>X</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   left col_a  right col_b\n",
       "0     1     A      0     Z\n",
       "1     2     B      0     Z\n",
       "2     3     C      0     Z\n",
       "3     3     C      2     X"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.conditional_join(df2, col('left') > col('right'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For multiple non-equi conditions, the results of the individual conditions are combined with the and(&) operator. Let's look at a range join, to see this in action - the example is adapted from [here](http://da.qcri.org/ntang/pubs/vldbj2016.pdf):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "east = dict(id = range(100,103), \n",
    "            dur = [140, 100, 90], \n",
    "            rev = [9, 12, 5], \n",
    "            cores = [2, 8, 4])\n",
    "\n",
    "west = dict(t_id = [404,498, 676, 742], \n",
    "            time = [100, 140, 80, 90], \n",
    "            cost = [6, 11, 10, 5], \n",
    "            cores = [4, 2, 1, 4])\n",
    "\n",
    "east = pd.DataFrame(east)\n",
    "west = pd.DataFrame(west)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>dur</th>\n",
       "      <th>rev</th>\n",
       "      <th>cores</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>100</td>\n",
       "      <td>140</td>\n",
       "      <td>9</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>101</td>\n",
       "      <td>100</td>\n",
       "      <td>12</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>102</td>\n",
       "      <td>90</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    id  dur  rev  cores\n",
       "0  100  140    9      2\n",
       "1  101  100   12      8\n",
       "2  102   90    5      4"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "east"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>t_id</th>\n",
       "      <th>time</th>\n",
       "      <th>cost</th>\n",
       "      <th>cores</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>404</td>\n",
       "      <td>100</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>498</td>\n",
       "      <td>140</td>\n",
       "      <td>11</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>676</td>\n",
       "      <td>80</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>742</td>\n",
       "      <td>90</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   t_id  time  cost  cores\n",
       "0   404   100     6      4\n",
       "1   498   140    11      2\n",
       "2   676    80    10      1\n",
       "3   742    90     5      4"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "west"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Task : Look for any customer from the West Coast who rented a virtual machine for more hours than any customer from the East Coast, but who paid less. Basically, get rows where `east.dur < west.time` and `east.rev > west.cost`. \n",
    "\n",
    "This is a range join, and is solved easily and efficiently with [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>dur</th>\n",
       "      <th>rev</th>\n",
       "      <th>t_id</th>\n",
       "      <th>time</th>\n",
       "      <th>cost</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>101</td>\n",
       "      <td>100</td>\n",
       "      <td>12</td>\n",
       "      <td>498</td>\n",
       "      <td>140</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    id  dur  rev  t_id  time  cost\n",
       "0  101  100   12   498   140    11"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(east\n",
    ".conditional_join(\n",
    "    west, \n",
    "    ('dur', 'time', '<'), \n",
    "    ('rev', 'cost', '>'), \n",
    "    df_columns = ['id', 'dur', 'rev'],\n",
    "    right_columns = ['t_id', 'time', 'cost']\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>dur</th>\n",
       "      <th>rev</th>\n",
       "      <th>t_id</th>\n",
       "      <th>time</th>\n",
       "      <th>cost</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>101</td>\n",
       "      <td>100</td>\n",
       "      <td>12</td>\n",
       "      <td>498</td>\n",
       "      <td>140</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    id  dur  rev  t_id  time  cost\n",
       "0  101  100   12   498   140    11"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(east\n",
    ".conditional_join(\n",
    "    west, \n",
    "    col('dur') < col('time'),\n",
    "    col('rev') > col('cost'),\n",
    "    df_columns = ['id', 'dur', 'rev'],\n",
    "    right_columns = ['t_id', 'time', 'cost']\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For range joins, and when `use_numba = False`, [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join) uses an algorithm based on this [vertica blog post](https://www.vertica.com/blog/what-is-a-range-join-and-why-is-it-so-fastba-p223413/). \n",
    "\n",
    "You might get more performance with `use_numba = True`. The implementation is based on the algorithm in this [publication](https://www.scitepress.org/papers/2018/68268/68268.pdf)\n",
    "\n",
    "Note that non-equi join is supported only for numeric and date types."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **Equi and Non-equi Joins**\n",
    "\n",
    "[conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join) also supports a combination of equi and non-equi join conditions - under the hood it uses Pandas' internal merge function to get the matching pairs, before pruning the non-equi joins to get the final dataframe. Depending on the size of the dataframes, this might offer more performance than the classic merge and filter approach:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = pd.DataFrame({'id': [1,1,1,2,2,3], \n",
    "                    'value_1': [2,5,7,1,3,4]})\n",
    "                    \n",
    "df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], \n",
    "                    'value_2A': [0,3,7,12,0,2,3,1], \n",
    "                    'value_2B': [1,5,9,15,1,4,6,3]})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>value_1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id  value_1\n",
       "0   1        2\n",
       "1   1        5\n",
       "2   1        7\n",
       "3   2        1\n",
       "4   2        3\n",
       "5   3        4"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>value_2A</th>\n",
       "      <th>value_2B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>12</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id  value_2A  value_2B\n",
       "0   1         0         1\n",
       "1   1         3         5\n",
       "2   1         7         9\n",
       "3   1        12        15\n",
       "4   2         0         1\n",
       "5   2         2         4\n",
       "6   2         3         6\n",
       "7   3         1         3"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Classic Pandas' merge and filter approach:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>value_1</th>\n",
       "      <th>value_2A</th>\n",
       "      <th>value_2B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    id  value_1  value_2A  value_2B\n",
       "5    1        5         3         5\n",
       "10   1        7         7         9\n",
       "12   2        1         0         1\n",
       "16   2        3         2         4\n",
       "17   2        3         3         6"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(df1\n",
    ".merge(df2, on = 'id')\n",
    ".query('value_2A <= value_1 <= value_2B')\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">left</th>\n",
       "      <th colspan=\"3\" halign=\"left\">right</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>value_1</th>\n",
       "      <th>id</th>\n",
       "      <th>value_2A</th>\n",
       "      <th>value_2B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  left         right                  \n",
       "    id value_1    id value_2A value_2B\n",
       "0    1       5     1        3        5\n",
       "1    1       7     1        7        9\n",
       "2    2       1     2        0        1\n",
       "3    2       3     2        2        4\n",
       "4    2       3     2        3        6"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(df1\n",
    ".conditional_join(\n",
    "    df2,\n",
    "    ('id', 'id', '=='),\n",
    "    ('value_1', 'value_2A', '>='),\n",
    "    ('value_1', 'value_2B', '<=')\n",
    "))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">left</th>\n",
       "      <th colspan=\"3\" halign=\"left\">right</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>value_1</th>\n",
       "      <th>id</th>\n",
       "      <th>value_2A</th>\n",
       "      <th>value_2B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  left         right                  \n",
       "    id value_1    id value_2A value_2B\n",
       "0    1       5     1        3        5\n",
       "1    1       7     1        7        9\n",
       "2    2       1     2        0        1\n",
       "3    2       3     2        2        4\n",
       "4    2       3     2        3        6"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(df1\n",
    ".conditional_join(\n",
    "    df2,\n",
    "    col('id') == col('id'),\n",
    "    col('value_1') >= col('value_2A'),\n",
    "    col('value_1') <= col('value_2B')\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note the above returned a MultiIndex column; if the columns from the left and right dataframes have something in common a MultiIndex column is returned. Of course, you can select/rename columns with `df_columns` and `right_columns` to get a single index column, if that is preferred."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **Non-equi joins with OR**\n",
    "\n",
    "For multiple non-equi joins, an and(&) operator is applied, to combine the results of the individual conditions. There are scenarios however, where the join might have an *OR* condition. In this case, the joins are executed individually, and the resulting dataframes can then be concatenated into one. Let's look at an example from [Nelson Tang's blog post](https://www.nelsontang.com/blog/2021-11-25-pandas-conditional-merging):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "sales_volume_table = pd.DataFrame.from_dict([\n",
    "    {'date':'2021-11-15', 'quantity':1, 'brand':'Outdoor'},\n",
    "    {'date':'2021-11-20', 'quantity':2, 'brand':'Leisure'},\n",
    "    {'date':'2021-11-25', 'quantity':3, 'brand':'Athletic'},\n",
    "    {'date':'2021-11-26', 'quantity':2, 'brand':'Outdoor'},\n",
    "])\n",
    "\n",
    "promos_table = pd.DataFrame.from_dict([\n",
    "    {'start_date':'2021-11-01', 'end_date':'2021-11-25',\n",
    "    'brand':'ANY', 'rebate_per_unit':3},\n",
    "    {'start_date':'2021-11-25', 'end_date':'2021-11-26',\n",
    "    'brand':'Outdoor', 'rebate_per_unit':5},\n",
    "])\n",
    "\n",
    "sales_volume_table['date'] = pd.to_datetime(sales_volume_table['date'])\n",
    "promos_table['start_date'] = pd.to_datetime(promos_table['start_date'])\n",
    "promos_table['end_date'] = pd.to_datetime(promos_table['end_date'])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>quantity</th>\n",
       "      <th>brand</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2021-11-15</td>\n",
       "      <td>1</td>\n",
       "      <td>Outdoor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2021-11-20</td>\n",
       "      <td>2</td>\n",
       "      <td>Leisure</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2021-11-25</td>\n",
       "      <td>3</td>\n",
       "      <td>Athletic</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2021-11-26</td>\n",
       "      <td>2</td>\n",
       "      <td>Outdoor</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        date  quantity     brand\n",
       "0 2021-11-15         1   Outdoor\n",
       "1 2021-11-20         2   Leisure\n",
       "2 2021-11-25         3  Athletic\n",
       "3 2021-11-26         2   Outdoor"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sales_volume_table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>start_date</th>\n",
       "      <th>end_date</th>\n",
       "      <th>brand</th>\n",
       "      <th>rebate_per_unit</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2021-11-01</td>\n",
       "      <td>2021-11-25</td>\n",
       "      <td>ANY</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2021-11-25</td>\n",
       "      <td>2021-11-26</td>\n",
       "      <td>Outdoor</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  start_date   end_date    brand  rebate_per_unit\n",
       "0 2021-11-01 2021-11-25      ANY                3\n",
       "1 2021-11-25 2021-11-26  Outdoor                5"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "promos_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Problem statement, culled from the blog post: \n",
    "\n",
    "1. The date in the left table was between two dates (a start and end date) in the second table\n",
    "\n",
    "2. ...AND the values in two other columns matched each other, OR the column on the right table was equal to 'ANY' (aka a 'wildcard' value)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Translating this into SQL is [easy](https://www.nelsontang.com/blog/2021-11-25-pandas-conditional-merging): \n",
    "```SQL\n",
    "SELECT *\n",
    "FROM sales_volume_table, promos_table\n",
    "WHERE (sales_volume_table.brand=promos_table.brand or promos_table.brand='ANY')\n",
    "AND (start_date <= date AND date <= end_date)\n",
    "```\n",
    "\n",
    "Replicating this in Pandas is more involved but doable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"3\" halign=\"left\">left</th>\n",
       "      <th colspan=\"4\" halign=\"left\">right</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>quantity</th>\n",
       "      <th>brand</th>\n",
       "      <th>start_date</th>\n",
       "      <th>end_date</th>\n",
       "      <th>brand</th>\n",
       "      <th>rebate_per_unit</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2021-11-26</td>\n",
       "      <td>2</td>\n",
       "      <td>Outdoor</td>\n",
       "      <td>2021-11-25</td>\n",
       "      <td>2021-11-26</td>\n",
       "      <td>Outdoor</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2021-11-15</td>\n",
       "      <td>1</td>\n",
       "      <td>Outdoor</td>\n",
       "      <td>2021-11-01</td>\n",
       "      <td>2021-11-25</td>\n",
       "      <td>ANY</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2021-11-20</td>\n",
       "      <td>2</td>\n",
       "      <td>Leisure</td>\n",
       "      <td>2021-11-01</td>\n",
       "      <td>2021-11-25</td>\n",
       "      <td>ANY</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2021-11-25</td>\n",
       "      <td>3</td>\n",
       "      <td>Athletic</td>\n",
       "      <td>2021-11-01</td>\n",
       "      <td>2021-11-25</td>\n",
       "      <td>ANY</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        left                         right                                    \n",
       "        date quantity     brand start_date   end_date    brand rebate_per_unit\n",
       "0 2021-11-26        2   Outdoor 2021-11-25 2021-11-26  Outdoor               5\n",
       "0 2021-11-15        1   Outdoor 2021-11-01 2021-11-25      ANY               3\n",
       "1 2021-11-20        2   Leisure 2021-11-01 2021-11-25      ANY               3\n",
       "2 2021-11-25        3  Athletic 2021-11-01 2021-11-25      ANY               3"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top = (sales_volume_table\n",
    "       .conditional_join(\n",
    "            promos_table,\n",
    "            ('brand', 'brand', '=='),\n",
    "            ('date', 'start_date', '>='),\n",
    "            ('date', 'end_date', '<=')\n",
    "          )\n",
    "    )\n",
    "\n",
    "bottom = (sales_volume_table\n",
    "          .conditional_join(\n",
    "               promos_table.query('brand == \"ANY\"'),\n",
    "               ('date', 'start_date', '>='),\n",
    "               ('date', 'end_date', '<=')\n",
    "               )\n",
    "     )\n",
    "\n",
    "pd.concat([top, bottom])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **Performance**\n",
    "\n",
    "Why bother with [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join)? simple syntax, memory and speed efficiency. Let's see the memory and speed benefits, compared to a cartesian join (the data used below is adapted from [here](https://stackoverflow.com/a/25655497/7175713)):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/memory_profiler.py:1136: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n",
      "  ipython_version = LooseVersion(IPython.__version__)\n",
      "/home/sam/mambaforge/envs/jupyterbook/lib/python3.9/site-packages/setuptools/_distutils/version.py:345: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n",
      "  other = LooseVersion(other)\n"
     ]
    }
   ],
   "source": [
    "%load_ext memory_profiler\n",
    "from string import ascii_lowercase\n",
    "import numpy as np\n",
    "np.random.seed(1)\n",
    "n = 2_000; k = 10_000\n",
    "\n",
    "idx1 = np.random.randint(0, high = 100, size = n)\n",
    "idx2 = np.random.randint(0, high = 100, size = n)\n",
    "\n",
    "d1 = dict(x = np.random.choice(list(ascii_lowercase[:5]), size=n),\n",
    "          start = np.minimum(idx1, idx2),\n",
    "          end = np.maximum(idx1, idx2))\n",
    "\n",
    "d2 = dict(x = np.random.choice(list(ascii_lowercase[:15]), size=k),\n",
    "          pos1 = np.random.randint(low=60, high = 151, size=k))\n",
    "\n",
    "d1 = pd.DataFrame(d1)\n",
    "d2 = pd.DataFrame(d2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c</td>\n",
       "      <td>37</td>\n",
       "      <td>61</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>d</td>\n",
       "      <td>12</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>b</td>\n",
       "      <td>72</td>\n",
       "      <td>79</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>c</td>\n",
       "      <td>9</td>\n",
       "      <td>74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>d</td>\n",
       "      <td>75</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   x  start  end\n",
       "0  c     37   61\n",
       "1  d     12   16\n",
       "2  b     72   79\n",
       "3  c      9   74\n",
       "4  d     75   76"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x</th>\n",
       "      <th>pos1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>f</td>\n",
       "      <td>126</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>m</td>\n",
       "      <td>121</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>o</td>\n",
       "      <td>128</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>a</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>n</td>\n",
       "      <td>61</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   x  pos1\n",
       "0  f   126\n",
       "1  m   121\n",
       "2  o   128\n",
       "3  a    96\n",
       "4  n    61"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2000, 10000)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(d1), len(d2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2740898, 3)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A = (d1\n",
    "    .merge(d2, how = 'cross')\n",
    "    .query(\"start <= pos1 <= end\")\n",
    "    .filter(['start', 'end', 'pos1'])\n",
    "    )\n",
    "A.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>pos1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>37</td>\n",
       "      <td>61</td>\n",
       "      <td>61</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>37</td>\n",
       "      <td>61</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>37</td>\n",
       "      <td>61</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>104</th>\n",
       "      <td>37</td>\n",
       "      <td>61</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>105</th>\n",
       "      <td>37</td>\n",
       "      <td>61</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     start  end  pos1\n",
       "4       37   61    61\n",
       "53      37   61    60\n",
       "74      37   61    60\n",
       "104     37   61    60\n",
       "105     37   61    60"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2740898, 3)"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B = (d1\n",
    "    .conditional_join(\n",
    "        d2, \n",
    "        ('start', 'pos1', '<='), \n",
    "        ('end', 'pos1', '>='), \n",
    "        df_columns=['start', 'end'],\n",
    "        right_columns='pos1',\n",
    "        use_numba=False)\n",
    "    )\n",
    "B.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>pos1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>37</td>\n",
       "      <td>61</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>37</td>\n",
       "      <td>61</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>37</td>\n",
       "      <td>61</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>37</td>\n",
       "      <td>61</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>37</td>\n",
       "      <td>61</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   start  end  pos1\n",
       "0     37   61    60\n",
       "1     37   61    60\n",
       "2     37   61    60\n",
       "3     37   61    60\n",
       "4     37   61    60"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2740898, 3)"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C = (d1\n",
    "    .conditional_join(\n",
    "        d2, \n",
    "        ('start', 'pos1', '<='), \n",
    "        ('end', 'pos1', '>='), \n",
    "        df_columns=['start', 'end'],\n",
    "        right_columns='pos1',\n",
    "        use_numba=True)\n",
    "    )\n",
    "C.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check that the outputs match:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(True, True, True)"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cols = ['start', 'end', 'pos1']\n",
    "A = A.sort_values(cols, ignore_index=True)\n",
    "B = B.sort_values(cols, ignore_index=True)\n",
    "C = C.sort_values(cols, ignore_index=True)\n",
    "A.equals(B), A.equals(C), B.equals(C)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare the speed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.36 s ± 92.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit d1.merge(d2, how = 'cross').query(\"start <= pos1 <= end\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "57.8 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "74.8 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compare memory usage:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "peak memory: 3396.63 MiB, increment: 2683.73 MiB\n"
     ]
    }
   ],
   "source": [
    "%memit d1.merge(d2, how = 'cross').query(\"start <= pos1 <= end\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "peak memory: 713.18 MiB, increment: 0.00 MiB\n"
     ]
    }
   ],
   "source": [
    "%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "peak memory: 725.80 MiB, increment: 12.62 MiB\n"
     ]
    }
   ],
   "source": [
    "%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), use_numba=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's run another performance test - the code below is a self join to find overlapping events in a 30,000 row DataFrame, and adapted from [DuckDB](https://github.com/duckdb/duckdb/blob/master/benchmark/micro/join/iejoin_events.benchmark):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>name</th>\n",
       "      <th>audience</th>\n",
       "      <th>start</th>\n",
       "      <th>sponsor</th>\n",
       "      <th>end</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Event 1</td>\n",
       "      <td>1178</td>\n",
       "      <td>2022-11-19 10:00:00</td>\n",
       "      <td>Sponsor 2</td>\n",
       "      <td>2022-11-19 10:15:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Event 2</td>\n",
       "      <td>1446</td>\n",
       "      <td>2015-09-27 15:00:00</td>\n",
       "      <td>Sponsor 11</td>\n",
       "      <td>2015-09-27 15:11:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Event 3</td>\n",
       "      <td>2261</td>\n",
       "      <td>2019-11-12 18:00:00</td>\n",
       "      <td>Sponsor 10</td>\n",
       "      <td>2019-11-12 18:53:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Event 4</td>\n",
       "      <td>1471</td>\n",
       "      <td>2019-12-24 22:00:00</td>\n",
       "      <td>Sponsor 6</td>\n",
       "      <td>2019-12-24 22:11:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Event 5</td>\n",
       "      <td>2605</td>\n",
       "      <td>2028-06-20 12:00:00</td>\n",
       "      <td>Sponsor 8</td>\n",
       "      <td>2028-06-20 12:31:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id     name  audience               start     sponsor                 end\n",
       "0   1  Event 1      1178 2022-11-19 10:00:00   Sponsor 2 2022-11-19 10:15:00\n",
       "1   2  Event 2      1446 2015-09-27 15:00:00  Sponsor 11 2015-09-27 15:11:00\n",
       "2   3  Event 3      2261 2019-11-12 18:00:00  Sponsor 10 2019-11-12 18:53:00\n",
       "3   4  Event 4      1471 2019-12-24 22:00:00   Sponsor 6 2019-12-24 22:11:00\n",
       "4   5  Event 5      2605 2028-06-20 12:00:00   Sponsor 8 2028-06-20 12:31:00"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url = 'https://raw.githubusercontent.com/samukweku/data-wrangling-blog/master/notebooks/Data_files/results.csv'\n",
    "events = pd.read_csv(url, parse_dates=['start', 'end']).iloc[:, 1:]\n",
    "events.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"3\" halign=\"left\">left</th>\n",
       "      <th colspan=\"3\" halign=\"left\">right</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>id</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>1993-11-27 12:00:00</td>\n",
       "      <td>1993-11-27 12:37:00</td>\n",
       "      <td>2345</td>\n",
       "      <td>1993-11-27 10:00:00</td>\n",
       "      <td>1993-11-27 12:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15</td>\n",
       "      <td>1993-04-04 16:00:00</td>\n",
       "      <td>1993-04-04 18:00:00</td>\n",
       "      <td>11178</td>\n",
       "      <td>1993-04-04 17:00:00</td>\n",
       "      <td>1993-04-04 17:22:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>17</td>\n",
       "      <td>2030-10-25 07:00:00</td>\n",
       "      <td>2030-10-25 07:27:00</td>\n",
       "      <td>19605</td>\n",
       "      <td>2030-10-25 06:00:00</td>\n",
       "      <td>2030-10-25 08:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>26</td>\n",
       "      <td>2005-10-04 17:00:00</td>\n",
       "      <td>2005-10-04 17:18:00</td>\n",
       "      <td>8218</td>\n",
       "      <td>2005-10-04 17:00:00</td>\n",
       "      <td>2005-10-04 17:27:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>35</td>\n",
       "      <td>2024-05-02 15:00:00</td>\n",
       "      <td>2024-05-02 15:35:00</td>\n",
       "      <td>6916</td>\n",
       "      <td>2024-05-02 15:00:00</td>\n",
       "      <td>2024-05-02 15:36:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3697</th>\n",
       "      <td>29966</td>\n",
       "      <td>2000-08-26 11:00:00</td>\n",
       "      <td>2000-08-26 13:00:00</td>\n",
       "      <td>29375</td>\n",
       "      <td>2000-08-26 13:00:00</td>\n",
       "      <td>2000-08-26 13:53:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3698</th>\n",
       "      <td>29971</td>\n",
       "      <td>2018-05-18 04:00:00</td>\n",
       "      <td>2018-05-18 04:18:00</td>\n",
       "      <td>24173</td>\n",
       "      <td>2018-05-18 04:00:00</td>\n",
       "      <td>2018-05-18 04:36:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3699</th>\n",
       "      <td>29978</td>\n",
       "      <td>1992-06-07 22:00:00</td>\n",
       "      <td>1992-06-07 22:23:00</td>\n",
       "      <td>981</td>\n",
       "      <td>1992-06-07 22:00:00</td>\n",
       "      <td>1992-06-07 22:30:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3700</th>\n",
       "      <td>29984</td>\n",
       "      <td>2025-06-05 03:00:00</td>\n",
       "      <td>2025-06-05 03:17:00</td>\n",
       "      <td>19051</td>\n",
       "      <td>2025-06-05 01:00:00</td>\n",
       "      <td>2025-06-05 03:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3701</th>\n",
       "      <td>29995</td>\n",
       "      <td>2016-09-04 14:00:00</td>\n",
       "      <td>2016-09-04 14:32:00</td>\n",
       "      <td>12296</td>\n",
       "      <td>2016-09-04 14:00:00</td>\n",
       "      <td>2016-09-04 14:50:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3702 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       left                                          right  \\\n",
       "         id               start                 end     id   \n",
       "0        10 1993-11-27 12:00:00 1993-11-27 12:37:00   2345   \n",
       "1        15 1993-04-04 16:00:00 1993-04-04 18:00:00  11178   \n",
       "2        17 2030-10-25 07:00:00 2030-10-25 07:27:00  19605   \n",
       "3        26 2005-10-04 17:00:00 2005-10-04 17:18:00   8218   \n",
       "4        35 2024-05-02 15:00:00 2024-05-02 15:35:00   6916   \n",
       "...     ...                 ...                 ...    ...   \n",
       "3697  29966 2000-08-26 11:00:00 2000-08-26 13:00:00  29375   \n",
       "3698  29971 2018-05-18 04:00:00 2018-05-18 04:18:00  24173   \n",
       "3699  29978 1992-06-07 22:00:00 1992-06-07 22:23:00    981   \n",
       "3700  29984 2025-06-05 03:00:00 2025-06-05 03:17:00  19051   \n",
       "3701  29995 2016-09-04 14:00:00 2016-09-04 14:32:00  12296   \n",
       "\n",
       "                                              \n",
       "                   start                 end  \n",
       "0    1993-11-27 10:00:00 1993-11-27 12:00:00  \n",
       "1    1993-04-04 17:00:00 1993-04-04 17:22:00  \n",
       "2    2030-10-25 06:00:00 2030-10-25 08:00:00  \n",
       "3    2005-10-04 17:00:00 2005-10-04 17:27:00  \n",
       "4    2024-05-02 15:00:00 2024-05-02 15:36:00  \n",
       "...                  ...                 ...  \n",
       "3697 2000-08-26 13:00:00 2000-08-26 13:53:00  \n",
       "3698 2018-05-18 04:00:00 2018-05-18 04:36:00  \n",
       "3699 1992-06-07 22:00:00 1992-06-07 22:30:00  \n",
       "3700 2025-06-05 01:00:00 2025-06-05 03:00:00  \n",
       "3701 2016-09-04 14:00:00 2016-09-04 14:50:00  \n",
       "\n",
       "[3702 rows x 6 columns]"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(events\n",
    ".conditional_join(\n",
    "    events,\n",
    "    ('start', 'end', '<='),\n",
    "    ('end', 'start', '>='),\n",
    "    ('id', 'id', '!='),\n",
    "    use_numba = False,\n",
    "    df_columns = ['id', 'start', 'end'],\n",
    "    right_columns = ['id', 'start', 'end'])\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cartesian join takes a very long time - too long for us to test here. The focus for this test will just be on [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "26.2 ms ± 961 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "(events\n",
    ".conditional_join(\n",
    "    events,\n",
    "    ('start', 'end', '<='),\n",
    "    ('end', 'start', '>='),\n",
    "    ('id', 'id', '!='),\n",
    "    use_numba = False,\n",
    "    df_columns = ['id', 'start', 'end'],\n",
    "    right_columns = ['id', 'start', 'end'])\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "31.3 ms ± 3.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "(events\n",
    ".conditional_join(\n",
    "    events,\n",
    "    ('start', 'end', '<='),\n",
    "    ('end', 'start', '>='),\n",
    "    ('id', 'id', '!='),\n",
    "    use_numba = True,\n",
    "    df_columns = ['id', 'start', 'end'],\n",
    "    right_columns = ['id', 'start', 'end'])\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at performance when an equi join is present:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "from string import ascii_lowercase\n",
    "np.random.seed(1)\n",
    "n = 20_000; k = 50_000\n",
    "\n",
    "idx1 = np.random.randint(0, high = 100, size = n)\n",
    "idx2 = np.random.randint(0, high = 100, size = n)\n",
    "\n",
    "d1 = dict(x = np.random.choice(list(ascii_lowercase[:5]), size=n),\n",
    "          start = np.minimum(idx1, idx2),\n",
    "          end = np.maximum(idx1, idx2))\n",
    "\n",
    "d2 = dict(x = np.random.choice(list(ascii_lowercase[:15]), size=k),\n",
    "          pos1 = np.random.randint(low=60, high = 151, size=k))\n",
    "\n",
    "d1 = pd.DataFrame(d1)\n",
    "d2 = pd.DataFrame(d2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(8886840, 4)"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "out = (d1\n",
    "       .merge(d2, on = 'x')\n",
    "       .query('start <= pos1 <= end')\n",
    "       )\n",
    "out.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(8886840, 5)"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "out = (d1\n",
    "       .conditional_join(\n",
    "            d2, \n",
    "            ('start', 'pos1', '<='), \n",
    "            ('end', 'pos1', '>='), \n",
    "            ('x', 'x', '=='), \n",
    "            use_numba=False)\n",
    "    )\n",
    "out.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "About 18 million rows are returned; let's see how fast each function runs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.17 s ± 24.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "(d1\n",
    ".merge(d2, on = 'x')\n",
    ".query('start <= pos1 <= end')\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.33 s ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "(d1\n",
    ".conditional_join(\n",
    "    d2, \n",
    "    ('start', 'pos1', '<='), \n",
    "    ('end', 'pos1', '>='), \n",
    "    ('x', 'x', '=='), \n",
    "    use_numba=False)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "peak memory: 5829.88 MiB, increment: 4767.70 MiB\n"
     ]
    }
   ],
   "source": [
    "%memit d1.merge(d2, on = 'x').query('start <= pos1 <= end')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "peak memory: 4135.80 MiB, increment: 3073.66 MiB\n"
     ]
    }
   ],
   "source": [
    "%memit d1.conditional_join(d2, ('start', 'pos1', '<='), ('end', 'pos1', '>='), ('x', 'x', '=='), use_numba=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join) is faster; this might not be the case all the time. \n",
    "\n",
    "Depending on the distribution of data, using `numba` where an equi-join is present can give significant performance speedups:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "from itertools import count\n",
    "mapper = dict(zip(ascii_lowercase, count()))\n",
    "d1['xx'] = d1.x.map(mapper)\n",
    "d2['xx'] = d2.x.map(mapper)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>xx</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>d</td>\n",
       "      <td>37</td>\n",
       "      <td>82</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>c</td>\n",
       "      <td>2</td>\n",
       "      <td>12</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>b</td>\n",
       "      <td>46</td>\n",
       "      <td>72</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>b</td>\n",
       "      <td>7</td>\n",
       "      <td>9</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>c</td>\n",
       "      <td>39</td>\n",
       "      <td>75</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   x  start  end  xx\n",
       "0  d     37   82   3\n",
       "1  c      2   12   2\n",
       "2  b     46   72   1\n",
       "3  b      7    9   1\n",
       "4  c     39   75   2"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x</th>\n",
       "      <th>pos1</th>\n",
       "      <th>xx</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c</td>\n",
       "      <td>143</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>i</td>\n",
       "      <td>85</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>k</td>\n",
       "      <td>121</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>m</td>\n",
       "      <td>68</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>o</td>\n",
       "      <td>97</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   x  pos1  xx\n",
       "0  c   143   2\n",
       "1  i    85   8\n",
       "2  k   121  10\n",
       "3  m    68  12\n",
       "4  o    97  14"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.97 s ± 47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "(d1\n",
    ".merge(d2, on = 'xx')\n",
    ".query('start <= pos1 <= end')\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.45 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "(d1\n",
    ".conditional_join(\n",
    "    d2, \n",
    "    ('start', 'pos1', '<='), \n",
    "    ('end', 'pos1', '>='), \n",
    "    ('xx', 'xx', '=='), \n",
    "    use_numba=False)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "324 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "(d1\n",
    ".conditional_join(\n",
    "    d2, \n",
    "    ('start', 'pos1', '<='), \n",
    "    ('end', 'pos1', '>='), \n",
    "    ('xx', 'xx', '=='), \n",
    "    use_numba=True)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above numba implementation depends on a number of conditions:\n",
    "- Only a single equi join is supported\n",
    "- The columns in the equi join should be either numeric or datetime data types\n",
    "- The columns in the equi join are duplicated with large ranges, but small unique values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-07-30T10:57:18.265994Z",
     "iopub.status.busy": "2020-07-30T10:57:18.265751Z",
     "iopub.status.idle": "2020-07-30T10:57:18.283104Z",
     "shell.execute_reply": "2020-07-30T10:57:18.280684Z",
     "shell.execute_reply.started": "2020-07-30T10:57:18.265970Z"
    }
   },
   "source": [
    "## **Summary**\n",
    "\n",
    "This blog post shows how to efficiently join on non-equi conditions, using [conditional_join](https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.conditional_join.conditional_join). If you are comfortable with SQL, you could get more performance with [DuckDB](https://duckdb.org), which allows querying Pandas DataFrames efficiently with SQL, supports non-equi joins as well, and has a [performant implementation](https://duckdb.org/2022/05/27/iejoin.html) for range joins."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "ipynb,md"
  },
  "kernelspec": {
   "display_name": "Python 3.9.12 ('playground')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  },
  "vscode": {
   "interpreter": {
    "hash": "2a407888a31a0a9019767319ea2d3065f86a5891a7d6f13880fa432bbfc1e5b5"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}