{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Joins\n", "\n", "Have questions? Chat with us on Github or Slack:\n", "\n", "[![Homepage](https://img.shields.io/badge/fugue-source--code-red?logo=github)](https://github.com/fugue-project/fugue)\n", "[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://slack.fugue.ai)\n", "\n", "We have seen three engine-agnostic functions with `save()`, `load()` and `transform()`. Here, we'll show how to combine DataFrames through joins. The following joins are supported in Fugue: `LEFT OUTER`, `RIGHT OUTER`, `CROSS`, `LEFT SEMI`, `LEFT ANTI`, `INNER`, `FULL OUTER`. Most of these joins should be familiar, so this section will just be about providing examples on how to use them in Fugue.\n", "\n", "## Join and Renaming Columns\n", "\n", "Sometimes there will be a naming conflict with columns. In such situations, the `rename()` function needs to be used like the code snippet below. The `join()` function takes all of the join types mentioned above for the `how` argument. `on` takes a list of columns. The columns to join on can be inferred based on the columns, but explicitly specifying is better." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Hide Pandas deprecation warnings\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idval1val2
0a12
1b23
\n", "
" ], "text/plain": [ " id val1 val2\n", "0 a 1 2\n", "1 b 2 3" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import fugue.api as fa\n", "\n", "df1 = pd.DataFrame({\"id\": [\"a\",\"b\"], \"val1\": [1,2]})\n", "df2 = pd.DataFrame({\"id\": [\"a\",\"b\"], \"val1\": [2,3]})\n", "\n", "fa.join(df1, fa.rename(df2, {\"val1\":\"val2\"}), how=\"left_outer\", on=[\"id\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All functions in the `fugue.api` accept an execution engine as a parameter, we can also pass the execution engine to the `join()` function. Note that using the `\"dask\"` engine executes lazily so we have to call `.compute()`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idval1val2
0b23
0a12
\n", "
" ], "text/plain": [ " id val1 val2\n", "0 b 2 3\n", "0 a 1 2" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res = fa.join(df1, fa.rename(df2, {\"val1\":\"val2\"}), how=\"left_outer\", on=[\"id\"], engine=\"dask\")\n", "res.compute().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use the `engine_context` that we learned in the last section." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DaskDataFrame\n", "id:str|val1:long|val2:long\n", "------+---------+---------\n", "b |2 |3 \n", "a |1 |2 \n", "Total count: 2\n", "\n" ] } ], "source": [ "with fa.engine_context(\"dask\"):\n", " res = fa.join(df1, fa.rename(df2, {\"val1\":\"val2\"}), how=\"left_outer\", on=[\"id\"])\n", " fa.show(res)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SQL vs pandas Joins\n", "\n", "Joins in SQL and pandas can have different outcomes. The clearest example of this is `None` joining with `None`. In such cases, Fugue is consistent with SQL and Spark rather than pandas. Notice that column `a` has a row with None after the join below." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
0None1
1a2
\n", "
" ], "text/plain": [ " a b\n", "0 None 1\n", "1 a 2" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame({'a': [None, \"a\"], 'b': [1, 2]})\n", "df2 = pd.DataFrame({'a': [None, \"a\"], 'b': [1, 2]})\n", "df1.merge(df2, how=\"inner\", on=[\"a\", \"b\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With Fugue, the row with None will be dropped because it follows SQL convention." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
0a2
\n", "
" ], "text/plain": [ " a b\n", "0 a 2" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fa.join(df1, df2, how=\"inner\", on=[\"a\",\"b\"]) # None,1 is excluded" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiple Joins\n", "\n", "Multiple DataFrames can be joined together if there is no conflict. This will work across all engines." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcde
012345
\n", "
" ], "text/plain": [ " a b c d e\n", "0 1 2 3 4 5" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame({\"a\": [1], \"b\": [2]})\n", "df2 = pd.DataFrame({\"a\": [1], \"c\": [3]})\n", "df3 = pd.DataFrame({\"a\": [1], \"d\": [4]})\n", "df4 = pd.DataFrame({\"a\": [1], \"e\": [5]})\n", "\n", "fa.join(df1, df2, df3, df4, how=\"inner\", on=[\"a\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Union, Intersect, Subtract\n", "\n", "Fugue has support for Union, Intersect and Subtract. Union combines two DataFrames with the same columns. By default, only unique items are kept. Everything can be kept by setting `distinct=False`. Intersect gets the distinct elements of the intersection of the two DataFrames. Subtract gets the distinct elements of the left DataFrame that are not in the right DataFrame. Examples shown below. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
001
112
\n", "
" ], "text/plain": [ " a b\n", "0 0 1\n", "1 1 2" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame({\"a\": [0,1], \"b\": [1,2]})\n", "df2 = pd.DataFrame({\"a\": [0,0,1], \"b\": [1,1,2]})\n", "\n", "fa.union(df1, df2)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
001
112
201
301
412
\n", "
" ], "text/plain": [ " a b\n", "0 0 1\n", "1 1 2\n", "2 0 1\n", "3 0 1\n", "4 1 2" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fa.union(df1, df2, distinct=False)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
001
112
\n", "
" ], "text/plain": [ " a b\n", "0 0 1\n", "1 1 2" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fa.intersect(df1, df2)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [a, b]\n", "Index: []" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fa.subtract(df1, df2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that `join()`, `union()`, `subtract()` and `intersect()` can all take in an `engine` argument to use the appropriate backend. They will also work with the `engine_context()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "This sections covers all of the base operations Fugue offers when combining two or more DataFrames. If there is logic that is not covered by this functionality, then a user can implement a custom Fugue extension. The `transformer` we covered in previous sections is the most commonly used Fugue extension. In the [extension](/beginner/extensions.ipynb) section, we'll cover the other extensions." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.13 ('fugue')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "orig_nbformat": 2, "vscode": { "interpreter": { "hash": "9fcd6e71927f6b3e5f4fa4280b4e8e6a66aa8d4365bb61cf7ef4017620fc09b9" } } }, "nbformat": 4, "nbformat_minor": 2 }