{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# transform() Function\n",
    "\n",
    "Have questions? Chat with us on Github or Slack:\n",
    "\n",
    "[![Homepage](https://img.shields.io/badge/fugue-source--code-red?logo=github)](https://github.com/fugue-project/fugue)\n",
    "[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://slack.fugue.ai)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas is great for small datasets but unfortunately does not scale large datasets well. The primary reason is that Pandas is single core and does not take advantage of all available computing resources. A lot of operations also generate [intermediate copies](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#scaling-to-large-datasets) of data, utilizing more memory than necessary. To effectively handle data with Pandas, users preferably need to have [5 to 10 times](https://wesmckinney.com/blog/apache-arrow-pandas-internals/) as much RAM as the dataset.\n",
    "\n",
    "[Spark](https://spark.apache.org/) and [Dask](https://dask.org/) allow us to split computing jobs across multiple machines. They also can handle datasets that don’t fit into memory by spilling data over to the disk in some cases. But ultimately, moving to Spark or Dask still requires significant code changes to port existing Pandas code. Added to changing code, there is also a lot of knowledge required to use these frameworks effectively. [Ray](https://www.ray.io/) is a newer engine seeing increased adoption. How can we prevent being locked in to frameworks so we have the flexibility to switch in the future? \n",
    "\n",
    "**Fugue is a framework that is designed to unify the interface between Pandas, Spark, Dask and Ray, allowing one codebase to be used across all compute engines.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fugue `transform()`\n",
    "\n",
    "The simplest way Fugue can be used to scale Pandas based code to Spark, Dask, or Ray is with the `transform()` function. In the example below, we’ll train a model using scikit-learn and Pandas and then perform the model predictions parallelized on top of the Spark execution engine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "X = pd.DataFrame({\"x_1\": [1, 1, 2, 2], \"x_2\":[1, 2, 2, 3]})\n",
    "y = np.dot(X, np.array([1, 2])) + 3\n",
    "reg = LinearRegression().fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After training our model, we then wrap it in a `predict()` function. This function is still written in Pandas. We can easily test it on the `input_df` that we create. Wrapping it will allow us to bring it to Spark. Type hints are a Fugue requirement, but we'll discuss them more in future sections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x_1</th>\n",
       "      <th>x_2</th>\n",
       "      <th>predicted</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>12.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>13.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>21.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>21.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   x_1  x_2  predicted\n",
       "0    3    3       12.0\n",
       "1    4    3       13.0\n",
       "2    6    6       21.0\n",
       "3    6    6       21.0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def predict(df: pd.DataFrame, model: LinearRegression) -> pd.DataFrame:\n",
    "    return df.assign(predicted=model.predict(df))\n",
    "\n",
    "input_df = pd.DataFrame({\"x_1\": [3, 4, 6, 6], \"x_2\":[3, 3, 6, 6]})\n",
    "\n",
    "# test the function\n",
    "predict(input_df.copy(), reg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we bring it to Spark using Fugue's `transform()` function. This takes in a DataFrame and applies a function to it using either of the Pandas, Spark, Dask, or Ray engines. The `transform()` inputs will be explained later, but for now, notice that we did not make modifications to the Pandas-based `predict()` function in order to use it on Spark. This function can now scale to big datasets through the Spark or Dask execution engines.\n",
    "\n",
    "All we have to do to bring it to Spark is pass a SparkSession as the engine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession.builder.getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pyspark.sql.dataframe.DataFrame'>\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Stage 8:===================>                                       (1 + 2) / 3]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---+---+---------+\n",
      "|x_1|x_2|predicted|\n",
      "+---+---+---------+\n",
      "|  3|  3|     12.0|\n",
      "|  4|  3|     13.0|\n",
      "|  6|  6|     21.0|\n",
      "|  6|  6|     21.0|\n",
      "+---+---+---------+\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "from fugue import transform\n",
    "\n",
    "result = transform(\n",
    "    input_df,\n",
    "    predict,\n",
    "    schema=\"*,predicted:double\",\n",
    "    params=dict(model=reg),\n",
    "    engine=spark\n",
    ")\n",
    "print(type(result))\n",
    "result.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `transform()` function takes in the following arguments:\n",
    "\n",
    "* df     - input DataFrame (can be a Pandas, Spark, or Dask DataFrame)\n",
    "* using  - a Python function with valid type annotations\n",
    "* schema - required output schema of the operation\n",
    "* params - a dictionary of parameters to pass in the function\n",
    "* engine - the execution engine to run the operation on (Spark, Dask, Ray)\n",
    "\n",
    "Because we supplied `spark` as the engine, the `predict()` function will be applied on `input_df` on top of the Spark ExecutionEngine. Fugue will handle the conversion from a Pandas DataFrame to a Spark DataFrame. Similarly, a Spark DataFrame can be passed to the `transform()` call. Supplying no engine uses the Pandas-based engine. Fugue also has a Dask and Ray engines available.\n",
    "\n",
    "Explicit `schema` is a hard requirement in distributed computing frameworks, so we need to supply the output `schema` of the operation. When compared to the Spark equivalent (see below), this is a much simpler interface to handle the `schema`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Testability and Maintainability\n",
    "\n",
    "Is the native Python or Pandas implementation of `map_phone_to_location()` better or is the native Spark implementation better? \n",
    "\n",
    "The main concern of Fugue is clear, readable code. **Users can write code in whatever expresses their logic the best**. The computing efficiency lost by using Fugue is unlikely to be significant, especially in comparison to the developer efficiency gained through more rapid iterations and easier maintenance. In fact, Fugue is designed in a way that often sees more speed-ups than inexperienced users working with native Spark code because it handles a lot of the tricks necessary to use Spark effectively. \n",
    "\n",
    "Fugue code becomes easily testable because the function contains logic that is portable across all Pandas, Spark, and Dask. We can test code without the need to spin up computing resources (such as Spark or Dask clusters). This hardware often takes time to spin up just for a simple test, making it painful to run unit tests on Spark. Now, we can test quickly with native Python or Pandas and then execute on Spark when needed. Developers that use Fugue benefit from more rapid iterations in their data projects.\n",
    "\n",
    "If we use a pure Python function, such as the one below, all we have to do to test it is run some values through the defined function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'phone': '(407)-234-5678', 'location': 'Orlando, FL'},\n",
       " {'phone': '(407)-234-5679', 'location': 'Orlando, FL'}]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from typing import List, Dict, Any\n",
    "\n",
    "def map_phone_to_location2(df: List[Dict[str,Any]]) -> List[Dict[str,Any]]:\n",
    "    for row in df:\n",
    "        row[\"location\"] = _area_code_map[row[\"phone\"][1:4]]\n",
    "    return df\n",
    "\n",
    "# Remember the input was List[Dict[str,Any]]\n",
    "map_phone_to_location2([{'phone': '(407)-234-5678'}, \n",
    "                       {'phone': '(407)-234-5679'}])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even if the output here is a `List[Dict[str,Any]]`, Fugue takes care of converting it back to a DataFrame."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "Fugue's `transform()` function can scale Pandas-written code to Spark or Dask, without altering the existing functions. In the next section, we’ll take a deeper look at type hinting and the role they play in Fugue. While we used Pandas here, we’ll also show that native Python functions can also be used across the different execution engines."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fugue as a Mindset\n",
    "\n",
    "Fugue is a framework, but more importantly, it is a mindset. \n",
    "\n",
    "1. Fugue believes that the framework should adapt to the user, not the other way around.\n",
    "2. Fugue lets users code express logic in a scale-agnostic way, with the tools they prefer.\n",
    "3. Fugue values readability and maintainability of code over deep framework-specific optimizations\n",
    "\n",
    "Using distributed computing is currently harder than it needs to be. However, these systems often follow similar patterns, which have been abstracted to create a framework that lets users focus on defining their logic. We cover these concepts in the rest of tutorials. If you're new to distributed computing, Fugue is the perfect place to get started."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Optional] Spark Equivalent of `transform()`\n",
    "\n",
    "Below is an example of how the `predict()` function would be brought to Spark without the `transform()` function. This implementation uses the Spark’s `mapInPandas()` method available in Spark 3.0. Note how the `schema` has to be handled inside the `run_predict` function. This is the `schema` requirement we mentioned earlier that Fugue provides a simpler interface for."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Stage 4:==============>                                            (1 + 3) / 4]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---+---+---------+\n",
      "|x_1|x_2|predicted|\n",
      "+---+---+---------+\n",
      "|  3|  3|     12.0|\n",
      "|  4|  3|     13.0|\n",
      "|  6|  6|     21.0|\n",
      "|  6|  6|     21.0|\n",
      "+---+---+---------+\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "from typing import Iterator, Any, Union\n",
    "from pyspark.sql.types import StructType, StructField, DoubleType\n",
    "from pyspark.sql import DataFrame, SparkSession\n",
    "\n",
    "def predict_wrapper(dfs: Iterator[pd.DataFrame], model):\n",
    "    for df in dfs:\n",
    "        yield predict(df, model)\n",
    "\n",
    "def run_predict(input_df: Union[DataFrame, pd.DataFrame], model):\n",
    "    # conversion\n",
    "    if isinstance(input_df, pd.DataFrame):\n",
    "        sdf = spark.createDataFrame(input_df.copy())\n",
    "    else:\n",
    "        sdf = input_df.copy()\n",
    "\n",
    "    schema = StructType(list(sdf.schema.fields))\n",
    "    schema.add(StructField(\"predicted\", DoubleType()))\n",
    "    return sdf.mapInPandas(lambda dfs: predict_wrapper(dfs, model), \n",
    "                           schema=schema)\n",
    "\n",
    "result = run_predict(input_df.copy(), reg)\n",
    "result.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It’s easy to see why it becomes very difficult to bring a Pandas codebase to Spark with this approach. We had to define two additional functions in the `predict_wrapper()` and the `run_predict()` to bring it to Spark. If this had to be done for tens of functions, it could easily fill the codebase with boilerplate code, making it hard to focus on the logic. These also add additional unit tests to the code base."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Optional] Comparison to Modin and Koalas\n",
    "\n",
    "Fugue gets compared a lot to Modin and Koalas. Modin is a Pandas interface for execution on Dask, and Koalas is a Pandas interface for execution on Spark. Fugue, Modin, and Koalas have similar goals in making an easier distributed computing experience. The main difference is that Modin and Koalas use Pandas as the grammar for distributed computing. Fugue, on the other hand, uses native Python and SQL as the grammar for distributed computing (though Pandas is also supported). For more information, check this [page](https://fugue-tutorials.readthedocs.io/tutorials/appendix/fugue_not_pandas.html).\n",
    "\n",
    "The clearest example of Pandas not being compatible with Spark is the acceptance of mixed-typed columns. A single column can have numeric and string values. Spark, on the other hand, is strongly typed and enforces the schema. More than that, Pandas is strongly reliant on the index for operations. As users transition to Spark, the index mindset does not hold as well. Order is not always guaranteed in a distributed system; there is an overhead to maintain a global index, and, moreover, it is often not necessary."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "metadata": {
   "interpreter": {
    "hash": "f7f9294720e464cd08733c6cd5cfe1a4599977fa03668bc63f2dfd97f1a61807"
   }
  },
  "vscode": {
   "interpreter": {
    "hash": "9fcd6e71927f6b3e5f4fa4280b4e8e6a66aa8d4365bb61cf7ef4017620fc09b9"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}