{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# transform() Function\n", "\n", "Have questions? Chat with us on Github or Slack:\n", "\n", "[](https://github.com/fugue-project/fugue)\n", "[](http://slack.fugue.ai)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas is great for small datasets but unfortunately does not scale large datasets well. The primary reason is that Pandas is single core and does not take advantage of all available computing resources. A lot of operations also generate [intermediate copies](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#scaling-to-large-datasets) of data, utilizing more memory than necessary. To effectively handle data with Pandas, users preferably need to have [5 to 10 times](https://wesmckinney.com/blog/apache-arrow-pandas-internals/) as much RAM as the dataset.\n", "\n", "[Spark](https://spark.apache.org/) and [Dask](https://dask.org/) allow us to split computing jobs across multiple machines. They also can handle datasets that don’t fit into memory by spilling data over to the disk in some cases. But ultimately, moving to Spark or Dask still requires significant code changes to port existing Pandas code. Added to changing code, there is also a lot of knowledge required to use these frameworks effectively. [Ray](https://www.ray.io/) is a newer engine seeing increased adoption. How can we prevent being locked in to frameworks so we have the flexibility to switch in the future? \n", "\n", "**Fugue is a framework that is designed to unify the interface between Pandas, Spark, Dask and Ray, allowing one codebase to be used across all compute engines.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fugue `transform()`\n", "\n", "The simplest way Fugue can be used to scale Pandas based code to Spark, Dask, or Ray is with the `transform()` function. In the example below, we’ll train a model using scikit-learn and Pandas and then perform the model predictions parallelized on top of the Spark execution engine." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.linear_model import LinearRegression\n", "\n", "X = pd.DataFrame({\"x_1\": [1, 1, 2, 2], \"x_2\":[1, 2, 2, 3]})\n", "y = np.dot(X, np.array([1, 2])) + 3\n", "reg = LinearRegression().fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After training our model, we then wrap it in a `predict()` function. This function is still written in Pandas. We can easily test it on the `input_df` that we create. Wrapping it will allow us to bring it to Spark. Type hints are a Fugue requirement, but we'll discuss them more in future sections." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | x_1 | \n", "x_2 | \n", "predicted | \n", "
---|---|---|---|
0 | \n", "3 | \n", "3 | \n", "12.0 | \n", "
1 | \n", "4 | \n", "3 | \n", "13.0 | \n", "
2 | \n", "6 | \n", "6 | \n", "21.0 | \n", "
3 | \n", "6 | \n", "6 | \n", "21.0 | \n", "