Getting Started

All questions are welcome in the Slack channel.

Slack Status

Fugue is an abstraction framework that lets users write code in native Python or Pandas, and then port it over to Spark and Dask. This section will cover the motivation of Fugue, the benefits of using an abstraction layer, and how to get started. This section is not a complete reference, but will be sufficient enough to get started with writing full workflows in Fugue.

Introduction

We’ll get started by introducing Fugue and the simplest way to use it with the transform() function. The transform() function can take in a Python or pandas function and scale it out in Spark or Dask without having to modify the function. This provides a very simple interface to parallelize Python and pandas code on distributed compute engines.

Type Flexbility

After seeing an example of the transform() function, we look into the further flexibility Fugue provides by accepting functions with different input and output types. This allows users to define their logic in whatever makes the most sense, and bring native Python functions to Spark or Dask.

Partitioning

Now that we have seen how functions can be written for Fugue to bring them to Spark or Dask, we look at how the transform() function can be applied with partitioning. In pandas semantics, this would be the equivalent of a groupby-apply(). The difference is partitioning is a core concept in distrubted compute as it controls both logical and physical grouping of data.

Decoupling Logic and Execution

After seeing how the transform function enables the use of Python and pandas code on Spark, we’ll see how we can apply this same principle to entire compute workflows using FugueWorkflow. We’ll show how Fugue allows users to decouple logic from execution, and introduce some of the benefits this provides. We’ll go one step further in showing how we use native Python to make our code truly independent of any framework.

Fugue Interface

In this section we’ll start covering some concepts like the Directed Acyclic Graph (DAG) and the need for explicit schema in a distributed compute environment. We’ll show how to pass parameters to transformers, as well as load and save data. With these, users will be able to start some basic work on data through Fugue.

Joining Data

Here we’ll show the different ways to join DataFrames in Fugue along with union, intersect, and except. SQL and Pandas also have some inconsistencies users have to be aware of when joining. Fugue maintains consistency with SQL (and Spark).

Extensions

We already covered the transformer, the most commonly used Fugue extension. Extensions are Fugue operations on DataFrames that are used inside the DAG. Here we will cover the creator, processor, cotransformer and outputter.

Distributed Compute

The heart of Fugue is distributed compute. In this section we’ll show the keywords and concepts that allow Fugue to fully utilize the power of distributed compute. This includes partitions, persisting, and broadcasting.

Fugue-SQL

We’ll show a bit on Fugue-Sql, the SQL interface for using Fugue. This is targeted for heavy SQL users and SQL-lovers who want to use SQL on top of Spark and Dask, or even Pandas. This is SQL that is used on DataFrames in memory as opposed to data in databases.

With that, you should be ready to implement data workflows using Fugue.

Ibis Integration (Experimental)

As a last note, we’ll show a nice addition to Fugue. The Ibis project provides a very nice pythonic way to express SQL logic, plus, it is also an abstraction layer that can run on different backend. We can make these two abstractions work together seamlessly so users can take the advantages of both.

For full end-to-end examples, check out the Stock Sentiment and COVID-19 examples.

For any questions, feel free to join the Slack channel.