Deep Dive#

All questions are welcome in the Slack channel.

Slack Status

This section is not needed to create end-to-end workflows with Fugue, but it will help give a better understanding of the features available. In some cases, applying these concepts may significantly improve performance.

Since you already have experience in Spark or distributed computing in general, you may be interested in the extra values Fugue can add.

Architecture#

Fugue Configurations (MUST READ)#

These configurations can have significant impact on building and running the Fugue workflows.

Execution Engine#

The heart of Fugue. It is the layer that unifies many of the core concepts of distributed computing, and separates the underlying computing frameworks from user level logic. Normally you don’t directly interact with execution engines. But it’s good to understand some basics.

Validation#

Fugue applies input validation.

Data Type, Schema & DataFrames#

Fugue data types and schema are strictly based on Apache Arrow. Dataframe is an abstract concept with several built-in implementations to adapt to different dataframes. In this tutorial, we will go through the basic APIs and focus on the most common use cases.

Partition (MUST READ)#

This tutorial is more focused on explaining the basic ideas of data partitioning. It’s less related with Fugue. To have a good understanding of partition is the key for writing high performance code.

Callbacks From Transformers To Driver#

You can provide a callback function to any transformer, to communicate with driver while running

X-Like Objects Initialization#

You may often see -like objects in Fugue API document, here is a complete list of these objects and their ways to initialize.