Extensions#

Extensions are Python functions that are wrapped in order to execute in the %%fsql. These are needed to implement custom logic in SQL workflows.

Have questions? Chat with us on Github or Slack:

Creator#

Creators are functions that generate a DataFrame. The example below contains all syntax variations. Schema needs to be specified in the Python code, or in the SQL query. Pandas DataFrames have schema defined, so it does not need to be passed. The default LOAD an example of a Creator.

A common use case for Creator is reading from a different data source like MongoDB Atlas or AWS S3.

	a	b
0	0	hello
1	1	world

	a
0	1

	a
0	3

	a
0	4

	a
0	4

Outputter#

Outputters are functions that either write out DataFrames or display them. The default SAVE and PRINT are examples of Outputters. They do not return anything. They are invoked in SQL using the OUTPUT keyword.

PREPARTITION can be used along with Outputters to apply the logic on each partition. This is only possible if the Outputter interface is used to define the extension.

Processor#

Processors take in multiple DataFrames and output one DataFrame. Similar to the Outputter, the SQL PROCESS keyword can be used in conjunction with PREPARTITION but only if the Processor class interface was used to define the Processor.

	a	b
0	0	1
1	1	2

Transformer#

Transformers are the most used extension. They take one DataFrame in and output one DataFrame. This has appeared in the previous tutorials. It can be used with PREPARTITION to apply the Transformer to each partition.

Read more about Transformers

data = [
    ["A", "2020-01-01", 10],
    ["A", "2020-01-02", None],
    ["A", "2020-01-03", 30],
    ["B", "2020-01-01", 20],
    ["B", "2020-01-02", None],
    ["B", "2020-01-03", 40]
]
df = pd.DataFrame(data, columns=["id", "date", "value"])

# schema: *, shift:double
def shift(df: pd.DataFrame) -> pd.DataFrame:
    df['shift'] = df['value'].shift()
    return df

%%fsql
a = SELECT * FROM df
TRANSFORM a PREPARTITION BY id PRESORT date DESC USING shift
PRINT
TRANSFORM a USING shift    # default partition
PRINT

	id	date	value	shift
0	A	2020-01-03	30.0	NaN
1	A	2020-01-02	NaN	30.0
2	A	2020-01-01	10.0	NaN
3	B	2020-01-03	40.0	NaN
4	B	2020-01-02	NaN	40.0
5	B	2020-01-01	20.0	NaN

schema: id:str,date:str,value:double,shift:double

	id	date	value	shift
0	A	2020-01-01	10.0	NaN
1	A	2020-01-02	NaN	10.0
2	A	2020-01-03	30.0	NaN
3	B	2020-01-01	20.0	30.0
4	B	2020-01-02	NaN	20.0
5	B	2020-01-03	40.0	NaN

schema: id:str,date:str,value:double,shift:double

Spark may give inconsistent results when using TRANSFORM without using PREPARITION because the default partitions are used. Also note order is not guaranteed in a distributed environment unless explicitly specified. PREPARTITION can also be used without a PRESORT.

CoTransformer#

CoTransformers are the used on multiple DataFrames partitioned in the same way. The data is then joined together with an INNER JOIN by default, but it can be specified which join to use. In FugueSQL, TRANSFORM and ZIP are used together to apply the CoTransformer.

Fugue Tutorials

Extensions

Contents