Syntax#

The FugueSQL syntax is between standard SQL, JSON, and Python. The goals are

  • To be fully compatible with standard SQL SELECT statement

  • To minimize syntax overhead, to make code as short as possible while still easy to read

  • Allow users to fully describe their compute logic in SQL as opposed to Python

To achieve these goals, enhancements were made to the standard SQL syntax that will be demonstrated here.

Hello World#

First, we start with the basic syntax FugueSQL. We import fugue_jupyter, which contains a Jupyter notebook extension. Fugue has both a Python interface, and SQL interface which have equivalent functionality. To install fugue_jupyter, check this section

The setup function in the cell below provides syntax highlighting for FugueSQL users. At the moment, syntax highlighting will not work for JupyterLab notebooks.

from fugue_jupyter import setup
setup()
%%fsql
CREATE [[0,"hello"],[1,"world"]] SCHEMA number:int,word:str
PRINT
number word
0 0 hello
1 1 world
schema: number:int,word:str

The CREATE keyword here is a FugueSQL keyword. We’ll dive into extensions later and learn more about integrating Python functions into FugueSQL.

SQL Compliant#

All standard SQL keywords are available in FugueSQL. In this example, GROUP BY, WHERE, SELECT, FROM are all the same as standard SQL.

# Defining data
import pandas as pd
data = pd.DataFrame({"id": ["A","A","A","B","B","B"],
                    "date": ["2020-01-01", "2020-01-02",
                             "2020-01-03", "2020-01-01", 
                             "2020-01-02", "2020-01-03"],
                    "value": [10, None, 30, 20, None, 40]})
%%fsql
SELECT id, date, MIN(value) value
FROM data
WHERE value > 20
GROUP BY id
PRINT
id date value
0 A 2020-01-03 30.0
1 B 2020-01-03 40.0
schema: id:str,date:str,value:double

Note that the Pandas DataFrame df was accessed inside the SQL expression. DataFrames defined in Python cells are automatically accessible by SQL cells. Other variables need to be passed in through Jinja templating. More on this will be shown when we explore how Python and fugue-sql interact.

The example above shows the possibility of combining Python and SQL workflows. This is useful if Python needs to connect somewhere (AWS S3, Azure Blob Storage, Google Analytics etc) to retrieve data that is needed for the compute workflow. The data can be loaded in with Python and passed to %%fsql cells.

Input and Output#

Actual data work often requires loading in the DataFrame for which Fugue has two keywords SAVE and LOAD. Using these allow FugueSQL users to orchestrate their ETL jobs with SQL logic. A csv file can be loaded in, transformed, and then saved elsewhere. Full data analysis and transformation workflows can be done in FugueSQL.

Unless Pandas is used to create the DataFrame the column data types (int, float, str etc.) or SCHEMA must be defined explicitly as Pandas is able to infer the SCHEMA

%%fsql
CREATE [[0,"1"]] SCHEMA a:int,b:str
SAVE OVERWRITE "/tmp/f.parquet"
SAVE OVERWRITE "/tmp/f.csv" (header=true)
SAVE OVERWRITE "/tmp/f.json"
SAVE OVERWRITE PARQUET "/tmp/f"
%%fsql
LOAD "/tmp/f.parquet" PRINT
LOAD "/tmp/f.parquet" COLUMNS a PRINT
LOAD PARQUET "/tmp/f" PRINT
LOAD "/tmp/f.csv" (header=true) PRINT
LOAD "/tmp/f.csv" (header=true) COLUMNS a:int,b:str PRINT
LOAD "/tmp/f.json" PRINT
LOAD "/tmp/f.json" COLUMNS a:int,b:str PRINT
a b
0 0 1
schema: a:int,b:str
a
0 0
schema: a:int
a b
0 0 1
schema: a:int,b:str
a b
0 0 1
schema: a:str,b:str
a b
0 0 1
schema: a:int,b:str
a b
0 0 1
schema: a:long,b:long
a b
0 0 1
schema: a:int,b:str

json, csv, parquet and avro are supported file formats. Notice that parameters can be passed. If running on the default execution engine, these would be passed on to Pandas read_csv and to_csv. The file extension is used as a hint to use the appropriate load/save function. If the extension is not present in the filename, it has to be specified.

Variable Assignment#

From here, it should be clear that Fugue extends SQL to make it a more complete language. One of the additional features is variable assignment. Along with this, multiple SELECT statements can be used. This is the equivalent of temp tables or Common Table Expressions (CTE) in SQL.

df = pd.DataFrame({"number":[0,1],"word":["hello","world"]})
%%fsql
-- Note that df is used from the previous Python cell
SELECT * FROM df
SAVE OVERWRITE "/tmp/f.csv"(header=true)

temp = SELECT * FROM (LOAD "/tmp/f.csv" (header=true)) 
        WHERE number = 1
output = SELECT word FROM temp
SAVE OVERWRITE "/tmp/output.csv"(header=true)

new = LOAD "/tmp/output.csv"(header=true)
PRINT new
word
schema: word:str

Execution Engine#

So far, we’ve only dealt with the default execution engine. If nothing is passed to %%fsql, the NativeExecutionEngine is used. Like Fugue programming interface, the execution engine can be easily changed by passing it to FugueSQLWorkflow. Below is an example for Spark.

Take note of the output DataFrame in the example below. It will be a SparkDataFrame.

%%fsql spark
SELECT *, 1 AS constant
FROM df
YIELD DATAFRAME AS df
print(type(df))
<class 'fugue_spark.dataframe.SparkDataFrame'>

Anonymity#

In FugueSQL, one of the simplifications is anonymity. It’s optional, but it usually can significantly simplify your code and make it more readable.

For a statement that only needs to consume the previous dataframe, a FROM keyword is not needed. PRINT is the best example. SAVE is another example. This can be applied to other keywords. In this example we’ll use the TAKE function that only returns the number of rows specified.

%%fsql
a = SELECT * FROM df
TAKE 2 ROWS PRESORT number DESC          # a is consumed by TAKE
PRINT 
b = SELECT * FROM df
TAKE 2 ROWS FROM b PRESORT number DESC   # equivalent explicit syntax
PRINT
number word constant
0 1 world 1
1 0 hello 1
schema: number:long,word:str,constant:int
number word constant
0 1 world 1
1 0 hello 1
schema: number:long,word:str,constant:int

Inline Statements#

The last enhancement is inline statements. One statement can be written in another in between ( ) . Anonymity and variable assignment often make this unnecessary, but it’s just good to know that this option exists.

%%fsql
a = CREATE [[0,"hello"], [1,"world"]] SCHEMA number:int,word:str
SELECT *
FROM (TAKE 1 ROW FROM a)
PRINT
number word
0 0 hello
schema: number:int,word:str

Passing DataFrames through FugueSQL cells#

DataFrames in preceding FugueSQL cells cannot be used in future FugueSQL cells by default. To use them in downstream cells, the DataFrame needs to be yielded with YIELD DATAFRAME like in the example below. This also makes it available in Python cells. For large DataFrames, YIELD FILE stores the file in a temporary location for it to be loaded when used.

%%fsql
a=CREATE [[0,"hello"],[1,"world"]] SCHEMA number:int,word:str
YIELD DATAFRAME AS a
# Using the yielded DataFrame in Python
print(a.as_pandas().head())
   number   word
0       0  hello
1       1  world
%%fsql
b = CREATE [[0,"hello2"],[1,"world2"]] SCHEMA number:int,word2:str

SELECT a.number num, b.word2 
FROM a 
INNER JOIN b
ON a.number = b.number
PRINT
num word2
0 0 hello2
1 1 world2
schema: num:int,word2:str

From notebooks to deployment#

While notebooks are good for data exploration and prototyping, some users want to include their FugueSQL code in Python scripts. For this, users can use the fsql function. Similar to %%fsql cells, the execution engine can be defined in the run method.

from fugue_sql import fsql

fsql("""
b = CREATE [[0,"hello2"],[1,"world2"]] SCHEMA number:int,word2:str

SELECT a.number num, b.word2 
FROM a 
INNER JOIN b
ON a.number = b.number
PRINT
""").run("spark")
num word2
0 1 world2
1 0 hello2
schema: num:int,word2:str
DataFrames()

In this tutorial we have gone through how to use standard SQL operations (and more) on top of Pandas, Spark, and Dask. We have also seen enhancements over standard SQL like anonymity and variable assignment.

In a following section we’ll look at more ways of integrating Python with FugueSQL to extend the capabilities of using SQL.