This section is about best practices related to distributed computing, and less about the Fugue framework. One of the things that makes it hard to transition from small data to big data is the mindset. Here, we go over best practices and explain how to fully utilize distributed computing.
Have questions? Chat with us on Github or Slack:
This section explains the difference of CSV files and Parquet files, and why Parquet files are better for big data jobs.
Why Fugue is Not Pandas-like#
There are other libraries that promise to distribute Pandas just by changing the import statement. In this section, we explain why Pandas-like frameworks are not meant for distributed computing.
Fugue Spark Benchmark#
We show that Fugue has a minimal overhead by adding it to the Databricks benchmarks.