File Formats#

Pandas practitioners often use CSVs to ingest data for processing. Although this is common practice, it’s not good practice. In this section, we’ll explain why parquet files are more often used when dealing with distributed computing frameworks such as Spark, Dask, and Ray. Even if the size of the data is still small, there are benefits to musing parquet.

CSVs Don’t Hold Schema Information#

The first major downside is that CSVs do not hold schema information. Pandas users often just rely on schema inference during loading. This is why many Pandas users have to convert types after loading in data. For example, boolean columns might be represented as integer columns, which would occupy significantly more memory.

Another common case is that data types can sometimes be loaded as a string, needing conversion before any processing is done.

To get past this, Pandas users often write a function to change all the types after loading in the data. Sometimes, this script will be present across multiple files.

On the other hand, parquet holds schema information making it easy for sharing data across a team. It also eliminates the need for schema inference.

Compression#

The compression is significantly better on parquet. Parquet files tend to be around one-fifth the size of CSV files.

#