DataComPy#

Have questions? Chat with us on Github or Slack:

Homepage Slack Status

DataComPy is an open-source package by Capital One that started as a way to compare two Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame). It allows users to specify tolerances and prints out statistics.

Fugue is now an internal dependency of DataCompy, which extends the functionality to be used on backends Fugue supports (Spark, Dask, Ray, Polars, DuckDB, Arrow, etc.). A common use case is also comparing a Pandas DataFrame with a distributed DataFrame (Spark, Dask or Ray).

Basic Usage#

The most scalable way to use DataCompy is the is_match method. An example can be found below. The DataFrames are joined on the acct_id column and then compared. There are other supported operations not covered here. For more details, check the DataCompy documentation

from io import StringIO
import pandas as pd
import datacompy

data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001239,1.05,Lucille Bluth,,2017-01-01
"""

data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""

df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))

datacompy.is_match(
    df1,
    df2,
    join_columns='acct_id',  #You can also specify a list of columns
    abs_tol=0, #Optional, defaults to 0
    rel_tol=0, #Optional, defaults to 0
    df1_name='Original', #Optional, defaults to 'df1'
    df2_name='New' #Optional, defaults to 'df2'
)
# False
False

Report Generation#

For a full report, use the report function. The report is truncated in this notebook because the output is long.

# This method prints out a human-readable report summarizing and sampling differences
print(datacompy.report(
    df1,
    df2,
    join_columns='acct_id',  #You can also specify a list of columns
    abs_tol=0, #Optional, defaults to 0
    rel_tol=0, #Optional, defaults to 0
    df1_name='Original', #Optional, defaults to 'df1'
    df2_name='New' #Optional, defaults to 'df2'
))
DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0  Original        5     5
1       New        4     5

Column Summary
--------------

Number of columns in common: 4
Number of columns in Original but not in New: 1
Number of columns in New but not in Original: 0

Row Summary
-----------

Matched on: acct_id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 4
Number of rows in Original but not in New: 1
Number of rows in New but not in Original: 1

Number of rows with some compared columns unequal: 4
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 6

Columns with Unequal Values or Types
------------------------------------

       Column Original dtype New dtype  # Unequal  Max Diff  # Null Diff
0  dollar_amt        float64   float64          1    0.0500            0
2   float_fld        float64   float64          3    0.0005            2
1        name         object    object          2    0.0000            0

Sample Rows with Unequal Values
-------------------------------

       acct_id  dollar_amt (Original)  dollar_amt (New)
0  10000001234                 123.45             123.4

       acct_id name (Original)            name (New)
0  10000001237      Bob Loblaw         Robert Loblaw
1  10000001234  George Maharis  George Michael Bluth

       acct_id  float_fld (Original)  float_fld (New)
0  10000001234            14530.1555        14530.155
1  10000001236                   NaN            1.000
2  10000001235                1.0000              NaN

Sample Rows Only in Original (First 10 Columns)
-----------------------------------------------

       acct_id  dollar_amt           name  float_fld    date_fld
0  10000001239        1.05  Lucille Bluth        NaN  2017-01-01

Sample Rows Only in New (First 10 Columns)
------------------------------------------

       acct_id  dollar_amt              name  float_fld
0  10000001238        1.05  Loose Seal Bluth      111.0

Distributed Usage#

In order to compare DataFrames of different backends, you just need to replace df1 and df2 with DataFrames of different backends. Just pass in DataFrames such as Pandas DataFrames, DuckDB relations, Polars DataFrames, Arrow tables, Spark DataFrames, Dask DataFrames or Ray Datasets. For example, to compare a Pandas DataFrame with a Spark DataFrame:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark_df2 = spark.createDataFrame(df2)
datacompy.is_match(
    df1,
    spark_df2,
    join_columns='acct_id',
)

Notice that in order to use a specific backend, you need to have the corresponding library installed. For example, if you want compare Ray datasets, you must do:

pip install datacompy[ray]

Conclusion#

Here we introduced the DataComPy library and showed how to compare DataFrames across Spark, Dask, Ray, DuckDB, PyArrow, and Polars through Fugue.