DataComPy
Contents
DataComPy#
Have questions? Chat with us on Github or Slack:
DataComPy is an open-source package by Capital One that started as a way to compare two Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame). It allows users to specify tolerances and prints out statistics.
Fugue is now an internal dependency of DataCompy, which extends the functionality to be used on backends Fugue supports (Spark, Dask, Ray, Polars, DuckDB, Arrow, etc.). A common use case is also comparing a Pandas DataFrame with a distributed DataFrame (Spark, Dask or Ray).
Basic Usage#
The most scalable way to use DataCompy is the is_match
method. An example can be found below. The DataFrames are joined on the acct_id
column and then compared. There are other supported operations not covered here. For more details, check the DataCompy documentation
from io import StringIO
import pandas as pd
import datacompy
data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001239,1.05,Lucille Bluth,,2017-01-01
"""
data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""
df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))
datacompy.is_match(
df1,
df2,
join_columns='acct_id', #You can also specify a list of columns
abs_tol=0, #Optional, defaults to 0
rel_tol=0, #Optional, defaults to 0
df1_name='Original', #Optional, defaults to 'df1'
df2_name='New' #Optional, defaults to 'df2'
)
# False
False
Report Generation#
For a full report, use the report
function. The report is truncated in this notebook because the output is long.
# This method prints out a human-readable report summarizing and sampling differences
print(datacompy.report(
df1,
df2,
join_columns='acct_id', #You can also specify a list of columns
abs_tol=0, #Optional, defaults to 0
rel_tol=0, #Optional, defaults to 0
df1_name='Original', #Optional, defaults to 'df1'
df2_name='New' #Optional, defaults to 'df2'
))
DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 Original 5 5
1 New 4 5
Column Summary
--------------
Number of columns in common: 4
Number of columns in Original but not in New: 1
Number of columns in New but not in Original: 0
Row Summary
-----------
Matched on: acct_id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 4
Number of rows in Original but not in New: 1
Number of rows in New but not in Original: 1
Number of rows with some compared columns unequal: 4
Number of rows with all compared columns equal: 0
Column Comparison
-----------------
Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 6
Columns with Unequal Values or Types
------------------------------------
Column Original dtype New dtype # Unequal Max Diff # Null Diff
0 dollar_amt float64 float64 1 0.0500 0
2 float_fld float64 float64 3 0.0005 2
1 name object object 2 0.0000 0
Sample Rows with Unequal Values
-------------------------------
acct_id dollar_amt (Original) dollar_amt (New)
0 10000001234 123.45 123.4
acct_id name (Original) name (New)
0 10000001237 Bob Loblaw Robert Loblaw
1 10000001234 George Maharis George Michael Bluth
acct_id float_fld (Original) float_fld (New)
0 10000001234 14530.1555 14530.155
1 10000001236 NaN 1.000
2 10000001235 1.0000 NaN
Sample Rows Only in Original (First 10 Columns)
-----------------------------------------------
acct_id dollar_amt name float_fld date_fld
0 10000001239 1.05 Lucille Bluth NaN 2017-01-01
Sample Rows Only in New (First 10 Columns)
------------------------------------------
acct_id dollar_amt name float_fld
0 10000001238 1.05 Loose Seal Bluth 111.0
Distributed Usage#
In order to compare DataFrames of different backends, you just need to replace df1 and df2 with DataFrames of different backends. Just pass in DataFrames such as Pandas DataFrames, DuckDB relations, Polars DataFrames, Arrow tables, Spark DataFrames, Dask DataFrames or Ray Datasets. For example, to compare a Pandas DataFrame with a Spark DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df2 = spark.createDataFrame(df2)
datacompy.is_match(
df1,
spark_df2,
join_columns='acct_id',
)
Notice that in order to use a specific backend, you need to have the corresponding library installed. For example, if you want compare Ray datasets, you must do:
pip install datacompy[ray]
Conclusion#
Here we introduced the DataComPy library and showed how to compare DataFrames across Spark, Dask, Ray, DuckDB, PyArrow, and Polars through Fugue.