I was looking for an option to compare huge db to db, finally I found one.
What is Vaex?
Vaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate basic statistics for more than a billion rows per second. It supports multiple visualizations allowing interactive exploration of big data.
How to use Vaex for file compare?
First of all install pip install vaex, once it is completed, you are good to go, instead of going through whole documents of Vaex, I tried a shortcut to using existing functions written in panda to compare a file, like assert, equal etc., while running the program I got an error with set_index cannot be set for dataframe, so I decided to go over in detail and I found compare function in documentation. I thought of trying but I can’t find any Information in internet. So I decided to try by myself and the good news is it worked
In just 3 lines of code I can compare huge files.
vxsrdf = vaex.read_csv(“ownership.csv”, copy_index=False, low_memory=True, encoding=”ISO-8859-1″) # source file
vxtrgdf = vaex.read_csv(“ownership1.csv”, copy_index=False, low_memory=True, encoding=”ISO-8859-1″) # trg file
vxsrcdf.compare(vxtrgdf, report_missing=True, report_difference=True, show=10, orderby=None, column_names=None) #calling compare function to compare src vs target