Doc-Diff (Python Library)

Python is often the choice for developers who need to apply data analysis in their work or mainly data scientists/data engineers whose tasks are more related deriving insight from the data.

One of Python’s greatest assets is its extensive set of libraries. Recently, I was working on very popular Data Mining algorithms (i.e: FP-Growth and Custom A-Priori). There was a situation I wanted to get comprehensive analysis report on results generated by these algorithms.

As a support lib for Data Science work introducing “doc-dff — Generate the diff data between two files”

image

doc-diff supports the following features:

  • Generate the following comparison reports
    • common_in_doc1-and-doc2-%Y-%m-%d.csv
    • common_key_with_diff_values-%Y-%m-%d.csv
    • exclusive_in_doc1-%Y-%m-%d.csv
    • exclusive_in_doc2-%Y-%m-%d.csv
  • Compare two files and return following ‘dicts(prodCode, recommendation)’
    • common_in_doc1_and_doc2_list = dicts()
    • common_key_with_diff_values_list = dicts()
    • exclusive_in_doc1_list = dicts()
    • exclusive_in_doc2_list = dicts()

Install

$ pip install doc-diff

Implementation

from doc_diff import Diff
from doc_diff import gen_comp_report

if __name__ == '__main__':
    # Data file location
    a_priori_csv_location = "./data/a-priori.csv"
    pfp_csv_location = "./data/pfp.csv"

    # Process a-priori.csv data file
    a_priori_diff = Diff(a_priori_csv_location)
    a_priori_diff.process_file()

    # Process pfp.csv data file
    pfp_diff = Diff(pfp_csv_location)
    pfp_diff.process_file()

    gen_comp_report(a_priori_diff, pfp_diff)

I’m looking forward to open source all my supportive lib for Data Science/Data Engineering work. Let me know what you think about ‘doc-diff’ below in the comments and share your thoughts. If you want to share any new features/issues, feel free to open an issue in the GitHub repository.