Key Improvements Made

Class-Based Architecture

All methods now have access to vendor_names, id_list, and owner_df

Utility functions are shared between draft and executed pipelines

Draft vs executed logic is cleanly separated but inherits from the same base

1.3 Configuration Section at Top

EXCEL_PATH: Set your vendor spreadsheet path once here

HEADER_ROW: Configure which row contains headers (0-indexed)

Column name variables (VENDOR_COLUMN, ID_COLUMN, OWNER_COLUMN) that can be changed in one place

1.6. Automatic Sheet Selection

Now automatically uses the first sheet in the Excel file

No need to specify sheet name in command line

1.9. Simplified Functionality

Removed all single-file processing code, focused entirely on batch processing, cleaner command-line interface

2 Improved CLI, to run:

# Preview mode (default - shows what would be renamed)
python document_renamer.py /path/to/directory --status executed

# Actually rename files
python document_renamer.py /path/to/directory --status executed --rename

# Process only PDF files
python document_renamer.py /path/to/directory --status executed --pattern "*.pdf"

# Process draft Word documents
python document_renamer.py /path/to/directory --status draft --pattern "*.docx"

Enhanced Fuzzy Matching

The PDF pipeline now has full access to fuzzy matching

Improved vendor guessing with better ignore patterns

Consistent ID matching across both pipelines

Better Error Handling & User Experience

Preview mode shows before/after comparison

Batch processing capabilities

Verbose/quiet modes for debugging

Clear error messages with suggestions

Modular Design

Each major function has a single responsibility

Easy to extend with new document types

Clear debugging output that can be toggled

Fixed the "Executed" Definition Issue

The distinction between drfat and executed is now explicit:

Draft: Uses file modification date, includes owner initials
Executed: Extracts dates from PDF content, no initials (assumes final document)

Needed improvments

The program relies on files being stored in a consistent directory. There are two variables defined to be for the fuzzy match. folder and base. folder is the file-path, split by the delimiters ('/', '-', '_' ect.) base is the file name itself, excluding the extension. If the files have the name of the vendor, use base in words. If the folder/directory has the name (which it usually does moreso than the filename) use folder in words. A more sophisticated method can be devised for more precision.

PyPDF parsing isn't bad but a method with OCR would be the most accurate. AWS's textract seems to be the standard, but that costs money. Another great, open source model is tesseract. I'll look into that.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
old		old
src		src
tests		tests
README.md		README.md
requirements.txt		requirements.txt
vendors.xlsx		vendors.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Key Improvements Made

Needed improvments

About

Uh oh!

Releases

Packages

Languages

mindphil/batched-document-parsing

Folders and files

Latest commit

History

Repository files navigation

Key Improvements Made

Needed improvments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages