- Class-Based Architecture
All methods now have access to vendor_names, id_list, and owner_df
Utility functions are shared between draft and executed pipelines
Draft vs executed logic is cleanly separated but inherits from the same base
1.3 Configuration Section at Top
EXCEL_PATH:
Set your vendor spreadsheet path once here
HEADER_ROW:
Configure which row contains headers (0-indexed)
Column name variables (VENDOR_COLUMN
, ID_COLUMN
, OWNER_COLUMN
) that can be changed in one place
1.6. Automatic Sheet Selection
Now automatically uses the first sheet in the Excel file
No need to specify sheet name in command line
1.9. Simplified Functionality
Removed all single-file processing code, focused entirely on batch processing, cleaner command-line interface
2 Improved CLI, to run:
# Preview mode (default - shows what would be renamed)
python document_renamer.py /path/to/directory --status executed
# Actually rename files
python document_renamer.py /path/to/directory --status executed --rename
# Process only PDF files
python document_renamer.py /path/to/directory --status executed --pattern "*.pdf"
# Process draft Word documents
python document_renamer.py /path/to/directory --status draft --pattern "*.docx"
- Enhanced Fuzzy Matching
The PDF pipeline now has full access to fuzzy matching
Improved vendor guessing with better ignore patterns
Consistent ID matching across both pipelines
- Better Error Handling & User Experience
Preview mode shows before/after comparison
Batch processing capabilities
Verbose/quiet modes for debugging
Clear error messages with suggestions
- Modular Design
Each major function has a single responsibility
Easy to extend with new document types
Clear debugging output that can be toggled
- Fixed the "Executed" Definition Issue
The distinction between drfat and executed is now explicit:
-
Draft: Uses file modification date, includes owner initials
-
Executed: Extracts dates from PDF content, no initials (assumes final document)
The program relies on files being stored in a consistent directory. There are two variables defined to be for the fuzzy match. folder
and base
. folder
is the file-path, split by the delimiters ('/', '-', '_' ect.) base
is the file name itself, excluding the extension. If the files have the name of the vendor, use base
in words
. If the folder/directory has the name (which it usually does moreso than the filename) use folder
in words
. A more sophisticated method can be devised for more precision.
PyPDF parsing isn't bad but a method with OCR would be the most accurate. AWS's textract seems to be the standard, but that costs money. Another great, open source model is tesseract. I'll look into that.