Skip to content

mindphil/batched-document-parsing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Key Improvements Made

  1. Class-Based Architecture

All methods now have access to vendor_names, id_list, and owner_df

Utility functions are shared between draft and executed pipelines

Draft vs executed logic is cleanly separated but inherits from the same base

1.3 Configuration Section at Top

EXCEL_PATH: Set your vendor spreadsheet path once here

HEADER_ROW: Configure which row contains headers (0-indexed)

Column name variables (VENDOR_COLUMN, ID_COLUMN, OWNER_COLUMN) that can be changed in one place

1.6. Automatic Sheet Selection

Now automatically uses the first sheet in the Excel file

No need to specify sheet name in command line

1.9. Simplified Functionality

Removed all single-file processing code, focused entirely on batch processing, cleaner command-line interface

2 Improved CLI, to run:

# Preview mode (default - shows what would be renamed)
python document_renamer.py /path/to/directory --status executed

# Actually rename files
python document_renamer.py /path/to/directory --status executed --rename

# Process only PDF files
python document_renamer.py /path/to/directory --status executed --pattern "*.pdf"

# Process draft Word documents
python document_renamer.py /path/to/directory --status draft --pattern "*.docx"
  1. Enhanced Fuzzy Matching

The PDF pipeline now has full access to fuzzy matching

Improved vendor guessing with better ignore patterns

Consistent ID matching across both pipelines

  1. Better Error Handling & User Experience

Preview mode shows before/after comparison

Batch processing capabilities

Verbose/quiet modes for debugging

Clear error messages with suggestions

  1. Modular Design

Each major function has a single responsibility

Easy to extend with new document types

Clear debugging output that can be toggled

  1. Fixed the "Executed" Definition Issue

The distinction between drfat and executed is now explicit:

  • Draft: Uses file modification date, includes owner initials

  • Executed: Extracts dates from PDF content, no initials (assumes final document)

    Needed improvments

The program relies on files being stored in a consistent directory. There are two variables defined to be for the fuzzy match. folder and base. folder is the file-path, split by the delimiters ('/', '-', '_' ect.) base is the file name itself, excluding the extension. If the files have the name of the vendor, use base in words. If the folder/directory has the name (which it usually does moreso than the filename) use folder in words. A more sophisticated method can be devised for more precision.

PyPDF parsing isn't bad but a method with OCR would be the most accurate. AWS's textract seems to be the standard, but that costs money. Another great, open source model is tesseract. I'll look into that.

About

A Python toolkit for automated document renaming based on extracted metadata, and robust parsing from PDFs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published