Word Frequency Analysis of BC Legislative Documents

This project analyzes word frequencies in BC Legislative documents using Stanford CoreNLP and Python. The program extracts text from PDF documents, processes it using natural language processing techniques, and generates a comprehensive word frequency analysis.

Requirements

Python Libraries

Install the following Python libraries using pip:

# For accessWebsite.py
pip install selenium

# For word-count.py
pip install easyocr        # For OCR text extraction from PDFs
pip install PyMuPDF        # For PDF processing
pip install pandas         # For data manipulation
pip install numpy          # For array operations
pip install pycorenlp      # For Stanford CoreNLP

Integration

Stanford CoreNLP Setup

Download Stanford CoreNLP from the official Stanford website
Extract the downloaded zip file
Navigate to the extracted folder

Start the CoreNLP server using:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

Program Execution

Ensure the Stanford CoreNLP server is running
Place your PDF documents in a folder named 'data' in the project directory
Run the program:

python word-count.py

PDF Documents

The following PDF documents from BC Legislative Public Statutes and Regulations were analyzed:

Act1(AccessibleBritishColumbia).pdf (10 pages)

Source: Accessible British Columbia Act
Act2(EscheatAct).pdf (11 pages)

Source: Escheat Act
Act3(FamilyLawPart1).pdf (4 pages)

Source: Family Law Act: Part 1 (Interpretation)

Output

The program generates a CSV file named 'word_frequencies.csv' containing:

First column: Unique words in vocabulary (lexicographic order)
Second column: Total word counts across all documents
Additional columns: Word counts for each individual PDF document
Final row: Total counts for each column

Important Notes

The program uses OCR (Optical Character Recognition) to extract text from PDFs, so processing might take some time depending on your computer's specifications.
Ensure all PDFs are in the 'data' directory before running the program.
The Stanford CoreNLP server must be running on port 9000 before executing the program.
The program requires an active internet connection for the first run to download EasyOCR models.

Technical Implementation Details

OCR is implemented using EasyOCR for reliable text extraction from PDF documents
Text processing includes cleaning, tokenization, and word frequency analysis
Word counting excludes numbers and special characters, focusing on alphabetic words
Results are sorted alphabetically and include comprehensive frequency statistics

Troubleshooting

If you encounter any issues:

Verify that Stanford CoreNLP server is running (default port: 9000)
Check that all required Python libraries are installed
Ensure PDF documents are properly placed in the 'data' directory
Verify that the PDFs are readable and not corrupted

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
accessWebsite.py		accessWebsite.py
word-count.py		word-count.py
word_frequencies.csv		word_frequencies.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Word Frequency Analysis of BC Legislative Documents

Requirements

Python Libraries

Integration

Program Execution

PDF Documents

Output

Technical Implementation Details

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

License

shefreenkaur/Web-Scraping-and-Word-Frequencies

Folders and files

Latest commit

History

Repository files navigation

Word Frequency Analysis of BC Legislative Documents

Requirements

Python Libraries

Integration

Program Execution

PDF Documents

Output

Technical Implementation Details

Troubleshooting

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages