Russian Language Benchmark Analysis Toolkit

A toolkit for performing fine-grained analysis of Russian language benchmarks, providing detailed insights into dataset characteristics through statistical and linguistic analysis.

Features

Data Preprocessing: Clean and validate CSV data containing potential JSON artifacts
Text Analysis: Comprehensive linguistic analysis including:
- Basic statistics (word count, sentence length, etc.)
- Readability metrics (Flesch-Kincaid, SMOG, etc.)
- POS tagging and named entity recognition
- Syntactic dependency analysis
- Text homogeneity measurement
Data Management: Store, normalize, and retrieve analysis results
Statistical Comparison: Compare datasets using:
- Mahalanobis and Euclidean distance metrics
- Z-score normalization
- Feature contribution analysis

Installation and Usage Guide

Install dependencies

pip install -r requirements.txt

Download required language models

python -m spacy download ru_core_news_sm
python -m spacy download en_core_web_sm
python -m nltk.downloader punkt stopwords

Configuration Files

The system uses two JSON configuration files for data normalization:

normalize_by_sentences.json – Contains column names to be normalized by sentence count.
normalize_by_words.json – Contains column names to be normalized by word count.

Place these files in the same directory as the data_handler module.
The normalize_data() method automatically uses these configurations.

Reference Data

master_data.csv is the baseline dataset used for statistical comparison. It should include:

A grouping column (e.g., "Genre" or "Creation")
The same feature columns used for analysis

It’s passed to StatisticalAnalyzer as master_df to:

Compare user data to reference groups
Identify key differences and generate similarity scores

Usage

1. Data Preprocessing

from data_processor import DataProcessor

processor = DataProcessor()
results = processor.process_csv("your_data.csv")

2. Text Analysis

from text_analyzer import TextAnalyzer

analyzer = TextAnalyzer("processed data here")
results = analyzer.analyze()  # Returns comprehensive analysis

3. Data Management

from data_handler import TextAnalysisDataHandler

handler = TextAnalysisDataHandler()
handler.save_analysis(results, "dataset_name")
normalized_data = handler.normalize_data()

4. Statistical Comparison

from statistical_analyzer import StatisticalAnalyzer

master_df = pd.read_csv("master_data.csv")
user_df = pd.read_csv("user_data.csv")
features = ["feature1", "feature2", ...]  # Replace with actual feature names

analyzer = StatisticalAnalyzer(master_df, user_df, features)
report = analyzer.compare_to_groups("category_column")
StatisticalAnalyzer.print_comparison_report(report)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
README.md		README.md
data_handler.py		data_handler.py
data_processor.py		data_processor.py
master_data.csv		master_data.csv
normalize_by_sentences.json		normalize_by_sentences.json
normalize_by_words.json		normalize_by_words.json
requirements.txt		requirements.txt
stats_analyzer.py		stats_analyzer.py
text_analyzer.py		text_analyzer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Russian Language Benchmark Analysis Toolkit

Features

Installation and Usage Guide

Install dependencies

Download required language models

Configuration Files

Reference Data

Usage

1. Data Preprocessing

2. Text Analysis

3. Data Management

4. Statistical Comparison

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

iis-research-team/bench-analysis

Folders and files

Latest commit

History

Repository files navigation

Russian Language Benchmark Analysis Toolkit

Features

Installation and Usage Guide

Install dependencies

Download required language models

Configuration Files

Reference Data

Usage

1. Data Preprocessing

2. Text Analysis

3. Data Management

4. Statistical Comparison

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages