Skip to content

iis-research-team/bench-analysis

Repository files navigation

Russian Language Benchmark Analysis Toolkit

Python License

A toolkit for performing fine-grained analysis of Russian language benchmarks, providing detailed insights into dataset characteristics through statistical and linguistic analysis.

Features

  • Data Preprocessing: Clean and validate CSV data containing potential JSON artifacts
  • Text Analysis: Comprehensive linguistic analysis including:
    • Basic statistics (word count, sentence length, etc.)
    • Readability metrics (Flesch-Kincaid, SMOG, etc.)
    • POS tagging and named entity recognition
    • Syntactic dependency analysis
    • Text homogeneity measurement
  • Data Management: Store, normalize, and retrieve analysis results
  • Statistical Comparison: Compare datasets using:
    • Mahalanobis and Euclidean distance metrics
    • Z-score normalization
    • Feature contribution analysis

Installation and Usage Guide

Install dependencies

pip install -r requirements.txt

Download required language models

python -m spacy download ru_core_news_sm
python -m spacy download en_core_web_sm
python -m nltk.downloader punkt stopwords

Configuration Files

The system uses two JSON configuration files for data normalization:

  • normalize_by_sentences.json – Contains column names to be normalized by sentence count.
  • normalize_by_words.json – Contains column names to be normalized by word count.

Place these files in the same directory as the data_handler module.
The normalize_data() method automatically uses these configurations.

Reference Data

master_data.csv is the baseline dataset used for statistical comparison. It should include:

  • A grouping column (e.g., "Genre" or "Creation")
  • The same feature columns used for analysis

It’s passed to StatisticalAnalyzer as master_df to:

  • Compare user data to reference groups
  • Identify key differences and generate similarity scores

Usage

1. Data Preprocessing

from data_processor import DataProcessor

processor = DataProcessor()
results = processor.process_csv("your_data.csv")

2. Text Analysis

from text_analyzer import TextAnalyzer

analyzer = TextAnalyzer("processed data here")
results = analyzer.analyze()  # Returns comprehensive analysis

3. Data Management

from data_handler import TextAnalysisDataHandler

handler = TextAnalysisDataHandler()
handler.save_analysis(results, "dataset_name")
normalized_data = handler.normalize_data()

4. Statistical Comparison

from statistical_analyzer import StatisticalAnalyzer

master_df = pd.read_csv("master_data.csv")
user_df = pd.read_csv("user_data.csv")
features = ["feature1", "feature2", ...]  # Replace with actual feature names

analyzer = StatisticalAnalyzer(master_df, user_df, features)
report = analyzer.compare_to_groups("category_column")
StatisticalAnalyzer.print_comparison_report(report)

About

Fine-grained analysis of Russian benchmarks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages