Skip to content

This project leverages core data science python libraries as tech stack to detect the coding languages ensuring high throughput without latency

Notifications You must be signed in to change notification settings

BRupani/AI-Language-Identification-System

Repository files navigation

AI Language Identification System

A system for identifying programming languages in source code files using machine learning. This implementation supports both a legacy TensorFlow-based model and a new scikit-learn-based model with improved performance and resource efficiency.

Features

  • Supports 8 programming languages: Python, Java, C++, Groovy, JavaScript, XML, JSON, and YAML
  • Resource-efficient implementation (runs on CPU with <512MB RAM)
  • Fast prediction (more than 4 files/second)
  • Handles class imbalance through balanced class weights
  • Provides confidence scores for predictions
  • Supports batch processing of multiple files

Requirements

  • Python 3.6+
  • Dependencies listed in requirements.txt

Installation

  1. Clone the repository
  2. Install dependencies:
    pip install -r requirements.txt

Usage

Training the Model

python ai_predict_lang.py --train --train-dir file

Predicting Language (CLI)

Single file:

python ai_predict_lang.py --file path/to/file.txt

Directory (batch) prediction:

python ai_predict_lang.py --dir path/to/directory/

Show top-k predictions:

python ai_predict_lang.py --file path/to/file.txt --top-k 3

Other options:

  • --model-dir: Directory containing trained model (default: 'models')
  • --output: Output file for batch results
  • --validate: Validate predictions against file extensions (for batch)
  • --legacy: Use legacy TensorFlow model

Implementation Details

Model Architecture

The new implementation uses:

  • HashingVectorizer with character n-grams (1-3) for feature extraction
  • SGDClassifier (logistic regression) with balanced class weights
  • Lightweight, language-specific features (keywords, syntax markers, comment styles)
  • Efficient memory usage and fast inference

Resource Constraints

The implementation is optimized for:

  • Single quad-core CPU
  • 512MB RAM
  • Local file storage
  • Processing > 4 files/second

Results

  • Test accuracy: ~93%
  • Validation accuracy: ~94%
  • Macro F1-score: ~0.78 (test set)
  • Weighted F1-score: ~0.94 (test set)
  • Resource usage: Peak memory ~139MB, training time ~75s, prediction speed >4 files/sec

Limitations

  1. May struggle with:

    • Very short files
    • Files with mixed languages
    • Files with non-standard extensions
    • Binary files or non-UTF-8 encoded files
  2. Resource constraints:

    • Limited to CPU processing
    • Memory usage must stay under 512MB
    • Processing speed target of 4 files/second

Future Improvements

  1. Model improvements:

    • Add file extension as a feature
    • Implement ensemble methods
    • Add confidence thresholds
    • Create language-specific preprocessing rules
  2. Performance optimizations:

    • Implement batch processing for training
    • Add caching for frequently accessed files
    • Optimize feature extraction pipeline
  3. Additional features:

    • Support for more languages
    • Better handling of mixed-language files
    • Improved error handling and logging
    • API for integration with other tools

About

This project leverages core data science python libraries as tech stack to detect the coding languages ensuring high throughput without latency

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages