A system for identifying programming languages in source code files using machine learning. This implementation supports both a legacy TensorFlow-based model and a new scikit-learn-based model with improved performance and resource efficiency.
- Supports 8 programming languages: Python, Java, C++, Groovy, JavaScript, XML, JSON, and YAML
- Resource-efficient implementation (runs on CPU with <512MB RAM)
- Fast prediction (more than 4 files/second)
- Handles class imbalance through balanced class weights
- Provides confidence scores for predictions
- Supports batch processing of multiple files
- Python 3.6+
- Dependencies listed in
requirements.txt
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
python ai_predict_lang.py --train --train-dir fileSingle file:
python ai_predict_lang.py --file path/to/file.txtDirectory (batch) prediction:
python ai_predict_lang.py --dir path/to/directory/Show top-k predictions:
python ai_predict_lang.py --file path/to/file.txt --top-k 3Other options:
--model-dir: Directory containing trained model (default: 'models')--output: Output file for batch results--validate: Validate predictions against file extensions (for batch)--legacy: Use legacy TensorFlow model
The new implementation uses:
- HashingVectorizer with character n-grams (1-3) for feature extraction
- SGDClassifier (logistic regression) with balanced class weights
- Lightweight, language-specific features (keywords, syntax markers, comment styles)
- Efficient memory usage and fast inference
The implementation is optimized for:
- Single quad-core CPU
- 512MB RAM
- Local file storage
- Processing > 4 files/second
- Test accuracy: ~93%
- Validation accuracy: ~94%
- Macro F1-score: ~0.78 (test set)
- Weighted F1-score: ~0.94 (test set)
- Resource usage: Peak memory ~139MB, training time ~75s, prediction speed >4 files/sec
-
May struggle with:
- Very short files
- Files with mixed languages
- Files with non-standard extensions
- Binary files or non-UTF-8 encoded files
-
Resource constraints:
- Limited to CPU processing
- Memory usage must stay under 512MB
- Processing speed target of 4 files/second
-
Model improvements:
- Add file extension as a feature
- Implement ensemble methods
- Add confidence thresholds
- Create language-specific preprocessing rules
-
Performance optimizations:
- Implement batch processing for training
- Add caching for frequently accessed files
- Optimize feature extraction pipeline
-
Additional features:
- Support for more languages
- Better handling of mixed-language files
- Improved error handling and logging
- API for integration with other tools