Code2Prompt Dataset Curation for LLM Fine-Tuning

Overview

Code2Prompt is a curated resource designed to help researchers and practitioners prepare high-quality datasets for Large Language Model (LLM) fine-tuning, especially for code-related tasks. This repository demonstrates best practices in dataset extraction, cleaning, organization, and project hygiene, with a focus on supporting downstream LLM fine-tuning efforts.

Purpose

Fine-Tuning Support: The entire workflow is designed to help others build datasets suitable for LLM fine-tuning, enabling more effective model adaptation for coding and automation tasks.
Reproducible Data Engineering: All scripts, instructions, and project protocols are aimed at making dataset curation easy to replicate, adapt, or extend for similar projects.

Key Strengths

Comprehensive Dataset Curation
- Extracts and organizes code data from open datasets (e.g., 3b1b/videos), ready for use in LLM fine-tuning pipelines.
- Modular and well-documented notebooks for scraping, cleaning, and structuring data[1][8].
Best Practices in Data Management
- Large raw datasets are never committed to the repository; instead, clear instructions are provided to source them externally, keeping the repo lightweight and compliant with GitHub guidelines.
- .gitignore is configured to prevent accidental tracking of large or sensitive files.
Project Hygiene and Security
- Careful attention to secret management: any sensitive files (like API keys) are scrubbed from both the working directory and git history.
- All data-related scripts are safe for open-source use and collaboration.
Documentation-Driven Approach
- Every step is clearly explained so users can reproduce the dataset creation process from scratch with minimal friction[4][8].

Improvements Over Typical Dataset Projects

Reproducibility: Anyone can reconstruct the dataset by following provided scripts and instructions—ideal for research and collaboration.
Security: Demonstrates robust workflows for removing secrets and handling large files, which are common pitfalls in open-source data projects.
Fine-Tuning Ready: All data preparation is tailored for downstream LLM fine-tuning tasks—even if actual model training is not included in this repo.

Getting Started

Clone the Repository

git clone https://github.com/hassanfarhan777/code2prompt-llm-fine-tuning.git

Install Dependencies
```
pip install -r requirements.txt
```

Obtain the 3b1b Dataset

This project relies on code data from 3b1b/videos. Please clone the dataset repository separately:

git clone https://github.com/3b1b/videos.git

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
hybrid_loop_fixed.ipynb		hybrid_loop_fixed.ipynb
multiyear_code_fixed.ipynb		multiyear_code_fixed.ipynb
prompt_exp_fixed.ipynb		prompt_exp_fixed.ipynb
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code2Prompt Dataset Curation for LLM Fine-Tuning

Overview

Purpose

Key Strengths

Improvements Over Typical Dataset Projects

Getting Started

About

Uh oh!

Releases

Packages

Languages

hassanfarhan777/code2prompt-llm-fine-tuning

Folders and files

Latest commit

History

Repository files navigation

Code2Prompt Dataset Curation for LLM Fine-Tuning

Overview

Purpose

Key Strengths

Improvements Over Typical Dataset Projects

Getting Started

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages