Skip to content

Commit 0a79e04

Browse files
jayhackcodegen-bot
andauthored
feat: fetches system-prompt + guide (#111)
# Motivation <!-- Why is this change necessary? --> # Content <!-- Please include a summary of the change --> # Testing <!-- How was the change tested? --> # Please check the following before marking your PR as ready for review - [ ] I have added tests for my changes - [ ] I have updated the documentation or added new documentation as needed - [ ] I have read and agree to the [Contributor License Agreement](../CLA.md) --------- Co-authored-by: codegen-bot <team+codegenbot@codegen.sh>
1 parent 7888277 commit 0a79e04

File tree

6 files changed

+252
-3
lines changed

6 files changed

+252
-3
lines changed

docs/building-with-codegen/symbol-api.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ All symbols share common APIs for manipulation:
3838
- [symbol.source](/api-reference/core/Symbol#source)
3939
- [symbol.docstring](/api-reference/core/Symbol#docstring)
4040
- Edit operations
41-
- [symbol.set_docstring](/api-reference/core/Symbol#add_comment)
41+
- [symbol.set_docstring](/api-reference/core/Symbol#set-docstring)
4242
- [symbol.move_to_file](/api-reference/core/Symbol#move-to-file) (see [Moving Symbols](/building-with-codegen/moving-symbols))
4343
- Graph relations (See [Usages and Dependencies](/building-with-codegen/dependencies-and-usages))
4444
- [symbol.usages](/api-reference/core/Symbol#usages)

docs/mint.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@
7575
"tutorials/modularity",
7676
"tutorials/deleting-dead-code",
7777
"tutorials/increase-type-coverage",
78+
"tutorials/training-data",
7879
"tutorials/manage-feature-flags",
7980
"tutorials/managing-typescript-exports",
8081
"tutorials/converting-default-exports",

docs/tutorials/training-data.mdx

Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
---
2+
title: "Generating Training Data for LLMs"
3+
sidebarTitle: "Training Data"
4+
description: "Learn how to generate training data for large language models using Codegen"
5+
icon: "network-wired"
6+
iconType: "solid"
7+
---
8+
9+
This guide demonstrates how to use Codegen to generate high-quality training data for large language models (LLMs) by extracting function implementations along with their dependencies and usages. This approach is similar to [word2vec](https://www.tensorflow.org/text/tutorials/word2vec) or [node2vec](https://snap.stanford.edu/node2vec/) - given the context of a function, learn to predict the function's implementation.
10+
11+
<Info>View the full code in our [examples repository](https://github.com/codegen-sh/codegen-examples/blob/main/generate_training_data/run.py)</Info>
12+
13+
<Tip>This example works with both Python and Typescript repositories without modification</Tip>
14+
15+
## Overview
16+
17+
The process involves three main steps:
18+
19+
1. Finding all functions in the codebase
20+
2. Extracting their implementations, dependencies, and usages
21+
3. Generating structured training data
22+
23+
Let's walk through each step using Codegen.
24+
25+
## Step 1: Finding Functions and Their Context
26+
27+
First, we will do a "graph expansion" for each function - grab the function's source, as well as the full source of all usages of the function and all dependencies.
28+
29+
<Info>See [dependencies and usages](/building-with-codegen/dependencies-and-usages) to learn more about navigating the code graph</Info>
30+
31+
First, let's import the types we need from Codegen:
32+
33+
```python
34+
import codegen
35+
from codegen import Codebase
36+
from codegen.sdk.core.external_module import ExternalModule
37+
from codegen.sdk.core.import_resolution import Import
38+
from codegen.sdk.core.symbol import Symbol
39+
```
40+
41+
Here's how we get the full context for each function:
42+
43+
```python
44+
def get_function_context(function) -> dict:
45+
"""Get the implementation, dependencies, and usages of a function."""
46+
context = {
47+
"implementation": {"source": function.source, "filepath": function.filepath},
48+
"dependencies": [],
49+
"usages": [],
50+
}
51+
52+
# Add dependencies
53+
for dep in function.dependencies:
54+
# Hop through imports to find the root symbol source
55+
if isinstance(dep, Import):
56+
dep = hop_through_imports(dep)
57+
58+
context["dependencies"].append({"source": dep.source, "filepath": dep.filepath})
59+
60+
# Add usages
61+
for usage in function.usages:
62+
context["usages"].append({
63+
"source": usage.usage_symbol.source,
64+
"filepath": usage.usage_symbol.filepath,
65+
})
66+
67+
return context
68+
```
69+
70+
Notice how we use `hop_through_imports` to resolve dependencies. When working with imports, symbols can be re-exported multiple times. For example, a helper function might be imported and re-exported through several files before being used. We need to follow this chain to find the actual implementation:
71+
72+
```python
73+
def hop_through_imports(imp: Import) -> Symbol | ExternalModule:
74+
"""Finds the root symbol for an import."""
75+
if isinstance(imp.imported_symbol, Import):
76+
return hop_through_imports(imp.imported_symbol)
77+
return imp.imported_symbol
78+
```
79+
80+
This creates a structured representation of each function's context:
81+
82+
```json
83+
{
84+
"implementation": {
85+
"source": "def process_data(input: str) -> dict: ...",
86+
"filepath": "src/data_processor.py"
87+
},
88+
"dependencies": [
89+
{
90+
"source": "def validate_input(data: str) -> bool: ...",
91+
"filepath": "src/validators.py"
92+
}
93+
],
94+
"usages": [
95+
{
96+
"source": "result = process_data(user_input)",
97+
"filepath": "src/api.py"
98+
}
99+
]
100+
}
101+
```
102+
103+
## Step 2: Processing the Codebase
104+
105+
Next, we process all functions in the codebase to generate our training data:
106+
107+
```python
108+
def run(codebase: Codebase):
109+
"""Generate training data using a node2vec-like approach for code embeddings."""
110+
# Track all function contexts
111+
training_data = {
112+
"functions": [],
113+
"metadata": {
114+
"total_functions": len(codebase.functions),
115+
"total_processed": 0,
116+
"avg_dependencies": 0,
117+
"avg_usages": 0,
118+
},
119+
}
120+
121+
# Process each function in the codebase
122+
for function in codebase.functions:
123+
# Skip if function is too small
124+
if len(function.source.split("\n")) < 2:
125+
continue
126+
127+
# Get function context
128+
context = get_function_context(function)
129+
130+
# Only keep functions with enough context
131+
if len(context["dependencies"]) + len(context["usages"]) > 0:
132+
training_data["functions"].append(context)
133+
134+
# Update metadata
135+
training_data["metadata"]["total_processed"] = len(training_data["functions"])
136+
if training_data["functions"]:
137+
training_data["metadata"]["avg_dependencies"] = sum(
138+
len(f["dependencies"]) for f in training_data["functions"]
139+
) / len(training_data["functions"])
140+
training_data["metadata"]["avg_usages"] = sum(
141+
len(f["usages"]) for f in training_data["functions"]
142+
) / len(training_data["functions"])
143+
144+
return training_data
145+
```
146+
147+
## Step 3: Running the Generator
148+
149+
Finally, we can run our training data generator on any codebase.
150+
151+
<Note>See [parsing codebases](/building-with-codegen/parsing-codebases) to learn more</Note>
152+
153+
```python
154+
if __name__ == "__main__":
155+
print("Initializing codebase...")
156+
codebase = Codebase.from_repo("fastapi/fastapi")
157+
158+
print("Generating training data...")
159+
training_data = run(codebase)
160+
161+
print("Saving training data...")
162+
with open("training_data.json", "w") as f:
163+
json.dump(training_data, f, indent=2)
164+
print("Training data saved to training_data.json")
165+
```
166+
167+
This will:
168+
1. Load the target codebase
169+
2. Process all functions
170+
3. Save the structured training data to a JSON file
171+
172+
<Tip>
173+
You can use any Git repository as your source codebase by passing the repo URL
174+
to [Codebase.from_repo(...)](/api-reference/core/codebase#from-repo).
175+
</Tip>
176+
177+
## Using the Training Data
178+
179+
The generated data can be used to train LLMs in several ways:
180+
181+
1. **Masked Function Prediction**: Hide a function's implementation and predict it from dependencies and usages
182+
2. **Code Embeddings**: Generate embeddings that capture semantic relationships between functions
183+
3. **Dependency Prediction**: Learn to predict which functions are likely to be dependencies
184+
4. **Usage Pattern Learning**: Train models to understand common usage patterns
185+
186+
For example, to create a masked prediction task:
187+
188+
```python
189+
def create_training_example(function_data):
190+
"""Create a masked prediction example from function data."""
191+
return {
192+
"context": {
193+
"dependencies": function_data["dependencies"],
194+
"usages": function_data["usages"]
195+
},
196+
"target": function_data["implementation"]
197+
}
198+
199+
# Create training examples
200+
examples = [create_training_example(f) for f in training_data["functions"]]
201+
```
202+
203+
## Best Practices
204+
205+
1. **Filter Small Functions**: Skip trivial functions that won't provide meaningful training data:
206+
```python
207+
if len(function.source.split("\n")) < 2:
208+
continue
209+
```
210+
211+
2. **Ensure Sufficient Context**: Only use functions with dependencies or usages:
212+
```python
213+
if len(context["dependencies"]) + len(context["usages"]) > 0:
214+
training_data["functions"].append(context)
215+
```
216+
217+
3. **Track Metadata**: Keep statistics about your training data:
218+
```python
219+
training_data["metadata"] = {
220+
"total_functions": len(codebase.functions),
221+
"total_processed": len(training_data["functions"]),
222+
"avg_dependencies": average_dependencies,
223+
"avg_usages": average_usages
224+
}
225+
```
226+
227+
4. **Handle Import Chains**: Follow import chains to find root implementations:
228+
```python
229+
def hop_through_imports(imp: Import) -> Symbol | ExternalModule:
230+
if isinstance(imp.imported_symbol, Import):
231+
return hop_through_imports(imp.imported_symbol)
232+
return imp.imported_symbol
233+
```
234+
235+
By following these guidelines, you can generate high-quality training data for your LLM projects while maintaining code quality and consistency.

src/codegen/cli/api/endpoints.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@
99
LOOKUP_ENDPOINT = f"https://{MODAL_PREFIX}--cli-lookup.modal.run"
1010
RUN_ON_PR_ENDPOINT = f"https://{MODAL_PREFIX}--cli-run-on-pull-request.modal.run"
1111
PR_LOOKUP_ENDPOINT = f"https://{MODAL_PREFIX}--cli-pr-lookup.modal.run"
12+
CODEGEN_SYSTEM_PROMPT_URL = "https://gist.githubusercontent.com/jayhack/15681a2ceaccd726f19e6fdb3a44738b/raw/17c08054e3931b3b7fdf424458269c9e607541e8/codegen-system-prompt.txt"

src/codegen/cli/commands/init/render.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,4 @@ def get_success_message(codegen_dir: Path, docs_dir: Path, examples_dir: Path) -
66
return """📁 .codegen configuration folder created:
77
[dim]config.toml[/dim] Project configuration
88
[dim]codemods/[/dim] Your codemod implementations
9-
[dim]jupyter/[/dim] Notebooks for codebase exploration
10-
[dim]prompts/[/dim] AI system prompts (gitignored)"""
9+
[dim]codegen-system-prompt.txt[/dim] AI system prompt (gitignored)"""

src/codegen/cli/workspace/initialize_workspace.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from contextlib import nullcontext
33
from pathlib import Path
44

5+
import requests
56
import rich
67
import toml
78
from rich.status import Status
@@ -78,6 +79,7 @@ def initialize_codegen(
7879
CONFIG_PATH = CODEGEN_FOLDER / "config.toml"
7980
JUPYTER_DIR = CODEGEN_FOLDER / "jupyter"
8081
CODEMODS_DIR = CODEGEN_FOLDER / "codemods"
82+
SYSTEM_PROMPT_PATH = CODEGEN_FOLDER / "codegen-system-prompt.txt"
8183

8284
# If status is a string, create a new spinner
8385
context = create_spinner(f" {status} folders...") if isinstance(status, str) else nullcontext()
@@ -91,6 +93,16 @@ def initialize_codegen(
9193
JUPYTER_DIR.mkdir(parents=True, exist_ok=True)
9294
CODEMODS_DIR.mkdir(parents=True, exist_ok=True)
9395

96+
# Download system prompt
97+
try:
98+
from codegen.cli.api.endpoints import CODEGEN_SYSTEM_PROMPT_URL
99+
100+
response = requests.get(CODEGEN_SYSTEM_PROMPT_URL)
101+
response.raise_for_status()
102+
SYSTEM_PROMPT_PATH.write_text(response.text)
103+
except Exception as e:
104+
rich.print(f"[yellow]Warning: Could not download system prompt: {e}[/yellow]")
105+
94106
if not repo:
95107
rich.print("No git repository found. Please run this command in a git repository.")
96108
else:
@@ -152,6 +164,7 @@ def modify_gitignore(codegen_folder: Path):
152164
"examples/",
153165
"prompts/",
154166
"jupyter/",
167+
"codegen-system-prompt.txt", # Add system prompt to gitignore
155168
"",
156169
"# Python cache files",
157170
"__pycache__/",

0 commit comments

Comments
 (0)