-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Rust: regenerate MaD files using DCA #19674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 11 commits
31d1604
900a3b0
d5c16d6
31954fa
fbd5058
4f47ee2
ee7eb86
530b990
f4bbef9
ec77eb3
6162cf5
e1eb1f6
d6d13b9
e6056f9
bcfc009
ecc35e5
ca99add
0d03699
4ac4e44
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
language: cpp | ||
strategy: dca | ||
destination: cpp/ql/lib/ext/generated | ||
targets: | ||
- name: openssl | ||
with-sinks: false | ||
with-sources: false | ||
- name: sqlite | ||
with-sinks: false | ||
with-sources: false |
This file was deleted.
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -1,3 +1,4 @@ | ||||||
#!/usr/bin/env python3 | ||||||
""" | ||||||
Experimental script for bulk generation of MaD models based on a list of projects. | ||||||
|
||||||
|
@@ -7,15 +8,23 @@ | |||||
import os.path | ||||||
import subprocess | ||||||
import sys | ||||||
from typing import NotRequired, TypedDict, List | ||||||
from typing import NotRequired, TypedDict, List, Callable, Optional | ||||||
from concurrent.futures import ThreadPoolExecutor, as_completed | ||||||
import time | ||||||
import argparse | ||||||
import json | ||||||
import requests | ||||||
import zipfile | ||||||
import tarfile | ||||||
from functools import cmp_to_key | ||||||
import shutil | ||||||
|
||||||
try: | ||||||
import yaml | ||||||
except ImportError: | ||||||
print( | ||||||
"ERROR: PyYAML is not installed. Please install it with 'pip install pyyaml'." | ||||||
) | ||||||
sys.exit(1) | ||||||
|
||||||
import generate_mad as mad | ||||||
|
||||||
|
@@ -103,6 +112,37 @@ def clone_project(project: Project) -> str: | |||||
return target_dir | ||||||
|
||||||
|
||||||
def run_in_parallel[T, U]( | ||||||
func: Callable[[T], U], | ||||||
items: List[T], | ||||||
*, | ||||||
on_error=lambda item, exc: None, | ||||||
error_summary=lambda failures: None, | ||||||
max_workers=8, | ||||||
) -> List[Optional[U]]: | ||||||
if not items: | ||||||
return [] | ||||||
max_workers = min(max_workers, len(items)) | ||||||
results = [None for _ in range(len(items))] | ||||||
with ThreadPoolExecutor(max_workers=max_workers) as executor: | ||||||
# Start cloning tasks and keep track of them | ||||||
futures = { | ||||||
executor.submit(func, item): index for index, item in enumerate(items) | ||||||
} | ||||||
# Process results as they complete | ||||||
for future in as_completed(futures): | ||||||
index = futures[future] | ||||||
try: | ||||||
results[index] = future.result() | ||||||
except Exception as e: | ||||||
on_error(items[index], e) | ||||||
failed = [item for item, result in zip(items, results) if result is None] | ||||||
if failed: | ||||||
error_summary(failed) | ||||||
sys.exit(1) | ||||||
return results | ||||||
|
||||||
|
||||||
def clone_projects(projects: List[Project]) -> List[tuple[Project, str]]: | ||||||
""" | ||||||
Clone all projects in parallel. | ||||||
|
@@ -114,40 +154,19 @@ def clone_projects(projects: List[Project]) -> List[tuple[Project, str]]: | |||||
List of (project, project_dir) pairs in the same order as the input projects | ||||||
""" | ||||||
start_time = time.time() | ||||||
max_workers = min(8, len(projects)) # Use at most 8 threads | ||||||
project_dirs_map = {} # Map to store results by project name | ||||||
|
||||||
with ThreadPoolExecutor(max_workers=max_workers) as executor: | ||||||
# Start cloning tasks and keep track of them | ||||||
future_to_project = { | ||||||
executor.submit(clone_project, project): project for project in projects | ||||||
} | ||||||
|
||||||
# Process results as they complete | ||||||
for future in as_completed(future_to_project): | ||||||
project = future_to_project[future] | ||||||
try: | ||||||
project_dir = future.result() | ||||||
project_dirs_map[project["name"]] = (project, project_dir) | ||||||
except Exception as e: | ||||||
print(f"ERROR: Failed to clone {project['name']}: {e}") | ||||||
|
||||||
if len(project_dirs_map) != len(projects): | ||||||
failed_projects = [ | ||||||
project["name"] | ||||||
for project in projects | ||||||
if project["name"] not in project_dirs_map | ||||||
] | ||||||
print( | ||||||
f"ERROR: Only {len(project_dirs_map)} out of {len(projects)} projects were cloned successfully. Failed projects: {', '.join(failed_projects)}" | ||||||
) | ||||||
sys.exit(1) | ||||||
|
||||||
project_dirs = [project_dirs_map[project["name"]] for project in projects] | ||||||
|
||||||
dirs = run_in_parallel( | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [nitpick] Exiting from within a utility function (via sys.exit in on_error handlers) can make the logic harder to test or reuse; consider returning errors and handling exit at the top level instead.
Suggested change
Copilot uses AI. Check for mistakes. Positive FeedbackNegative Feedback |
||||||
clone_project, | ||||||
projects, | ||||||
on_error=lambda project, exc: print( | ||||||
f"ERROR: Failed to clone project {project['name']}: {exc}" | ||||||
), | ||||||
error_summary=lambda failures: print( | ||||||
f"ERROR: Failed to clone {len(failures)} projects: {', '.join(p['name'] for p in failures)}" | ||||||
), | ||||||
) | ||||||
clone_time = time.time() - start_time | ||||||
print(f"Cloning completed in {clone_time:.2f} seconds") | ||||||
return project_dirs | ||||||
return list(zip(projects, dirs)) | ||||||
|
||||||
|
||||||
def build_database( | ||||||
|
@@ -307,7 +326,10 @@ def pretty_name_from_artifact_name(artifact_name: str) -> str: | |||||
|
||||||
|
||||||
def download_dca_databases( | ||||||
experiment_name: str, pat: str, projects: List[Project] | ||||||
language: str, | ||||||
experiment_name: str, | ||||||
pat: str, | ||||||
projects: List[Project], | ||||||
) -> List[tuple[Project, str | None]]: | ||||||
""" | ||||||
Download databases from a DCA experiment. | ||||||
|
@@ -318,14 +340,14 @@ def download_dca_databases( | |||||
Returns: | ||||||
List of (project_name, database_dir) pairs, where database_dir is None if the download failed. | ||||||
""" | ||||||
database_results = {} | ||||||
print("\n=== Finding projects ===") | ||||||
response = get_json_from_github( | ||||||
f"https://raw.githubusercontent.com/github/codeql-dca-main/data/{experiment_name}/reports/downloads.json", | ||||||
pat, | ||||||
) | ||||||
targets = response["targets"] | ||||||
project_map = {project["name"]: project for project in projects} | ||||||
artifact_map = {} | ||||||
redsun82 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
for data in targets.values(): | ||||||
downloads = data["downloads"] | ||||||
analyzed_database = downloads["analyzed_database"] | ||||||
|
@@ -336,6 +358,16 @@ def download_dca_databases( | |||||
print(f"Skipping {pretty_name} as it is not in the list of projects") | ||||||
continue | ||||||
|
||||||
if pretty_name in artifact_map: | ||||||
print( | ||||||
f"Skipping previous database {artifact_map[pretty_name]['artifact_name']} for {pretty_name}" | ||||||
) | ||||||
|
||||||
artifact_map[pretty_name] = analyzed_database | ||||||
|
||||||
def download_and_extract(item: tuple[str, dict]) -> str: | ||||||
redsun82 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
pretty_name, analyzed_database = item | ||||||
artifact_name = analyzed_database["artifact_name"] | ||||||
repository = analyzed_database["repository"] | ||||||
run_id = analyzed_database["run_id"] | ||||||
print(f"=== Finding artifact: {artifact_name} ===") | ||||||
|
@@ -356,22 +388,36 @@ def download_dca_databases( | |||||
# First we open the zip file | ||||||
with zipfile.ZipFile(artifact_zip_location, "r") as zip_ref: | ||||||
artifact_unzipped_location = os.path.join(build_dir, artifact_name) | ||||||
# clean up any remnants of previous runs | ||||||
shutil.rmtree(artifact_unzipped_location, ignore_errors=True) | ||||||
# And then we extract it to build_dir/artifact_name | ||||||
zip_ref.extractall(artifact_unzipped_location) | ||||||
# And then we iterate over the contents of the extracted directory | ||||||
# and extract the tar.gz files inside it | ||||||
for entry in os.listdir(artifact_unzipped_location): | ||||||
artifact_tar_location = os.path.join(artifact_unzipped_location, entry) | ||||||
with tarfile.open(artifact_tar_location, "r:gz") as tar_ref: | ||||||
# And we just untar it to the same directory as the zip file | ||||||
tar_ref.extractall(artifact_unzipped_location) | ||||||
database_results[pretty_name] = os.path.join( | ||||||
artifact_unzipped_location, remove_extension(entry) | ||||||
) | ||||||
# and extract the language tar.gz file inside it | ||||||
redsun82 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
artifact_tar_location = os.path.join( | ||||||
artifact_unzipped_location, f"{language}.tar.gz" | ||||||
) | ||||||
with tarfile.open(artifact_tar_location, "r:gz") as tar_ref: | ||||||
# And we just untar it to the same directory as the zip file | ||||||
tar_ref.extractall(artifact_unzipped_location) | ||||||
ret = os.path.join(artifact_unzipped_location, language) | ||||||
print(f"Extraction complete: {ret}") | ||||||
return ret | ||||||
|
||||||
results = run_in_parallel( | ||||||
download_and_extract, | ||||||
list(artifact_map.items()), | ||||||
redsun82 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
on_error=lambda item, exc: print( | ||||||
f"ERROR: Failed to download database for {item[0]}: {exc}" | ||||||
), | ||||||
error_summary=lambda failures: print( | ||||||
f"ERROR: Failed to download {len(failures)} databases: {', '.join(item[0] for item in failures)}" | ||||||
), | ||||||
) | ||||||
|
||||||
print(f"\n=== Extracted {len(database_results)} databases ===") | ||||||
print(f"\n=== Extracted {len(results)} databases ===") | ||||||
|
||||||
return [(project, database_results[project["name"]]) for project in projects] | ||||||
return [(project_map[n], r) for n, r in zip(artifact_map, results)] | ||||||
|
||||||
|
||||||
def get_mad_destination_for_project(config, name: str) -> str: | ||||||
|
@@ -422,7 +468,9 @@ def main(config, args) -> None: | |||||
case "repo": | ||||||
extractor_options = config.get("extractor_options", []) | ||||||
database_results = build_databases_from_projects( | ||||||
language, extractor_options, projects | ||||||
language, | ||||||
extractor_options, | ||||||
projects, | ||||||
) | ||||||
case "dca": | ||||||
experiment_name = args.dca | ||||||
|
@@ -439,7 +487,10 @@ def main(config, args) -> None: | |||||
with open(args.pat, "r") as f: | ||||||
pat = f.read().strip() | ||||||
database_results = download_dca_databases( | ||||||
experiment_name, pat, projects | ||||||
language, | ||||||
experiment_name, | ||||||
pat, | ||||||
projects, | ||||||
) | ||||||
|
||||||
# Generate models for all projects | ||||||
|
@@ -492,9 +543,9 @@ def main(config, args) -> None: | |||||
sys.exit(1) | ||||||
try: | ||||||
with open(args.config, "r") as f: | ||||||
config = json.load(f) | ||||||
except json.JSONDecodeError as e: | ||||||
print(f"ERROR: Failed to parse JSON file {args.config}: {e}") | ||||||
config = yaml.safe_load(f) | ||||||
except yaml.YAMLError as e: | ||||||
print(f"ERROR: Failed to parse YAML file {args.config}: {e}") | ||||||
sys.exit(1) | ||||||
|
||||||
main(config, args) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
strategy: dca | ||
language: rust | ||
destination: rust/ql/lib/ext/generated | ||
targets: | ||
- name: rust | ||
- name: libc | ||
- name: log | ||
- name: memchr | ||
- name: once_cell | ||
- name: rand | ||
- name: smallvec | ||
- name: serde | ||
- name: tokio | ||
- name: reqwest | ||
- name: rocket | ||
- name: actix-web | ||
- name: hyper | ||
- name: clap | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a nice simple list, once everything is merged and stable I'll add a bunch more targets to it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. one thing to keep in mind is that at the moment this list needs to be topologically ordered with respect to dependencies (so later additions should depend on earlier ones and not the other way around). Possibly worth a comment here, now that this is yaml There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also, just so you know, you can tweak what gets generated with any of with-sinks: false
with-sources: false
with-summaries: false (all are true by default) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the expected use cases for those three options? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really know, but you can ask Mathias once he's back from his PTO, two of them are used for the C++ generated models There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My guess is there are certain libraries that produce a lot of inaccurate models of one type but not the others, and this gives us some additional control. @MathiasVP ? (no rush, not blocking this PR) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason is simply that C++ doesn't yet autogenerate sources and sinks (for a couple of reasons, but mainly because I didn't bother to set that properly up yet). The MaD generator script (which this script invokes under the hood) already provides these hooks to configure which kinds of models are generated, so I just lifted those hooks to this script in #19627 |
This file was deleted.
Uh oh!
There was an error while loading. Please reload this page.