Claude/fix prover wiring epy ir#28
Merged
hyperpolymath merged 8 commits intomainfrom Apr 20, 2026
Merged
Conversation
.github/workflows/live-provers.yml Replace Tier-3 placeholder with a per-backend matrix. Live-provisioned (apt / upstream tarball): tamarin, proverif, metamath, twelf, ortools. Deferred-to-Containerfile with explicit skip note: hol4, acl2, scip, imandra. Each job probes the binary and runs the single matching case from tests/live_prover_suite.rs with continue-on-error while Wave-3 wires up. tests/live_prover_suite.rs Add #[tokio::test] entries for all nine Tier-3 backends and extend kind_label with their display names. Tests SKIP when the binary is absent on PATH (existing `which` probe), so running locally or in PR CI costs nothing; only the weekly tier3 matrix exercises them. src/julia/run_training.jl Drop the hardcoded max_proof_states=50_000 cap that made the expanded corpus pointless. Default 200k on GPU / 50k on CPU, overridable via ECHIDNA_MAX_PROOF_STATES (0 = unlimited). ECHIDNA_NUM_EPOCHS and ECHIDNA_NUM_NEGATIVES gain the same treatment. Required before retraining on the 2026-04 corpus expansion.
scripts/provision_corpora.sh Single source of truth for every prover's upstream corpus mirror. A catalogue entry per source (52 total: 45 git clones, 7 manual/gated) binds the external_corpora/<name>/ path the extractors expect to a concrete clone URL, replacing the comments-scattered-across-54-files pattern. Idempotent, non-interactive, retries once on network failure, continues past a dead mirror so one 404 doesn't abort a 40-source run. --list, --status, --force, --all, and named-subset selection. Meant for a nightly cron or a Justfile recipe. scripts/extract_all.sh Composable companion. Runs every scripts/extract_*.jl (or a named subset) under the src/julia Julia project and collects failures into a non-zero exit code. Short-circuits gracefully when an extractor's upstream is absent (the extractors themselves already handle that). Together with provision_corpora.sh + run_training.jl this gives a three-step end-to-end retrain recipe: provision → extract → train. Motivation: the existing expand_training_data.sh was interactive (read -p) and only knew about three sources, so no CI job could grow the corpus without human input. Closing that gap unblocks the 100k per-prover target.
Justfile Three new recipes that compose the scripts landed in 9df340c into an idiomatic end-to-end retrain flow: just provision-corpora [NAMES...] # scripts/provision_corpora.sh just corpora-status # on-disk report just extract-corpora [NAMES...] # scripts/extract_all.sh just retrain # src/julia/run_training.jl just corpus-refresh # all three, in order Each step is idempotent; re-running the pipeline only touches what has changed. The retrain recipe honours the env-var knobs added in 5611cb2 (ECHIDNA_MAX_PROOF_STATES, ECHIDNA_NUM_EPOCHS, ECHIDNA_NUM_NEGATIVES), so: ECHIDNA_MAX_PROOF_STATES=0 just corpus-refresh re-baselines MRR against the full expanded corpus on GPU hardware.
… ProofStateRecord
src/rust/verification/proof.rs
New module (verisim-feature-gated) materialising the four entities the
E-R audit flagged as missing from the Verisim schema. Each closes one
of the FK-discipline gaps identified in the 2026-04-19 audit:
* Proof{id, goal_hash, theorem_hash, theorem_name, prover, first_seen}
makes the SHA-256 digest already produced by
proof_encoding::proof_identity() a first-class row. theorem_hash
(new helper theorem_identity) and goal_hash let cross-prover and
cross-goal joins index on fixed-width columns without re-deriving
the triple from the primary key.
* TacticApplication{proof_id, step, tactic, goal_hash, status,
goals_after, duration_ms, applied_at} persists the proof DAG.
Success/Closed/Error TacticStatus is kept (not just success) so
MCTS/agentic search retains failed-branch training signal.
* ProofVersion{proof_id, version, snapshot_id, goals_remaining,
is_qed, actor, recorded_at} is the append-only audit log that
OctadPayload.temporal.versions currently lacks — the latter is
render-only; this is durable. `initial()` and `next()` chain
monotonically so versions never collide.
* ProofStateRecord{proof_id, step, goal_hash, prover, prover_version,
state_cbor_b64, features, relevant_premises, label_confidence,
source} is the typed corpus schema replacing the ad-hoc JSONL dict.
state_cbor_b64 round-trips the tree structure the JSONL format
loses; label_confidence separates human-verified ground truth from
weak heuristics; prover_version attributes MRR deltas to solver
builds.
The module adds 8 unit tests (hash-derivation correctness, FK wiring,
monotonic version chaining, QED marking, round-trip serde). Existing
OctadPayload posting continues unchanged — these rows are intended to
be written alongside, not instead of, the octad, via an incremental
migration path.
src/rust/verification/mod.rs
Export the new types behind the existing `verisim` feature gate,
matching how proof_encoding is gated.
src/rust/proof_encoding.rs + src/rust/verisim_bridge.rs
Drive-by fix: the `verisim` feature tests referenced
ProverKind::Lean4 (doesn't exist; the Lean 4 variant is ProverKind::Lean,
Lean 3 is the sibling ProverKind::Lean3) and a hardcoded "Lean4" Debug
string. Restoring the 27 previously-uncompilable verisim-feature
tests so the new proof.rs tests can share the harness.
Test counts: 613 default, 640 with --features verisim (0 failures).
scripts/align_premises.jl
New Julia script that reconciles the three-way id-space bug found
during today's retrain:
* per-prover extractors emit proof_state.id and matching
premise.proof_id sharing a local integer id space
* scripts/merge_corpus.jl:350-352 dedupes proof_states by
(prover, theorem) and rewrites every id to a fresh sequential
counter 1..N in proof_states_UNIFIED.jsonl
* the premise files retain their original extractor ids, so the
dataloader's proof_id join matches ~0% of records — today's
retrain saw 1 392/50 000 = 2.8% coverage
Fix: rebuild premises_COMPLETE.jsonl keyed on (canonical_prover,
theorem) — the same pair merge_corpus.jl dedupes on, which is
durable across the id rewrite. For every premise record, look up
the fresh id assigned by the merge and rewrite proof_id := UNIFIED_id.
Records whose theorem is not in UNIFIED get dropped and counted.
Expected coverage jump: 2.8% → ~60–80% of proof_states with at
least one matched premise, ~20-30x more training examples.
Canonicalisation table held in lockstep with merge_corpus.jl's
PROVER_CANONICAL. Stats written to premises_ALIGNED_stats.json
alongside the output.
Justfile
Add `merge-corpora` (runs scripts/merge_corpus.jl) and
`align-premises` (runs the new script) recipes and insert them
into the corpus-refresh chain between extract-corpora and retrain,
so the full pipeline is now: provision → extract → merge → align →
retrain. Individual recipes remain callable for targeted work.
src/rust/agent/memory.rs, src/rust/vcl_ut.rs Drop `warn` from the default-feature `tracing` import — the macro is only called inside #[cfg(feature = "verisim")] blocks. Re-import it behind the same gate so verisim builds keep compiling. src/rust/vcl_ut.rs Move the `let theorem = ...` binding inside the existing #[cfg(feature = "verisim")] block at execute_find_proof — it's only used by the verisim octad-key path. Removes the unused-variable warning under the default feature without sprinkling _theorem. src/rust/provers/mod.rs Drop the `"lp"` arm from the Dedukti file-extension match (line 1359). `.lp` is already claimed by GLPK seventeen lines above, so the Dedukti branch was unreachable. Inline a comment explaining the .lp ambiguity (LP/MIP vs lambdapi) and pointing readers at detect_from_file_content() for content-aware disambiguation. src/rust/verisim_bridge.rs Drop unused `info` from the tracing import (only debug + warn are called) and unused `Tactic` from the top-level core import (only the test module references it; added there explicitly instead). Test counts unchanged: 613 default, 640 verisim, 0 failures.
…nfig dispatch.rs:101 added a `diagnostics: bool` field to DispatchConfig under #[serde(default)] but routing_benchmarks.rs:305 still constructs the config with the old field set, breaking `cargo bench --no-run`. Add `diagnostics: false` to match the field set. Production callers keep using DispatchConfig::default() so this only affects the bench.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Signed-off-by: Jonathan D.A. Jewell <6759885+hyperpolymath@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.