-
Notifications
You must be signed in to change notification settings - Fork 1
Claude AI #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Claude AI #15
Conversation
- Replace panic in print_dup_text with graceful error handling when files become inaccessible - Fix unwrap in ignore file parsing to handle I/O errors without crashing - Replace unwrap calls in file path handling with to_string_lossy for better UTF-8 support - Improve application stability by allowing partial failures instead of complete crashes Signed-off-by: Tony Asleson <tony.asleson@gmail.com> Co-Authored-By: Claude <noreply@anthropic.com>
Replace all unwrap() calls in src/main.rs with appropriate error handling: - Mutex lock failures now print descriptive error messages and return/exit gracefully - JSON serialization errors are handled with proper error messages - Thread pool builder uses expect() with descriptive message - Maintains thread safety while providing better user experience 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Sync version number in README.md from 0.9.0 to 0.9.3 to match current version in Cargo.toml. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Implement extensive test suite covering duplicate detection algorithms: Core functionality tests: - Hash calculation and consistency - File signature generation with real files - Rolling hash windows and pattern detection - FileId management and memory optimization Advanced sequence matching tests: - Exact sequence alignment between files - Partial matches with diverging content - Mixed match/no-match patterns in sequences - Complex repeating subsequence detection Edge case coverage: - Single-line files and boundary conditions - End-of-file matching scenarios - Same-file overlap prevention - Non-matching sequence handling Integration scenarios: - Multi-file collision detection - Real-world code patterns (functions, braces) - Whitespace trimming validation - Cross-file duplicate positioning Added tempfile dev-dependency for realistic file testing. All 23 tests pass with comprehensive coverage of duplihere's core algorithms. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Implement three sophisticated tests simulating real-world duplicate detection scenarios: 1. Large file with repeating 12-line sections separated by random lines: - Simulates 64-line file with 5 repetitions of JavaScript function pattern - Tests pattern detection across random separator comments - Validates multiple window sizes (3, 6, 12 lines) and position accuracy - Confirms sub-pattern matching and scaling behavior 2. Collision detection with interrupted patterns: - Tests core collision detection system with 6-line patterns - Simulates full pipeline from rolling hashes to collision frequency analysis - Validates detection of patterns repeated 3+ times with separators - Confirms multi-occurrence detection and window size impact 3. Real-world firmware hex pattern testing: - Simulates firmware blob scenarios with 4-line hex patterns - Tests 10 repetitions with intentional variations and separators - Validates statistical pattern frequency analysis - Confirms robustness with realistic data structures These tests cover complex scenarios like configuration files, code templates, log patterns, and firmware data - validating duplihere's ability to detect duplicates across interrupted sequences and random separators. All 26 tests pass, including comprehensive real-world pattern validation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Split duplihere into a clean library/binary architecture for better maintainability: Library structure (src/lib.rs): - Extract all core duplicate detection functionality to library - Make functions public for external testing and reuse - Implement proper error handling throughout - Organize code into logical modules and sections Binary structure (src/main.rs): - Reduce to focused CLI interface (82 lines vs 1,580+ lines) - Use library functions for all core functionality - Clean separation of concerns between CLI and algorithms Test organization (tests/integration_tests.rs): - Move all 26 comprehensive unit tests to external test file - Enable proper integration testing of library functions - Maintain full test coverage of core algorithms - Follow standard Rust testing conventions Benefits: - Better code maintainability and organization - Improved testability with external test access - Standard Rust project structure - Cleaner separation between library and CLI - Enables library reuse by other projects All functionality preserved with significantly improved code organization. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Modernize codebase to use Rust Edition 2021 for improved language features: Cargo.toml: - Update edition from "2018" to "2021" Library modernization (src/lib.rs): - Remove unnecessary extern crate declarations - Replace macro_use with explicit lazy_static import - Use direct module imports following Edition 2021 conventions - Cleaner import structure and better code organization Binary modernization (src/main.rs): - Remove extern crate alias for rags_rs - Use direct module imports instead of aliasing - Update function signature to use rags_rs::Error directly Benefits: - Access to latest Rust language features and improvements - Cleaner, more modern Rust syntax - Better alignment with current Rust conventions - Simplified import statements and dependency management All functionality preserved with full test coverage maintained. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Replace all unwrap() calls and most process::exit() calls with proper Result-based error handling throughout the codebase. Key changes: - Add custom DupliError enum with variants for different error types - Implement From traits for automatic error conversion - Convert core functions to return Result types: * file_signatures() -> Result<Vec<u64>> * process_file() -> Result<()> * print_report() -> Result<()> * get_ignore_hashes() -> Result<HashMap<u64, bool>> * files_to_process() -> Result<Vec<(u32, Arc<String>)>> * process_files() -> Result<DashMap<u64, Collision>> - Update main.rs to handle Result types with proper error messages - Modify all tests to work with new Result-based APIs - Maintain backwards compatibility for Serialize trait implementation Benefits: - Better error context and composable error handling - Elimination of unwrap() calls and most panics - More testable and idiomatic Rust error handling - Graceful error recovery where possible 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
WalkthroughIntroduces a new reusable library (src/lib.rs) for duplicate-text detection, refactors main to use the library, updates Cargo.toml to edition 2021 and adds a dev-dependency, adjusts README version text, and adds comprehensive integration tests validating hashing, rolling windows, collision detection, file mapping, and reporting paths. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as main.rs (CLI)
participant Lib as duplihere::process_files
participant FS as Filesystem
participant Hash as Signatures & Rolling Hashes
participant Col as Collision Detector
participant Rep as Reporting
User->>CLI: Run with args (globs, lines, print/json, ignore, threads)
CLI->>Lib: process_files(Options)
Lib->>FS: files_to_process(globs)
FS-->>Lib: [(file_id, name)]
par parallel per file
Lib->>Hash: file_signatures(file)
Hash-->>Lib: Vec<u64>
end
Lib->>Hash: rolling_hashes(signatures, min_lines)
Lib->>Col: find_collisions(...)
Col-->>Lib: DashMap<key, Collision>
Lib-->>CLI: results
CLI->>Rep: process_report(results, Options, ignore_hashes)
alt json
Rep->>CLI: print JSON
else text
Rep->>CLI: print text
end
CLI-->>User: Exit 0 or error
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (8)
Cargo.toml (1)
19-20: Loosen dev-dep pin to receive compatible fixes automaticallyUse a caret range so CI picks up non-breaking fixes (security/bug) in 3.x without manual bumps.
-[dev-dependencies] -tempfile = "3.0" +[dev-dependencies] +tempfile = "3"README.md (1)
30-31: Typo: “indention” → “indentation”Minor doc polish.
-The duplicated text can be at different levels of indention, +The duplicated text can be at different levels of indentation,tests/integration_tests.rs (3)
521-526: Align with stricter collision semanticsSame rationale: prefer None over Some(num_lines=0).
- let collision3 = maximize_collision(&file_hashes, &left3, &right3, 2); - // Should not match [4,5] vs [99,98] - might return Some with 0 lines - if let Some(collision3) = collision3 { - assert_eq!(collision3.num_lines, 0); - } + let collision3 = maximize_collision(&file_hashes, &left3, &right3, 2); + assert!(collision3.is_none());
557-560: Avoid accepting 0-line “matches”Tighten to None to reflect true non-match.
- let collision2 = maximize_collision(&file_hashes, &left, &right2, min_lines); - // maximize_collision might return Some with 0 lines instead of None - if let Some(collision2) = collision2 { - assert_eq!(collision2.num_lines, 0); - } + let collision2 = maximize_collision(&file_hashes, &left, &right2, min_lines); + assert!(collision2.is_none());
398-405: Comment nit: bracesThe comment says “both '}'” but the file content writes “}}”. Tests are correct; tweak the comment to reduce confusion.
-// Lines 1 and 3 should have the same signature (both "}") +// Lines 3 and 7 should have the same signature (both "}}")src/lib.rs (3)
631-668: Global FILE_LOOKUP locking scopeYou hold the FILE_LOOKUP mutex across I/O (glob + canonicalize). Not an issue here (single-threaded phase), but annotate this invariant or narrow the lock if you ever parallelize discovery.
703-710: Clarify Arc → &str conversionThe implicit deref from &Arc to &str works, but using as_str() is clearer.
- if let Err(err) = process_file( - e.0, - &e.1, + if let Err(err) = process_file( + e.0, + e.1.as_str(), opts.lines as usize, &file_hashes, &collision_hashes, ) {
372-442: print_report: prefer HashSet for ignore lookupsUse HashSet for ignore lookups; drop bool payload.
-pub fn print_report( - printable_results: &[Collision], - opts: &Options, - ignore_hashes: &HashMap<u64, bool>, -) -> Result<()> { +pub fn print_report( + printable_results: &[Collision], + opts: &Options, + ignore_hashes: &std::collections::HashSet<u64>, +) -> Result<()> { @@ - if ignore_hashes.contains_key(&p.key) { + if ignore_hashes.contains(&p.key) {And in get_ignore_hashes:
-pub fn get_ignore_hashes(file_name: &str) -> Result<HashMap<u64, bool>> { - let mut ignores: HashMap<u64, bool> = HashMap::new(); +pub fn get_ignore_hashes(file_name: &str) -> Result<std::collections::HashSet<u64>> { + let mut ignores: std::collections::HashSet<u64> = std::collections::HashSet::new(); @@ - ignores.insert(hv, true); + ignores.insert(hv);Propagate type changes to callers (main.rs, process_report).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
Cargo.toml(2 hunks)README.md(1 hunks)src/lib.rs(1 hunks)src/main.rs(3 hunks)tests/integration_tests.rs(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/integration_tests.rs (1)
src/lib.rs (8)
calculate_hash(89-93)rolling_hashes(121-140)new(592-598)id_to_name(621-623)number_files(626-628)file_signatures(97-115)overlap(290-297)maximize_collision(301-348)
src/main.rs (1)
src/lib.rs (3)
new(592-598)get_ignore_hashes(543-565)process_files(696-720)
🔇 Additional comments (9)
Cargo.toml (2)
5-5: Edition 2021: good upgradeEdition bump aligns with current tooling and lets you use 2021 idioms without friction.
9-18: Confirm these versions are deliberate: rayon 1.3.0 and dashmap 5.5.0 (Cargo.toml lines 9–18)
rayon 1.3.0 and dashmap 5.5.0 may be older; if not intentionally pinned, bump them under semver to pick up fixes/optimizations for parallel-sorts/raw-shards usage.
Verification here failed (no configured Rust toolchain). Run locally and paste outputs:
rustup default stableif you need -Z flags: rustup default nightly
cargo update -Z minimal-versions
cargo tree -i rayon
cargo tree -i dashmapREADME.md (1)
24-24: Version string update: LGTMMatches Cargo version 0.9.3.
src/main.rs (2)
16-58: Return-based main is fineResult-returning main with rags_rs errors is clean and consistent.
73-81: Ignore-file load: good UX with specific exit codeClear message and exit code 2. Consider bubbling errors instead if this binary ever becomes a library entry-point.
tests/integration_tests.rs (1)
257-281: OK: exact-match maximize_collision happy pathSolid coverage for multi-file exact sequence growth.
src/lib.rs (3)
121-140: Consecutive-duplicate suppression: goodrolling_hashes avoids back-to-back duplicates; windowing logic is sound.
350-369: print_dup_text: boundary handling looks correctReads and prints inclusive range as intended; no changes needed.
399-407: Off-by-one: end line in human-readable outputEnd should be inclusive (start + num_lines - 1). Current code prints one past the last line.
- let start_line = spec_file.line_number; - let end_line = start_line + p.num_lines; + let start_line = spec_file.line_number; + let end_line = start_line + p.num_lines.saturating_sub(1); println!( "Between lines {} and {} in {}", start_line + 1, end_line, filename );Likely an incorrect or invalid review comment.
| // If after walking we overlap skip too | ||
| if overlap(l_info, r_info, offset) { | ||
| return None; | ||
| } | ||
|
|
||
| let files: Vec<LineId> = vec![*l_info, *r_info]; | ||
| Some(Collision { | ||
| key: s.finish(), | ||
| num_lines: offset, | ||
| start_lines: files, | ||
| sig: 0, | ||
| }) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return None for sub-threshold matches to prevent 0-line “collisions”
When the rolling-hash collides but actual line hashes differ immediately, offset stays 0. Returning Some with num_lines=0 pollutes results and downstream reporting.
// If after walking we overlap skip too
- if overlap(l_info, r_info, offset) {
+ if offset < min_lines {
+ return None;
+ }
+ if overlap(l_info, r_info, offset) {
return None;
}Update related tests as suggested in tests/integration_tests.rs.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // If after walking we overlap skip too | |
| if overlap(l_info, r_info, offset) { | |
| return None; | |
| } | |
| let files: Vec<LineId> = vec![*l_info, *r_info]; | |
| Some(Collision { | |
| key: s.finish(), | |
| num_lines: offset, | |
| start_lines: files, | |
| sig: 0, | |
| }) | |
| } | |
| // If after walking we overlap skip too | |
| if offset < min_lines { | |
| return None; | |
| } | |
| if overlap(l_info, r_info, offset) { | |
| return None; | |
| } | |
| let files: Vec<LineId> = vec![*l_info, *r_info]; | |
| Some(Collision { | |
| key: s.finish(), | |
| num_lines: offset, | |
| start_lines: files, | |
| sig: 0, | |
| }) | |
| } |
🤖 Prompt for AI Agents
In src/lib.rs around lines 336 to 348, the function currently returns
Some(Collision) even when offset is 0, producing a 0-line collision; change the
logic to return None when offset == 0 so sub-threshold (immediate mismatch)
matches are ignored, and update the related tests in tests/integration_tests.rs
to expect no collision in that case (adjust assertions/expected outputs
accordingly).
| pub fn process_report( | ||
| results_hash: DashMap<u64, Collision>, | ||
| opts: &Options, | ||
| ignore_hashes: &HashMap<u64, bool>, | ||
| ) { | ||
| let mut final_report: Vec<Collision> = results_hash.into_iter().map(|(_, v)| v).collect(); | ||
| final_report.par_sort_unstable_by(|a, b| a.num_lines.cmp(&b.num_lines).reverse()); | ||
|
|
||
| let mut printable_results: Vec<Collision> = Vec::new(); | ||
|
|
||
| { | ||
| let mut chunk_processed: HashMap<u64, bool> = HashMap::new(); | ||
|
|
||
| final_report.par_iter_mut().for_each(|ea| ea.scrub()); | ||
|
|
||
| for ea in final_report { | ||
| let cs = ea.signature(); | ||
| if let std::collections::hash_map::Entry::Vacant(e) = chunk_processed.entry(cs) { | ||
| e.insert(true); | ||
| printable_results.push(ea); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| printable_results.par_sort_unstable_by(|a, b| { | ||
| a.num_lines | ||
| .cmp(&b.num_lines) | ||
| .then_with(|| { | ||
| a.start_lines[0] | ||
| .line_number | ||
| .cmp(&b.start_lines[0].line_number) | ||
| }) | ||
| .then_with(|| a.start_lines[0].file_id.cmp(&b.start_lines[0].file_id)) | ||
| }); | ||
|
|
||
| if let Err(e) = print_report(&printable_results, opts, ignore_hashes) { | ||
| eprintln!("ERROR: Failed to print report: {}", e); | ||
| process::exit(1); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Library should not exit the process
process_report calls process::exit(1) on error, which is surprising for a library. Return Result and let the binary decide exit codes. See main.rs comment for coordinated change.
-pub fn process_report(
+pub fn process_report(
results_hash: DashMap<u64, Collision>,
opts: &Options,
ignore_hashes: &HashMap<u64, bool>,
-) {
+) -> Result<()> {
@@
- if let Err(e) = print_report(&printable_results, opts, ignore_hashes) {
- eprintln!("ERROR: Failed to print report: {}", e);
- process::exit(1);
- }
+ print_report(&printable_results, opts, ignore_hashes)?;
-}
+ Ok(())
+}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub fn process_report( | |
| results_hash: DashMap<u64, Collision>, | |
| opts: &Options, | |
| ignore_hashes: &HashMap<u64, bool>, | |
| ) { | |
| let mut final_report: Vec<Collision> = results_hash.into_iter().map(|(_, v)| v).collect(); | |
| final_report.par_sort_unstable_by(|a, b| a.num_lines.cmp(&b.num_lines).reverse()); | |
| let mut printable_results: Vec<Collision> = Vec::new(); | |
| { | |
| let mut chunk_processed: HashMap<u64, bool> = HashMap::new(); | |
| final_report.par_iter_mut().for_each(|ea| ea.scrub()); | |
| for ea in final_report { | |
| let cs = ea.signature(); | |
| if let std::collections::hash_map::Entry::Vacant(e) = chunk_processed.entry(cs) { | |
| e.insert(true); | |
| printable_results.push(ea); | |
| } | |
| } | |
| } | |
| printable_results.par_sort_unstable_by(|a, b| { | |
| a.num_lines | |
| .cmp(&b.num_lines) | |
| .then_with(|| { | |
| a.start_lines[0] | |
| .line_number | |
| .cmp(&b.start_lines[0].line_number) | |
| }) | |
| .then_with(|| a.start_lines[0].file_id.cmp(&b.start_lines[0].file_id)) | |
| }); | |
| if let Err(e) = print_report(&printable_results, opts, ignore_hashes) { | |
| eprintln!("ERROR: Failed to print report: {}", e); | |
| process::exit(1); | |
| } | |
| } | |
| pub fn process_report( | |
| results_hash: DashMap<u64, Collision>, | |
| opts: &Options, | |
| ignore_hashes: &HashMap<u64, bool>, | |
| ) -> Result<()> { | |
| let mut final_report: Vec<Collision> = results_hash.into_iter().map(|(_, v)| v).collect(); | |
| final_report.par_sort_unstable_by(|a, b| a.num_lines.cmp(&b.num_lines).reverse()); | |
| let mut printable_results: Vec<Collision> = Vec::new(); | |
| { | |
| let mut chunk_processed: HashMap<u64, bool> = HashMap::new(); | |
| final_report.par_iter_mut().for_each(|ea| ea.scrub()); | |
| for ea in final_report { | |
| let cs = ea.signature(); | |
| if let std::collections::hash_map::Entry::Vacant(e) = chunk_processed.entry(cs) { | |
| e.insert(true); | |
| printable_results.push(ea); | |
| } | |
| } | |
| } | |
| printable_results.par_sort_unstable_by(|a, b| { | |
| a.num_lines | |
| .cmp(&b.num_lines) | |
| .then_with(|| { | |
| a.start_lines[0] | |
| .line_number | |
| .cmp(&b.start_lines[0].line_number) | |
| }) | |
| .then_with(|| a.start_lines[0].file_id.cmp(&b.start_lines[0].file_id)) | |
| }); | |
| print_report(&printable_results, opts, ignore_hashes)?; | |
| Ok(()) | |
| } |
🤖 Prompt for AI Agents
In src/lib.rs around lines 500 to 539, process_report currently calls
process::exit(1) when print_report fails which is inappropriate for a library;
change the function to return a Result<(), E> (choose the crate convention e.g.,
anyhow::Result<()> or Result<(), Box<dyn std::error::Error + Send + Sync>>)
instead of returning unit, remove the process::exit call, and propagate the
print_report error back to the caller (use the ? operator or map_err to convert
error types as needed), ensuring you return Ok(()) on success; update callers
(e.g., the binary in main.rs) to handle the Result and decide exit codes.
| pub fn register_file(&mut self, file_name: Arc<String>) -> Option<u32> { | ||
| if self.name_to_index.contains_key(&file_name) { | ||
| return None; | ||
| } | ||
| let num = self.num_files; | ||
|
|
||
| self.index_to_name.push(Arc::clone(&file_name)); | ||
| self.name_to_index | ||
| .insert(Arc::clone(&file_name), self.num_files); | ||
| if let Some(v) = self.num_files.checked_add(1) { | ||
| self.num_files = v; | ||
| } else { | ||
| eprintln!("Number of files processed exceeds {}", u32::MAX); | ||
| process::exit(2); | ||
| } | ||
| Some(num) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Don’t exit from a library on counter overflow
Exiting the process from FileId::register_file is hostile to embedders. Return an error and let callers decide.
- pub fn register_file(&mut self, file_name: Arc<String>) -> Option<u32> {
+ pub fn register_file(&mut self, file_name: Arc<String>) -> Option<u32> {
@@
- if let Some(v) = self.num_files.checked_add(1) {
- self.num_files = v;
- } else {
- eprintln!("Number of files processed exceeds {}", u32::MAX);
- process::exit(2);
- }
+ self.num_files = self.num_files.checked_add(1).unwrap_or_else(|| {
+ eprintln!("Number of files processed exceeds {}", u32::MAX);
+ // Prefer: change API to Result and propagate instead of exiting.
+ panic!("FileId overflow");
+ });If you’re open to a breaking change, switch return type to Result<u32, DupliError> and propagate.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub fn register_file(&mut self, file_name: Arc<String>) -> Option<u32> { | |
| if self.name_to_index.contains_key(&file_name) { | |
| return None; | |
| } | |
| let num = self.num_files; | |
| self.index_to_name.push(Arc::clone(&file_name)); | |
| self.name_to_index | |
| .insert(Arc::clone(&file_name), self.num_files); | |
| if let Some(v) = self.num_files.checked_add(1) { | |
| self.num_files = v; | |
| } else { | |
| eprintln!("Number of files processed exceeds {}", u32::MAX); | |
| process::exit(2); | |
| } | |
| Some(num) | |
| } | |
| pub fn register_file(&mut self, file_name: Arc<String>) -> Option<u32> { | |
| if self.name_to_index.contains_key(&file_name) { | |
| return None; | |
| } | |
| let num = self.num_files; | |
| self.index_to_name.push(Arc::clone(&file_name)); | |
| self.name_to_index | |
| .insert(Arc::clone(&file_name), self.num_files); | |
| self.num_files = self.num_files.checked_add(1).unwrap_or_else(|| { | |
| eprintln!("Number of files processed exceeds {}", u32::MAX); | |
| // Prefer: change API to Result and propagate instead of exiting. | |
| panic!("FileId overflow"); | |
| }); | |
| Some(num) | |
| } |
🤖 Prompt for AI Agents
In src/lib.rs around lines 602 to 618, the function currently calls
process::exit on num_files overflow which aborts the entire process; change it
to return an error instead of exiting: update the signature to return
Result<u32, DupliError> (or, if you must avoid a breaking change, return
Option<u32> and treat overflow as None), add a new DupliError variant for
"FileIdOverflow" (or reuse an existing error type), replace the process::exit
branch with returning Err(DupliError::FileIdOverflow) (or None), and update all
callers to propagate or handle the Result appropriately (use ? where possible)
so embedders can decide how to handle overflow.
| if opts.threads != 0 { | ||
| rayon::ThreadPoolBuilder::new() | ||
| ThreadPoolBuilder::new() | ||
| .num_threads(opts.threads) | ||
| .build_global() | ||
| .unwrap(); | ||
| .expect("Failed to build global thread pool"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t panic if the global Rayon pool is already initialized
build_global() will Err if any Rayon use happened earlier; expect(...) will abort the program. Handle gracefully and continue with the existing pool.
- if opts.threads != 0 {
- ThreadPoolBuilder::new()
- .num_threads(opts.threads)
- .build_global()
- .expect("Failed to build global thread pool");
- }
+ if opts.threads != 0 {
+ if let Err(e) = ThreadPoolBuilder::new()
+ .num_threads(opts.threads)
+ .build_global()
+ {
+ eprintln!("WARNING: Unable to set Rayon global thread pool (already initialized?): {e}");
+ }
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if opts.threads != 0 { | |
| rayon::ThreadPoolBuilder::new() | |
| ThreadPoolBuilder::new() | |
| .num_threads(opts.threads) | |
| .build_global() | |
| .unwrap(); | |
| .expect("Failed to build global thread pool"); | |
| } | |
| if opts.threads != 0 { | |
| if let Err(e) = ThreadPoolBuilder::new() | |
| .num_threads(opts.threads) | |
| .build_global() | |
| { | |
| eprintln!("WARNING: Unable to set Rayon global thread pool (already initialized?): {e}"); | |
| } | |
| } |
🤖 Prompt for AI Agents
In src/main.rs around lines 66 to 71, the call to
ThreadPoolBuilder::new().num_threads(...).build_global().expect(...) panics if
the global Rayon pool is already initialized; replace the expect with error
handling that treats an Err as non-fatal. Call build_global(), match or if-let
on the Result, and on Ok do nothing, but on Err log a debug/warn that the global
pool was already initialized (or otherwise failed to install) and continue
without aborting; ensure the program does not panic when Rayon reports the pool
is already set up.
| let results_hash = match process_files(&opts) { | ||
| Ok(results) => results, | ||
| Err(e) => { | ||
| eprintln!("ERROR: Failed to process files: {}", e); | ||
| std::process::exit(1); | ||
| } | ||
| }; | ||
| process_report(results_hash, &opts, &ignore_hash); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Propagate reporting errors instead of exiting inside the library
process_report currently calls process::exit(1) on failures. Prefer returning Result from the library and handling exit here.
Apply these diffs (lib + main):
In src/lib.rs:
-pub fn process_report(
+pub fn process_report(
results_hash: DashMap<u64, Collision>,
opts: &Options,
ignore_hashes: &HashMap<u64, bool>,
-) {
+) -> Result<()> {
@@
- if let Err(e) = print_report(&printable_results, opts, ignore_hashes) {
- eprintln!("ERROR: Failed to print report: {}", e);
- process::exit(1);
- }
+ print_report(&printable_results, opts, ignore_hashes)?;
-}
+ Ok(())
+}In src/main.rs:
- process_report(results_hash, &opts, &ignore_hash);
+ if let Err(e) = process_report(results_hash, &opts, &ignore_hash) {
+ eprintln!("ERROR: Failed to print report: {e}");
+ std::process::exit(1);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let results_hash = match process_files(&opts) { | |
| Ok(results) => results, | |
| Err(e) => { | |
| eprintln!("ERROR: Failed to process files: {}", e); | |
| std::process::exit(1); | |
| } | |
| }; | |
| process_report(results_hash, &opts, &ignore_hash); | |
| } | |
| let results_hash = match process_files(&opts) { | |
| Ok(results) => results, | |
| Err(e) => { | |
| eprintln!("ERROR: Failed to process files: {}", e); | |
| std::process::exit(1); | |
| } | |
| }; | |
| if let Err(e) = process_report(results_hash, &opts, &ignore_hash) { | |
| eprintln!("ERROR: Failed to print report: {e}"); | |
| std::process::exit(1); | |
| } | |
| } |
🤖 Prompt for AI Agents
In src/main.rs around lines 83 to 91, main currently calls
process_report(results_hash, &opts, &ignore_hash) while process_report may call
process::exit(1) inside the library; change the library (src/lib.rs) so
process_report returns a Result<(), Error> instead of exiting, then update main
to match on that Result and, on Err(e), print an error (eprintln!("ERROR: Failed
to generate report: {}", e)) and call std::process::exit(1); ensure
process_report's signature is updated, propagate errors from any internal
operations using ? or map_err, and remove any direct process::exit calls from
the library so the binary handles termination.
| // Test completely different sequences | ||
| let file_hashes = vec![vec![1, 2, 3, 4, 5], vec![6, 7, 8, 9, 10]]; | ||
|
|
||
| let left = LineId { | ||
| file_id: 0, | ||
| line_number: 0, | ||
| }; | ||
| let right = LineId { | ||
| file_id: 1, | ||
| line_number: 0, | ||
| }; | ||
|
|
||
| let collision = maximize_collision(&file_hashes, &left, &right, 2); | ||
| // Since the first elements are different (1 != 6), there should be no collision | ||
| // but maximize_collision might still return a collision with 0 lines | ||
| if let Some(collision) = collision { | ||
| assert_eq!(collision.num_lines, 0); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strengthen expectation: no collision when first elements differ
maximize_collision should return None when rolling-hash digest was a false positive. This avoids 0-line “collisions”.
- let collision = maximize_collision(&file_hashes, &left, &right, 2);
- // Since the first elements are different (1 != 6), there should be no collision
- // but maximize_collision might still return a collision with 0 lines
- if let Some(collision) = collision {
- assert_eq!(collision.num_lines, 0);
- }
+ let collision = maximize_collision(&file_hashes, &left, &right, 2);
+ assert!(collision.is_none());📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Test completely different sequences | |
| let file_hashes = vec![vec![1, 2, 3, 4, 5], vec![6, 7, 8, 9, 10]]; | |
| let left = LineId { | |
| file_id: 0, | |
| line_number: 0, | |
| }; | |
| let right = LineId { | |
| file_id: 1, | |
| line_number: 0, | |
| }; | |
| let collision = maximize_collision(&file_hashes, &left, &right, 2); | |
| // Since the first elements are different (1 != 6), there should be no collision | |
| // but maximize_collision might still return a collision with 0 lines | |
| if let Some(collision) = collision { | |
| assert_eq!(collision.num_lines, 0); | |
| } | |
| } | |
| // Test completely different sequences | |
| let file_hashes = vec![vec![1, 2, 3, 4, 5], vec![6, 7, 8, 9, 10]]; | |
| let left = LineId { | |
| file_id: 0, | |
| line_number: 0, | |
| }; | |
| let right = LineId { | |
| file_id: 1, | |
| line_number: 0, | |
| }; | |
| let collision = maximize_collision(&file_hashes, &left, &right, 2); | |
| assert!(collision.is_none()); | |
| } |
🤖 Prompt for AI Agents
In tests/integration_tests.rs around lines 309 to 327, the test currently allows
maximize_collision to return Some with num_lines == 0 when first elements
differ; instead assert that no collision is returned. Change the assertion to
require collision.is_none() (i.e., fail the test if maximize_collision returns
Some), ensuring the test expects None rather than a zero-length collision.
Misc. improvements using claude
Summary by CodeRabbit