Skip to content

Conversation

@tasleson
Copy link
Owner

@tasleson tasleson commented Sep 17, 2025

Misc. improvements using claude

Summary by CodeRabbit

  • New Features
    • Added library-backed duplicate detection with CLI support: globbed file input, configurable line thresholds, ignore-list file, text/JSON output, and multithreaded processing.
  • Refactor
    • CLI now delegates to the library with centralized error handling and safer thread pool setup.
  • Documentation
    • Updated README usage version to 0.9.3.
  • Tests
    • Introduced comprehensive integration tests across hashing, rolling windows, file mapping, and collision detection.
  • Chores
    • Upgraded Rust edition to 2021 and added a development dependency for testing.

tasleson and others added 8 commits September 15, 2025 10:50
- Replace panic in print_dup_text with graceful error handling when files become inaccessible
- Fix unwrap in ignore file parsing to handle I/O errors without crashing
- Replace unwrap calls in file path handling with to_string_lossy for better UTF-8 support
- Improve application stability by allowing partial failures instead of complete crashes

Signed-off-by: Tony Asleson <tony.asleson@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Replace all unwrap() calls in src/main.rs with appropriate error handling:
- Mutex lock failures now print descriptive error messages and return/exit gracefully
- JSON serialization errors are handled with proper error messages
- Thread pool builder uses expect() with descriptive message
- Maintains thread safety while providing better user experience

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sync version number in README.md from 0.9.0 to 0.9.3 to match current version in Cargo.toml.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implement extensive test suite covering duplicate detection algorithms:

Core functionality tests:
- Hash calculation and consistency
- File signature generation with real files
- Rolling hash windows and pattern detection
- FileId management and memory optimization

Advanced sequence matching tests:
- Exact sequence alignment between files
- Partial matches with diverging content
- Mixed match/no-match patterns in sequences
- Complex repeating subsequence detection

Edge case coverage:
- Single-line files and boundary conditions
- End-of-file matching scenarios
- Same-file overlap prevention
- Non-matching sequence handling

Integration scenarios:
- Multi-file collision detection
- Real-world code patterns (functions, braces)
- Whitespace trimming validation
- Cross-file duplicate positioning

Added tempfile dev-dependency for realistic file testing.
All 23 tests pass with comprehensive coverage of duplihere's core algorithms.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implement three sophisticated tests simulating real-world duplicate detection scenarios:

1. Large file with repeating 12-line sections separated by random lines:
   - Simulates 64-line file with 5 repetitions of JavaScript function pattern
   - Tests pattern detection across random separator comments
   - Validates multiple window sizes (3, 6, 12 lines) and position accuracy
   - Confirms sub-pattern matching and scaling behavior

2. Collision detection with interrupted patterns:
   - Tests core collision detection system with 6-line patterns
   - Simulates full pipeline from rolling hashes to collision frequency analysis
   - Validates detection of patterns repeated 3+ times with separators
   - Confirms multi-occurrence detection and window size impact

3. Real-world firmware hex pattern testing:
   - Simulates firmware blob scenarios with 4-line hex patterns
   - Tests 10 repetitions with intentional variations and separators
   - Validates statistical pattern frequency analysis
   - Confirms robustness with realistic data structures

These tests cover complex scenarios like configuration files, code templates,
log patterns, and firmware data - validating duplihere's ability to detect
duplicates across interrupted sequences and random separators.

All 26 tests pass, including comprehensive real-world pattern validation.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Split duplihere into a clean library/binary architecture for better maintainability:

Library structure (src/lib.rs):
- Extract all core duplicate detection functionality to library
- Make functions public for external testing and reuse
- Implement proper error handling throughout
- Organize code into logical modules and sections

Binary structure (src/main.rs):
- Reduce to focused CLI interface (82 lines vs 1,580+ lines)
- Use library functions for all core functionality
- Clean separation of concerns between CLI and algorithms

Test organization (tests/integration_tests.rs):
- Move all 26 comprehensive unit tests to external test file
- Enable proper integration testing of library functions
- Maintain full test coverage of core algorithms
- Follow standard Rust testing conventions

Benefits:
- Better code maintainability and organization
- Improved testability with external test access
- Standard Rust project structure
- Cleaner separation between library and CLI
- Enables library reuse by other projects

All functionality preserved with significantly improved code organization.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Modernize codebase to use Rust Edition 2021 for improved language features:

Cargo.toml:
- Update edition from "2018" to "2021"

Library modernization (src/lib.rs):
- Remove unnecessary extern crate declarations
- Replace macro_use with explicit lazy_static import
- Use direct module imports following Edition 2021 conventions
- Cleaner import structure and better code organization

Binary modernization (src/main.rs):
- Remove extern crate alias for rags_rs
- Use direct module imports instead of aliasing
- Update function signature to use rags_rs::Error directly

Benefits:
- Access to latest Rust language features and improvements
- Cleaner, more modern Rust syntax
- Better alignment with current Rust conventions
- Simplified import statements and dependency management

All functionality preserved with full test coverage maintained.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Replace all unwrap() calls and most process::exit() calls with proper
Result-based error handling throughout the codebase.

Key changes:
- Add custom DupliError enum with variants for different error types
- Implement From traits for automatic error conversion
- Convert core functions to return Result types:
  * file_signatures() -> Result<Vec<u64>>
  * process_file() -> Result<()>
  * print_report() -> Result<()>
  * get_ignore_hashes() -> Result<HashMap<u64, bool>>
  * files_to_process() -> Result<Vec<(u32, Arc<String>)>>
  * process_files() -> Result<DashMap<u64, Collision>>
- Update main.rs to handle Result types with proper error messages
- Modify all tests to work with new Result-based APIs
- Maintain backwards compatibility for Serialize trait implementation

Benefits:
- Better error context and composable error handling
- Elimination of unwrap() calls and most panics
- More testable and idiomatic Rust error handling
- Graceful error recovery where possible

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Sep 17, 2025

Walkthrough

Introduces a new reusable library (src/lib.rs) for duplicate-text detection, refactors main to use the library, updates Cargo.toml to edition 2021 and adds a dev-dependency, adjusts README version text, and adds comprehensive integration tests validating hashing, rolling windows, collision detection, file mapping, and reporting paths.

Changes

Cohort / File(s) Summary
Build & metadata
Cargo.toml
Update Rust edition to 2021; add [dev-dependencies] with tempfile = "3.0". No changes to main dependencies’ APIs.
Documentation
README.md
Update usage snippet version string from 0.9.0 to 0.9.3.
Library: duplicate detection core
src/lib.rs
Add new library implementing end-to-end duplicate detection: file discovery, per-line signatures, rolling hashes, collision detection/merging, reporting (text/JSON), ignore list, threading. Exposes Options, Collision, FileId, LineId, error type DupliError, helpers, and process_files/process_report.
CLI refactor to library
src/main.rs
Replace inline implementation with calls to the new library. Switch error type to rags_rs::Error. Configure Rayon thread pool, load ignore hashes via library, call process_files and process_report. Remove local Options.
Tests
tests/integration_tests.rs
Add integration tests covering hashing, rolling windows, file signatures, FileId mapping, collision logic (overlap, maximize_collision, Collision::scrub), edge cases, and multi-file scenarios using temp files.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as main.rs (CLI)
  participant Lib as duplihere::process_files
  participant FS as Filesystem
  participant Hash as Signatures & Rolling Hashes
  participant Col as Collision Detector
  participant Rep as Reporting

  User->>CLI: Run with args (globs, lines, print/json, ignore, threads)
  CLI->>Lib: process_files(Options)
  Lib->>FS: files_to_process(globs)
  FS-->>Lib: [(file_id, name)] 
  par parallel per file
    Lib->>Hash: file_signatures(file)
    Hash-->>Lib: Vec<u64>
  end
  Lib->>Hash: rolling_hashes(signatures, min_lines)
  Lib->>Col: find_collisions(...)
  Col-->>Lib: DashMap<key, Collision>
  Lib-->>CLI: results

  CLI->>Rep: process_report(results, Options, ignore_hashes)
  alt json
    Rep->>CLI: print JSON
  else text
    Rep->>CLI: print text
  end
  CLI-->>User: Exit 0 or error
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

In burrows of code, I hop and compare,
Lines in a row, duplicates laid bare.
A library blooms where main once sprawled,
Hashes roll on, collisions called.
Tests like carrots, crisp and bright—
0.9.3, we launch tonight! 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title Check ❓ Inconclusive The PR title "Claude AI" is vague and does not describe the primary code changes in this branch; the diff actually introduces a new duplihere library implementing multi-file duplicate-text detection, refactors main to use that library, updates Cargo and README, and adds comprehensive integration tests. Because the title references a tooling aid rather than summarizing the functional change, it does not give reviewers a clear sense of the PR's intent or scope. This makes the title insufficient for quick history scanning or review triage. Please replace the title with a concise, change-focused summary such as "Add duplihere library for multi-file duplicate detection and refactor main" or "Introduce duplihere crate and integration tests; refactor CLI to use library", so the title highlights the primary functional change rather than the tooling used.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch claude_demo

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (8)
Cargo.toml (1)

19-20: Loosen dev-dep pin to receive compatible fixes automatically

Use a caret range so CI picks up non-breaking fixes (security/bug) in 3.x without manual bumps.

-[dev-dependencies]
-tempfile = "3.0"
+[dev-dependencies]
+tempfile = "3"
README.md (1)

30-31: Typo: “indention” → “indentation”

Minor doc polish.

-The duplicated text can be at different levels of indention,
+The duplicated text can be at different levels of indentation,
tests/integration_tests.rs (3)

521-526: Align with stricter collision semantics

Same rationale: prefer None over Some(num_lines=0).

-    let collision3 = maximize_collision(&file_hashes, &left3, &right3, 2);
-    // Should not match [4,5] vs [99,98] - might return Some with 0 lines
-    if let Some(collision3) = collision3 {
-        assert_eq!(collision3.num_lines, 0);
-    }
+    let collision3 = maximize_collision(&file_hashes, &left3, &right3, 2);
+    assert!(collision3.is_none());

557-560: Avoid accepting 0-line “matches”

Tighten to None to reflect true non-match.

-    let collision2 = maximize_collision(&file_hashes, &left, &right2, min_lines);
-    // maximize_collision might return Some with 0 lines instead of None
-    if let Some(collision2) = collision2 {
-        assert_eq!(collision2.num_lines, 0);
-    }
+    let collision2 = maximize_collision(&file_hashes, &left, &right2, min_lines);
+    assert!(collision2.is_none());

398-405: Comment nit: braces

The comment says “both '}'” but the file content writes “}}”. Tests are correct; tweak the comment to reduce confusion.

-// Lines 1 and 3 should have the same signature (both "}")
+// Lines 3 and 7 should have the same signature (both "}}")
src/lib.rs (3)

631-668: Global FILE_LOOKUP locking scope

You hold the FILE_LOOKUP mutex across I/O (glob + canonicalize). Not an issue here (single-threaded phase), but annotate this invariant or narrow the lock if you ever parallelize discovery.


703-710: Clarify Arc → &str conversion

The implicit deref from &Arc to &str works, but using as_str() is clearer.

-        if let Err(err) = process_file(
-            e.0,
-            &e.1,
+        if let Err(err) = process_file(
+            e.0,
+            e.1.as_str(),
             opts.lines as usize,
             &file_hashes,
             &collision_hashes,
         ) {

372-442: print_report: prefer HashSet for ignore lookups

Use HashSet for ignore lookups; drop bool payload.

-pub fn print_report(
-    printable_results: &[Collision],
-    opts: &Options,
-    ignore_hashes: &HashMap<u64, bool>,
-) -> Result<()> {
+pub fn print_report(
+    printable_results: &[Collision],
+    opts: &Options,
+    ignore_hashes: &std::collections::HashSet<u64>,
+) -> Result<()> {
@@
-        if ignore_hashes.contains_key(&p.key) {
+        if ignore_hashes.contains(&p.key) {

And in get_ignore_hashes:

-pub fn get_ignore_hashes(file_name: &str) -> Result<HashMap<u64, bool>> {
-    let mut ignores: HashMap<u64, bool> = HashMap::new();
+pub fn get_ignore_hashes(file_name: &str) -> Result<std::collections::HashSet<u64>> {
+    let mut ignores: std::collections::HashSet<u64> = std::collections::HashSet::new();
@@
-                    ignores.insert(hv, true);
+                    ignores.insert(hv);

Propagate type changes to callers (main.rs, process_report).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5586b6d and 433a968.

📒 Files selected for processing (5)
  • Cargo.toml (2 hunks)
  • README.md (1 hunks)
  • src/lib.rs (1 hunks)
  • src/main.rs (3 hunks)
  • tests/integration_tests.rs (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/integration_tests.rs (1)
src/lib.rs (8)
  • calculate_hash (89-93)
  • rolling_hashes (121-140)
  • new (592-598)
  • id_to_name (621-623)
  • number_files (626-628)
  • file_signatures (97-115)
  • overlap (290-297)
  • maximize_collision (301-348)
src/main.rs (1)
src/lib.rs (3)
  • new (592-598)
  • get_ignore_hashes (543-565)
  • process_files (696-720)
🔇 Additional comments (9)
Cargo.toml (2)

5-5: Edition 2021: good upgrade

Edition bump aligns with current tooling and lets you use 2021 idioms without friction.


9-18: Confirm these versions are deliberate: rayon 1.3.0 and dashmap 5.5.0 (Cargo.toml lines 9–18)
rayon 1.3.0 and dashmap 5.5.0 may be older; if not intentionally pinned, bump them under semver to pick up fixes/optimizations for parallel-sorts/raw-shards usage.
Verification here failed (no configured Rust toolchain). Run locally and paste outputs:
rustup default stable

if you need -Z flags: rustup default nightly

cargo update -Z minimal-versions
cargo tree -i rayon
cargo tree -i dashmap

README.md (1)

24-24: Version string update: LGTM

Matches Cargo version 0.9.3.

src/main.rs (2)

16-58: Return-based main is fine

Result-returning main with rags_rs errors is clean and consistent.


73-81: Ignore-file load: good UX with specific exit code

Clear message and exit code 2. Consider bubbling errors instead if this binary ever becomes a library entry-point.

tests/integration_tests.rs (1)

257-281: OK: exact-match maximize_collision happy path

Solid coverage for multi-file exact sequence growth.

src/lib.rs (3)

121-140: Consecutive-duplicate suppression: good

rolling_hashes avoids back-to-back duplicates; windowing logic is sound.


350-369: print_dup_text: boundary handling looks correct

Reads and prints inclusive range as intended; no changes needed.


399-407: Off-by-one: end line in human-readable output

End should be inclusive (start + num_lines - 1). Current code prints one past the last line.

-                    let start_line = spec_file.line_number;
-                    let end_line = start_line + p.num_lines;
+                    let start_line = spec_file.line_number;
+                    let end_line = start_line + p.num_lines.saturating_sub(1);
                     println!(
                         "Between lines {} and {} in {}",
                         start_line + 1,
                         end_line,
                         filename
                     );

Likely an incorrect or invalid review comment.

Comment on lines +336 to +348
// If after walking we overlap skip too
if overlap(l_info, r_info, offset) {
return None;
}

let files: Vec<LineId> = vec![*l_info, *r_info];
Some(Collision {
key: s.finish(),
num_lines: offset,
start_lines: files,
sig: 0,
})
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Return None for sub-threshold matches to prevent 0-line “collisions”

When the rolling-hash collides but actual line hashes differ immediately, offset stays 0. Returning Some with num_lines=0 pollutes results and downstream reporting.

     // If after walking we overlap skip too
-    if overlap(l_info, r_info, offset) {
+    if offset < min_lines {
+        return None;
+    }
+    if overlap(l_info, r_info, offset) {
         return None;
     }

Update related tests as suggested in tests/integration_tests.rs.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// If after walking we overlap skip too
if overlap(l_info, r_info, offset) {
return None;
}
let files: Vec<LineId> = vec![*l_info, *r_info];
Some(Collision {
key: s.finish(),
num_lines: offset,
start_lines: files,
sig: 0,
})
}
// If after walking we overlap skip too
if offset < min_lines {
return None;
}
if overlap(l_info, r_info, offset) {
return None;
}
let files: Vec<LineId> = vec![*l_info, *r_info];
Some(Collision {
key: s.finish(),
num_lines: offset,
start_lines: files,
sig: 0,
})
}
🤖 Prompt for AI Agents
In src/lib.rs around lines 336 to 348, the function currently returns
Some(Collision) even when offset is 0, producing a 0-line collision; change the
logic to return None when offset == 0 so sub-threshold (immediate mismatch)
matches are ignored, and update the related tests in tests/integration_tests.rs
to expect no collision in that case (adjust assertions/expected outputs
accordingly).

Comment on lines +500 to +539
pub fn process_report(
results_hash: DashMap<u64, Collision>,
opts: &Options,
ignore_hashes: &HashMap<u64, bool>,
) {
let mut final_report: Vec<Collision> = results_hash.into_iter().map(|(_, v)| v).collect();
final_report.par_sort_unstable_by(|a, b| a.num_lines.cmp(&b.num_lines).reverse());

let mut printable_results: Vec<Collision> = Vec::new();

{
let mut chunk_processed: HashMap<u64, bool> = HashMap::new();

final_report.par_iter_mut().for_each(|ea| ea.scrub());

for ea in final_report {
let cs = ea.signature();
if let std::collections::hash_map::Entry::Vacant(e) = chunk_processed.entry(cs) {
e.insert(true);
printable_results.push(ea);
}
}
}

printable_results.par_sort_unstable_by(|a, b| {
a.num_lines
.cmp(&b.num_lines)
.then_with(|| {
a.start_lines[0]
.line_number
.cmp(&b.start_lines[0].line_number)
})
.then_with(|| a.start_lines[0].file_id.cmp(&b.start_lines[0].file_id))
});

if let Err(e) = print_report(&printable_results, opts, ignore_hashes) {
eprintln!("ERROR: Failed to print report: {}", e);
process::exit(1);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Library should not exit the process

process_report calls process::exit(1) on error, which is surprising for a library. Return Result and let the binary decide exit codes. See main.rs comment for coordinated change.

-pub fn process_report(
+pub fn process_report(
     results_hash: DashMap<u64, Collision>,
     opts: &Options,
     ignore_hashes: &HashMap<u64, bool>,
-) {
+) -> Result<()> {
@@
-    if let Err(e) = print_report(&printable_results, opts, ignore_hashes) {
-        eprintln!("ERROR: Failed to print report: {}", e);
-        process::exit(1);
-    }
+    print_report(&printable_results, opts, ignore_hashes)?;
-}
+    Ok(())
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub fn process_report(
results_hash: DashMap<u64, Collision>,
opts: &Options,
ignore_hashes: &HashMap<u64, bool>,
) {
let mut final_report: Vec<Collision> = results_hash.into_iter().map(|(_, v)| v).collect();
final_report.par_sort_unstable_by(|a, b| a.num_lines.cmp(&b.num_lines).reverse());
let mut printable_results: Vec<Collision> = Vec::new();
{
let mut chunk_processed: HashMap<u64, bool> = HashMap::new();
final_report.par_iter_mut().for_each(|ea| ea.scrub());
for ea in final_report {
let cs = ea.signature();
if let std::collections::hash_map::Entry::Vacant(e) = chunk_processed.entry(cs) {
e.insert(true);
printable_results.push(ea);
}
}
}
printable_results.par_sort_unstable_by(|a, b| {
a.num_lines
.cmp(&b.num_lines)
.then_with(|| {
a.start_lines[0]
.line_number
.cmp(&b.start_lines[0].line_number)
})
.then_with(|| a.start_lines[0].file_id.cmp(&b.start_lines[0].file_id))
});
if let Err(e) = print_report(&printable_results, opts, ignore_hashes) {
eprintln!("ERROR: Failed to print report: {}", e);
process::exit(1);
}
}
pub fn process_report(
results_hash: DashMap<u64, Collision>,
opts: &Options,
ignore_hashes: &HashMap<u64, bool>,
) -> Result<()> {
let mut final_report: Vec<Collision> = results_hash.into_iter().map(|(_, v)| v).collect();
final_report.par_sort_unstable_by(|a, b| a.num_lines.cmp(&b.num_lines).reverse());
let mut printable_results: Vec<Collision> = Vec::new();
{
let mut chunk_processed: HashMap<u64, bool> = HashMap::new();
final_report.par_iter_mut().for_each(|ea| ea.scrub());
for ea in final_report {
let cs = ea.signature();
if let std::collections::hash_map::Entry::Vacant(e) = chunk_processed.entry(cs) {
e.insert(true);
printable_results.push(ea);
}
}
}
printable_results.par_sort_unstable_by(|a, b| {
a.num_lines
.cmp(&b.num_lines)
.then_with(|| {
a.start_lines[0]
.line_number
.cmp(&b.start_lines[0].line_number)
})
.then_with(|| a.start_lines[0].file_id.cmp(&b.start_lines[0].file_id))
});
print_report(&printable_results, opts, ignore_hashes)?;
Ok(())
}
🤖 Prompt for AI Agents
In src/lib.rs around lines 500 to 539, process_report currently calls
process::exit(1) when print_report fails which is inappropriate for a library;
change the function to return a Result<(), E> (choose the crate convention e.g.,
anyhow::Result<()> or Result<(), Box<dyn std::error::Error + Send + Sync>>)
instead of returning unit, remove the process::exit call, and propagate the
print_report error back to the caller (use the ? operator or map_err to convert
error types as needed), ensuring you return Ok(()) on success; update callers
(e.g., the binary in main.rs) to handle the Result and decide exit codes.

Comment on lines +602 to +618
pub fn register_file(&mut self, file_name: Arc<String>) -> Option<u32> {
if self.name_to_index.contains_key(&file_name) {
return None;
}
let num = self.num_files;

self.index_to_name.push(Arc::clone(&file_name));
self.name_to_index
.insert(Arc::clone(&file_name), self.num_files);
if let Some(v) = self.num_files.checked_add(1) {
self.num_files = v;
} else {
eprintln!("Number of files processed exceeds {}", u32::MAX);
process::exit(2);
}
Some(num)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Don’t exit from a library on counter overflow

Exiting the process from FileId::register_file is hostile to embedders. Return an error and let callers decide.

-    pub fn register_file(&mut self, file_name: Arc<String>) -> Option<u32> {
+    pub fn register_file(&mut self, file_name: Arc<String>) -> Option<u32> {
@@
-        if let Some(v) = self.num_files.checked_add(1) {
-            self.num_files = v;
-        } else {
-            eprintln!("Number of files processed exceeds {}", u32::MAX);
-            process::exit(2);
-        }
+        self.num_files = self.num_files.checked_add(1).unwrap_or_else(|| {
+            eprintln!("Number of files processed exceeds {}", u32::MAX);
+            // Prefer: change API to Result and propagate instead of exiting.
+            panic!("FileId overflow");
+        });

If you’re open to a breaking change, switch return type to Result<u32, DupliError> and propagate.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub fn register_file(&mut self, file_name: Arc<String>) -> Option<u32> {
if self.name_to_index.contains_key(&file_name) {
return None;
}
let num = self.num_files;
self.index_to_name.push(Arc::clone(&file_name));
self.name_to_index
.insert(Arc::clone(&file_name), self.num_files);
if let Some(v) = self.num_files.checked_add(1) {
self.num_files = v;
} else {
eprintln!("Number of files processed exceeds {}", u32::MAX);
process::exit(2);
}
Some(num)
}
pub fn register_file(&mut self, file_name: Arc<String>) -> Option<u32> {
if self.name_to_index.contains_key(&file_name) {
return None;
}
let num = self.num_files;
self.index_to_name.push(Arc::clone(&file_name));
self.name_to_index
.insert(Arc::clone(&file_name), self.num_files);
self.num_files = self.num_files.checked_add(1).unwrap_or_else(|| {
eprintln!("Number of files processed exceeds {}", u32::MAX);
// Prefer: change API to Result and propagate instead of exiting.
panic!("FileId overflow");
});
Some(num)
}
🤖 Prompt for AI Agents
In src/lib.rs around lines 602 to 618, the function currently calls
process::exit on num_files overflow which aborts the entire process; change it
to return an error instead of exiting: update the signature to return
Result<u32, DupliError> (or, if you must avoid a breaking change, return
Option<u32> and treat overflow as None), add a new DupliError variant for
"FileIdOverflow" (or reuse an existing error type), replace the process::exit
branch with returning Err(DupliError::FileIdOverflow) (or None), and update all
callers to propagate or handle the Result appropriately (use ? where possible)
so embedders can decide how to handle overflow.

Comment on lines 66 to 71
if opts.threads != 0 {
rayon::ThreadPoolBuilder::new()
ThreadPoolBuilder::new()
.num_threads(opts.threads)
.build_global()
.unwrap();
.expect("Failed to build global thread pool");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t panic if the global Rayon pool is already initialized

build_global() will Err if any Rayon use happened earlier; expect(...) will abort the program. Handle gracefully and continue with the existing pool.

-        if opts.threads != 0 {
-            ThreadPoolBuilder::new()
-                .num_threads(opts.threads)
-                .build_global()
-                .expect("Failed to build global thread pool");
-        }
+        if opts.threads != 0 {
+            if let Err(e) = ThreadPoolBuilder::new()
+                .num_threads(opts.threads)
+                .build_global()
+            {
+                eprintln!("WARNING: Unable to set Rayon global thread pool (already initialized?): {e}");
+            }
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if opts.threads != 0 {
rayon::ThreadPoolBuilder::new()
ThreadPoolBuilder::new()
.num_threads(opts.threads)
.build_global()
.unwrap();
.expect("Failed to build global thread pool");
}
if opts.threads != 0 {
if let Err(e) = ThreadPoolBuilder::new()
.num_threads(opts.threads)
.build_global()
{
eprintln!("WARNING: Unable to set Rayon global thread pool (already initialized?): {e}");
}
}
🤖 Prompt for AI Agents
In src/main.rs around lines 66 to 71, the call to
ThreadPoolBuilder::new().num_threads(...).build_global().expect(...) panics if
the global Rayon pool is already initialized; replace the expect with error
handling that treats an Err as non-fatal. Call build_global(), match or if-let
on the Result, and on Ok do nothing, but on Err log a debug/warn that the global
pool was already initialized (or otherwise failed to install) and continue
without aborting; ensure the program does not panic when Rayon reports the pool
is already set up.

Comment on lines +83 to 91
let results_hash = match process_files(&opts) {
Ok(results) => results,
Err(e) => {
eprintln!("ERROR: Failed to process files: {}", e);
std::process::exit(1);
}
};
process_report(results_hash, &opts, &ignore_hash);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Propagate reporting errors instead of exiting inside the library

process_report currently calls process::exit(1) on failures. Prefer returning Result from the library and handling exit here.

Apply these diffs (lib + main):

In src/lib.rs:

-pub fn process_report(
+pub fn process_report(
     results_hash: DashMap<u64, Collision>,
     opts: &Options,
     ignore_hashes: &HashMap<u64, bool>,
-) {
+) -> Result<()> {
@@
-    if let Err(e) = print_report(&printable_results, opts, ignore_hashes) {
-        eprintln!("ERROR: Failed to print report: {}", e);
-        process::exit(1);
-    }
+    print_report(&printable_results, opts, ignore_hashes)?;
-}
+    Ok(())
+}

In src/main.rs:

-        process_report(results_hash, &opts, &ignore_hash);
+        if let Err(e) = process_report(results_hash, &opts, &ignore_hash) {
+            eprintln!("ERROR: Failed to print report: {e}");
+            std::process::exit(1);
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let results_hash = match process_files(&opts) {
Ok(results) => results,
Err(e) => {
eprintln!("ERROR: Failed to process files: {}", e);
std::process::exit(1);
}
};
process_report(results_hash, &opts, &ignore_hash);
}
let results_hash = match process_files(&opts) {
Ok(results) => results,
Err(e) => {
eprintln!("ERROR: Failed to process files: {}", e);
std::process::exit(1);
}
};
if let Err(e) = process_report(results_hash, &opts, &ignore_hash) {
eprintln!("ERROR: Failed to print report: {e}");
std::process::exit(1);
}
}
🤖 Prompt for AI Agents
In src/main.rs around lines 83 to 91, main currently calls
process_report(results_hash, &opts, &ignore_hash) while process_report may call
process::exit(1) inside the library; change the library (src/lib.rs) so
process_report returns a Result<(), Error> instead of exiting, then update main
to match on that Result and, on Err(e), print an error (eprintln!("ERROR: Failed
to generate report: {}", e)) and call std::process::exit(1); ensure
process_report's signature is updated, propagate errors from any internal
operations using ? or map_err, and remove any direct process::exit calls from
the library so the binary handles termination.

Comment on lines +309 to +327
// Test completely different sequences
let file_hashes = vec![vec![1, 2, 3, 4, 5], vec![6, 7, 8, 9, 10]];

let left = LineId {
file_id: 0,
line_number: 0,
};
let right = LineId {
file_id: 1,
line_number: 0,
};

let collision = maximize_collision(&file_hashes, &left, &right, 2);
// Since the first elements are different (1 != 6), there should be no collision
// but maximize_collision might still return a collision with 0 lines
if let Some(collision) = collision {
assert_eq!(collision.num_lines, 0);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Strengthen expectation: no collision when first elements differ

maximize_collision should return None when rolling-hash digest was a false positive. This avoids 0-line “collisions”.

-    let collision = maximize_collision(&file_hashes, &left, &right, 2);
-    // Since the first elements are different (1 != 6), there should be no collision
-    // but maximize_collision might still return a collision with 0 lines
-    if let Some(collision) = collision {
-        assert_eq!(collision.num_lines, 0);
-    }
+    let collision = maximize_collision(&file_hashes, &left, &right, 2);
+    assert!(collision.is_none());
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Test completely different sequences
let file_hashes = vec![vec![1, 2, 3, 4, 5], vec![6, 7, 8, 9, 10]];
let left = LineId {
file_id: 0,
line_number: 0,
};
let right = LineId {
file_id: 1,
line_number: 0,
};
let collision = maximize_collision(&file_hashes, &left, &right, 2);
// Since the first elements are different (1 != 6), there should be no collision
// but maximize_collision might still return a collision with 0 lines
if let Some(collision) = collision {
assert_eq!(collision.num_lines, 0);
}
}
// Test completely different sequences
let file_hashes = vec![vec![1, 2, 3, 4, 5], vec![6, 7, 8, 9, 10]];
let left = LineId {
file_id: 0,
line_number: 0,
};
let right = LineId {
file_id: 1,
line_number: 0,
};
let collision = maximize_collision(&file_hashes, &left, &right, 2);
assert!(collision.is_none());
}
🤖 Prompt for AI Agents
In tests/integration_tests.rs around lines 309 to 327, the test currently allows
maximize_collision to return Some with num_lines == 0 when first elements
differ; instead assert that no collision is returned. Change the assertion to
require collision.is_none() (i.e., fail the test if maximize_collision returns
Some), ensuring the test expects None rather than a zero-length collision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants