Skip to content

Conversation

@praateekmahajan
Copy link
Contributor

@praateekmahajan praateekmahajan commented Nov 25, 2025

Description

This pull request refactors the workflow interfaces for the deduplication pipelines (exact, fuzzy, and semantic) to standardize their outputs and improve usability.

Core API and Interface Refactoring

  • Introduced a new WorkflowRunResult dataclass in nemo_curator/pipeline/workflow.py to encapsulate workflow outputs, pipeline task mappings, and metadata. Also added an abstract WorkflowBase class to standardize workflow interfaces.
  • Updated all deduplication workflow classes (ExactDeduplicationWorkflow, FuzzyDeduplicationWorkflow, SemanticDeduplicationWorkflow) to inherit from WorkflowBase and to return a WorkflowRunResult from their run methods, instead of returning None or a dictionary.

Workflow Output and Metadata Improvements

  • Refactored the run methods of all workflows to collect and record detailed timing and result metadata (such as per-stage execution times and duplicate counts) into the WorkflowRunResult object.
  • Each pipeline stage now adds its results and timing to the result object.

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <praateekm@gmail.com>
…dd-workflow-results

Signed-off-by: Praateek <praateekm@gmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 25, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
@praateekmahajan
Copy link
Contributor Author

/ok to test af0787c

Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (6)

  1. nemo_curator/stages/deduplication/exact/workflow.py, line 257 (link)

    logic: id_generator_path undefined when self.assign_id is False

  2. nemo_curator/stages/deduplication/fuzzy/workflow.py, line 373 (link)

    logic: id_generator_path undefined when no duplicates found (len(valid_lsh_tasks) == 0)

  3. tests/stages/text/deduplication/test_removal_workflow.py, line 192 (link)

    logic: Metadata key mismatch - removal workflow sets num_duplicates_removed, not num_duplicates

  4. tests/stages/text/deduplication/test_removal_workflow.py, line 193 (link)

    logic: num_output_tasks never set in removal workflow metadata

  5. tests/stages/text/deduplication/test_removal_workflow.py, line 210 (link)

    logic: Metadata key mismatch - removal workflow sets num_duplicates_removed, not num_duplicates

  6. tests/stages/text/deduplication/test_removal_workflow.py, line 282-283 (link)

    logic: Metadata key mismatches - removal workflow sets num_duplicates_removed, not num_duplicates, and never sets num_output_tasks

13 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants