Implement AutoPopulate 2.0 #1290

dimitri-yatsenko · 2025-12-22T19:22:01Z

Fix issue #1243.

Implement AutoPopulate 2.0.

Design specification for issue #1243 proposing: - Per-table jobs tables with native primary keys - Extended status values (pending, reserved, success, error, ignore) - Priority and scheduling support - Referential integrity via foreign keys - Automatic refresh on populate

Auto-populated tables must have primary keys composed entirely of foreign key references. This ensures 1:1 job correspondence and enables proper referential integrity for the jobs table.

- Jobs tables have matching primary key structure but no FK constraints - Stale jobs (from deleted upstream records) handled by refresh() - Added created_time field for stale detection - refresh() now returns {added, removed} counts - Updated rationale sections to reflect performance-focused design

- Jobs table automatically dropped when target table is dropped/altered - schema.jobs returns list of JobsTable objects for all auto-populated tables - Updated dashboard examples to use schema.jobs iteration

- Updated state transition diagram to show only automatic transitions - Added note that ignore is manually set and skipped by populate/refresh - reset() can move ignore jobs back to pending

Major changes: - Remove reset() method; use delete() + refresh() instead - Jobs go from any state → (none) via delete, then → pending via refresh() - Shorten deprecation roadmap: clean break, no legacy support - Jobs tables created lazily on first populate(reserve_jobs=True) - Legacy tables with extra PK attributes: jobs table uses only FK-derived keys

- Remove SELECT FOR UPDATE locking from job reservation - Conflicts (rare) resolved by make() transaction's duplicate key error - Second worker catches error and moves to next job - Simpler code, better performance on high-traffic jobs table

Each job is marked as 'reserved' individually before its make() call, matching the current implementation's behavior.

- Replace ASCII diagram with Mermaid stateDiagram - Remove separate schedule() and set_priority() methods - refresh() now handles scheduling via scheduled_time and priority params - Clarify complete() can delete or keep job based on settings

- ignore() can be called on keys not yet in jobs table - Reserve is done via update1() per key, client provides pid/host/connection_id - Removed specific SQL query from spec

If a success job's key is still in key_source but the target entry was deleted, refresh() will transition it back to pending.

Replaces multiple [*] start/end states with a single explicit "(none)" state for clarity.

- Use complete() and complete()* notation for conditional transitions - Same for refresh() and refresh()* - Remove clear_completed(); use (jobs & 'status="success"').delete() instead - Note that delete() requires no confirmation (low-cost operation)

- Priority: lower = more urgent (0 = highest), default = 5 - Acyclic state diagram with dual (none) states - delete() inherited from delete_quick(), use (jobs & cond).delete() - Added 'ignored' property for consistency - populate() logic: fetch pending first, only refresh if no pending found - Updated all examples to reflect new priority semantics

- Add Terminology section defining stale (pending jobs with deleted upstream) and orphaned (reserved jobs from crashed processes) - Rename "Stale Reserved Job Detection" to "Orphaned Job Handling" - Clarify that orphaned job detection is orchestration-dependent (no algorithmic method) - Update stale job handling section for consistency

- Remove requirement that auto-populated tables have FK-only primary keys (this constraint is handled elsewhere, not by the jobs system) - Clarify that jobs table PK includes only FK-derived attributes from the target table's primary key - Add example showing how additional PK attributes are excluded - Add comprehensive Hazard Analysis section covering: - Race conditions (reservation, refresh, completion) - State transitions (invalid, stuck, ignored) - Data integrity (stale jobs, sync, transactions) - Performance (table size, refresh speed) - Operational (accidental deletion, priority) - Migration (legacy table, version mixing)

- Clarify that transaction-based conflict resolution applies regardless of reserve_jobs setting (True or False) - Add new section "Job Reservation vs Pre-Partitioning" documenting the alternative workflow where orchestrators explicitly divide jobs before distributing to workers - Include comparison table for when to use each approach

Deleting a reserved job does not terminate the running worker - it only removes the reservation record. The worker continues its make() call. The actual risk is duplicated work if the job is refreshed and picked up by another worker.

Change scheduling parameter from absolute datetime to relative seconds: - Rename scheduled_time to delay (float, seconds from now) - Uses database server time (NOW() + INTERVAL) to avoid clock sync issues - Update all examples to use delay parameter

Duplicate key errors from collisions occur outside make() and are handled silently - the job reverts to pending or (none) state. Only genuine computation failures inside make() are logged with error status.

This commit implements the per-table jobs system specified in the Autopopulate 2.0 design document. New features: - Per-table JobsTable class (jobs_v2.py) with FK-derived primary keys - Status enum: pending, reserved, success, error, ignore - Priority system (lower = more urgent, 0 = highest, default = 5) - Scheduled processing via delay parameter - Methods: refresh(), reserve(), complete(), error(), ignore() - Properties: pending, reserved, errors, ignored, completed, progress() Configuration (settings.py): - New JobsSettings class with: - jobs.auto_refresh (default: True) - jobs.keep_completed (default: False) - jobs.stale_timeout (default: 3600 seconds) - jobs.default_priority (default: 5) AutoPopulate changes (autopopulate.py): - Added jobs property to access per-table JobsTable - Updated populate() with new parameters: priority, refresh - Updated _populate1() to use new JobsTable API - Collision errors (DuplicateError) handled silently per spec Schema changes (schemas.py): - Track auto-populated tables during decoration - schema.jobs now returns list of JobsTable objects - Added schema.legacy_jobs for backward compatibility

Override drop_quick() in Imported and Computed to also drop the associated jobs table when the main table is dropped.

Comprehensive test suite for the new per-table jobs system: - JobsTable structure and initialization - refresh() method with priority and delay - reserve() method and reservation conflicts - complete() method with keep option - error() method and message truncation - ignore() method - Status filter properties (pending, reserved, errors, ignored, completed) - progress() method - populate() with reserve_jobs=True - schema.jobs property - Configuration settings

- Remove unused `job` dict and `now` variable in refresh() - Remove unused `pk_attrs` in fetch_pending() - Remove unused datetime import - Apply ruff-format formatting changes

Replace schema-wide `~jobs` table with per-table JobsTable (Autopopulate 2.0): - Delete src/datajoint/jobs.py (old JobTable class) - Remove legacy_jobs property from Schema class - Delete tests/test_jobs.py (old schema-wide tests) - Remove clean_jobs fixture and schema.jobs.delete() cleanup calls - Update test_autopopulate.py to use new per-table jobs API The new system provides per-table job queues with FK-derived primary keys, rich status tracking (pending/reserved/success/error/ignore), priority scheduling, and proper handling of job collisions.

Now that the legacy schema-wide jobs system has been removed, rename the new per-table jobs module to its canonical name: - src/datajoint/jobs_v2.py -> src/datajoint/jobs.py - tests/test_jobs_v2.py -> tests/test_jobs.py - Update imports in autopopulate.py and test_jobs.py

- Use variable assignment for pk_section instead of chr(10) in f-string - Change error_stack type from mediumblob to <djblob> - Use update1() in error() instead of raw SQL and deprecated _update() - Remove config.override(enable_python_native_blobs=True) wrapper Note: reserve() keeps raw SQL for atomic conditional update with rowcount check - this is required for safe concurrent job reservation.

- reserve() now uses update1 instead of raw SQL - Remove status='pending' check since populate verifies this - Change return type from bool to None - Update autopopulate.py to not check reserve return value - Update tests to reflect new behavior

The new implementation always populates self - the target property is no longer needed. All references to self.target replaced with self.

- Inline the logic directly in populate() and progress() - Move restriction check to populate() - Use (self.key_source & AndList(restrictions)).proj() directly - Remove unused QueryExpression import

- Remove early jobs_table assignment, use self.jobs directly - Fix comment: key_source is correct behavior, not legacy - Use self.jobs directly in _get_pending_jobs

Method only called from one place, no need for separate function.

- Remove 'order' parameter (conflicts with priority/scheduled_time) - Remove 'limit' parameter, keep only 'max_calls' for simplicity - Remove unused 'random' import

…tXQt Implement Object-Augmented Schemas

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>

github-actions bot added enhancement Indicates new improvements documentation Issues related to documentation labels Dec 22, 2025

claude added 24 commits December 22, 2025 20:44

Add foreign-key-only primary key constraint to spec

df94fcc

Auto-populated tables must have primary keys composed entirely of foreign key references. This ensures 1:1 job correspondence and enables proper referential integrity for the jobs table.

Add table drop/alter behavior and schema.jobs list API

4637708

- Jobs table automatically dropped when target table is dropped/altered - schema.jobs returns list of JobsTable objects for all auto-populated tables - Updated dashboard examples to use schema.jobs iteration

Clarify ignore status is manual, not automatic transition

68d876d

- Updated state transition diagram to show only automatic transitions - Added note that ignore is manually set and skipped by populate/refresh - reset() can move ignore jobs back to pending

Clarify per-key reservation flow in populate()

8900fea

Each job is marked as 'reserved' individually before its make() call, matching the current implementation's behavior.

Merge pre/v2.0 into spec-issue-1243

1f56102

Add (none)->ignore transition, simplify reserve description

3018b8f

- ignore() can be called on keys not yet in jobs table - Reserve is done via update1() per key, client provides pid/host/connection_id - Removed specific SQL query from spec

Add success->pending transition via refresh()

7eda583

If a success job's key is still in key_source but the target entry was deleted, refresh() will transition it back to pending.

Use explicit (none) state in Mermaid diagram

bab7e10

Replaces multiple [*] start/end states with a single explicit "(none)" state for clarity.

Fix incorrect statement about deleting reserved jobs

314ad0a

Deleting a reserved job does not terminate the running worker - it only removes the reservation record. The worker continues its make() call. The actual risk is duplicated work if the job is refreshed and picked up by another worker.

Clarify that only make() errors are logged as error status

7b11d65

Duplicate key errors from collisions occur outside make() and are handled silently - the job reverts to pending or (none) state. Only genuine computation failures inside make() are logged with error status.

Drop jobs table when auto-populated table is dropped

53bd28d

Override drop_quick() in Imported and Computed to also drop the associated jobs table when the main table is dropped.

Fix ruff linting errors and reformat

e89e064

- Remove unused `job` dict and `now` variable in refresh() - Remove unused `pk_attrs` in fetch_pending() - Remove unused datetime import - Apply ruff-format formatting changes

dimitri-yatsenko assigned ttngu207 Dec 23, 2025

dimitri-yatsenko requested review from mweitzel and ttngu207 December 23, 2025 00:39

claude added 15 commits December 23, 2025 00:45

Simplify reserve() to use update1

8430e2a

- reserve() now uses update1 instead of raw SQL - Remove status='pending' check since populate verifies this - Change return type from bool to None - Update autopopulate.py to not check reserve return value - Update tests to reflect new behavior

Use update1 in complete() method

34c302a

Simplify: use self.proj() for jobs table projections

e0d6fd9

Simplify ignore(): only insert new records, cannot convert existing

83b7f49

Use insert1 in _insert_job_with_status instead of explicit SQL

080b6c0

Remove AutoPopulate._job_key - no longer needed

84ba4b7

Remove AutoPopulate.target property

6ef2de7

The new implementation always populates self - the target property is no longer needed. All references to self.target replaced with self.

Remove legacy _make_tuples callback support - use self.make exclusively

55d7f32

Eliminate _jobs_to_do method

7b28c64

- Inline the logic directly in populate() and progress() - Move restriction check to populate() - Use (self.key_source & AndList(restrictions)).proj() directly - Remove unused QueryExpression import

Simplify jobs variable usage in populate()

d28fa7c

- Remove early jobs_table assignment, use self.jobs directly - Fix comment: key_source is correct behavior, not legacy - Use self.jobs directly in _get_pending_jobs

Inline _get_pending_jobs into populate()

7d595fb

Method only called from one place, no need for separate function.

Remove order parameter and consolidate limit/max_calls

0a5f3a9

- Remove 'order' parameter (conflicts with priority/scheduled_time) - Remove 'limit' parameter, keep only 'max_calls' for simplicity - Remove unused 'random' import

dimitri-yatsenko mentioned this pull request Dec 23, 2025

Update documentation using DataJoint Book #1291

Draft

Merge claude/upgrade-adapted-type-1W3ap

bab4db8

dimitri-yatsenko changed the base branch from pre/v2.0 to claude/upgrade-adapted-type-1W3ap December 24, 2025 19:29

Merge pull request #1288 from datajoint/claude/add-file-column-type-L…

895259f

…tXQt Implement Object-Augmented Schemas

dimitri-yatsenko mentioned this pull request Dec 25, 2025

Unexpected / poorly documented behaviour of keyword limit in table.populate() #1203

Open

Merge pre/v2.0

4a43f50

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement AutoPopulate 2.0 #1290

Implement AutoPopulate 2.0 #1290

dimitri-yatsenko commented Dec 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Implement AutoPopulate 2.0 #1290

Are you sure you want to change the base?

Implement AutoPopulate 2.0 #1290

Conversation

dimitri-yatsenko commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dimitri-yatsenko commented Dec 22, 2025 •

edited

Loading