Skip to content

Conversation

@dimitri-yatsenko
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko commented Jan 7, 2026

Summary

DataJoint 2.0 is a major release that modernizes the entire codebase while maintaining backward compatibility for core functionality. This release focuses on extensibility, type safety, and developer experience.

Planning: DataJoint 2.0 Plan | Milestone 2.0

Major Features

Codec System (Extensible Types)

Replaces the adapter system with a modern, composable codec architecture:

  • Base codecs: <blob>, <json>, <attach>, <filepath>, <object>, <hash>
  • Chaining: Codecs can wrap other codecs (e.g., <blob> wraps <json> for external storage)
  • Auto-registration: Custom codecs register via __init_subclass__
  • Validation: Optional validate() method for type checking before insert
from datajoint import Codec

class MyCodec(Codec):
    python_type = MyClass
    dj_type = "<blob>"  # Storage format
    
    def encode(self, value): ...
    def decode(self, value): ...

Semantic Matching

Attribute lineage tracking ensures joins only match semantically compatible attributes:

  • Attributes track their origin through foreign key inheritance
  • Joins require matching lineage (not just matching names)
  • Prevents accidental matches on generic names like id or name
  • semantic_check=False for legacy permissive behavior
# These join on subject_id because both inherit from Subject
Session * Recording  # ✓ Works - same lineage

# These fail because 'id' has different origins
TableA * TableB  # ✗ Fails - different lineage for 'id'

Primary Key Rules

Rigorous primary key propagation through all operators:

  • Join: Result PK based on functional dependencies (A→B, B→A, both, neither)
  • Aggregation: Groups by left operand's primary key
  • Projection: Preserves PK attributes, drops secondary
  • Universal set: dj.U('attr') creates ad-hoc grouping entities

AutoPopulate 2.0 (Jobs System)

Per-table job management with enhanced tracking:

  • Hidden metadata: ~~_job_timestamp and ~~_job_duration columns
  • Per-table jobs: Each computed table has its own ~~table_name job table
  • Schema.jobs: List all job tables in a schema
  • Progress tracking: table.progress() returns (remaining, total)
  • Priority scheduling: Jobs ordered by priority, then timestamp

Modern Fetch & Insert API

New fetch methods:

  • to_dicts() - List of dictionaries
  • to_pandas() - DataFrame with PK as index
  • to_arrays(*attrs) - NumPy arrays (structured or individual)
  • keys() - Primary keys only
  • fetch1() - Single row

Insert improvements:

Type Aliases

Core DataJoint types for portability:

Alias MySQL Type
int8, int16, int32, int64 tinyint, smallint, int, bigint
uint8, uint16, uint32, uint64 unsigned variants
float32, float64 float, double
bool tinyint
uuid binary(16)

Object Storage

Content-addressed and object storage types:

  • <hash> - Content-addressed storage with deduplication
  • <object> - Named object storage (Zarr, folders)
  • <filepath> - Reference to managed files
  • <attach> - File attachments (uploaded on insert)

Virtual Schema Infrastructure (#1307)

New schema introspection API for exploring existing databases:

  • Schema.get_table(name) - Direct table access with auto tier prefix detection
  • Schema['TableName'] - Bracket notation access
  • for table in schema - Iterate tables in dependency order
  • 'TableName' in schema - Check table existence
  • dj.virtual_schema() - Clean entry point for accessing schemas
  • dj.VirtualModule() - Virtual modules with custom names

CLI Improvements

The dj command-line interface for interactive exploration:

  • dj -s schema:alias - Load schemas as virtual modules
  • --host, --user, --password - Connection options
  • Fixed -h conflict with --help

Settings Modernization

Pydantic-based configuration with validation:

  • Type-safe settings with automatic validation
  • dj.config.override() context manager
  • Secrets directory support (.secrets/)
  • Environment variable overrides (DJ_HOST, etc.)

License Change

Changed from LGPL to Apache 2.0 license (#1235 (discussion)):

  • More permissive for commercial and academic use
  • Compatible with broader ecosystem of tools
  • Clearer patent grant provisions

Breaking Changes

Removed Support

API Changes

  • fetch()to_dicts(), to_pandas(), to_arrays()
  • fetch(format='frame')to_pandas()
  • fetch(as_dict=True)to_dicts()
  • safemode=Falseprompt=False

Semantic Changes

  • Joins now require lineage compatibility by default
  • Aggregation keeps non-matching rows by default (like LEFT JOIN)

Documentation

Developer Documentation (this repo)

Comprehensive updates in docs/:

  • NumPy-style docstrings for all public APIs
  • Architecture guides for contributors
  • Auto-generated API reference via mkdocstrings

User Documentation (datajoint-docs)

Full documentation site following the Diátaxis framework:

Tutorials (learning-oriented, Jupyter notebooks):

  1. Getting Started - Installation, connection, first schema
  2. Schema Design - Table tiers, definitions, foreign keys
  3. Data Entry - Insert patterns, lookups, manual tables
  4. Queries - Restriction, projection, join, aggregation, fetch
  5. Computation - Computed tables, make(), populate patterns
  6. Object Storage - Blobs, attachments, external storage

How-To Guides (task-oriented):

  • Configure object storage, Design primary keys, Model relationships
  • Handle computation errors, Manage large datasets, Create custom codecs
  • Use the CLI, Migrate from 1.x

Reference (specifications):

  • Table Declaration, Query Algebra, Data Manipulation
  • Primary Keys, Semantic Matching, Type System, Virtual Schemas
  • Codec API, AutoPopulate, Fetch API, Job Metadata

Project Structure

Test Plan

  • 580+ integration tests pass
  • 80+ unit tests pass
  • Pre-commit hooks pass
  • Documentation builds successfully
  • Tutorials execute against test database

Closes

Milestone 2.0 Issues

Bug Fixes

Improvements

Related PRs

Migration Guide

See How to Migrate from 1.x for detailed migration instructions.


🤖 Generated with Claude Code

d-v-b and others added 30 commits August 29, 2025 10:09
update test workflow to use src layout
use pytest to manage docker container startup for tests
dimitri-yatsenko and others added 11 commits January 9, 2026 14:51
Add CSS media query for prefers-color-scheme: dark to automatically
adapt table preview styling to dark mode environments.

Dark mode colors:
- Table header: #4a4a4a background
- Odd rows: #2d2d2d background, #e0e0e0 text
- Even rows: #3d3d3d background, #e0e0e0 text
- Primary key: #bd93f9 (purple accent)
- Borders: #555555

Uses browser-native dark mode detection - no JavaScript or config needed.
Light mode styling remains unchanged for backward compatibility.

Fixes #1167

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
When a user answers "no" to "Commit deletes?", the transaction is
rolled back but delete() still returned the count of rows that would
have been deleted. This was unintuitive - if nothing was deleted,
the return value should be 0.

Now delete() returns 0 when:
- User cancels at the prompt
- Nothing to delete (already worked correctly)

Fixes #1155

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
When inspect.getmembers() encounters modules with objects that have
non-standard __bases__ attributes (like _ClassNamespace from typing
internals), it raises TypeError. This caused dj.Diagram(schema) to
fail intermittently depending on what modules were imported.

Now catches TypeError in addition to ImportError, allowing the search
to continue by skipping problematic modules.

Fixes #1072

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
)

`text` is no longer a core DataJoint type. It remains available as a
native SQL passthrough type (with portability warning).

Rationale:
- Core types should encourage structured, bounded data
- varchar(n) covers most legitimate text needs with explicit bounds
- json handles structured text better
- <object> is better for large/unbounded text (files, sequences, docs)
- text behavior varies across databases, hurting portability

Changes:
- Remove `text` from CORE_TYPES in declare.py
- Update NATIVE_TEXT pattern to match plain `text` (in addition to
  tinytext, mediumtext, longtext)
- Update archive docs to note text is native-only

Users who need unlimited text can:
- Use varchar(n) with generous limit
- Use json for structured content
- Use <object> for large text files
- Use native text types with portability warning

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
When dropping a table/schema referenced by foreign key constraints,
MySQL returns error 3730. This was passing through as a raw pymysql
OperationalError, making it difficult for users to catch and handle.

Now translates to datajoint.errors.IntegrityError, consistent with
other foreign key constraint errors (1217, 1451, 1452).

Before:
  pymysql.err.OperationalError: (3730, "Cannot drop table...")

After:
  datajoint.errors.IntegrityError: Cannot drop table '#table'
  referenced by a foreign key constraint...

Fixes #1032

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Add __enter__ and __exit__ methods to Connection for use with Python's
`with` statement. This enables automatic connection cleanup, particularly
useful for serverless environments (AWS Lambda, Cloud Functions).

Usage:
    with dj.Connection(host, user, password) as conn:
        schema = dj.schema('my_schema', connection=conn)
        # perform operations
    # connection automatically closed

Closes #1081

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* perf: implement lazy imports for heavy dependencies

Defer loading of heavy dependencies (networkx, matplotlib, click, pymysql)
until their associated features are accessed:

- dj.Diagram, dj.Di, dj.ERD -> loads diagram.py (networkx, matplotlib)
- dj.kill -> loads admin.py (pymysql via connection)
- dj.cli -> loads cli.py (click)

This reduces `import datajoint` time significantly, especially on macOS
where import overhead is higher. Core functionality (Schema, Table,
Connection, etc.) remains immediately available.

Closes #1220

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: cache lazy imports correctly and expose diagram module

- Cache lazy imports in globals() to override the submodule that
  importlib automatically sets on the parent module
- Add dj.diagram to lazy modules (returns module for diagram_active access)
- Add tests for cli callable and diagram module access

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix: raise error when table declaration fails due to permissions

Previously, AccessError during table declaration was silently swallowed,
causing tables with cross-schema foreign keys to fail without any feedback
when the user lacked REFERENCES privilege.

Now:
- If table already exists: suppress error (idempotent declaration)
- If table doesn't exist: raise AccessError with helpful message about
  CREATE and REFERENCES privileges

Closes #1161

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* test: update test to expect AccessError at declaration time

The test previously expected silent failure at declaration followed by
error at insert time. Now we fail fast at declaration time (better UX).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Fix read_cell_array to handle edge cases from MATLAB:
- Empty cell arrays ({})
- Cell arrays with empty elements ({[], [], []})
- Nested/ragged arrays ({[1,2], [3,4,5]})
- Cell matrices with mixed content

The fix uses dtype='object' to avoid NumPy's array homogeneity
requirements that caused reshape failures with ragged arrays.

Closes #1056
Closes #1098

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix: provide helpful error when table heading is not configured

When using tables from non-activated schemas, operations that access
the heading now raise a clear DataJointError instead of confusing
"NoneType has no attribute" errors.

Example:
    schema = dj.Schema()  # Not activated
    @Schema
    class MyTable(dj.Manual): ...

    MyTable().heading  # Now raises: "Table `MyTable` is not properly
                       # configured. Ensure the schema is activated..."

Closes #1039

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Allow heading introspection on base tier classes

The heading property now returns None for base tier classes (Lookup,
Manual, Imported, Computed, Part) instead of raising an error. This
allows Python's help() and inspect modules to work correctly.

User-defined table classes still get the helpful error message when
trying to access heading on a non-activated schema.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…cs (#1328)

* feat: Add consistent URL representation for all storage paths (#1326)

Implements unified URL handling for all storage backends including local files:

- Add URL_PROTOCOLS tuple including file://
- Add is_url() to check if path is a URL
- Add normalize_to_url() to convert local paths to file:// URLs
- Add parse_url() to parse any URL into protocol and path
- Add StorageBackend.get_url() to return full URLs for any backend
- Add comprehensive unit tests for URL functions

This enables consistent internal representation across all storage types,
aligning with fsspec's unified approach to filesystems.

Closes #1326

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* test: Remove redundant URL tests from test_object.py

The TestRemoteURLSupport class tested is_remote_url and parse_remote_url
which were renamed to is_url and parse_url. These tests are now redundant
as comprehensive coverage exists in tests/unit/test_storage_urls.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Remove trailing whitespace from blank line in json.ipynb

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* style: Apply ruff formatting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* chore: Remove accidentally committed local config files

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* style: Apply ruff formatting to test files

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Remove orphaned archive documentation

Content has been migrated to datajoint-docs repository.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
@dimitri-yatsenko dimitri-yatsenko added breaking Not backward compatible changes bug Indicates an unexpected problem or unintended behavior labels Jan 9, 2026
Ensures PR #1311 automatically receives breaking and bug labels.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions github-actions bot removed bug Indicates an unexpected problem or unintended behavior breaking Not backward compatible changes labels Jan 9, 2026
@dimitri-yatsenko dimitri-yatsenko added bug Indicates an unexpected problem or unintended behavior breaking Not backward compatible changes labels Jan 9, 2026
Makes tables more compact in notebook displays.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions github-actions bot removed bug Indicates an unexpected problem or unintended behavior breaking Not backward compatible changes labels Jan 10, 2026
dimitri-yatsenko and others added 9 commits January 9, 2026 18:28
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Documentation is now consolidated in datajoint-docs repository.

Changes:
- Delete docs/ folder (legacy MkDocs infrastructure)
- Create ARCHITECTURE.md with transpiler design docs
- Update README.md links to point to docs.datajoint.com

The Developer Guide remains in README.md. Internal architecture
documentation for contributors is now in ARCHITECTURE.md.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Transpilation documentation moved to datajoint-docs query-algebra spec.
Developer docs now consolidated in README.md.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename CHANGELOG.md to CHANGELOG-archive.md with redirect to GitHub Releases
- Add "Writing Release Notes" section to RELEASE_MEMO.md:
  - Categories (BREAKING, Added, Changed, Deprecated, Fixed, Security)
  - Format template with examples
  - Guidelines for good release notes
  - PR label mapping for release drafter

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Slim README.md to essentials (intro, badges, install, links)
- Create CONTRIBUTING.md with:
  - Development setup (pixi and pip)
  - Test running instructions
  - Pre-commit hooks
  - Environment variables
  - Condensed docstring style guide
- Delete DOCSTRING_STYLE.md (merged into CONTRIBUTING.md)

README: 218 → 82 lines
All detailed docs now at docs.datajoint.com

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add .github/DISCUSSION_TEMPLATE/rfc.yml for enhancement proposals
- Fix table header alignment (center instead of right)
- Fix excessive padding in table headers by removing p tag margins

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Issues related to documentation enhancement Indicates new improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants