Migration from v0.14 to v0.15

JoinSpec renamed to InputSpec

JoinSpec has been removed. Replace it with InputSpec everywhere.

Before:

from datapipe.types import JoinSpec, Required

BatchTransform(
    func,
    inputs=[JoinSpec("models", join_type="inner")],
    outputs=["results"],
)

After:

from datapipe.types import InputSpec, Required

BatchTransform(
    func,
    inputs=[Required("models")],   # inner join — use Required
    outputs=["results"],
)

Note: Required (inner join) and the plain-table form (outer join) cover the two join_type values that JoinSpec exposed. InputSpec itself is now the base type used when you need key mapping (see next section).

Key mapping with InputSpec.keys and OutputSpec

When two input tables share the same primary key column name (e.g. both have id), the transform engine previously could not distinguish them. InputSpec.keys solves this by giving each table's primary key a transform-level alias.

from datapipe.types import InputSpec, OutputSpec

BatchTransform(
    enrich_posts,
    transform_keys=["post_id", "author_id"],
    inputs=[
        # Post.id → transform key "post_id"
        # Post.author_id → transform key "author_id"
        InputSpec(Post, keys={"post_id": "id", "author_id": "author_id"}),

        # Author.id → transform key "author_id"
        InputSpec(Author, keys={"author_id": "id"}),
    ],
    outputs=[
        # PostCard.id stores the post id → map transform key "post_id" to column "id"
        OutputSpec(PostCard, keys={"post_id": "id"}),
    ],
)

InputSpec.keys is a dict {"transform_key": "table_pk_column"}. Without it, key names are assumed to match (the previous behaviour is preserved).

OutputSpec.keys maps transform keys to output table primary key columns for the purpose of incremental cleanup. Without it, all transform keys are assumed to match the output table's primary key column names.

Explicit step names via name=

All step types now accept an optional name: str | None parameter. When provided, that string is used as the step name exactly — no hash suffix is appended.

This is the recommended way to make step names stable and predictable, especially when using the CLI to target specific steps:

BatchTransform(
    resize_images,
    inputs=[images_tbl],
    outputs=[thumbnails_tbl],
    name="resize_images",   # datapipe step --name=resize_images run
)

UpdateExternalTable(output=images_tbl, name="sync_images")

Without an explicit name, step names are auto-generated from a hash of the step class, function name, and table names. This hash changes if any of those change, which may break --name filters in scripts.

DatatableTransform and UpdateExternalTable now use hash-based names

In v0.14, these two step types used plain auto-generated names:

  • DatatableTransform"my_func" (the function name)
  • UpdateExternalTable"update_images" (the table name)

In v0.15, they use the same hash-based naming as BatchTransform:

  • DatatableTransform"my_func_9a3f1c8d" (function name + hash suffix)
  • UpdateExternalTable"update_images_4b72e091" (table name + hash suffix)

If you use datapipe step --name=my_func run or similar CLI invocations targeting these steps, those name filters will no longer match. Pin the name explicitly to restore stable names:

DatatableTransform(my_func, inputs=[...], outputs=[...], name="my_func")
UpdateExternalTable(output=images_tbl, name="update_images")

Duplicate step names now raise immediately

build_compute() now raises a ValueError if two steps in the same pipeline produce the same name. Previously this was silently accepted, which could cause one step to shadow another.

If you encounter this error, use the name= parameter on the affected steps to give each a distinct explicit name.

DatatableBatchTransform.inputs now accepts PipelineInput

Required and InputSpec wrappers can now be used in DatatableBatchTransform.inputs, matching the behaviour of BatchTransform. Existing code using plain table names or ORM table references continues to work unchanged.

Python 3.9 no longer supported

v0.15.0 uses T | None union syntax and built-in list[T] / dict[K, V] generics, which require Python 3.10 or later. Upgrade to Python 3.10+.

Internal: DataTable.meta_tableDataTable.meta

If your code directly accessed DataTable.meta_table (a non-public attribute), rename it to DataTable.meta. The attribute now returns a TableMeta interface instead of the concrete SQLTableMeta class.