Identity Columns in Fabric Runtime 2.0: From Workarounds to First‑Class Support

Reliable surrogate keys for modern lakehouse workloads

As data platforms evolve from traditional data warehouses to open lakehouse architectures, one requirement has remained constant: the need for stable, unique surrogate keys. With Fabric Runtime 2.0, built on Apache Spark 4.x and Delta Lake 4.x, Fabric Runtime 2.0 now provides native identity column support—bringing a familiar but modernized capability into large-scale, distributed analytics.

This post explains why identity columns matter, how they work in Fabric Runtime 2.0, and how customers can migrate safely from older Spark‑based patterns.

Why surrogate keys still matter

In traditional data warehouses, tables typically rely on surrogate keys—system-generated, meaningless numeric identifiers used to uniquely identify rows. Unlike business keys (also called natural keys), surrogate keys carry no domain meaning. That is precisely what makes them valuable.

Surrogate keys are critical for:

Dimension tables and fact–dimension joins — They provide a stable join key that does not change when business attributes evolve.
Slowly Changing Dimensions (SCDs) — SCD Type 2 patterns depend on unique row-level identifiers to track historical versions of a dimension record.
Referential integrity across datasets — Surrogate keys decouple table relationships from volatile business identifiers.
Stable identifiers independent of business logic — When customers merge, products are renamed, or codes are reissued, surrogate keys remain unchanged.

Beyond stability, surrogate keys are typically integers (BIGINT), which occupy far less storage than alphanumeric business keys like GUIDs or composite natural keys. Smaller key columns mean more compact indexes, faster joins, and reduced shuffle overhead—advantages that compound at lakehouse scale where tables span billions of rows.

While business keys carry meaning, they can change. Surrogate keys provide stability and performance, especially at scale. In large-scale lakehouse environments, where tables span billions of rows and multiple writers operate concurrently, surrogate key generation needs to be safe, transactional, and automatic.

What is an identity column?

An identity column is a table column whose values are automatically generated by the system at write time. Each new row is assigned a unique numeric value without requiring the writer to supply one.

In Delta Lake, identity columns:

Are generated transactionally and safely — values are assigned during the commit protocol, not during Spark task execution.
Guarantee uniqueness without Spark-side logic — the Delta transaction layer handles ID allocation, eliminating the risk of duplicates from retries or re-executions.
Do not depend on Spark execution plans or partitioning — the identity mechanism operates at the Delta transaction layer, not at the Spark task level.
Are persisted as normal columns — once written, they behave like any other column for reads, joins, filters, and aggregations.
Only support the BIGINT data type.

Identity columns are designed to be surrogate keys, not business keys. Delta Lake supports creating identity columns in two modes:

GENERATED ALWAYS AS IDENTITY — The system always assigns the value. Manual inserts into the column are rejected.
GENERATED BY DEFAULT AS IDENTITY — The system assigns a value only when the writer does not supply one. This is useful during migration when you need to preserve existing IDs for some rows.

Why identity column support was not originally available in Spark

If identity columns have been a standard feature in relational databases for decades, why did it take so long for Spark and Delta Lake to support them?

The answer lies in architecture. Traditional databases execute writes through a single, centralized engine that can trivially maintain a counter or sequence. Spark, by contrast, distributes writes across hundreds or thousands of parallel tasks running on different executors. There is no central coordinator handing out sequential IDs during task execution.

Because of this, Spark workloads historically relied on workarounds:

monotonically_increasing_id() — Generates IDs during Spark execution, but values are non-deterministic across retries, not unique across concurrent jobs, and tied to partition layout. If a task retries or a job is re-executed, different IDs are produced for the same logical data.
row_number() with window functions — Produces ordinal values within a Spark DataFrame, but recalculates on every execution, impacting performance and provides no cross-job uniqueness.

These approaches break under concurrency, retries, and scale — exactly the conditions production lakehouse workloads operate in.

Identity columns were introduced in Delta Lake 3.3 and matured in Delta Lake 4, where the ID allocation was moved from the Spark execution layer into the Delta transaction protocol. This means IDs are assigned during the atomic commit, not during task execution, making them safe for distributed, concurrent, and retriable workloads. This capability is fully supported in Fabric Runtime 2.0 (Spark 4.x + Delta 4.x).

How identity columns work in Fabric Runtime 2.0

Fabric Runtime 2.0 supports identity columns natively through PySpark or Scala for creating Delta tables.

Creating a table with an identity column

Creating a dimension table with an identity column is straightforward:

from delta.tables import DeltaTable, IdentityGenerator

DeltaTable.createIfNotExists(spark) \
    .tableName("dbo.dim_sales") \
    .addColumn("id", "BIGINT", generatedAlwaysAs=IdentityGenerator()) \
    .addColumn("product_name", "STRING") \
    .addColumn("category", "STRING") \
    .execute()

The start and step parameters of IdentityGenerator are optional and both default to 1. You can customize them to control the starting value and step size. The step size can be positive or negative, but cannot be 0. This is an example of starting with 10 and incrementing it by 10 each time.

from delta.tables import DeltaTable, IdentityGenerator

DeltaTable.createIfNotExists(spark) \
    .tableName("dbo.dim_sales") \
    .addColumn("id", "BIGINT", generatedAlwaysAs=IdentityGenerator(start=10, step=10)) \
    .addColumn("product_name", "STRING") \
    .addColumn("category", "STRING") \
    .execute()

Insert rows without specifying the id column—Delta Lake assigns values automatically:

%%sql
INSERT INTO dbo.dim_sales (product_name, category)
VALUES ('Laptop', 'Electronics'), ('Chair', 'Furniture'), ('Camera', 'Electronics');

Query the table to see the generated IDs:

%%sql
SELECT * FROM dbo.dim_sales ORDER BY id;

id	product_name	category
1	Laptop	Electronics
2	Chair	Furniture
3	Camera	Electronics

Using `GENERATED BY DEFAULT` for migration scenarios

When migrating from an existing table that already has surrogate keys, use BY DEFAULT mode to preserve existing IDs while letting the system generate new ones for future inserts:

from delta.tables import DeltaTable, IdentityGenerator

DeltaTable.createIfNotExists(spark) \
    .tableName("dbo.dim_customer") \
    .addColumn("customer_id", "BIGINT", generatedByDefaultAs=IdentityGenerator(start=3, step=1)) \
    .addColumn("name", "STRING") \
    .addColumn("country", "STRING") \
    .execute()

You can use Spark SQL for inserting data.

%%sql
-- Migrate existing data with original IDs intact
INSERT INTO dbo.dim_customer (customer_id, name, country)
VALUES (1, 'Alice', 'US'), (2, 'Bob', 'DE');

-- New rows get system-generated IDs
INSERT INTO dbo.dim_customer (name, country)
VALUES ('Carlos', 'MX');

Here is the result - Since the table was created with IdentityGenerator(start=3, step=1) and the first two rows were inserted with explicit IDs (1, 2), the auto-generated ID for Carlos is 3.

%%sql
SELECT * FROM dbo.dim_customer ORDER BY customer_id;

customer_id	name	country
1	Alice	US
2	Bob	DE
3	Carlos	MX

If you are a Scala developer, here is the equivalent Scala API for the same pattern:

%%spark
import io.delta.tables.DeltaTable

DeltaTable.create(spark)
  .tableName("dbo.dim_customer_scala")
  .addColumn(
    DeltaTable.columnBuilder(spark, "customer_id")
      .dataType("BIGINT")
      .generatedByDefaultAsIdentity(1, 1)
      .build()
  )
  .addColumn("name", "STRING")
  .addColumn("country", "STRING")
  .execute()

Before and after: identity columns in practice

The biggest practical improvement is how much simpler your write pipelines become. Here is a side-by-side comparison.

Before — Spark-side workaround (fragile):

from pyspark.sql.functions import monotonically_increasing_id

df = spark.read.format("csv").load("/data/products.csv")

# Manually generate IDs — non-deterministic, unsafe across retries
df = df.withColumn("id", monotonically_increasing_id())
df.write.mode("append").saveAsTable("dbo.dim_product")

After — identity column (transactionally safe):

# Table created once with: GENERATED ALWAYS AS IDENTITY on the id column
df = spark.read.format("csv").load("/data/products.csv")

# No ID generation needed — Delta Lake handles it at commit time
df.write.mode("append").saveAsTable("dbo.dim_product")

The writer no longer needs to know about ID generation at all. The ID column is invisible to the write path and guaranteed unique by the Delta transaction layer.

Migration guidance - from Existing Spark Patterns

If your pipelines already use Spark-side ID generation, here is how to transition to identity columns.

Migrating from `monotonically_increasing_id()`

This is the most common pattern to replace. The typical existing code looks like:

from pyspark.sql.functions import monotonically_increasing_id

df = df.withColumn("id", monotonically_increasing_id())
df.write.mode("append").saveAsTable("dbo.dim_orders")

Migration steps:

Recreate the target table with an identity column (GENERATED ALWAYS AS IDENTITY).
Remove the withColumn("id", monotonically_increasing_id()) call from your pipeline code.
Write the DataFrame without the ID column—Delta Lake handles it.
Validate that downstream joins and lookups work correctly with the new ID values.

The result is simpler code, safer IDs, and no risk of duplicates across retries or concurrent writes.

Migrating from `row_number()`

If your pipeline uses window functions to assign sequential IDs:

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window = Window.orderBy("created_at")
df = df.withColumn("id", row_number().over(window))

Migration steps:

If you only need uniqueness (not strict ordering), use GENERATED ALWAYS AS IDENTITY and remove the window function entirely.
If you need to preserve existing IDs during a transition period, use GENERATED BY DEFAULT AS IDENTITY and insert historical data with explicit ID values.
Do not rely on identity columns for deterministic ordering — they guarantee uniqueness, not sequence continuity or sort order.

General migration checklist

Identify all tables using Spark-side ID generation
Determine whether existing ID values need to be preserved
Choose ALWAYS (clean break) or BY DEFAULT (gradual migration)
Update write logic to exclude the ID column from DataFrames
Validate downstream consumers (joins, lookups, BI reports)
Remove deprecated ID generation code from notebooks and pipelines

What identity columns are — and what they are not

Identity columns are for	Identity columns are not for
Surrogate keys	Business-meaningful identifiers
Dimension and fact table joins	Deterministic row ordering
SCD Type 2 row versioning	Gap-free sequences
Stable cross-table references	Sort keys or partition keys

Identity columns guarantee uniqueness, not contiguity. Gaps can occur due to transaction rollbacks, failed writes, or concurrent inserts. This is the same behavior as identity columns in SQL Server, PostgreSQL, and other relational databases.

Practical example: SCD Type 2 with identity columns

Slowly Changing Dimension Type 2 is one of the most common patterns that benefits from identity columns. Each historical version of a dimension record needs its own unique surrogate key, while the business key stays the same across versions.

from delta.tables import DeltaTable, IdentityGenerator

# Create the SCD2 dimension table
DeltaTable.createIfNotExists(spark) \
    .tableName("dbo.dim_customer_scd2") \
    .addColumn("sk", "BIGINT", generatedAlwaysAs=IdentityGenerator()) \
    .addColumn("customer_id", "STRING") \
    .addColumn("name", "STRING") \
    .addColumn("city", "STRING") \
    .addColumn("valid_from", "DATE") \
    .addColumn("valid_to", "DATE") \
    .addColumn("is_current", "BOOLEAN") \
    .execute()

When a customer’s city changes, you expire the old row and insert a new one — both get unique surrogate keys without any manual ID management:

from pyspark.sql.functions import lit, current_date, col
from datetime import date

# Insert the new version — sk is assigned automatically
new_version = spark.createDataFrame([
    ("C100", "Alice", "Seattle", date.today(), True)
], ["customer_id", "name", "city", "valid_from", "is_current"]) \
    .withColumn("valid_to", lit(None).cast("date"))

new_version.write.mode("append").saveAsTable("dbo.dim_customer_scd2")

You can then use Spark SQL to make change to this record, including to expire it.

%%sql
UPDATE dbo.dim_customer_scd2
SET is_current = false, valid_to = current_date()
WHERE customer_id = 'C100' AND is_current = true

Without identity columns, you would need to query the max existing ID, increment it, and handle race conditions manually. With identity columns, the surrogate key is handled entirely by the Delta transaction layer.

Key limitations of identity columns in Delta Lake to be aware of

Concurrent transactions are not supported — Declaring an identity column on a Delta table disables concurrent transactions. Only use identity columns on tables where concurrent writes are not required.
Cannot partition by an identity column — Identity columns cannot be used as partition keys.
Cannot alter identity columns — You cannot use ALTER TABLE to add, replace, or change an identity column after table creation.
Cannot update identity values in place — To change an identity value, you must delete the row and insert it as a new record.
CTAS is not supported — You cannot define identity columns using CREATE TABLE ... AS SELECT. Create the table first with the identity column definition, then use INSERT to populate it.
BIGINT only — Identity columns only support the BIGINT data type.

Why this matters in Fabric Runtime 2.0

Before identity column support, Fabric Spark users had to choose between fragile Spark-side workarounds and complex custom logic to generate surrogate keys safely. Fabric Runtime 2.0 removes that tradeoff.

With identity columns integrated at the Delta Lake transaction layer:

Pipeline code is simpler — No more monotonically_increasing_id() or window function workarounds.
ID generation is transactionally safe — Values are assigned at the commit layer, eliminating duplicates from retries or re-executions.
Retries are safe — Failed and retried tasks do not produce duplicate IDs.
Migration from warehouse patterns is easier — Teams moving from SQL Server, Synapse, or other warehouse platforms find a familiar surrogate key pattern in their lakehouse.

Note that tables with identity columns do not support concurrent transactions. For tables that require concurrent writes from multiple jobs or pipelines, consider alternative surrogate key strategies.

This is one of several capabilities in Runtime 2.0 that brings production-grade reliability to patterns that previously required custom engineering.

Note

OSS Delta Lake 4.x does not support the SQL DDL syntax (CREATE TABLE ... GENERATED ALWAYS AS IDENTITY) for creating tables with identity columns yet. Since Fabric Runtime 2.0 is built on Delta Lake 4.x, this applies to Fabric as well.

To create tables with identity columns, you must use the programmatic DeltaTable builder API — IdentityGenerator in Python or generatedByDefaultAsIdentity/generatedAlwaysAsIdentity in Scala — as shown in the examples throughout this post.

Once the table is created, standard SQL DML statements (INSERT, SELECT, UPDATE, DELETE) work normally against tables with identity columns.

Key takeaways

Identity columns generate unique surrogate keys at the Delta transaction layer — no Spark-side workarounds needed.
Use GENERATED ALWAYS AS IDENTITY for new tables; use GENERATED BY DEFAULT when migrating existing IDs.
They replace monotonically_increasing_id() and row_number() patterns with a safer, simpler alternative.
Tables with identity columns do not support concurrent transactions — plan accordingly.
Delta Lake 4.x (including Fabric Runtime 2.0) requires the DeltaTable builder API to create identity columns — SQL DDL is not supported.

Final thoughts

Identity columns may seem like a small feature on the surface, but they address a long-standing gap in lakehouse data engineering. With Fabric Runtime 2.0, surrogate keys belong where they should — in the table layer, not the execution engine.

For teams building dimension models, SCD pipelines, or any workload that needs stable unique identifiers at scale, identity columns in Fabric Spark provide a production-safe, code-simple, and transaction-safe solution out of the box.

Note: This blog post was written with assistance from AI