This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Apache Iceberg

Learn how to use the Jikkou extension provider for Apache Iceberg — manage Iceberg namespaces, tables, and views as code.

Here, you will find information to use the Apache Iceberg extensions.

More information:

1 - Configuration

Learn how to configure the extensions for Apache Iceberg.

Here, you will find the configuration properties for the Apache Iceberg extension.

Configuration

The Apache Iceberg extension connects to an Iceberg catalog. You configure it through the Jikkou client configuration property jikkou.provider.iceberg.

Example (JDBC catalog with PostgreSQL):

jikkou {
  provider.iceberg {
    enabled = true
    type = io.streamthoughts.jikkou.iceberg.IcebergExtensionProvider
    config = {
      # Required — type of Iceberg catalog.
      # Accepted values: rest, hive, jdbc, glue, nessie, hadoop
      catalogType = "jdbc"

      # The catalog name used to identify this catalog instance.
      catalogName = "default"

      # The URI of the catalog endpoint (REST URL, Hive Metastore URI, JDBC URL, etc.)
      catalogUri = "jdbc:postgresql://localhost:5432/iceberg"

      # The warehouse root location (e.g., local path, S3 bucket path, HDFS path)
      warehouse = "/tmp/iceberg-warehouse"

      # Extra catalog-specific properties passed directly to CatalogUtil.buildIcebergCatalog()
      catalogProperties {
        jdbc.user     = "iceberg"
        jdbc.password = "iceberg"
      }

      # Enable verbose debug logging for catalog operations (default: false)
      debugLoggingEnabled = false
    }
  }
}

Configuration Properties

PropertyTypeRequiredDefaultDescription
catalogTypeStringyesIceberg catalog type: rest, hive, jdbc, glue, nessie, hadoop
catalogNameStringnodefaultThe catalog instance name
catalogUriStringnoCatalog endpoint URI (REST API URL, Hive Metastore thrift URI, JDBC URL, Nessie server URL)
warehouseStringnoWarehouse root location (e.g., s3://bucket/warehouse, /tmp/iceberg)
catalogPropertiesMapnoAdditional catalog properties forwarded verbatim to the Iceberg CatalogUtil
debugLoggingEnabledBooleannofalseEnable debug-level logging for catalog operations

Catalog Types

JDBC Catalog (PostgreSQL)

Stores catalog metadata (namespaces, table specs) in a relational database. The PostgreSQL JDBC driver is bundled in the Jikkou CLI distribution.

config = {
  catalogType = "jdbc"
  catalogUri  = "jdbc:postgresql://localhost:5432/iceberg"
  warehouse   = "/tmp/iceberg-warehouse"
  catalogProperties {
    jdbc.user     = "iceberg"
    jdbc.password = "iceberg"
  }
}

REST Catalog

Connects to any Iceberg REST Catalog API (e.g., Polaris, Gravitino, Unity Catalog):

config = {
  catalogType = "rest"
  catalogUri  = "https://polaris.example.com/api/catalog"
  warehouse   = "s3://my-bucket/warehouse"
  catalogProperties {
    rest.signing-name   = "execute-api"
    rest.signing-region = "us-east-1"
  }
}

Hive Metastore

Connects to an Apache Hive Metastore (requires iceberg-hive-metastore on the classpath):

config = {
  catalogType = "hive"
  catalogUri  = "thrift://hive-metastore:9083"
  warehouse   = "hdfs://namenode:8020/user/hive/warehouse"
}

AWS Glue

Connects to AWS Glue Data Catalog (requires iceberg-aws on the classpath):

config = {
  catalogType = "glue"
  warehouse   = "s3://my-bucket/warehouse"
  catalogProperties {
    glue.region = "us-east-1"
  }
}

Nessie

Nessie exposes a standard Iceberg REST catalog endpoint at /iceberg. Using catalogType = "rest" is recommended because it relies only on iceberg-core (always bundled in the Jikkou CLI). The catalogType = "nessie" variant requires the optional iceberg-nessie JAR on the classpath.

# Recommended: use Nessie's built-in Iceberg REST endpoint
config = {
  catalogType = "rest"
  catalogUri  = "http://nessie:19120/iceberg"
  warehouse   = "s3://my-bucket/warehouse"
  catalogProperties {
    prefix = "main"   # Nessie branch
  }
}

Controller Settings

The table and view controllers expose additional options to control reconciliation behaviour. These are set inside the provider config block.

Table Controller

PropertyTypeDefaultDescription
delete-orphansBooleanfalseDrop tables that exist in the catalog but are not defined in any resource
delete-orphan-columnsBooleanfalseDrop columns present in the live table but absent from the spec
delete-purgeBooleanfalsePurge underlying data files when dropping a table (irreversible)
tables.deletion.excludeList<Pattern>[]Regex patterns — matching table names are never deleted

View Controller

PropertyTypeDefaultDescription
delete-orphansBooleanfalseDrop views that exist in the catalog but are not defined in any resource
views.deletion.excludeList<Pattern>[]Regex patterns — matching view names are never deleted

Example:

jikkou {
  provider.iceberg {
    enabled = true
    type = io.streamthoughts.jikkou.iceberg.IcebergExtensionProvider
    config = {
      catalogType = "rest"
      catalogUri  = "http://localhost:8181"
      warehouse   = "s3://my-bucket/warehouse"

      # Table reconciliation safety settings
      delete-orphans        = false
      delete-orphan-columns = false
      delete-purge          = false

      # Never delete tables whose name starts with "audit_"
      tables.deletion.exclude = ["^audit_.*"]

      # Never delete views whose name starts with "v_core_"
      views.deletion.exclude = ["^v_core_.*"]
    }
  }
}

2 - Resources

Learn how to use the built-in resources provided by the extensions for Apache Iceberg.

Here, you will find the list of resources supported by the extensions for Apache Iceberg.

Iceberg Resources

The Apache Iceberg extension provides the following resource types:

ResourceDescription
IcebergNamespaceManage namespaces (databases) in an Iceberg catalog
IcebergTableManage tables with schema evolution, partitioning, and sort order
IcebergViewManage SQL views backed by one or more dialect-specific queries

More information:

2.1 - Iceberg Namespace

Learn how to manage Apache Iceberg Namespaces.

IcebergNamespace resources are used to define the namespaces (databases) you want to manage in your Iceberg catalog. A namespace groups tables and carries metadata as key/value properties.

IcebergNamespace

Specification

Here is the resource definition file for defining an IcebergNamespace.

apiVersion: "iceberg.jikkou.io/v1beta1"  # The api version (required)
kind: "IcebergNamespace"                  # The resource kind (required)
metadata:
  name: <namespace name>                  # Dot-separated namespace path (required)
  labels: { }
  annotations: { }
spec:
  properties:                             # Namespace-level metadata (optional)
    <key>: <value>

The metadata.name property is mandatory. Nested namespaces are expressed using dot notation — for example, analytics.events creates a namespace events inside analytics.

Example

file: iceberg-namespaces.yaml

---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergNamespace"
metadata:
  name: "analytics"
spec:
  properties:
    owner: "data-team"
    environment: "production"
---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergNamespace"
metadata:
  name: "analytics.events"
spec:
  properties:
    owner: "data-team"
    team: "platform"

IcebergNamespaceList

If you need to define multiple namespaces (e.g., using a template), it may be easier to use an IcebergNamespaceList resource.

Specification

apiVersion: "iceberg.jikkou.io/v1beta1"  # The api version (required)
kind: "IcebergNamespaceList"              # The resource kind (required)
metadata: { }
items: [ ]                                # An array of IcebergNamespace

Example

---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergNamespaceList"
items:
  - metadata:
      name: "analytics"
    spec:
      properties:
        owner: "data-team"
  - metadata:
      name: "analytics.events"
    spec:
      properties:
        owner: "platform-team"

2.2 - Iceberg Table

Learn how to manage Apache Iceberg Tables, including schema evolution.

IcebergTable resources are used to define the tables you want to manage in your Iceberg catalog. An IcebergTable resource defines the schema, partition layout, sort order, and table-level properties. Jikkou performs safe schema evolution — adding, renaming, updating, and dropping columns — without data loss.

IcebergTable

Specification

Here is the resource definition file for defining an IcebergTable.

apiVersion: "iceberg.jikkou.io/v1beta1"  # The api version (required)
kind: "IcebergTable"                      # The resource kind (required)
metadata:
  name: <namespace>.<table>              # Fully qualified table name (required)
  labels: { }
  annotations: { }
spec:
  location: <storage path>               # Override the default table location (optional)
  schema:
    identifierFields:                    # Primary-key columns for MERGE/UPSERT semantics (optional)
      - <column name>
    columns:                             # Ordered list of columns (required)
      - name: <column name>              # Column name (required)
        type: <type>                     # Column type (required) — see type reference below
        required: <true|false>           # Whether the column is non-nullable (default: false)
        doc: <description>              # Documentation string (optional)
        default: <value>                 # Initial default value, immutable after creation (optional)
        writeDefault: <value>            # Write-default for absent values, updatable (optional)
        previousName: <old name>         # Triggers a safe rename instead of drop+add (optional)
  partitionFields:                       # Partition layout (optional)
    - sourceColumn: <column>             # Source column to partition on (required)
      transform: <transform>             # Partition transform (required) — see transforms below
      name: <partition field name>       # Custom partition field name (optional)
  sortFields:                            # Default write sort order (optional)
    - column: <column>                   # Column name (mutually exclusive with term)
      term: <expression>                 # Sort expression e.g. bucket[16](user_id) (mutually exclusive with column)
      direction: <asc|desc>              # Sort direction (default: asc)
      nullOrder: <first|last>            # Null placement (default: last)
  properties:                            # Table-level metadata properties (optional)
    <key>: <value>

Column Types

Jikkou maps a set of type strings to the underlying Iceberg types:

Type stringIceberg typeNotes
booleanBooleanType
int / integerIntegerType
longLongType
floatFloatType
doubleDoubleType
dateDateType
timeTimeType
timestampTimestampType (without tz)
timestamptzTimestampType (with tz)
stringStringType
uuidUUIDType
binaryBinaryType
fixed[N]FixedType(N)e.g. fixed[16]
decimal(P,S)DecimalType(P,S)e.g. decimal(18,2)

Complex types (struct, list, map) can be expressed as nested objects using the following format:

Struct:

type:
  type: "struct"
  fields:
    - name: "field_name"
      type: "string"       # Any type (primitive or nested)
      required: true        # default: false
      doc: "description"    # optional

List:

type:
  type: "list"
  elementType: "string"     # Any type (primitive or nested)
  elementRequired: false    # default: false

Map:

type:
  type: "map"
  keyType: "string"         # Any type (primitive or nested)
  valueType: "long"         # Any type (primitive or nested)
  valueRequired: false      # default: false

Partition Transforms

TransformExampleDescription
identityidentityPartition by the exact column value
yearyearExtract year from a date/timestamp column
monthmonthExtract year-month
daydayExtract calendar date
hourhourExtract date-hour
bucket[N]bucket[16]Hash-bucket into N buckets
truncate[W]truncate[8]Truncate string or integer to width W
voidvoidAlways-null partition (marks dropped fields) — not yet supported

Schema Evolution

Jikkou applies schema changes in a safe, deterministic order:

  1. Incompatible change check — verify annotation before proceeding
  2. Renames — processed first to preserve Iceberg field IDs
  3. Column additions — new columns appended
  4. Column updates — type promotion, documentation, nullability, write-default changes
  5. Column deletions — processed after updates to avoid conflicts
  6. Identifier field changes
  7. Schema commit — all schema changes are committed atomically
  8. Partition spec replacement
  9. Sort order replacement
  10. Properties update

Safe Column Rename

To rename a column without losing its Iceberg field ID (which would break existing readers), set the previousName field to the old column name:

columns:
  - name: "user_identifier"   # new name
    previousName: "user_id"   # old name — triggers a rename, not drop+add
    type: "long"
    required: true

Incompatible Changes

By default, Jikkou rejects type changes that are not safe promotions (e.g., int → long is safe; string → int is not). To allow incompatible changes on a specific resource, set the annotation iceberg.jikkou.io/allow-incompatible-changes: "true".


Examples

Simple table with day partitioning

file: iceberg-page-views.yaml

---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergTable"
metadata:
  name: "analytics.events.page_views"
spec:
  schema:
    columns:
      - name: "event_id"
        type: "uuid"
        required: true
        doc: "Unique event identifier"
      - name: "user_id"
        type: "long"
        required: true
        doc: "The user who triggered the event"
      - name: "page_url"
        type: "string"
        required: true
        doc: "URL of the viewed page"
      - name: "event_time"
        type: "timestamptz"
        required: true
        doc: "Timestamp when the event occurred (UTC)"
      - name: "duration_ms"
        type: "long"
        doc: "Time spent on the page in milliseconds"
  partitionFields:
    - sourceColumn: "event_time"
      transform: "day"
  sortFields:
    - column: "event_time"
      direction: "asc"
      nullOrder: "last"
  properties:
    write.format.default: "parquet"
    write.parquet.compression-codec: "zstd"

Table with bucket partitioning and identifier fields

file: iceberg-orders.yaml

---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergTable"
metadata:
  name: "analytics.events.orders"
  annotations:
    iceberg.jikkou.io/allow-incompatible-changes: "false"
spec:
  schema:
    columns:
      - name: "order_id"
        type: "long"
        required: true
        doc: "Unique order identifier"
      - name: "customer_id"
        type: "long"
        required: true
        doc: "Customer who placed the order"
      - name: "order_date"
        type: "date"
        required: true
        doc: "Date the order was placed"
      - name: "status"
        type: "string"
        required: true
        doc: "Order status"
      - name: "total_amount"
        type: "decimal(18,2)"
        required: true
        doc: "Total order amount"
    identifierFields:
      - "order_id"
  partitionFields:
    - sourceColumn: "order_date"
      transform: "month"
    - sourceColumn: "customer_id"
      transform: "bucket[16]"
  sortFields:
    - column: "order_date"
      direction: "desc"
      nullOrder: "last"
    - column: "customer_id"
      direction: "asc"
      nullOrder: "last"
  properties:
    write.format.default: "parquet"

IcebergTableList

If you need to define multiple tables (e.g., using a template), it may be easier to use an IcebergTableList resource.

Specification

apiVersion: "iceberg.jikkou.io/v1beta1"  # The api version (required)
kind: "IcebergTableList"                  # The resource kind (required)
metadata: { }
items: [ ]                                # An array of IcebergTable

Example

---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergTableList"
items:
  - metadata:
      name: "analytics.raw.clicks"
    spec:
      schema:
        columns:
          - name: "click_id"
            type: "uuid"
            required: true
          - name: "ts"
            type: "timestamptz"
            required: true
      partitionFields:
        - sourceColumn: "ts"
          transform: "day"

  - metadata:
      name: "analytics.raw.impressions"
    spec:
      schema:
        columns:
          - name: "impression_id"
            type: "uuid"
            required: true
          - name: "ts"
            type: "timestamptz"
            required: true
      partitionFields:
        - sourceColumn: "ts"
          transform: "day"

2.3 - Iceberg View

Learn how to manage Apache Iceberg Views.

IcebergView resources are used to define SQL views in your Iceberg catalog. A view is a logical definition backed by one or more SQL queries — the output schema is inferred by the engine and populated on collect.

IcebergView

Specification

Here is the resource definition file for defining an IcebergView.

apiVersion: "iceberg.jikkou.io/v1beta1"  # The api version (required)
kind: "IcebergView"                      # The resource kind (required)
metadata:
  name: <namespace>.<view>              # Fully qualified view name (required)
  labels: { }
  annotations: { }
spec:
  schema:                                # Output schema (read-only, inferred by the engine)
    columns:
      - name: <column name>
        type: <type>
        required: <true|false>
        doc: <description>
  queries:                               # SQL query definitions (at least one required)
    - sql: <SQL SELECT statement>        # The SQL defining the view (required)
      dialect: <dialect>                 # SQL dialect e.g. 'spark', 'trino', 'presto', 'hive' (required)
  defaultNamespace: <namespace>          # Default namespace for unqualified table references (optional)
  defaultCatalog: <catalog>              # Default catalog for unqualified table references (optional)
  properties:                            # View-level metadata properties (optional)
    <key>: <value>

Examples

Simple view with daily aggregation

file: iceberg-daily-page-stats.yaml

---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergView"
metadata:
  name: "analytics.events.daily_page_stats"
spec:
  queries:
    - sql: >-
        SELECT
            CAST(event_time AS DATE) AS view_date,
            page_url,
            COUNT(*) AS view_count,
            COUNT(DISTINCT user_id) AS unique_users
        FROM analytics.events.page_views
        GROUP BY CAST(event_time AS DATE), page_url        
      dialect: "spark"
  defaultNamespace: "analytics.events"
  properties:
    comment: "Daily page view statistics aggregated from raw events"

View with multiple SQL dialects

file: iceberg-multi-dialect-view.yaml

---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergView"
metadata:
  name: "analytics.events.active_users"
spec:
  queries:
    - sql: >-
        SELECT user_id, COUNT(*) AS event_count
        FROM analytics.events.page_views
        WHERE event_time >= current_date() - INTERVAL 30 DAYS
        GROUP BY user_id        
      dialect: "spark"
    - sql: >-
        SELECT user_id, COUNT(*) AS event_count
        FROM analytics.events.page_views
        WHERE event_time >= current_date - INTERVAL '30' DAY
        GROUP BY user_id        
      dialect: "trino"
  defaultNamespace: "analytics.events"
  properties:
    comment: "Users active in the last 30 days"

IcebergViewList

If you need to define multiple views (e.g., using a template), it may be easier to use an IcebergViewList resource.

Specification

apiVersion: "iceberg.jikkou.io/v1beta1"  # The api version (required)
kind: "IcebergViewList"                  # The resource kind (required)
metadata: { }
items: [ ]                               # An array of IcebergView

Example

---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergViewList"
items:
  - metadata:
      name: "analytics.events.daily_page_stats"
    spec:
      queries:
        - sql: >-
            SELECT CAST(event_time AS DATE) AS view_date, page_url, COUNT(*) AS view_count
            FROM analytics.events.page_views
            GROUP BY CAST(event_time AS DATE), page_url            
          dialect: "spark"
      defaultNamespace: "analytics.events"

  - metadata:
      name: "analytics.events.active_users"
    spec:
      queries:
        - sql: >-
            SELECT user_id, COUNT(*) AS event_count
            FROM analytics.events.page_views
            WHERE event_time >= current_date() - INTERVAL 30 DAYS
            GROUP BY user_id            
          dialect: "spark"
      defaultNamespace: "analytics.events"

3 - Annotations

Learn how to use the metadata annotations provided by the extensions for Apache Iceberg.

Here, you will find information about the annotations provided by the Apache Iceberg extension for Jikkou.

List of built-in annotations

iceberg.jikkou.io/allow-incompatible-changes

Controls whether incompatible schema changes are allowed when reconciling an IcebergTable.

By default, Jikkou rejects type changes that cannot be safely promoted (for example, changing a column from string to int). Setting this annotation to "true" on a specific table resource lifts that restriction for that resource only.

metadata:
  annotations:
    iceberg.jikkou.io/allow-incompatible-changes: "true"

iceberg.jikkou.io/namespace-location

Read-only annotation populated by the namespace collector. Contains the storage location of a namespace as reported by the catalog. This annotation is set automatically when collecting existing namespaces and does not need to be specified in resource files.

iceberg.jikkou.io/table-location

Reserved annotation for the storage location of a table. Currently, the table location is stored in the spec.location field rather than as an annotation. This annotation key is defined for future use.

iceberg.jikkou.io/view-location

Read-only annotation populated by the view collector. Contains the storage location of a view as reported by the catalog. This annotation is set automatically when collecting existing views and does not need to be specified in resource files.