Here, you will find information to use the Apache Iceberg extensions.
More information:
This is the multi-page printable view of this section. Click here to print.
Here, you will find information to use the Apache Iceberg extensions.
More information:
Here, you will find the configuration properties for the Apache Iceberg extension.
The Apache Iceberg extension connects to an Iceberg catalog. You configure it through the Jikkou
client configuration property jikkou.provider.iceberg.
Example (JDBC catalog with PostgreSQL):
jikkou {
provider.iceberg {
enabled = true
type = io.streamthoughts.jikkou.iceberg.IcebergExtensionProvider
config = {
# Required — type of Iceberg catalog.
# Accepted values: rest, hive, jdbc, glue, nessie, hadoop
catalogType = "jdbc"
# The catalog name used to identify this catalog instance.
catalogName = "default"
# The URI of the catalog endpoint (REST URL, Hive Metastore URI, JDBC URL, etc.)
catalogUri = "jdbc:postgresql://localhost:5432/iceberg"
# The warehouse root location (e.g., local path, S3 bucket path, HDFS path)
warehouse = "/tmp/iceberg-warehouse"
# Extra catalog-specific properties passed directly to CatalogUtil.buildIcebergCatalog()
catalogProperties {
jdbc.user = "iceberg"
jdbc.password = "iceberg"
}
# Enable verbose debug logging for catalog operations (default: false)
debugLoggingEnabled = false
}
}
}
| Property | Type | Required | Default | Description |
|---|---|---|---|---|
catalogType | String | yes | — | Iceberg catalog type: rest, hive, jdbc, glue, nessie, hadoop |
catalogName | String | no | default | The catalog instance name |
catalogUri | String | no | — | Catalog endpoint URI (REST API URL, Hive Metastore thrift URI, JDBC URL, Nessie server URL) |
warehouse | String | no | — | Warehouse root location (e.g., s3://bucket/warehouse, /tmp/iceberg) |
catalogProperties | Map | no | — | Additional catalog properties forwarded verbatim to the Iceberg CatalogUtil |
debugLoggingEnabled | Boolean | no | false | Enable debug-level logging for catalog operations |
Stores catalog metadata (namespaces, table specs) in a relational database. The PostgreSQL JDBC driver is bundled in the Jikkou CLI distribution.
config = {
catalogType = "jdbc"
catalogUri = "jdbc:postgresql://localhost:5432/iceberg"
warehouse = "/tmp/iceberg-warehouse"
catalogProperties {
jdbc.user = "iceberg"
jdbc.password = "iceberg"
}
}
Connects to any Iceberg REST Catalog API (e.g., Polaris, Gravitino, Unity Catalog):
config = {
catalogType = "rest"
catalogUri = "https://polaris.example.com/api/catalog"
warehouse = "s3://my-bucket/warehouse"
catalogProperties {
rest.signing-name = "execute-api"
rest.signing-region = "us-east-1"
}
}
Connects to an Apache Hive Metastore (requires iceberg-hive-metastore on the classpath):
config = {
catalogType = "hive"
catalogUri = "thrift://hive-metastore:9083"
warehouse = "hdfs://namenode:8020/user/hive/warehouse"
}
Connects to AWS Glue Data Catalog (requires iceberg-aws on the classpath):
config = {
catalogType = "glue"
warehouse = "s3://my-bucket/warehouse"
catalogProperties {
glue.region = "us-east-1"
}
}
Nessie exposes a standard Iceberg REST catalog endpoint at /iceberg. Using
catalogType = "rest" is recommended because it relies only on iceberg-core
(always bundled in the Jikkou CLI). The catalogType = "nessie" variant requires
the optional iceberg-nessie JAR on the classpath.
# Recommended: use Nessie's built-in Iceberg REST endpoint
config = {
catalogType = "rest"
catalogUri = "http://nessie:19120/iceberg"
warehouse = "s3://my-bucket/warehouse"
catalogProperties {
prefix = "main" # Nessie branch
}
}
The table and view controllers expose additional options to control reconciliation behaviour.
These are set inside the provider config block.
| Property | Type | Default | Description |
|---|---|---|---|
delete-orphans | Boolean | false | Drop tables that exist in the catalog but are not defined in any resource |
delete-orphan-columns | Boolean | false | Drop columns present in the live table but absent from the spec |
delete-purge | Boolean | false | Purge underlying data files when dropping a table (irreversible) |
tables.deletion.exclude | List<Pattern> | [] | Regex patterns — matching table names are never deleted |
| Property | Type | Default | Description |
|---|---|---|---|
delete-orphans | Boolean | false | Drop views that exist in the catalog but are not defined in any resource |
views.deletion.exclude | List<Pattern> | [] | Regex patterns — matching view names are never deleted |
delete-orphans property applies independently to tables and views.
The IcebergNamespaceController always uses delete-orphans = false to prevent accidental
deletion of namespaces that may still contain tables. To delete a namespace, use the DELETE
reconciliation mode explicitly.Example:
jikkou {
provider.iceberg {
enabled = true
type = io.streamthoughts.jikkou.iceberg.IcebergExtensionProvider
config = {
catalogType = "rest"
catalogUri = "http://localhost:8181"
warehouse = "s3://my-bucket/warehouse"
# Table reconciliation safety settings
delete-orphans = false
delete-orphan-columns = false
delete-purge = false
# Never delete tables whose name starts with "audit_"
tables.deletion.exclude = ["^audit_.*"]
# Never delete views whose name starts with "v_core_"
views.deletion.exclude = ["^v_core_.*"]
}
}
}
Here, you will find the list of resources supported by the extensions for Apache Iceberg.
The Apache Iceberg extension provides the following resource types:
| Resource | Description |
|---|---|
IcebergNamespace | Manage namespaces (databases) in an Iceberg catalog |
IcebergTable | Manage tables with schema evolution, partitioning, and sort order |
IcebergView | Manage SQL views backed by one or more dialect-specific queries |
More information:
IcebergNamespace resources are used to define the namespaces (databases) you want to manage in your
Iceberg catalog. A namespace groups tables and carries metadata as key/value properties.
IcebergNamespaceHere is the resource definition file for defining an IcebergNamespace.
apiVersion: "iceberg.jikkou.io/v1beta1" # The api version (required)
kind: "IcebergNamespace" # The resource kind (required)
metadata:
name: <namespace name> # Dot-separated namespace path (required)
labels: { }
annotations: { }
spec:
properties: # Namespace-level metadata (optional)
<key>: <value>
The metadata.name property is mandatory. Nested namespaces are expressed using dot notation —
for example, analytics.events creates a namespace events inside analytics.
file: iceberg-namespaces.yaml
---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergNamespace"
metadata:
name: "analytics"
spec:
properties:
owner: "data-team"
environment: "production"
---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergNamespace"
metadata:
name: "analytics.events"
spec:
properties:
owner: "data-team"
team: "platform"
---.IcebergNamespaceListIf you need to define multiple namespaces (e.g., using a template), it may be easier to use
an IcebergNamespaceList resource.
apiVersion: "iceberg.jikkou.io/v1beta1" # The api version (required)
kind: "IcebergNamespaceList" # The resource kind (required)
metadata: { }
items: [ ] # An array of IcebergNamespace
---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergNamespaceList"
items:
- metadata:
name: "analytics"
spec:
properties:
owner: "data-team"
- metadata:
name: "analytics.events"
spec:
properties:
owner: "platform-team"
IcebergTable resources are used to define the tables you want to manage in your Iceberg catalog.
An IcebergTable resource defines the schema, partition layout, sort order, and table-level
properties. Jikkou performs safe schema evolution — adding, renaming, updating, and dropping
columns — without data loss.
IcebergTableHere is the resource definition file for defining an IcebergTable.
apiVersion: "iceberg.jikkou.io/v1beta1" # The api version (required)
kind: "IcebergTable" # The resource kind (required)
metadata:
name: <namespace>.<table> # Fully qualified table name (required)
labels: { }
annotations: { }
spec:
location: <storage path> # Override the default table location (optional)
schema:
identifierFields: # Primary-key columns for MERGE/UPSERT semantics (optional)
- <column name>
columns: # Ordered list of columns (required)
- name: <column name> # Column name (required)
type: <type> # Column type (required) — see type reference below
required: <true|false> # Whether the column is non-nullable (default: false)
doc: <description> # Documentation string (optional)
default: <value> # Initial default value, immutable after creation (optional)
writeDefault: <value> # Write-default for absent values, updatable (optional)
previousName: <old name> # Triggers a safe rename instead of drop+add (optional)
partitionFields: # Partition layout (optional)
- sourceColumn: <column> # Source column to partition on (required)
transform: <transform> # Partition transform (required) — see transforms below
name: <partition field name> # Custom partition field name (optional)
sortFields: # Default write sort order (optional)
- column: <column> # Column name (mutually exclusive with term)
term: <expression> # Sort expression e.g. bucket[16](user_id) (mutually exclusive with column)
direction: <asc|desc> # Sort direction (default: asc)
nullOrder: <first|last> # Null placement (default: last)
properties: # Table-level metadata properties (optional)
<key>: <value>
Jikkou maps a set of type strings to the underlying Iceberg types:
| Type string | Iceberg type | Notes |
|---|---|---|
boolean | BooleanType | |
int / integer | IntegerType | |
long | LongType | |
float | FloatType | |
double | DoubleType | |
date | DateType | |
time | TimeType | |
timestamp | TimestampType (without tz) | |
timestamptz | TimestampType (with tz) | |
string | StringType | |
uuid | UUIDType | |
binary | BinaryType | |
fixed[N] | FixedType(N) | e.g. fixed[16] |
decimal(P,S) | DecimalType(P,S) | e.g. decimal(18,2) |
Complex types (struct, list, map) can be expressed as nested objects using the following format:
Struct:
type:
type: "struct"
fields:
- name: "field_name"
type: "string" # Any type (primitive or nested)
required: true # default: false
doc: "description" # optional
List:
type:
type: "list"
elementType: "string" # Any type (primitive or nested)
elementRequired: false # default: false
Map:
type:
type: "map"
keyType: "string" # Any type (primitive or nested)
valueType: "long" # Any type (primitive or nested)
valueRequired: false # default: false
| Transform | Example | Description |
|---|---|---|
identity | identity | Partition by the exact column value |
year | year | Extract year from a date/timestamp column |
month | month | Extract year-month |
day | day | Extract calendar date |
hour | hour | Extract date-hour |
bucket[N] | bucket[16] | Hash-bucket into N buckets |
truncate[W] | truncate[8] | Truncate string or integer to width W |
void | void | Always-null partition (marks dropped fields) — not yet supported |
Jikkou applies schema changes in a safe, deterministic order:
To rename a column without losing its Iceberg field ID (which would break existing readers),
set the previousName field to the old column name:
columns:
- name: "user_identifier" # new name
previousName: "user_id" # old name — triggers a rename, not drop+add
type: "long"
required: true
By default, Jikkou rejects type changes that are not safe promotions (e.g., int → long is
safe; string → int is not). To allow incompatible changes on a specific resource, set the
annotation iceberg.jikkou.io/allow-incompatible-changes: "true".
file: iceberg-page-views.yaml
---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergTable"
metadata:
name: "analytics.events.page_views"
spec:
schema:
columns:
- name: "event_id"
type: "uuid"
required: true
doc: "Unique event identifier"
- name: "user_id"
type: "long"
required: true
doc: "The user who triggered the event"
- name: "page_url"
type: "string"
required: true
doc: "URL of the viewed page"
- name: "event_time"
type: "timestamptz"
required: true
doc: "Timestamp when the event occurred (UTC)"
- name: "duration_ms"
type: "long"
doc: "Time spent on the page in milliseconds"
partitionFields:
- sourceColumn: "event_time"
transform: "day"
sortFields:
- column: "event_time"
direction: "asc"
nullOrder: "last"
properties:
write.format.default: "parquet"
write.parquet.compression-codec: "zstd"
file: iceberg-orders.yaml
---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergTable"
metadata:
name: "analytics.events.orders"
annotations:
iceberg.jikkou.io/allow-incompatible-changes: "false"
spec:
schema:
columns:
- name: "order_id"
type: "long"
required: true
doc: "Unique order identifier"
- name: "customer_id"
type: "long"
required: true
doc: "Customer who placed the order"
- name: "order_date"
type: "date"
required: true
doc: "Date the order was placed"
- name: "status"
type: "string"
required: true
doc: "Order status"
- name: "total_amount"
type: "decimal(18,2)"
required: true
doc: "Total order amount"
identifierFields:
- "order_id"
partitionFields:
- sourceColumn: "order_date"
transform: "month"
- sourceColumn: "customer_id"
transform: "bucket[16]"
sortFields:
- column: "order_date"
direction: "desc"
nullOrder: "last"
- column: "customer_id"
direction: "asc"
nullOrder: "last"
properties:
write.format.default: "parquet"
IcebergTableListIf you need to define multiple tables (e.g., using a template), it may be easier to use an
IcebergTableList resource.
apiVersion: "iceberg.jikkou.io/v1beta1" # The api version (required)
kind: "IcebergTableList" # The resource kind (required)
metadata: { }
items: [ ] # An array of IcebergTable
---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergTableList"
items:
- metadata:
name: "analytics.raw.clicks"
spec:
schema:
columns:
- name: "click_id"
type: "uuid"
required: true
- name: "ts"
type: "timestamptz"
required: true
partitionFields:
- sourceColumn: "ts"
transform: "day"
- metadata:
name: "analytics.raw.impressions"
spec:
schema:
columns:
- name: "impression_id"
type: "uuid"
required: true
- name: "ts"
type: "timestamptz"
required: true
partitionFields:
- sourceColumn: "ts"
transform: "day"
IcebergView resources are used to define SQL views in your Iceberg catalog.
A view is a logical definition backed by one or more SQL queries — the output schema
is inferred by the engine and populated on collect.
IcebergViewHere is the resource definition file for defining an IcebergView.
apiVersion: "iceberg.jikkou.io/v1beta1" # The api version (required)
kind: "IcebergView" # The resource kind (required)
metadata:
name: <namespace>.<view> # Fully qualified view name (required)
labels: { }
annotations: { }
spec:
schema: # Output schema (read-only, inferred by the engine)
columns:
- name: <column name>
type: <type>
required: <true|false>
doc: <description>
queries: # SQL query definitions (at least one required)
- sql: <SQL SELECT statement> # The SQL defining the view (required)
dialect: <dialect> # SQL dialect e.g. 'spark', 'trino', 'presto', 'hive' (required)
defaultNamespace: <namespace> # Default namespace for unqualified table references (optional)
defaultCatalog: <catalog> # Default catalog for unqualified table references (optional)
properties: # View-level metadata properties (optional)
<key>: <value>
file: iceberg-daily-page-stats.yaml
---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergView"
metadata:
name: "analytics.events.daily_page_stats"
spec:
queries:
- sql: >-
SELECT
CAST(event_time AS DATE) AS view_date,
page_url,
COUNT(*) AS view_count,
COUNT(DISTINCT user_id) AS unique_users
FROM analytics.events.page_views
GROUP BY CAST(event_time AS DATE), page_url
dialect: "spark"
defaultNamespace: "analytics.events"
properties:
comment: "Daily page view statistics aggregated from raw events"
file: iceberg-multi-dialect-view.yaml
---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergView"
metadata:
name: "analytics.events.active_users"
spec:
queries:
- sql: >-
SELECT user_id, COUNT(*) AS event_count
FROM analytics.events.page_views
WHERE event_time >= current_date() - INTERVAL 30 DAYS
GROUP BY user_id
dialect: "spark"
- sql: >-
SELECT user_id, COUNT(*) AS event_count
FROM analytics.events.page_views
WHERE event_time >= current_date - INTERVAL '30' DAY
GROUP BY user_id
dialect: "trino"
defaultNamespace: "analytics.events"
properties:
comment: "Users active in the last 30 days"
IcebergViewListIf you need to define multiple views (e.g., using a template), it may be easier to use an
IcebergViewList resource.
apiVersion: "iceberg.jikkou.io/v1beta1" # The api version (required)
kind: "IcebergViewList" # The resource kind (required)
metadata: { }
items: [ ] # An array of IcebergView
---
apiVersion: "iceberg.jikkou.io/v1beta1"
kind: "IcebergViewList"
items:
- metadata:
name: "analytics.events.daily_page_stats"
spec:
queries:
- sql: >-
SELECT CAST(event_time AS DATE) AS view_date, page_url, COUNT(*) AS view_count
FROM analytics.events.page_views
GROUP BY CAST(event_time AS DATE), page_url
dialect: "spark"
defaultNamespace: "analytics.events"
- metadata:
name: "analytics.events.active_users"
spec:
queries:
- sql: >-
SELECT user_id, COUNT(*) AS event_count
FROM analytics.events.page_views
WHERE event_time >= current_date() - INTERVAL 30 DAYS
GROUP BY user_id
dialect: "spark"
defaultNamespace: "analytics.events"
Here, you will find information about the annotations provided by the Apache Iceberg extension for Jikkou.
iceberg.jikkou.io/allow-incompatible-changesControls whether incompatible schema changes are allowed when reconciling an IcebergTable.
By default, Jikkou rejects type changes that cannot be safely promoted (for example, changing
a column from string to int). Setting this annotation to "true" on a specific table
resource lifts that restriction for that resource only.
metadata:
annotations:
iceberg.jikkou.io/allow-incompatible-changes: "true"
iceberg.jikkou.io/namespace-locationRead-only annotation populated by the namespace collector. Contains the storage location of a namespace as reported by the catalog. This annotation is set automatically when collecting existing namespaces and does not need to be specified in resource files.
iceberg.jikkou.io/table-locationReserved annotation for the storage location of a table. Currently, the table location
is stored in the spec.location field rather than as an annotation. This annotation key
is defined for future use.
iceberg.jikkou.io/view-locationRead-only annotation populated by the view collector. Contains the storage location of a view as reported by the catalog. This annotation is set automatically when collecting existing views and does not need to be specified in resource files.