Data Modelling: Cardinality, Relationships and Compression

02 Feb 2026

Relationship Cardinality

Cardinality defines how rows in one table relate to rows in another.

Notation	Meaning
`1:1`	Exactly one to one
`1:N`	One to many
`N:M`	Many to many
`0..N`	Optional, zero or many

Example: E-Commerce Model

A customer places orders. Each order has line items referencing products in a category.

erDiagram
    CUSTOMER ||--o{ ORDER : "1:N"
    ORDER ||--|{ ORDER_LINE : "1:N"
    ORDER_LINE }o--|| PRODUCT : "N:1"
    PRODUCT }o--|| CATEGORY : "N:1"
    CUSTOMER {
        int customer_id PK
        string name
        string country
    }
    ORDER {
        int order_id PK
        int customer_id FK
        date order_date
        decimal total_amount
    }
    ORDER_LINE {
        int line_id PK
        int order_id FK
        int product_id FK
        int quantity
    }
    PRODUCT {
        int product_id PK
        int category_id FK
        string name
        decimal price
    }
    CATEGORY {
        int category_id PK
        string name
    }

Reading the Relationships

Relationship	Type	Meaning
Customer → Order	`1:N`	One customer, many orders
Order → Order Line	`1:N`	One order, many line items
Order Line → Product	`N:1`	Many lines reference one product
Product → Category	`N:1`	Many products in one category

Why this matters: a 1:N JOIN fans out rows. If you SUM(total_amount) after joining orders to order lines, you get inflated totals. Always aggregate before joining, or join on the 1 side.

Column Cardinality and Parquet Compression

Column cardinality = the number of distinct values in a column. This directly impacts how well Parquet compresses your data.

quadrantChart
    title Column Cardinality vs Compression
    x-axis "Low Cardinality" --> "High Cardinality"
    y-axis "Poor Compression" --> "Great Compression"
    country: [0.2, 0.85]
    status: [0.15, 0.9]
    category_id: [0.3, 0.75]
    order_date: [0.55, 0.5]
    product_id: [0.7, 0.35]
    email: [0.85, 0.15]
    event_id: [0.95, 0.1]

How Parquet Compression Works

Parquet stores data column-by-column, not row-by-row. Columns with few distinct values compress extremely well because repeated values are encoded efficiently.

block-beta
    columns 3
    block:row["Row-Oriented (CSV/JSON)"]:3
        r1["Alice, IE, checkout, evt_1"]
        r2["Bob, DE, login, evt_2"]
        r3["Alice, IE, checkout, evt_3"]
    end
    space:3
    block:col["Column-Oriented (Parquet)"]:3
        c1["name: Alice, Bob, Alice"]
        c2["country: IE, DE, IE"]
        c3["event_id: evt_1, evt_2, evt_3"]
    end

Encoding by Cardinality

Column	Distinct Values	Cardinality	Parquet Encoding	Compression
`country`	~200	Very low	Dictionary	Excellent
`status`	3-5	Very low	RLE + Dictionary	Excellent
`order_date`	~365/year	Low	Dictionary + Delta	Good
`product_id`	~10K	Medium	Dictionary	Moderate
`email`	~1 per row	High	Plain	Poor
`event_id`	1 per row	Unique	Plain	None

Dictionary encoding: Parquet builds a lookup table of distinct values and stores integer indices instead of full values. A country column with 1 billion rows but only 200 distinct values stores just 200 strings + 1B tiny integers.

RLE (Run-Length Encoding): If data is sorted, consecutive identical values are stored as (value, count). Sorting by country before writing turns IE, IE, IE, IE into (IE, 4).

Practical Advice

Sort before write — sorting by low-cardinality columns maximises RLE compression:

-- Iceberg: sorted by low-cardinality columns first
ALTER TABLE orders WRITE ORDERED BY (country, status, order_date);

Partition by low cardinality — Iceberg/Delta partition keys should be low-cardinality columns (country, event_date), never high-cardinality (user_id, event_id).

Column pruning — Parquet reads only requested columns. High-cardinality columns like event_id don’t hurt queries that don’t select them.

Action	Impact
Sort by low-cardinality columns	2-5x smaller files
Partition by date/region	Faster query pruning
Avoid `SELECT *`	Skip high-cardinality columns
Use Zstandard compression	Better ratio than Snappy