Data Modelling: Cardinality, Relationships and Compression
Relationship Cardinality
Cardinality defines how rows in one table relate to rows in another.
| Notation | Meaning |
|---|---|
1:1 |
Exactly one to one |
1:N |
One to many |
N:M |
Many to many |
0..N |
Optional, zero or many |
Example: E-Commerce Model
A customer places orders. Each order has line items referencing products in a category.
erDiagram
CUSTOMER ||--o{ ORDER : "1:N"
ORDER ||--|{ ORDER_LINE : "1:N"
ORDER_LINE }o--|| PRODUCT : "N:1"
PRODUCT }o--|| CATEGORY : "N:1"
CUSTOMER {
int customer_id PK
string name
string country
}
ORDER {
int order_id PK
int customer_id FK
date order_date
decimal total_amount
}
ORDER_LINE {
int line_id PK
int order_id FK
int product_id FK
int quantity
}
PRODUCT {
int product_id PK
int category_id FK
string name
decimal price
}
CATEGORY {
int category_id PK
string name
}
Reading the Relationships
| Relationship | Type | Meaning |
|---|---|---|
| Customer → Order | 1:N |
One customer, many orders |
| Order → Order Line | 1:N |
One order, many line items |
| Order Line → Product | N:1 |
Many lines reference one product |
| Product → Category | N:1 |
Many products in one category |
Why this matters: a 1:N JOIN fans out rows. If you SUM(total_amount) after joining orders to order lines, you get inflated totals. Always aggregate before joining, or join on the 1 side.
Column Cardinality and Parquet Compression
Column cardinality = the number of distinct values in a column. This directly impacts how well Parquet compresses your data.
quadrantChart
title Column Cardinality vs Compression
x-axis "Low Cardinality" --> "High Cardinality"
y-axis "Poor Compression" --> "Great Compression"
country: [0.2, 0.85]
status: [0.15, 0.9]
category_id: [0.3, 0.75]
order_date: [0.55, 0.5]
product_id: [0.7, 0.35]
email: [0.85, 0.15]
event_id: [0.95, 0.1]
How Parquet Compression Works
Parquet stores data column-by-column, not row-by-row. Columns with few distinct values compress extremely well because repeated values are encoded efficiently.
block-beta
columns 3
block:row["Row-Oriented (CSV/JSON)"]:3
r1["Alice, IE, checkout, evt_1"]
r2["Bob, DE, login, evt_2"]
r3["Alice, IE, checkout, evt_3"]
end
space:3
block:col["Column-Oriented (Parquet)"]:3
c1["name: Alice, Bob, Alice"]
c2["country: IE, DE, IE"]
c3["event_id: evt_1, evt_2, evt_3"]
end
Encoding by Cardinality
| Column | Distinct Values | Cardinality | Parquet Encoding | Compression |
|---|---|---|---|---|
country |
~200 | Very low | Dictionary | Excellent |
status |
3-5 | Very low | RLE + Dictionary | Excellent |
order_date |
~365/year | Low | Dictionary + Delta | Good |
product_id |
~10K | Medium | Dictionary | Moderate |
email |
~1 per row | High | Plain | Poor |
event_id |
1 per row | Unique | Plain | None |
Dictionary encoding: Parquet builds a lookup table of distinct values and stores integer indices instead of full values. A country column with 1 billion rows but only 200 distinct values stores just 200 strings + 1B tiny integers.
RLE (Run-Length Encoding): If data is sorted, consecutive identical values are stored as (value, count). Sorting by country before writing turns IE, IE, IE, IE into (IE, 4).
Practical Advice
Sort before write — sorting by low-cardinality columns maximises RLE compression:
-- Iceberg: sorted by low-cardinality columns first
ALTER TABLE orders WRITE ORDERED BY (country, status, order_date);
Partition by low cardinality — Iceberg/Delta partition keys should be low-cardinality columns (country, event_date), never high-cardinality (user_id, event_id).
Column pruning — Parquet reads only requested columns. High-cardinality columns like event_id don’t hurt queries that don’t select them.
| Action | Impact |
|---|---|
| Sort by low-cardinality columns | 2-5x smaller files |
| Partition by date/region | Faster query pruning |
Avoid SELECT * |
Skip high-cardinality columns |
| Use Zstandard compression | Better ratio than Snappy |