ClickHouse MergeTree Engine

After starting this series ClickHouse on Kubernetes, you can now configure your first single-node ClickHouse server. Let's dive into creating your first table and understanding the basic concepts behind the ClickHouse engine, its data storage, and some cool features

  1. Creating a Basic Table
  2. MergeTree Engine
  3. ORDER BY
  4. Insert data
  5. Insert by FORMAT
  6. UNDROP 🀯
  7. DETACH/ATTACH
  8. TTL
  9. LowCardinality(T) column data type
  10. ClickHouse Series

Creating a Basic Table

Here's a basic example of a table using the MergeTree engine:

CREATE TABLE events
(
    `event_time` DateTime,
    `event_date` Date DEFAULT toDate(event_time),
    `user_id` UInt32,
    `event_type` String,
    `value` String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (user_id, event_time)

ClickHouse column data types include (full list)

MergeTree Engine

The MergeTree engine and other engines of this family (*MergeTree) are the most commonly used and most robust ClickHouse table engines. The data is quickly written to the table part by part, and merging the parts in the background.

PARTITION BY toYYYYMM(event_date)

Each partition is stored separately folder

  • 202401/
  • 202402/
  • ...

ClickHouse also automatically cuts off the partition data where the partitioning key is specified in the query. The partition key cannot be modified. In most cases, you don't need a partition key. Don't partition your data by client identifiers or names.

ORDER BY

ORDER BY (user_id, event_date)

The data will be merged and sorted by (user_id, event_time). This data is stored as separate parts (chunks) sorted by the primary key. Within 10-15 minutes after insertion, the parts of the same partition are merged into a complete part. Note that a part is different from a partition; a part is a smaller unit within a partition. Use the ORDER BY tuple() syntax, if you do not need sorting.

Insert data

Inserts data into a table.

INSERT INTO events (event_time, user_id, event_type, value)
VALUES
	(now(), 111, 'click', '/home'),
	(now(), 222, 'click', '/blog');
SELECT * FROM events;

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€event_time─┬─event_date─┬─user_id─┬─event_type─┬─value─┐
β”‚ 2024-05-31 08:13:27 β”‚ 2024-05-31 β”‚     111 β”‚ click      β”‚ /home β”‚
β”‚ 2024-05-31 08:13:27 β”‚ 2024-05-31 β”‚     222 β”‚ click      β”‚ /blog β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

event_date is automatic assign by the DEFAULT toDate(event_time). It is also possible to use DEFAULT keyword to insert default values:

INSERT INTO events VALUES (now(), DEFAULT, 333, 'click', '/insights')
SELECT * FROM events

Query id: f7d18374-4439-4bfb-aa2f-478a7269f45d

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€event_time─┬─event_date─┬─user_id─┬─event_type─┬─value─────┐
β”‚ 2024-05-31 08:16:36 β”‚ 2024-05-31 β”‚     333 β”‚ click      β”‚ /insights β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€event_time─┬─event_date─┬─user_id─┬─event_type─┬─value─┐
β”‚ 2024-05-31 08:13:27 β”‚ 2024-05-31 β”‚     111 β”‚ click      β”‚ /home β”‚
β”‚ 2024-05-31 08:13:27 β”‚ 2024-05-31 β”‚     222 β”‚ click      β”‚ /blog β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

When using the clickhouse client binary in the terminal, you can see that the returned data is separated into two blocks because the data has not been merged yet. Check the number of parts:

SELECT table, partition, name, rows, path
FROM system.parts
WHERE database = 'default' AND table = 'events'

β”Œβ”€table──┬─partition─┬─name─────────┬─rows─┬─path─────────────────────────────────────────────────────────────────────────────┐
β”‚ events β”‚ 202405    β”‚ 202405_1_1_0 β”‚    2 β”‚ /var/lib/clickhouse/store/410/410ccafd-c5ac-48b7-93e6-42b6a82f4ece/202405_1_1_0/ β”‚
β”‚ events β”‚ 202405    β”‚ 202405_2_2_0 β”‚    1 β”‚ /var/lib/clickhouse/store/410/410ccafd-c5ac-48b7-93e6-42b6a82f4ece/202405_2_2_0/ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data is not merged yet, there is many to make it happens like OPTIMIZE table:

OPTIMIZE TABLE default.events FINAL

ClickHouse will merge into new part 202405_1_2_1 mark it as active part, and inactive will be clean up later.

Insert by FORMAT

Data can be passed to the INSERT in any format supported by ClickHouse. The format must be specified explicitly in the query, for example:

INSERT INTO events FORMAT JSONEachRow
{"event_time": "2024-06-01 00:00:00", "user_id": "111", "event_type": "scroll", "value": "/blog"}

Ok.

Bonus: You can change the data format returns from SELECT query:

SELECT * FROM events LIMIT 3 Format CSV;

"2024-05-31 09:40:00","2024-05-31",16089454,"click","/insights"
"2024-05-31 09:40:00","2024-05-31",16089454,"click","/insights"
"2024-05-31 09:40:00","2024-05-31",16089454,"click","/insights"

Some common formats for Input and Output data

UNDROP 🀯

Like every other database engine, you can DROP one or more tables, but you can even UNDROP them within up to 8 minutes (by default, can be adjusted using the database_atomic_delay_before_drop_table_sec setting).

DROP TABLE events;
UNDROP TABLE events;

DETACH/ATTACH

Detaching a table makes the server "forget" about the existence of the table. This action does not delete the data or metadata of the table. I usually DETACH it when encountering some issues that need to be fixed under the file system, and then ATTACH it to scan and load it back.

TTL

TTL (time-to-live) refers to the capability of moving, deleting, or rolling up rows or columns after a certain interval of time has passed. This actually happens when data is being merged. TTL can be applied to each column or the whole row level. More detail in document.

Now modify the table above:

ALTER TABLE events MODIFY TTL event_time + INTERVAL 1 HOUR;

Insert old data:

INSERT INTO events VALUES (now() - interval 10 hour, DEFAULT, 333, 'click', '/insights');

-- Quickly select the data
SELECT * FROM events;

After for a while or to force TTL cleanup by OPTIMIZE:

OPTIMIZE TABLE events FINAL;
SELECT * FROM events;

Second row was deleted from table.

LowCardinality(T) column data type

If you have a column like this

SELECT event_type FROM events;

β”Œβ”€event_type─┐
β”‚ click      β”‚
β”‚ pageview   β”‚
β”‚ pageview   β”‚
β”‚ pageview   β”‚
β”‚ pageview   β”‚
β”‚ pageview   β”‚
β”‚ click      β”‚
β”‚ click      β”‚
β”‚ click      β”‚
β”‚ click      β”‚
β”‚ pageview   β”‚
β”‚ pageview   β”‚
β”‚ pageview   β”‚
β”‚ pageview   β”‚


SELECT event_type, COUNT() FROM events GROUP BY 1;

β”Œβ”€event_type─┬──count()─┐
β”‚ click      β”‚ 17563648 β”‚
β”‚ scroll     β”‚   262144 β”‚
β”‚ pageview   β”‚ 15466496 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3 rows in set. Elapsed: 28.517 sec. Processed 33.29 million rows, 512.75 MB (1.17 million rows/s., 17.98 MB/s.)

The value is repeated. Consider using Enum when you have a handful of unique values or LowCardinality when you have up to 10,000 unique values (e.g. click, pageview,...) of a column.

Let’s look at how event_type column are stored

SELECT
    column,
    any(type),
    formatReadableSize(sum(column_data_compressed_bytes)) AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
    round(sum(column_data_uncompressed_bytes) / sum(column_data_compressed_bytes), 2) AS compr_ratio,
    sum(rows)
FROM system.parts_columns
WHERE (`table` = 'events') AND active AND (column = 'event_type')
GROUP BY column
ORDER BY column ASC

β”Œβ”€column─────┬─any(type)─┬─compressed─┬─uncompressed─┬─compr_ratio─┬─sum(rows)─┐
β”‚ event_type β”‚ String    β”‚ 1.04 MiB   β”‚ 233.25 MiB   β”‚      225.34 β”‚  33292288 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

I will change the type of the event_type column to LowCardinality. This can be done with a simple spell that looks like an ALTER TABLE statement.

ALTER TABLE events
  MODIFY COLUMN `event_type` LowCardinality(String);

0 rows in set. Elapsed: 99.556 sec.

This took 99 seconds in my tiny server. Now checking the column size again:

SELECT
    column,
    any(type),
    formatReadableSize(sum(column_data_compressed_bytes)) AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
    round(sum(column_data_uncompressed_bytes) / sum(column_data_compressed_bytes), 2) AS compr_ratio,
    sum(rows)
FROM system.parts_columns
WHERE (`table` = 'events') AND active AND (column = 'event_type')
GROUP BY column
ORDER BY column ASC

β”Œβ”€column─────┬─any(type)──────────────┬─compressed─┬─uncompressed─┬─compr_ratio─┬─sum(rows)─┐
β”‚ event_type β”‚ LowCardinality(String) β”‚ 151.16 KiB β”‚ 31.56 MiB    β”‚      213.81 β”‚  33292288 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The storage size has been reduced x7 times, and compressed is just now only 151 KiB. In some case you even get better compr_ratio. This also improved query performance:

SELECT event_type, COUNT() FROM events GROUP BY 1;

- 3 rows in set. Elapsed: 28.517 sec. Processed 33.29 million rows, 512.75 MB (1.17 million rows/s., 17.98 MB/s.)
+ 3 rows in set. Elapsed: 5.272 sec. Processed 33.29 million rows, 33.29 MB (6.31 million rows/s., 6.31 MB/s.)

The query now runs 5.41 times faster. LowCardinality changes the internal representation of other data types to be dictionary-encoded.

β€’Dataβ€’DataClickHouseClickHouse on Kubernetesβ€’

Series: ClickHouse on Kubernetes

1
ClickHouse on Kubernetes

ClickHouse has been both exciting and incredibly challenging based on my experience migrating and scaling from Iceberg to ClickHouse, zero to a large cluster of trillions of rows. I have had to deal with many of use cases and resolve issues. I have been trying to take notes every day for myself, although it takes time to publish them as a series of blog posts. I hope I can do so on this ClickHouse on Kubernetes series.

2
ClickHouse SELECT Advances

Dynamic column selection (also known as a `COLUMNS` expression) allows you to match some columns in a result with aΒ re2Β regular expression.

3
Monitoring ClickHouse on Kubernetes

Now that you have your first ClickHouse instance on Kubernetes and are starting to use it, you need to monitoring and observing what happens on it is an important task to achieve stability.

4
ClickHouse MergeTree Engine

After starting this series ClickHouse on Kubernetes, you can now configure your first single-node ClickHouse server. Let's dive into creating your first table and understanding the basic concepts behind the ClickHouse engine, its data storage, and some cool features

5
ClickHouse ReplacingMergeTree Engine

My favorite ClickHouse table engine is `ReplacingMergeTree`. The main reason is that it is similar to `MergeTree` but can automatically deduplicate based on columns in the `ORDER BY` clause, which is very useful.