How to create dictionary:
CREATE DICTIONARY dict_name( ... - attributes ) PRIMARY KEY ... - complex or single key configuration SOURCE(...) - source configuration LAYOUT(...) - memory layout configuration LIFETIME(...) - lifetime of dictionary in memory
Example xml configuration:
<dictionary> <name>dictionary_name</name> <structure> <key> key attributes </key> attributes </structure> <source> dictionary source configuration </source> <layout><dictionary_layout /></layout> <lifetime>dictionary_lifetime</lifetime> </dictionary>
Example configuration:
CREATE DICTIONARY dict_name ( id UInt64, value String, ... attributes ) PRIMARY KEY id ...
MySQL PostgreSQL Mongo Redis Cassandra ClickHouse File HTTP Executable program/script Any database throught ODBC driver
MySQL source configuration:
SOURCE(MYSQL( port 3306 user 'clickhouse' password 'qwerty' db 'db_name' table 'table_name'))
Flat Hashed Cache SSDCache Direct IPTrie Polygon
Cache layout configuration:
LAYOUT(CACHE(SIZE_IN_CELLS 100000))
LAYOUT(SSD_CACHE( BLOCK_SIZE 4096 FILE_SIZE 16777216 READ_BUFFER_SIZE 1048576 PATH /path/))
ClickHouse periodically updates the dictionaries. Dictionary updates (other than loading for first use) do not block queries.
LIFETIME(MIN 300 MAX 360)
Get dictionary attribute for specific keys
For primary key from single attribute
dictGet(dictionary_name, attribute_name, keys)
For primary key from multiple attributes
dictGet(dictionary_name, attribute_name, (key1, key2, ..., keyN))
For primary key from single attribute
dictHas(dictionary_name, keys)
For primary key from multiple attributes
dictHas(dictionary_name, (key1, key2, ..., keyN))
CREATE TABLE dictionaries.dictionary_example (id Int, value TEXT); INSERT INTO dictionaries.dictionary_example VALUES (1, 'First Value'), (2, 'Second Value'), (3, 'Third Value');
CREATE DICTIONARY mysql_dictionary_example ( id UInt64, value String DEFAULT 'Default Value' ) PRIMARY KEY id SOURCE(MYSQL( port 3306 user 'test' password 'test' db 'dictionaries' table 'dictionary_example' )) LIFETIME(MIN 300 MAX 300) LAYOUT(HASHED());
SELECT number, dictGet('mysql_dictionary_example', 'value', number) as values FROM system.numbers LIMIT 5
┌─number─┬─values────────┐ │ 0 │ Default Value │ │ 1 │ First Value │ │ 2 │ Second Value │ │ 3 │ Third Value │ │ 4 │ Default Value │ └────────┴───────────────┘
SELECT number, dictHas('mysql_dictionary_example', number) as values FROM system.numbers LIMIT 5
┌─number─┬─values─┐ │ 0 │ 0 │ │ 1 │ 1 │ │ 2 │ 1 │ │ 3 │ 1 │ │ 4 │ 0 │ └────────┴────────┘
File data.tsv in TabSeparated format
1 First Value 2 Second Value 3 Third Value
<name>executable_tsv</name> <structure> <id><name>x</name></id> <attribute> <name>value</name> <type>String</type> <null_value>No such value</null_value> </attribute> </structure> <source> <executable> <command>cat data.tsv</command> <format>TabSeparated</format> </executable> </source> <layout><direct /></layout> <lifetime>300</lifetime>
SELECT number, dictGet('executable_tsv_simple', 'value', number) as values FROM system.numbers LIMIT 5
┌─number─┬─values────────┐ │ 0 │ No such value │ │ 1 │ First Value │ │ 2 │ Second Value │ │ 3 │ Third Value │ │ 4 │ No such value │ └────────┴───────────────┘
SELECT number, dictHas('executable_tsv_simple', number) as values FROM system.numbers LIMIT 5
┌─number─┬─values─┐ │ 0 │ 0 │ │ 1 │ 1 │ │ 2 │ 1 │ │ 3 │ 1 │ │ 4 │ 0 │ └────────┴────────┘
Available since 21.3.
File data.tsv in TabSeparated format
First Value Second Value Third Value
<source> <executable> <command>cat data.tsv</command> <format>TabSeparated</format> <implicit_key>1</implicit_key> </executable> </source>
ClickHouse will vfork + exec to create new process for each block to process during dictGet, or dictHas
1. User program has long initialization
2. User program need to save state
Available since 21.3.
Create processes on demand, put them in pool with some size, and then reuse them during dictGet, or dictHas
<source> <executable_pool> <command>executable-program</command> <format>TabSeparated</format> <size>10</size> </executable_pool> </source>
int main(int argc, char **argv) { int result; int identifier; std::this_thread::sleep_for(std::chrono::milliseconds(50)); while ((result = scanf("%i", &identifier) != EOF)) std::cout << identifier << "\t" << (identifier + 1) << "\n" << std::flush; return 0; }
Executable pool:
clickhouse-benchmark --query="SELECT dictGet('executable_pool', 'value', number) FROM system.numbers_mt LIMIT 100000 FORMAT Null" --concurrency=10
localhost:9000, queries 189, QPS: 187.377, MiB/s: 2996.612
Executable:
clickhouse-benchmark --query="SELECT dictGet('executable', 'value', number) FROM system.numbers_mt LIMIT 100000 FORMAT Null" --concurrency=10
localhost:9000, queries 268, QPS: 92.657, MiB/s: 1481.805
<dictionary> <name>user_custom_function</name> <structure> <key> <attribute> <name>first_word</name> <type>String</type><null_value></null_value> </attribute> <attribute> <name>second_word</name> <type>String</type><null_value></null_value> </attribute> </key> <attribute> <name>distance</name> <type>Float64</type> <null_value>0.0</null_value> </attribute> </structure> <source> <executable_pool> <format>TabSeparated</format> <command>word2vec-function</command> <implicit_key>1</implicit_key> <size>5</size> </executable_pool> </source> <layout><complex_key_direct /></layout> <lifetime>300</lifetime> </dictionary>
SELECT dictGet('user_custom_function', 'distance', ('FirstWord', 'SecondWord'));
┌─dictGet('user_custom_function', 'distance', tuple('FirstWord', 'SecondWord'))─┐ │ 0.46 │ └───────────────────────────────────────────────────────────────────────────────┘
Available since 21.3.
1. Current implementation use fixed hash table with max collision resolution chain = 10
2. Cache attributes allocated memory for all elements on initialization
Issues:
https://github.com/ClickHouse/ClickHouse/issues/20194
https://github.com/ClickHouse/ClickHouse/issues/20252
https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU)
clickhouse-benchmark --query="SELECT dictGet('clickhouse_simple_cache_dictionary', 'value1', number) FROM system.numbers LIMIT 80000 FORMAT Null"
localhost:9000, queries 270, QPS: 87.105, MiB/s: 87.064
clickhouse-benchmark --query="SELECT dictGet('clickhouse_simple_cache_dictionary', 'value1', number) FROM system.numbers LIMIT 80000 FORMAT Null"
localhost:9000, queries 60, QPS: 20.105, MiB/s: 22.024
Will be available since 21.4. or 21.5
Available since 21.3.
If all keys are found in cache, but some of them are expired, return result to client
Implemented for both Cache, SSDCache dictionary layouts
Available since 21.3.
Added nullable type support for all dictionary layouts, except Polygon and IPTrie
Available since 21.3.
Problem. If your table initialization depends on dictGet result type, and dictionary cannot be loaded because of some error server startup can fails.
Developer — Vitaliy Baranov
Will be available in 21.4. or 21.5
1. Support for single complex key hierarchy attribute
2. Add dictGetChildren, dictGetDescendants methods
3. Improve performance
Issues:
Will be available in 21.4. or 21.5
1. Array dictionary attribute type support
2. Decimal256 dictionary attribute type support
3. Suport Enum type
4. Table function dictionary for dictionaries created using XML
5. Synchronous initialization of dictionaries created using DDL
6. Use system.dictionaries information without loading dictionary
7. Improve documentation with examples
8. Simplify simple, complex key dictionary configuration created with DDL, XML
Links