Skip to main content

N-Gram BloomFilter Index

Indexing Principles​

The NGram BloomFilter index, similar to the BloomFilter index, is a skip index based on BloomFilter.

Unlike the BloomFilter index, the NGram BloomFilter index is used to accelerate text LIKE queries. Instead of storing the original text values, it tokenizes the text using NGram and stores each token in the BloomFilter. For LIKE queries, the pattern in LIKE '%pattern%' is also tokenized using NGram. Each token is checked against the BloomFilter, and if any token is not found, the corresponding data block does not meet the LIKE condition and can be skipped, reducing IO and accelerating the query.

Use Cases​

The NGram BloomFilter index can only accelerate string LIKE queries, and the number of consecutive characters in the LIKE pattern must be greater than or equal to the N defined in the NGram index.

tip
  • NGram BloomFilter only supports string columns and can only accelerate LIKE queries.
  • NGram BloomFilter index and BloomFilter index are mutually exclusive, meaning a column can only have one or the other.
  • The performance analysis of the NGram BloomFilter index is similar to that of the BloomFilter index.

Managing Indexes​

Creating an NGram BloomFilter Index​

The index definition follows the COLUMN definition in the CREATE TABLE statement:

INDEX `idx_column_name` (`column_name`) USING NGRAM_BF PROPERTIES("gram_size"="3", "bf_size"="1024") COMMENT 'username ngram_bf index'

Explanation of the syntax:

  1. idx_column_name(column_name) is mandatory. column_name is the column to be indexed and must appear in the column definitions above. idx_column_name is the index name, which must be unique at the table level. It is recommended to name it with a prefix idx_ followed by the column name.

  2. USING NGRAM_BF is mandatory and specifies that the index type is an NGram BloomFilter index.

  3. PROPERTIES is optional and is used to specify additional properties for the NGram BloomFilter index. The supported properties are:

    • gram_size: The N in NGram, specifying the number of consecutive characters to form a token. For example, 'an ngram example' with N = 3 would be tokenized into 'an ', 'n n', ' ng', 'ngr', 'gra', 'ram' (6 tokens).
    • bf_size: The size of the BloomFilter in bits. bf_size determines the size of the index corresponding to each data block. The larger this value, the more storage space it occupies, but the lower the probability of hash collisions.

    It is recommended to set gram_size to the minimum length of the string in LIKE queries but not less than 2. Generally, "gram_size"="3", "bf_size"="1024" is recommended, then adjust based on the Query Profile.

  4. COMMENT is optional and specifies an index comment.

Viewing NGram BloomFilter Index​

-- Syntax 1: The INDEX section in the table schema with USING NGRAM_BF indicates an inverted index

SHOW CREATE TABLE table_name;

-- Syntax 2: IndexType as NGRAM_BF indicates an inverted index

SHOW INDEX FROM idx_name;

Deleting an NGram BloomFilter Index​

ALTER TABLE table_ngrambf DROP INDEX idx_ngrambf;

Modifying an NGram BloomFilter Index​

CREATE INDEX idx_column_name2(column_name2) ON table_ngrambf USING NGRAM_BF PROPERTIES("gram_size"="3", "bf_size"="1024") COMMENT 'username ngram_bf index';

ALTER TABLE table_ngrambf ADD INDEX idx_column_name2(column_name2) USING NGRAM_BF PROPERTIES("gram_size"="3", "bf_size"="1024") COMMENT 'username ngram_bf index';

Using Indexes​

NGram BloomFilter index is used to accelerate LIKE queries, for example: SELECT count() FROM table1 WHERE message LIKE '%error%';

The acceleration effect of the BloomFilter index (including NGram) can be analyzed using the following metrics in the Query Profile:

  • RowsBloomFilterFiltered: The number of rows filtered by the BloomFilter index, which can be compared with other Rows values to analyze the filtering effect of the index.
  • BlockConditionsFilteredBloomFilterTime: The time consumed by the BloomFilter inverted index.

Usage Example​

This section demonstrates the usage and effectiveness of the NGram BloomFilter index using a dataset of Amazon product reviews, amazon_reviews.

Table Creation​

CREATE TABLE `amazon_reviews` (  
`review_date` int(11) NULL,
`marketplace` varchar(20) NULL,
`customer_id` bigint(20) NULL,
`review_id` varchar(40) NULL,
`product_id` varchar(10) NULL,
`product_parent` bigint(20) NULL,
`product_title` varchar(500) NULL,
`product_category` varchar(50) NULL,
`star_rating` smallint(6) NULL,
`helpful_votes` int(11) NULL,
`total_votes` int(11) NULL,
`vine` boolean NULL,
`verified_purchase` boolean NULL,
`review_headline` varchar(500) NULL,
`review_body` string NULL
) ENGINE=OLAP
DUPLICATE KEY(`review_date`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`review_date`) BUCKETS 16
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"compression" = "ZSTD"
);

Data Import​

Download the dataset using wget or other tools from the following URLs:

https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2010.snappy.parquet
https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2011.snappy.parquet
https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2012.snappy.parquet
https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2013.snappy.parquet
https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2014.snappy.parquet
https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2015.snappy.parquet

Import the data using stream load:

curl --location-trusted -u root: -T amazon_reviews_2010.snappy.parquet -H "format:parquet" http://127.0.0.1:8030/api/${DB}/amazon_reviews/_stream_load
curl --location-trusted -u root: -T amazon_reviews_2011.snappy.parquet -H "format:parquet" http://127.0.0.1:8030/api/${DB}/amazon_reviews/_stream_load
curl --location-trusted -u root: -T amazon_reviews_2012.snappy.parquet -H "format:parquet" http://127.0.0.1:8030/api/${DB}/amazon_reviews/_stream_load
curl --location-trusted -u root: -T amazon_reviews_2013.snappy.parquet -H "format:parquet" http://127.0.0.1:8030/api/${DB}/amazon_reviews/_stream_load
curl --location-trusted -u root: -T amazon_reviews_2014.snappy.parquet -H "format:parquet" http://127.0.0.1:8030/api/${DB}/amazon_reviews/_stream_load
curl --location-trusted -u root: -T amazon_reviews_2015.snappy.parquet -H "format:parquet" http://127.0.0.1:8030/api/${DB}/amazon_reviews/_stream_load
info

The data file may exceed 10 GB, and you may need to adjust the streaming_road_max_mb in be.conf to prevent exceeding the upload size limit of the stream load. You can dynamically adjust it by following the steps below:

curl -X POST http://{be_ip}:{be_http_port}/api/update_config?streaming_load_max_mb=32768

Every BE needs to execute the above command.

Run a count query to confirm successful data import:

mysql> SELECT COUNT(*) FROM amazon_reviews;
+-----------+
| count(*) |
+-----------+
| 135589433 |
+-----------+

Querying​

First, run the query without any index. The WHERE clause contains a LIKE condition, and the query takes 7.60 seconds:

SELECT
product_id,
any(product_title),
AVG(star_rating) AS rating,
COUNT(*) AS count
FROM
amazon_reviews
WHERE
review_body LIKE '%is super awesome%'
GROUP BY
product_id
ORDER BY
count DESC,
rating DESC,
product_id
LIMIT 5;

Results:

+------------+------------------------------------------+--------------------+-------+
| product_id | any_value(product_title) | rating | count |
+------------+------------------------------------------+--------------------+-------+
| B00992CF6W | Minecraft | 4.8235294117647056 | 17 |
| B009UX2YAC | Subway Surfers | 4.7777777777777777 | 9 |
| B00DJFIMW6 | Minion Rush: Despicable Me Official Game | 4.875 | 8 |
| B0086700CM | Temple Run | 5 | 6 |
| B00KWVZ750 | Angry Birds Epic RPG | 5 | 6 |
+------------+------------------------------------------+--------------------+-------+
5 rows in set (7.60 sec)

Next, add an NGram BloomFilter index and run the same query again. The query takes 0.93 seconds, an 8x performance improvement:

ALTER TABLE amazon_reviews ADD INDEX review_body_ngram_idx(review_body) USING NGRAM_BF PROPERTIES("gram_size"="10", "bf_size"="10240");
+------------+------------------------------------------+--------------------+-------+
| product_id | any_value(product_title) | rating | count |
+------------+------------------------------------------+--------------------+-------+
| B00992CF6W | Minecraft | 4.8235294117647056 | 17 |
| B009UX2YAC | Subway Surfers | 4.7777777777777777 | 9 |
| B00DJFIMW6 | Minion Rush: Despicable Me Official Game | 4.875 | 8 |
| B0086700CM | Temple Run | 5 | 6 |
| B00KWVZ750 | Angry Birds Epic RPG | 5 | 6 |
+------------+------------------------------------------+--------------------+-------+
5 rows in set (0.93 sec)