File Path Pattern
Description
When accessing files from remote storage systems (S3, HDFS, and other S3-compatible object storage), Doris supports flexible file path patterns including wildcards and range expressions. This document describes the supported path formats and pattern matching syntax.
These path patterns are supported by:
- S3 TVF
- HDFS TVF
- Broker Load
- INSERT INTO SELECT from TVF
Supported URI Formats
S3-Style URIs
| Style | Format | Example |
|---|---|---|
| AWS Client Style (Hadoop S3) | s3://bucket/path/to/file | s3://my-bucket/data/file.csv |
| S3A Style | s3a://bucket/path/to/file | s3a://my-bucket/data/file.csv |
| S3N Style | s3n://bucket/path/to/file | s3n://my-bucket/data/file.csv |
| Virtual Host Style | https://bucket.endpoint/path/to/file | https://my-bucket.s3.us-west-1.amazonaws.com/data/file.csv |
| Path Style | https://endpoint/bucket/path/to/file | https://s3.us-west-1.amazonaws.com/my-bucket/data/file.csv |
Other Cloud Storage URIs
| Provider | Scheme | Example |
|---|---|---|
| Alibaba Cloud OSS | oss:// | oss://my-bucket/data/file.csv |
| Tencent Cloud COS | cos://, cosn:// | cos://my-bucket/data/file.csv |
| Baidu Cloud BOS | bos:// | bos://my-bucket/data/file.csv |
| Huawei Cloud OBS | obs:// | obs://my-bucket/data/file.csv |
| Google Cloud Storage | gs:// | gs://my-bucket/data/file.csv |
| Azure Blob Storage | azure:// | azure://container/data/file.csv |
HDFS URIs
| Style | Format | Example |
|---|---|---|
| Standard | hdfs://namenode:port/path/to/file | hdfs://namenode:8020/user/data/file.csv |
| HA Mode | hdfs://nameservice/path/to/file | hdfs://my-ha-cluster/user/data/file.csv |
Wildcard Patterns
Doris uses glob-style pattern matching for file paths. The following wildcards are supported:
Basic Wildcards
| Pattern | Description | Example | Matches |
|---|---|---|---|
* | Matches zero or more characters within a path segment | *.csv | file.csv, data.csv, a.csv |
? | Matches exactly one character | file?.csv | file1.csv, fileA.csv, but not file10.csv |
[abc] | Matches any single character in brackets | file[123].csv | file1.csv, file2.csv, file3.csv |
[a-z] | Matches any single character in the range | file[a-c].csv | filea.csv, fileb.csv, filec.csv |
[!abc] | Matches any single character NOT in brackets | file[!0-9].csv | filea.csv, fileb.csv, but not file1.csv |
Range Expansion (Brace Patterns)
Doris supports numeric range expansion using brace patterns {start..end}:
| Pattern | Expansion | Matches |
|---|---|---|
{1..3} | {1,2,3} | 1, 2, 3 |
{01..05} | {1,2,3,4,5} | 1, 2, 3, 4, 5 (leading zeros are NOT preserved) |
{3..1} | {1,2,3} | 1, 2, 3 (reverse ranges supported) |
{a,b,c} | {a,b,c} | a, b, c (enumeration) |
{1..3,5,7..9} | {1,2,3,5,7,8,9} | Mixed ranges and values |
- Doris tries to match as many files as possible. Invalid parts in brace expressions are silently skipped, and valid parts are still expanded. For example,
file_{a..b,-1..3,4..5}will matchfile_4andfile_5(the invalida..band negative range-1..3are skipped, but4..5is expanded normally). - If the entire range is negative (e.g.,
{-1..2}), the range is skipped. If mixed with valid ranges (e.g.,{-1..2,1..3}), only the valid range1..3is expanded. - When using comma-separated values with ranges, only numbers are allowed. For example, in
{1..4,a}, the non-numericawill be ignored, resulting in{1,2,3,4}. - Pure enumeration patterns like
{a,b,c}(without..ranges) are passed directly to glob matching and work as expected.
Combining Patterns
Multiple patterns can be combined in a single path:
s3://bucket/data_{1..3}/file_*.csv
This matches:
s3://bucket/data_1/file_a.csvs3://bucket/data_1/file_b.csvs3://bucket/data_2/file_a.csv- ... and so on
Examples
S3 TVF Examples
Match all CSV files in a directory:
SELECT * FROM S3(
"uri" = "s3://my-bucket/data/*.csv",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "csv"
);
Match files with numeric range:
SELECT * FROM S3(
"uri" = "s3://my-bucket/logs/data_{1..10}.csv",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "csv"
);
Match files in date-partitioned directories:
SELECT * FROM S3(
"uri" = "s3://my-bucket/logs/year=2024/month=*/day=*/data.parquet",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "parquet"
);
For zero-padded directory names like month=01, month=02, use wildcards (*) instead of range patterns. The pattern {01..12} expands to {1,2,...,12} which won't match month=01.
Match numbered file splits (e.g., Spark output):
SELECT * FROM S3(
"uri" = "s3://my-bucket/output/part-{00000..00099}.csv",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "csv"
);
Broker Load Examples
Load all CSV files matching a pattern:
LOAD LABEL db.label_wildcard
(
DATA INFILE("s3://my-bucket/data/file_*.csv")
INTO TABLE my_table
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(col1, col2, col3)
)
WITH S3 (
"provider" = "S3",
"AWS_ENDPOINT" = "s3.us-west-2.amazonaws.com",
"AWS_ACCESS_KEY" = "xxx",
"AWS_SECRET_KEY" = "xxx",
"AWS_REGION" = "us-west-2"
);
Load files using numeric range expansion:
LOAD LABEL db.label_range
(
DATA INFILE("s3://my-bucket/exports/data_{1..5}.csv")
INTO TABLE my_table
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(col1, col2, col3)
)
WITH S3 (
"provider" = "S3",
"AWS_ENDPOINT" = "s3.us-west-2.amazonaws.com",
"AWS_ACCESS_KEY" = "xxx",
"AWS_SECRET_KEY" = "xxx",
"AWS_REGION" = "us-west-2"
);
Load from HDFS with wildcards:
LOAD LABEL db.label_hdfs_wildcard
(
DATA INFILE("hdfs://namenode:8020/user/data/2024-*/*.csv")
INTO TABLE my_table
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(col1, col2, col3)
)
WITH HDFS (
"fs.defaultFS" = "hdfs://namenode:8020",
"hadoop.username" = "user"
);
Load from HDFS with numeric range:
LOAD LABEL db.label_hdfs_range
(
DATA INFILE("hdfs://namenode:8020/data/file_{1..3,5,7..9}.csv")
INTO TABLE my_table
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(col1, col2, col3)
)
WITH HDFS (
"fs.defaultFS" = "hdfs://namenode:8020",
"hadoop.username" = "user"
);
INSERT INTO SELECT Examples
Insert from S3 with wildcards:
INSERT INTO my_table (col1, col2, col3)
SELECT * FROM S3(
"uri" = "s3://my-bucket/data/part-*.parquet",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "parquet"
);
Performance Considerations
Use Specific Prefixes
Doris extracts the longest non-wildcard prefix from your path pattern to optimize S3/HDFS listing operations. More specific prefixes result in faster file discovery.
-- Good: specific prefix reduces listing scope
"uri" = "s3://bucket/data/2024/01/15/*.csv"
-- Less optimal: broad wildcard at early path segment
"uri" = "s3://bucket/data/**/file.csv"
Prefer Range Patterns for Known Sequences
When you know the exact file numbering, use range patterns instead of wildcards:
-- Better: explicit range
"uri" = "s3://bucket/data/part-{0001..0100}.csv"
-- Less optimal: wildcard matches unknown files
"uri" = "s3://bucket/data/part-*.csv"
Avoid Deep Recursive Wildcards
Deep recursive patterns like ** can cause slow file listing on large buckets:
-- Avoid when possible
"uri" = "s3://bucket/**/*.csv"
-- Prefer explicit path structure
"uri" = "s3://bucket/data/year=*/month=*/day=*/*.csv"
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| No files found | Pattern doesn't match any files | Verify the path and pattern syntax; test with a single file first |
| Slow file listing | Wildcard too broad or too many files | Use more specific prefix; limit wildcard scope |
| Invalid URI error | Malformed path syntax | Check URI scheme and bucket name format |
| Access denied | Credentials or permissions issue | Verify S3/HDFS credentials and bucket policies |
Testing Path Patterns
Before running a large load job, test your pattern with a limited query:
-- Test if files exist and match pattern
SELECT * FROM S3(
"uri" = "s3://bucket/your/pattern/*.csv",
...
) LIMIT 1;
Use DESC FUNCTION to verify the schema of matched files:
DESC FUNCTION S3(
"uri" = "s3://bucket/your/pattern/*.csv",
...
);