Skip to main content

File Path Pattern

Description

When accessing files from remote storage systems (S3, HDFS, and other S3-compatible object storage), Doris supports flexible file path patterns including wildcards and range expressions. This document describes the supported path formats and pattern matching syntax.

These path patterns are supported by:

Supported URI Formats

S3-Style URIs

StyleFormatExample
AWS Client Style (Hadoop S3)s3://bucket/path/to/files3://my-bucket/data/file.csv
S3A Styles3a://bucket/path/to/files3a://my-bucket/data/file.csv
S3N Styles3n://bucket/path/to/files3n://my-bucket/data/file.csv
Virtual Host Stylehttps://bucket.endpoint/path/to/filehttps://my-bucket.s3.us-west-1.amazonaws.com/data/file.csv
Path Stylehttps://endpoint/bucket/path/to/filehttps://s3.us-west-1.amazonaws.com/my-bucket/data/file.csv

Other Cloud Storage URIs

ProviderSchemeExample
Alibaba Cloud OSSoss://oss://my-bucket/data/file.csv
Tencent Cloud COScos://, cosn://cos://my-bucket/data/file.csv
Baidu Cloud BOSbos://bos://my-bucket/data/file.csv
Huawei Cloud OBSobs://obs://my-bucket/data/file.csv
Google Cloud Storagegs://gs://my-bucket/data/file.csv
Azure Blob Storageazure://azure://container/data/file.csv

HDFS URIs

StyleFormatExample
Standardhdfs://namenode:port/path/to/filehdfs://namenode:8020/user/data/file.csv
HA Modehdfs://nameservice/path/to/filehdfs://my-ha-cluster/user/data/file.csv

Wildcard Patterns

Doris uses glob-style pattern matching for file paths. The following wildcards are supported:

Basic Wildcards

PatternDescriptionExampleMatches
*Matches zero or more characters within a path segment*.csvfile.csv, data.csv, a.csv
?Matches exactly one characterfile?.csvfile1.csv, fileA.csv, but not file10.csv
[abc]Matches any single character in bracketsfile[123].csvfile1.csv, file2.csv, file3.csv
[a-z]Matches any single character in the rangefile[a-c].csvfilea.csv, fileb.csv, filec.csv
[!abc]Matches any single character NOT in bracketsfile[!0-9].csvfilea.csv, fileb.csv, but not file1.csv

Range Expansion (Brace Patterns)

Doris supports numeric range expansion using brace patterns {start..end}:

PatternExpansionMatches
{1..3}{1,2,3}1, 2, 3
{01..05}{1,2,3,4,5}1, 2, 3, 4, 5 (leading zeros are NOT preserved)
{3..1}{1,2,3}1, 2, 3 (reverse ranges supported)
{a,b,c}{a,b,c}a, b, c (enumeration)
{1..3,5,7..9}{1,2,3,5,7,8,9}Mixed ranges and values
Note
  • Doris tries to match as many files as possible. Invalid parts in brace expressions are silently skipped, and valid parts are still expanded. For example, file_{a..b,-1..3,4..5} will match file_4 and file_5 (the invalid a..b and negative range -1..3 are skipped, but 4..5 is expanded normally).
  • If the entire range is negative (e.g., {-1..2}), the range is skipped. If mixed with valid ranges (e.g., {-1..2,1..3}), only the valid range 1..3 is expanded.
  • When using comma-separated values with ranges, only numbers are allowed. For example, in {1..4,a}, the non-numeric a will be ignored, resulting in {1,2,3,4}.
  • Pure enumeration patterns like {a,b,c} (without .. ranges) are passed directly to glob matching and work as expected.

Combining Patterns

Multiple patterns can be combined in a single path:

s3://bucket/data_{1..3}/file_*.csv

This matches:

  • s3://bucket/data_1/file_a.csv
  • s3://bucket/data_1/file_b.csv
  • s3://bucket/data_2/file_a.csv
  • ... and so on

Examples

S3 TVF Examples

Match all CSV files in a directory:

SELECT * FROM S3(
"uri" = "s3://my-bucket/data/*.csv",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "csv"
);

Match files with numeric range:

SELECT * FROM S3(
"uri" = "s3://my-bucket/logs/data_{1..10}.csv",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "csv"
);

Match files in date-partitioned directories:

SELECT * FROM S3(
"uri" = "s3://my-bucket/logs/year=2024/month=*/day=*/data.parquet",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "parquet"
);
Zero-Padded Directories

For zero-padded directory names like month=01, month=02, use wildcards (*) instead of range patterns. The pattern {01..12} expands to {1,2,...,12} which won't match month=01.

Match numbered file splits (e.g., Spark output):

SELECT * FROM S3(
"uri" = "s3://my-bucket/output/part-{00000..00099}.csv",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "csv"
);

Broker Load Examples

Load all CSV files matching a pattern:

LOAD LABEL db.label_wildcard
(
DATA INFILE("s3://my-bucket/data/file_*.csv")
INTO TABLE my_table
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(col1, col2, col3)
)
WITH S3 (
"provider" = "S3",
"AWS_ENDPOINT" = "s3.us-west-2.amazonaws.com",
"AWS_ACCESS_KEY" = "xxx",
"AWS_SECRET_KEY" = "xxx",
"AWS_REGION" = "us-west-2"
);

Load files using numeric range expansion:

LOAD LABEL db.label_range
(
DATA INFILE("s3://my-bucket/exports/data_{1..5}.csv")
INTO TABLE my_table
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(col1, col2, col3)
)
WITH S3 (
"provider" = "S3",
"AWS_ENDPOINT" = "s3.us-west-2.amazonaws.com",
"AWS_ACCESS_KEY" = "xxx",
"AWS_SECRET_KEY" = "xxx",
"AWS_REGION" = "us-west-2"
);

Load from HDFS with wildcards:

LOAD LABEL db.label_hdfs_wildcard
(
DATA INFILE("hdfs://namenode:8020/user/data/2024-*/*.csv")
INTO TABLE my_table
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(col1, col2, col3)
)
WITH HDFS (
"fs.defaultFS" = "hdfs://namenode:8020",
"hadoop.username" = "user"
);

Load from HDFS with numeric range:

LOAD LABEL db.label_hdfs_range
(
DATA INFILE("hdfs://namenode:8020/data/file_{1..3,5,7..9}.csv")
INTO TABLE my_table
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(col1, col2, col3)
)
WITH HDFS (
"fs.defaultFS" = "hdfs://namenode:8020",
"hadoop.username" = "user"
);

INSERT INTO SELECT Examples

Insert from S3 with wildcards:

INSERT INTO my_table (col1, col2, col3)
SELECT * FROM S3(
"uri" = "s3://my-bucket/data/part-*.parquet",
"s3.access_key" = "xxx",
"s3.secret_key" = "xxx",
"s3.region" = "us-east-1",
"format" = "parquet"
);

Performance Considerations

Use Specific Prefixes

Doris extracts the longest non-wildcard prefix from your path pattern to optimize S3/HDFS listing operations. More specific prefixes result in faster file discovery.

-- Good: specific prefix reduces listing scope
"uri" = "s3://bucket/data/2024/01/15/*.csv"

-- Less optimal: broad wildcard at early path segment
"uri" = "s3://bucket/data/**/file.csv"

Prefer Range Patterns for Known Sequences

When you know the exact file numbering, use range patterns instead of wildcards:

-- Better: explicit range
"uri" = "s3://bucket/data/part-{0001..0100}.csv"

-- Less optimal: wildcard matches unknown files
"uri" = "s3://bucket/data/part-*.csv"

Avoid Deep Recursive Wildcards

Deep recursive patterns like ** can cause slow file listing on large buckets:

-- Avoid when possible
"uri" = "s3://bucket/**/*.csv"

-- Prefer explicit path structure
"uri" = "s3://bucket/data/year=*/month=*/day=*/*.csv"

Troubleshooting

IssueCauseSolution
No files foundPattern doesn't match any filesVerify the path and pattern syntax; test with a single file first
Slow file listingWildcard too broad or too many filesUse more specific prefix; limit wildcard scope
Invalid URI errorMalformed path syntaxCheck URI scheme and bucket name format
Access deniedCredentials or permissions issueVerify S3/HDFS credentials and bucket policies

Testing Path Patterns

Before running a large load job, test your pattern with a limited query:

-- Test if files exist and match pattern
SELECT * FROM S3(
"uri" = "s3://bucket/your/pattern/*.csv",
...
) LIMIT 1;

Use DESC FUNCTION to verify the schema of matched files:

DESC FUNCTION S3(
"uri" = "s3://bucket/your/pattern/*.csv",
...
);