Skip to main content

Huawei Cloud OBS

Apache Doris supports importing data files directly from Huawei Cloud Object Storage Service (OBS). This is commonly used for offline data migration, historical data backfill, and ingesting data lake files into the warehouse. This document uses CSV files as an example to walk through the complete steps for both import methods.

Choosing an approach

Doris provides two ways to import files from Huawei Cloud OBS. Choose based on data size and timeliness requirements:

Import methodTypeApplicable scenarioReference
S3 LoadAsynchronous importLarge-batch data import; jobs run in the background and support retry on failureBroker Load Manual
TVF (Table Value Function)Synchronous importSmall to medium data volumes, ad hoc queries, used together with INSERT INTO ... SELECTThis document, Import using TVF

Note: Both methods access OBS through the S3 protocol and require the Endpoint, Region, Access Key, and Secret Key provided by OBS.

Prerequisites

Before starting the import, make sure the following are ready:

  • An accessible Huawei Cloud OBS bucket with the data files to be imported already uploaded.
  • The OBS Endpoint (such as obs.cn-north-1.myhuaweicloud.com) and the corresponding Region (such as cn-north-1).
  • An Access Key (AK) and Secret Key (SK) with read permission on the bucket.
  • A deployed Apache Doris cluster, and the current user has the LOAD_PRIV privilege on the target database and table.

Import using S3 Load

S3 Load is an asynchronous import method. Once a job is submitted, Doris schedules and executes it in the background, which is suitable for large data volumes. For detailed parameters and status queries, see the Broker Load Manual.

Step 1: Prepare the data

Upload the CSV file s3load_example.csv to Huawei Cloud OBS. The file content is as follows:

1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64

Step 2: Create a table in Doris

Create a target table in Doris that matches the file structure:

CREATE TABLE test_s3load(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;

Step 3: Import data with S3 Load

Submit an S3 Load job to import the CSV file from OBS into the target table:

LOAD LABEL s3_load_2022_04_01
(
DATA INFILE("s3://your_bucket_name/s3load_example.csv")
INTO TABLE test_s3load
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(user_id, name, age)
)
WITH S3
(
"provider" = "OBS",
"s3.endpoint" = "obs.cn-north-1.myhuaweicloud.com",
"s3.region" = "cn-north-1",
"s3.access_key" = "<your-ak>",
"s3.secret_key" = "<your-sk>"
)
PROPERTIES
(
"timeout" = "3600"
);

Key parameters:

ParameterDescription
DATA INFILEThe OBS file path, using the s3:// protocol prefix
providerMust be fixed as OBS, indicating the object storage type is Huawei Cloud
s3.endpointThe Endpoint of the OBS service, set according to the region of the bucket
s3.regionThe Region where the OBS bucket is located
s3.access_key / s3.secret_keyThe AK/SK used for authentication
timeoutThe job timeout, in seconds

Step 4: Verify the imported data

After the import completes, query the target table to verify that the data is written correctly:

SELECT * FROM test_s3load;

Expected output:

mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name | age |
+---------+-----------+------+
| 5 | Ava | 17 |
| 10 | Liam | 64 |
| 7 | Sophia | 32 |
| 9 | Emma | 37 |
| 1 | Emily | 25 |
| 4 | Alexander | 60 |
| 2 | Benjamin | 35 |
| 3 | Olivia | 28 |
| 6 | William | 69 |
| 8 | James | 64 |
+---------+-----------+------+
10 rows in set (0.04 sec)

Import using TVF

The TVF (Table Value Function) exposes a file on OBS as a virtual table returned by a table function, so you can complete a synchronous import directly with INSERT INTO ... SELECT. This makes it convenient to combine with SQL expressions for lightweight ETL.

Step 1: Prepare the data

Upload the same CSV file s3load_example.csv used in S3 Load to OBS:

1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64

Step 2: Create a table in Doris

CREATE TABLE test_s3load(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;

Step 3: Import data with TVF

Read the OBS file through the S3 table function and write it into the target table with INSERT INTO ... SELECT:

INSERT INTO test_s3load
SELECT * FROM S3
(
"uri" = "s3://your_bucket_name/s3load_example.csv",
"format" = "csv",
"provider" = "OBS",
"s3.endpoint" = "obs.cn-north-1.myhuaweicloud.com",
"s3.region" = "cn-north-1",
"s3.access_key" = "<your-ak>",
"s3.secret_key" = "<your-sk>",
"column_separator" = ",",
"csv_schema" = "user_id:int;name:string;age:int"
);

Key parameters:

ParameterDescription
uriThe OBS file URI, using the s3:// protocol prefix
formatThe file format; use csv for CSV files
providerMust be fixed as OBS
s3.endpoint / s3.regionThe OBS Endpoint and Region
s3.access_key / s3.secret_keyThe AK/SK required for OBS authentication
column_separatorThe field delimiter; the default for CSV is ,
csv_schemaThe CSV column type definition, in the format column_name:type;column_name:type;...

Step 4: Verify the imported data

SELECT * FROM test_s3load;

Expected output:

mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name | age |
+---------+-----------+------+
| 5 | Ava | 17 |
| 10 | Liam | 64 |
| 7 | Sophia | 32 |
| 9 | Emma | 37 |
| 1 | Emily | 25 |
| 4 | Alexander | 60 |
| 2 | Benjamin | 35 |
| 3 | Olivia | 28 |
| 6 | William | 69 |
| 8 | James | 64 |
+---------+-----------+------+
10 rows in set (0.04 sec)

FAQ

Q1: How should you choose between S3 Load and TVF?

  • Large data volume, asynchronous execution, and retry on failure are needed: choose S3 Load.
  • Small data volume, immediate results are needed, or lightweight transformation with SQL expressions during import is required: choose TVF.

Q2: Why must the provider parameter be set to OBS?

The provider parameter identifies the vendor type of the underlying object storage. Although Huawei Cloud OBS is compatible with the S3 protocol, it differs in details such as signing. It must be explicitly declared as OBS so that Doris can use the corresponding access adapter.

Q3: How do you find the OBS Endpoint and Region?

You can find them on the bucket overview page in the Huawei Cloud OBS console. A typical Endpoint looks like obs.<region>.myhuaweicloud.com. For example, the cn-north-1 region corresponds to obs.cn-north-1.myhuaweicloud.com.

Q4: What file formats are supported in addition to CSV?

Both S3 Load and TVF support common formats such as csv, parquet, orc, and json. Specify the format with the FORMAT AS or format parameter.

Troubleshooting

SymptomPossible causeSuggested action
Error Access Denied or 403Incorrect AK/SK, or the account does not have read permission on the bucketVerify the AK/SK in the Huawei Cloud console and grant the account read permission on the OBS bucket
Error endpoint is invalidEndpoint is misspelled or does not match the regionConfirm the region of the bucket and use the Endpoint for that region
Import job stays in PENDING stateThe cluster is busy or import concurrency is limitedUse SHOW LOAD to check the job status, and adjust concurrency or wait for resources to be released if necessary
TVF reports a column count or type mismatchcsv_schema does not match the actual column structure of the fileVerify the column order, column count, and data types of the file, and update csv_schema accordingly
Import times outThe file is large and the default timeout is insufficientIncrease timeout in PROPERTIES (in seconds)