跳到主要内容

S3

描述

S3 表函数(table-valued-function, tvf)可以让用户像访问关系表格式数据一样,读取并访问 S3 兼容的对象存储上的文件内容。目前支持 csv/csv_with_names/csv_with_names_and_types/json/parquet/orc 文件格式。

语法

S3(
"uri" = "<uri>",
"s3.access_key" = "<s3_access_key>",
"s3.secret_key" = "<s3_secret_key>",
"s3.region" = "<s3_region>",
"s3.endpoint" = "<s3_endpoint>",
"format" = "<format>"
[, "<optional_property_key>" = "<optional_property_value>" [, ...] ]
)

必选参数

参数描述
uri用于访问 S3 的 URI,该函数会根据 use_path_style 参数来决定使用路径样式(Path Style)还是虚拟托管样式(Virtual-hosted Style)进行访问
s3.access_key访问 S3 的访问密钥
s3.secret_key访问 S3 的秘密密钥
s3.regionS3 存储所在的区域
s3.endpointS3 存储的端点地址
format文件格式,支持 csv/csv_with_names/csv_with_names_and_types/json/parquet/orc

可选参数

参数默认值描述备注
s3.session_token-S3 session token
use_path_stylefalseS3 SDK 默认使用 Virtual-hosted Style 方式。但某些对象存储系统可能没开启或不支持 Virtual-hosted Style 方式的访问,此时可以添加 use_path_style 参数来强制使用 Path Style 方式。例如 MinIO 默认情况下只允许 Path Style 访问方式,所以在访问 MinIO 时需要设置 use_path_style=true
force_parsing_by_standard_urifalse用于强制将非标准 URI 解析为标准 URI
column_separator\t列分割符
line_delimiter\n行分割符
compress_typeUNKNOWN压缩类型。目前支持 UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE/SNAPPYBLOCK,将根据 uri 的后缀自动推断类型
read_json_by_line"true"适用于 JSON 格式的导入JSON Load
strip_outer_array"false"适用于 JSON 格式的导入JSON Load
json_root适用于 JSON 格式的导入JSON Load
jsonpaths适用于 JSON 格式的导入JSON Load
num_as_stringfalse适用于 JSON 格式的导入JSON Load
fuzzy_parsefalse适用于 JSON 格式的导入
trim_double_quotesfalse用于 CSV 格式的导入,去除每个字段最外层的双引号
skip_lines0跳过 CSV 文件的前几行。在使用 csv_with_namescsv_with_names_and_types 格式时该参数无效
path_partition_keys-指定文件路径中携带的分区列名。例如路径为 /path/to/city=beijing/date="2023-07-09",则填写 path_partition_keys="city,date",将自动从路径中读取相应的列名和列值进行导入
resource-指定 Resource 名称。S3 TVF 可以利用已有的 S3 Resource 来直接访问 S3。创建 S3 Resource 的方法可参照 CREATE-RESOURCE。该功能自 2.1.4 版本开始支持。
enable_mapping_varbinaryfalse在读取 PARQUET/ORC 时将 BYTE_ARRAY 类型映射为 STRING,开启后则映射为 VARBINARY 类型。4.0.3 版本开始支持

注意事项

  1. 对于 AWS S3,标准 URI 样式有以下几种:

    • AWS Client Style(Hadoop S3 Style)

      s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88
    • Virtual Host Style

      https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
    • Path Style

      https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88

    除了支持以上三个标准常见的 URI 样式,还支持其他一些 URI 样式(可能不常见):

    • Virtual Host AWS Client (Hadoop S3) Mixed Style

      s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
    • Path AWS Client (Hadoop S3) Mixed Style

      s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88

    详细使用案例可参考下方的示例。

  2. 直接查询 TVF 或基于该 TVF 创建 View,需要拥有该 Resource 的 USAGE 权限;查询基于 TVF 创建的 View,只需要该 View 的 SELECT 权限。

示例

  • 读取并访问 S3 兼容的对象存储上的 CSV 格式文件

    select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style" = "true") order by c1;
  • 可以配合 desc function 使用

    desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style" = "true");
  • 不同 URL Schema 的写法

    http://https:// 使用示例:

    -- 注意 URI Bucket 写法以及 use_path_style 参数设置,HTTP 同理。
    -- 由于设置了 "use_path_style"="true",所以将采用 Path Style 方式访问 S3。
    select * from s3(
    "uri" = "https://endpoint/bucket/file/student.csv",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint" = "endpoint",
    "s3.region" = "region",
    "format" = "csv",
    "use_path_style"="true");

    -- 注意 URI Bucket 写法以及 use_path_style 参数设置,HTTP 同理。
    -- 由于设置了 "use_path_style"="false",所以将采用 Virtual-hosted Style 方式访问 S3。
    select * from s3(
    "uri" = "https://bucket.endpoint/bucket/file/student.csv",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint" = "endpoint",
    "s3.region" = "region",
    "format" = "csv",
    "use_path_style"="false");

    -- 阿里云 OSS 和腾讯云 COS 采用 Virtual-hosted Style 方式访问 S3。
    -- OSS
    select * from s3(
    "uri" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint" = "oss-cn-beijing.aliyuncs.com",
    "s3.region" = "oss-cn-beijing",
    "format" = "parquet",
    "use_path_style" = "false");
    -- COS
    select * from s3(
    "uri" = "https://example-bucket.cos.ap-hongkong.myqcloud.com/your-folder/file.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint" = "cos.ap-hongkong.myqcloud.com",
    "s3.region" = "ap-hongkong",
    "format" = "parquet",
    "use_path_style" = "false");

    -- MinIO
    select * from s3(
    "uri" = "s3://bucket/file.csv",
    "s3.endpoint" = "http://172.21.0.101:9000",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.region" = "us-east-1",
    "format" = "csv"
    );

    -- 百度云 BOS 采用兼容 S3 协议的 Virtual-hosted Style 方式访问 S3。
    -- BOS
    select * from s3(
    "uri" = "https://example-bucket.s3.bj.bcebos.com/your-folder/file.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.region" = "bj",
    "s3.endpoint" = "http://bj.bcebos.com",
    "format" = "parquet",
    "use_path_style" = "false");
  • s3:// 使用示例:

    -- 注意 URI Bucket 写法,无需设置 use_path_style 参数。
    -- 将采用 Virtual-hosted Style 方式访问 S3。
    select * from s3(
    "uri" = "s3://bucket/file/student.csv",
    "s3.endpoint"= "endpont",
    "s3.region"= "region",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "csv");
  • 其他支持的 URI 风格示例:

    -- Virtual Host AWS Client (Hadoop S3) Mixed Style。通过设置 use_path_style=false 以及 force_parsing_by_standard_uri=true 来使用。
    select * from s3(
    "URI" = "s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint"= "endpont",
    "s3.region"= "region",
    "format" = "csv",
    "use_path_style"="false",
    "force_parsing_by_standard_uri"="true");

    -- Path AWS Client (Hadoop S3) Mixed Style。通过设置 use_path_style=true 以及 force_parsing_by_standard_uri=true 来使用。
    select * from s3(
    "URI" = "s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint"= "endpont",
    "s3.region"= "region",
    "format" = "csv",
    "use_path_style"="true",
    "force_parsing_by_standard_uri"="true");
  • CSV 格式

    由于 S3 table-valued-function 事先并不知道 Table Schema,所以会先读取一遍文件来解析 Table Schema。

    csv 格式:S3 table-valued-function 读取 S3 上的文件并作为 CSV 文件来处理,读取文件中的第一行用于解析 Table Schema。文件第一行的列个数 n 将作为 Table Schema 的列个数,Table Schema 的列名则自动命名为 c1, c2, ..., cn,列类型都设置为 String。举例:

    student1.csv 文件内容为:

    1,ftw,12
    2,zs,18
    3,ww,20

    使用 S3 TVF:

    select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style" = "true") order by c1;

    +------+------+------+
    | c1 | c2 | c3 |
    +------+------+------+
    | 1 | ftw | 12 |
    | 2 | zs | 18 |
    | 3 | ww | 20 |
    +------+------+------+

    可以配合 desc function S3() 来查看 Table Schema:

    Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style" = "true");

    +-------+------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +-------+------+------+-------+---------+-------+
    | c1 | TEXT | Yes | false | NULL | NONE |
    | c2 | TEXT | Yes | false | NULL | NONE |
    | c3 | TEXT | Yes | false | NULL | NONE |
    +-------+------+------+-------+---------+-------+
  • csv_with_names 格式

    csv_with_names 格式:解析文件的第一行作为 Table Schema 的列个数和列名,列类型则都设置为 String。举例:

    student_with_names.csv 文件内容为:

    id,name,age
    1,ftw,12
    2,zs,18
    3,ww,20

    使用 S3 TVF:

    select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv_with_names",
    "use_path_style" = "true") order by id;

    +------+------+------+
    | id | name | age |
    +------+------+------+
    | 1 | ftw | 12 |
    | 2 | zs | 18 |
    | 3 | ww | 20 |
    +------+------+------+
  • 同样配合 desc function S3() 可查看 Table Schema

    desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv_with_names",
    "use_path_style" = "true");

    +-------+------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +-------+------+------+-------+---------+-------+
    | id | TEXT | Yes | false | NULL | NONE |
    | name | TEXT | Yes | false | NULL | NONE |
    | age | TEXT | Yes | false | NULL | NONE |
    +-------+------+------+-------+---------+-------+
  • csv_with_names_and_types 格式

    csv_with_names_and_types 格式:目前暂不支持从 CSV 文件中解析出列类型。使用该格式时,S3 TVF 会解析文件的第一行作为 Table Schema 的列个数和列名,列类型则都设置为 String,同时将忽略该文件的第二行。

    student_with_names_and_types.csv 文件内容为:

    id,name,age
    INT,STRING,INT
    1,ftw,12
    2,zs,18
    3,ww,20

    使用 S3 TVF:

    select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv_with_names_and_types",
    "use_path_style" = "true") order by id;

    +------+------+------+
    | id | name | age |
    +------+------+------+
    | 1 | ftw | 12 |
    | 2 | zs | 18 |
    | 3 | ww | 20 |
    +------+------+------+

    同样配合 desc function S3() 可查看 Table Schema:

    Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv_with_names_and_types",
    "use_path_style" = "true");

    +-------+------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +-------+------+------+-------+---------+-------+
    | id | TEXT | Yes | false | NULL | NONE |
    | name | TEXT | Yes | false | NULL | NONE |
    | age | TEXT | Yes | false | NULL | NONE |
    +-------+------+------+-------+---------+-------+
  • JSON 格式

    json 格式:JSON 格式涉及较多的可选参数,各个参数的意义可参考:Json Load。S3 TVF 查询 JSON 格式文件时根据 json_rootjsonpaths 参数定位到一个 JSON 对象,将该对象中的 key 作为 Table Schema 的列名,列类型都设置为 String。举例:

    data.json 文件内容:

    [{"id":1, "name":"ftw", "age":18}]
    [{"id":2, "name":"xxx", "age":17}]
    [{"id":3, "name":"yyy", "age":19}]

    使用 S3 TVF 查询:

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/data.json",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "json",
    "strip_outer_array" = "true",
    "read_json_by_line" = "true",
    "use_path_style"="true");
    +------+------+------+
    | id | name | age |
    +------+------+------+
    | 1 | ftw | 18 |
    | 2 | xxx | 17 |
    | 3 | yyy | 19 |
    +------+------+------+

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/data.json",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "json",
    "strip_outer_array" = "true",
    "jsonpaths" = "[\"$.id\", \"$.age\"]",
    "use_path_style"="true");

    +------+------+
    | id | age |
    +------+------+
    | 1 | 18 |
    | 2 | 17 |
    | 3 | 19 |
    +------+------+
  • Parquet 格式

    parquet 格式:S3 TVF 支持从 Parquet 文件中解析出 Table Schema 的列名和列类型。举例:

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "parquet",
    "use_path_style"="true") limit 5;

    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
    | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
    | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
    | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
    | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    MySQL [(none)]> desc function s3(
    "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "parquet",
    "use_path_style"="true");

    +---------------+--------------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +---------------+--------------+------+-------+---------+-------+
    | p_partkey | INT | Yes | false | NULL | NONE |
    | p_name | TEXT | Yes | false | NULL | NONE |
    | p_mfgr | TEXT | Yes | false | NULL | NONE |
    | p_brand | TEXT | Yes | false | NULL | NONE |
    | p_type | TEXT | Yes | false | NULL | NONE |
    | p_size | INT | Yes | false | NULL | NONE |
    | p_container | TEXT | Yes | false | NULL | NONE |
    | p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
    | p_comment | TEXT | Yes | false | NULL | NONE |
    +---------------+--------------+------+-------+---------+-------+
  • ORC 格式

    orc 格式:使用方法与 parquet 格式一致,将 format 参数设置为 orc 即可。

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/test.snappy.orc",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "orc",
    "use_path_style"="true") limit 5;

    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
    | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
    | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
    | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
    | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
  • URI 包含通配符

    URI 可以使用通配符来读取多个文件。注意:如果使用通配符,需要保证各个文件的格式一致(尤其是 csvcsv_with_namescsv_with_names_and_types 属于不同的格式),S3 TVF 会使用第一个文件来解析 Table Schema。

    如下两个 CSV 文件:

    // file1.csv
    1,aaa,18
    2,qqq,20
    3,qwe,19

    // file2.csv
    5,cyx,19
    6,ftw,21

    可以在 URI 上使用通配符来查询:

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/file*.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style"="true");

    +------+------+------+
    | c1 | c2 | c3 |
    +------+------+------+
    | 1 | aaa | 18 |
    | 2 | qqq | 20 |
    | 3 | qwe | 19 |
    | 5 | cyx | 19 |
    | 6 | ftw | 21 |
    +------+------+------+
  • 配合 insert intocast 使用 S3 TVF

    -- 创建 Doris 内部表
    CREATE TABLE IF NOT EXISTS ${testTable}
    (
    id int,
    name varchar(50),
    age int
    )
    COMMENT "my first table"
    DISTRIBUTED BY HASH(id) BUCKETS 32
    PROPERTIES("replication_num" = "1");

    -- 使用 S3 插入数据
    insert into ${testTable} (id,name,age)
    select cast (id as INT) as id, name, cast (age as INT) as age
    from s3(
    "uri" = "${uri}",
    "s3.access_key"= "${ak}",
    "s3.secret_key" = "${sk}",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "${format}",
    "strip_outer_array" = "true",
    "read_json_by_line" = "true",
    "use_path_style" = "true");