Skip to main content

HDFS

Description​

HDFS table-valued-function(tvf), allows users to read and access file contents on S3-compatible object storage, just like accessing relational table. Currently supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc file format.

Syntax​

HDFS(
"uri" = "<uri>",
"fs.defaultFS" = "<fs_defaultFS>",
"hadoop.username" = "<hadoop_username>",
"format" = "<format>",
[, "<optional_property_key>" = "<optional_property_value>" [, ...] ]
);

Required Parameters​

ParameterDescription
uriThe URI for accessing HDFS. If the URI path does not exist or the file is empty, the HDFS TVF will return an empty set.
fs.defaultFSThe default file system URI for HDFS
hadoop.usernameRequired, can be any string but cannot be empty.
formatFile format, required. Currently supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc/avro.

Optional Parameters​

optional_property_key in the above syntax can select the corresponding parameter from the following list as needed, and optional_property_value is the value of the parameter

ParameterDescriptionRemarks
hadoop.security.authenticationHDFS security authentication type
hadoop.usernameAlternative HDFS username
hadoop.kerberos.principalKerberos principal
hadoop.kerberos.keytabKerberos keytab
dfs.client.read.shortcircuitEnable short-circuit read
dfs.domain.socket.pathDomain socket path
dfs.nameservicesThe nameservice for HA mode
dfs.ha.namenodes.your-nameservicesConfiguration for namenode in HA mode
dfs.namenode.rpc-address.your-nameservices.your-namenodeSpecify the RPC address for the namenode
dfs.client.failover.proxy.provider.your-nameservicesSpecify the proxy provider for failover
column_separatorColumn separator, default is \t
line_delimiterLine separator, default is \n
compress_typeSupported types: UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE/SNAPPYBLOCK. Default is UNKNOWN, and the type will be automatically inferred based on the URI suffix.
read_json_by_lineFor JSON format imports, default is trueReference: JSON Load
strip_outer_arrayFor JSON format imports, default is falseReference: JSON Load
json_rootFor JSON format imports, default is emptyReference: JSON Load
json_pathsFor JSON format imports, default is emptyReference: JSON Load
num_as_stringFor JSON format imports, default is falseReference: JSON Load
fuzzy_parseFor JSON format imports, default is falseReference: JSON Load
trim_double_quotesFor CSV format imports, boolean type. Default is false. If true, removes the outermost double quotes from each field.
skip_linesFor CSV format imports, integer type. Default is 0. Skips the first few lines of the CSV file. This parameter is ignored if csv_with_names or csv_with_names_and_types is set.
path_partition_keysSpecify the partition column names carried in the file path, for example /path/to/city=beijing/date="2023-07-09", then fill in path_partition_keys="city,date", which will automatically read the corresponding column names and values from the path for import.
resourceSpecify the resource name. HDFS TVF can directly access HDFS using an existing HDFS resource. Refer to CREATE-RESOURCE for creating HDFS resources.Supported from version 2.1.4 and above.

Access Control Requirements​

PrivilegeObjectNotes
USAGE_PRIVtable
SELECT_PRIVtable

Examples​

  • Read and access csv format files on hdfs storage.

    select * from hdfs(
    "uri" = "hdfs://127.0.0.1:842/user/doris/csv_format_test/student.csv",
    "fs.defaultFS" = "hdfs://127.0.0.1:8424",
    "hadoop.username" = "doris",
    "format" = "csv");
      +------+---------+------+
    | c1 | c2 | c3 |
    +------+---------+------+
    | 1 | alice | 18 |
    | 2 | bob | 20 |
    | 3 | jack | 24 |
    | 4 | jackson | 19 |
    | 5 | liming | 18 |
    +------+---------+------+
  • Read and access csv format files on hdfs storage in HA mode.

    select * from hdfs(
    "uri" = "hdfs://127.0.0.1:842/user/doris/csv_format_test/student.csv",
    "fs.defaultFS" = "hdfs://127.0.0.1:8424",
    "hadoop.username" = "doris",
    "format" = "csv",
    "dfs.nameservices" = "my_hdfs",
    "dfs.ha.namenodes.my_hdfs" = "nn1,nn2",
    "dfs.namenode.rpc-address.my_hdfs.nn1" = "nanmenode01:8020",
    "dfs.namenode.rpc-address.my_hdfs.nn2" = "nanmenode02:8020",
    "dfs.client.failover.proxy.provider.my_hdfs" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
      +------+---------+------+
    | c1 | c2 | c3 |
    +------+---------+------+
    | 1 | alice | 18 |
    | 2 | bob | 20 |
    | 3 | jack | 24 |
    | 4 | jackson | 19 |
    | 5 | liming | 18 |
    +------+---------+------+
  • Can be used with desc function :

    desc function hdfs(
    "uri" = "hdfs://127.0.0.1:8424/user/doris/csv_format_test/student_with_names.csv",
    "fs.defaultFS" = "hdfs://127.0.0.1:8424",
    "hadoop.username" = "doris",
    "format" = "csv_with_names");