Select Into Outfile
This document introduces how to use the SELECT INTO OUTFILE
command to export query results.
For a detailed introduction to the SELECT INTO OUTFILE
command, refer to: SELECT INTO OUTFILE.
Overview
The SELECT INTO OUTFILE
command exports the result data of the SELECT
statement to a target storage system, such as object storage, HDFS, or the local file system, in a specified file format.
SELECT INTO OUTFILE
is a synchronous command, meaning it completes when the command returns. If successful, it returns information about the number, size, and paths of the exported files. If it fails, it returns error information.
For guidance on choosing between SELECT INTO OUTFILE
and EXPORT
, see the Export Overview.
Supported Export Formats
SELECT INTO OUTFILE
currently supports the following export formats:
- Parquet
- ORC
- CSV
- CSV with column names (
csv_with_names
) - CSV with column names and types (
csv_with_names_and_types
)
Compressed formats are not supported.
Example
mysql> SELECT * FROM tbl1 LIMIT 10 INTO OUTFILE "file:///home/work/path/result_";
+------------+-----------+----------+--------------------------------------------------------------------+
| FileNumber | TotalRows | FileSize | URL |
+------------+-----------+----------+--------------------------------------------------------------------+
| 1 | 2 | 8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ |
+------------+-----------+----------+--------------------------------------------------------------------+
Explanation of the returned results:
- FileNumber: The number of generated files.
- TotalRows: The number of rows in the result set.
- FileSize: The total size of the exported files in bytes.
- URL: The prefix of the exported file paths. Multiple files will be numbered sequentially with suffixes
_0
,_1
, etc.
Export File Column Type Mapping
SELECT INTO OUTFILE
supports exporting to Parquet and ORC file formats. Parquet and ORC have their own data types, and Doris can automatically map its data types to corresponding Parquet and ORC data types. Refer to the "Export File Column Type Mapping" section in the Export Overview document for the specific mapping relationships.
Examples
Export to HDFS
Export query results to the hdfs://path/to/
directory, specifying the export format as PARQUET:
SELECT c1, c2, c3 FROM tbl
INTO OUTFILE "hdfs://${host}:${fileSystem_port}/path/to/result_"
FORMAT AS PARQUET
PROPERTIES
(
"fs.defaultFS" = "hdfs://ip:port",
"hadoop.username" = "hadoop"
);
If HDFS is configured for high availability, provide HA information, such as:
SELECT c1, c2, c3 FROM tbl
INTO OUTFILE "hdfs://HDFS8000871/path/to/result_"
FORMAT AS PARQUET
PROPERTIES
(
"fs.defaultFS" = "hdfs://HDFS8000871",
"hadoop.username" = "hadoop",
"dfs.nameservices" = "your-nameservices",
"dfs.ha.namenodes.your-nameservices" = "nn1,nn2",
"dfs.namenode.rpc-address.HDFS8000871.nn1" = "ip:port",
"dfs.namenode.rpc-address.HDFS8000871.nn2" = "ip:port",
"dfs.client.failover.proxy.provider.HDFS8000871" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
);
If the Hadoop cluster is configured for high availability and Kerberos authentication is enabled, you can refer to the following SQL statement:
SELECT * FROM tbl
INTO OUTFILE "hdfs://path/to/result_"
FORMAT AS PARQUET
PROPERTIES
(
"fs.defaultFS"="hdfs://hacluster/",
"hadoop.username" = "hadoop",
"dfs.nameservices"="hacluster",
"dfs.ha.namenodes.hacluster"="n1,n2",
"dfs.namenode.rpc-address.hacluster.n1"="192.168.0.1:8020",
"dfs.namenode.rpc-address.hacluster.n2"="192.168.0.2:8020",
"dfs.client.failover.proxy.provider.hacluster"="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
"dfs.namenode.kerberos.principal"="hadoop/_HOST@REALM.COM",
"hadoop.security.authentication"="kerberos",
"hadoop.kerberos.principal"="doris_test@REALM.COM",
"hadoop.kerberos.keytab"="/path/to/doris_test.keytab"
);
Export to S3
Export query results to the S3 storage at s3://path/to/
directory, specifying the export format as ORC. Provide sk
, ak
, and other necessary information:
SELECT * FROM tbl
INTO OUTFILE "s3://path/to/result_"
FORMAT AS ORC
PROPERTIES(
"s3.endpoint" = "https://xxx",
"s3.region" = "ap-beijing",
"s3.access_key"= "your-ak",
"s3.secret_key" = "your-sk"
);
Export to Local File System
To export to the local file system, add
enable_outfile_to_local=true
infe.conf
and restart FE.
Export query results to the BE's file:///path/to/
directory, specifying the export format as CSV, with a comma as the column separator:
SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1
INTO OUTFILE "file:///path/to/result_"
FORMAT AS CSV
PROPERTIES(
"column_separator" = ","
);
Note: Exporting to local files is not suitable for public cloud users and is intended for private deployment users only. By default, users have full control over cluster nodes. Doris does not check the validity of the export path provided by the user. If the Doris process user does not have write permissions for the path, or the path does not exist, an error will be reported. Additionally, for security reasons, if a file with the same name already exists at the path, the export will fail. Doris does not manage exported local files or check disk space. Users need to manage these files themselves, including cleanup and other tasks.
Best Practices
Generate Export Success Indicator File
The SELECT INTO OUTFILE
command is synchronous, meaning that the task connection could be interrupted during SQL execution, leaving uncertainty about whether the export completed successfully or whether the data is complete. You can use the success_file_name
parameter to generate an indicator file upon successful export.
Similar to Hive, users can determine whether the export completed successfully and whether the files in the export directory are complete by checking for the presence of the file specified by the success_file_name
parameter.
For example, exporting the results of a SELECT
statement to Tencent Cloud COS s3://${bucket_name}/path/my_file_
, specifying the export format as CSV, and setting the success indicator file name to SUCCESS
:
SELECT k1, k2, v1 FROM tbl1 LIMIT 100000
INTO OUTFILE "s3://my_bucket/path/my_file_"
FORMAT AS CSV
PROPERTIES
(
"s3.endpoint" = "${endpoint}",
"s3.region" = "ap-beijing",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"column_separator" = ",",
"line_delimiter" = "\n",
"success_file_name" = "SUCCESS"
)
Upon completion, an additional file named SUCCESS
will be generated.
Concurrent Export
By default, the query results in the SELECT
section are aggregated to a single BE node, which exports data single-threadedly. However, in some cases (e.g., queries without an ORDER BY
clause), concurrent export can be enabled to have multiple BE nodes export data simultaneously, improving export performance.
Here’s an example demonstrating how to enable concurrent export:
- Enable the concurrent export session variable:
mysql> SET enable_parallel_outfile = true;
- Execute the export command:
mysql> SELECT * FROM demo.tbl
-> INTO OUTFILE "file:///path/to/ftw/export/exp_"
-> FORMAT AS PARQUET;
+------------+-----------+----------+-------------------------------------------------------------------------------+
| FileNumber | TotalRows | FileSize | URL |
+------------+-----------+----------+-------------------------------------------------------------------------------+
| 1 | 104494 | 7998308 | file:///127.0.0.1/path/to/exp_1f850179e684476b-9bf001a6bf96d7d3_ |
| 1 | 104984 | 8052491 | file:///127.0.0.1/path/to/exp_1f850179e684476b-9bf001a6bf96d7d5_ |
| 1 | 104345 | 7981406 | file:///127.0.0.1/path/to/exp_1f850179e684476b-9bf001a6bf96d7d1_ |
| 1 | 104034 | 7977301 | file:///127.0.0.1/path/to/exp_1f850179e684476b-9bf001a6bf96d7d4_ |
| 1 | 104238 | 7979757 | file:///127.0.0.1/path/to/exp_1f850179e684476b-9bf001a6bf96d7d2_ |
| 1 | 159450 | 11870222 | file:///127.0.0.1/path/to/exp_1f850179e684476b-9bf001a6bf96d7d0_ |
| 1 | 209691 | 16082100 | file:///127.0.0.1/path/to/exp_1f850179e684476b-9bf001a6bf96d7ce_ |
| 1 | 208769 | 16004096 | file:///127.0.0.1/path/to/exp_1f850179e684476b-9bf001a6bf96d7cf_ |
+------------+-----------+----------+-------------------------------------------------------------------------------+
With concurrent export successfully enabled, the result may consist of multiple rows, indicating that multiple threads exported data concurrently.
Adding an ORDER BY
clause to the query prevents concurrent export, as the top-level sorting node necessitates single-threaded export:
mysql> SELECT * FROM demo.tbl ORDER BY id
-> INTO OUTFILE "file:///path/to/ftw/export/exp_"
-> FORMAT AS PARQUET;
+------------+-----------+----------+-------------------------------------------------------------------------------+
| FileNumber | TotalRows | FileSize | URL |
+------------+-----------+----------+-------------------------------------------------------------------------------+
| 1 | 1100005 | 80664607 | file:///127.0.0.1/mnt/disk2/ftw/export/exp_20c5461055774128-826256c0cfb3d8fc_ |
+------------+-----------+----------+-------------------------------------------------------------------------------+
Here, the result is a single row, indicating no concurrent export was triggered.
Refer to the appendix for more details on concurrent export principles.
Clear Export Directory Before Exporting
SELECT * FROM tbl1
INTO OUTFILE "s3://my_bucket/export/my_file_"
FORMAT AS CSV
PROPERTIES
(
"s3.endpoint" = "${endpoint}",
"s3.region" = "region",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"column_separator" = ",",
"line_delimiter" = "\n",
"delete_existing_files" = "true"
)
If "delete_existing_files" = "true"
is set, the export job will first delete all files and directories under s3://my_bucket/export/
, then export data to that directory.
Note: To use the
delete_existing_files
parameter, addenable_delete_existing_files = true
tofe.conf
and restart FE. This parameter is potentially dangerous and should only be used in a testing environment.
Set Export File Size
SELECT * FROM tbl
INTO OUTFILE "s3://path/to/result_"
FORMAT AS ORC
PROPERTIES(
"s3.endpoint" = "https://xxx",
"s3.region" = "ap-beijing",
"s3.access_key"= "your-ak",
"s3.secret_key" = "your-sk",
"max_file_size" = "2048MB"
);
Specifying "max_file_size" = "2048MB"
ensures that the final file size does not exceed 2GB. If the total size exceeds 2GB, multiple files will be generated.
Considerations
- Export Data Volume and Efficiency
The
SELECT INTO OUTFILE
function executes a SQL query. Without concurrent export, a single BE node and thread export the query results. The total export time includes both the query execution time and the result set write-out time. Enabling concurrent export can reduce the export time. - Export Timeout
The export command shares the same timeout as the query. If the data volume is large and causes the export to timeout, you can extend the query timeout by setting the session variable
query_timeout
. - Export File Management
Doris does not manage exported files, whether successfully exported or remaining from failed exports. Users must handle these files themselves.
Additionally,
SELECT INTO OUTFILE
does not check for the existence of files or file paths. WhetherSELECT INTO OUTFILE
automatically creates paths or overwrites existing files depends entirely on the semantics of the remote storage system. - Empty Result Sets Exporting an empty result set still generates an empty file.
- File Splitting
File splitting ensures that a single row of data is stored completely in one file. Thus, the file size may not exactly equal
max_file_size
. - Non-visible Character Functions
For functions outputting non-visible characters (e.g., BITMAP, HLL types), CSV output is
\N
, and Parquet/ORC output is NULL. Currently, some geographic functions likeST_Point
output VARCHAR but with encoded binary characters, causing garbled output. UseST_AsText
for geographic functions.
Appendix
Concurrent Export Principles
Principle Overview
Doris is a high-performance, real-time analytical database based on the MPP (Massively Parallel Processing) architecture. MPP divides large datasets into small chunks and processes them in parallel across multiple nodes. Concurrent export in
SELECT INTO OUTFILE
leverages this parallel processing capability, allowing multiple BE nodes to export parts of the result set simultaneously.How to Determine Concurrent Export Eligibility
Ensure Session Variable is Enabled:
set enable_parallel_outfile = true;
Check Execution Plan with
EXPLAIN
:mysql> EXPLAIN SELECT ... INTO OUTFILE "s3://xxx" ...;
+-----------------------------------------------------------------------------+
| Explain String |
+-----------------------------------------------------------------------------+
| PLAN FRAGMENT 0 |
| OUTPUT EXPRS:<slot 2> | <slot 3> | <slot 4> | <slot 5> |
| PARTITION: UNPARTITIONED |
| |
| RESULT SINK |
| |
| 1:EXCHANGE |
| |
| PLAN FRAGMENT 1 |
| OUTPUT EXPRS:`k1`
k2
| | PARTITION: HASHPARTITIONED:default_cluster:test
.multi_tablet
.k1
| | | | RESULT FILE SINK | | FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951 | | STORAGE TYPE: S3 | | | | 0:OlapScanNode | | TABLE: multi_tablet | +-----------------------------------------------------------------------------+
The `EXPLAIN` command returns the query plan. If `RESULT FILE SINK` appears in `PLAN FRAGMENT 1`, the query can be exported concurrently. If it appears in `PLAN FRAGMENT 0`, concurrent export is not possible.
Export Concurrency
When concurrent export conditions are met, the export task's concurrency is determined by:
BE nodes * parallel_fragment_exec_instance_num
.