Skip to main content

CREATE FUNCTION

Description

This statement is used to create a custom function.

Syntax

CREATE [ GLOBAL ] 
[{AGGREGATE | TABLES | ALIAS }] FUNCTION <function_name>
(<arg_type> [, ...])
[ RETURNS <ret_type> ]
[ INTERMEDIATE <inter_type> ]
[ WITH PARAMETER(<param> [,...]) AS <origin_function> ]
[ PROPERTIES ("<key>" = "<value>" [, ...]) ]

Required Parameters

1. <function_name>

If function_name includes a database name, such as db1.my_func, the custom function will be created in the corresponding database. Otherwise, the function will be created in the database of the current session. The name and parameters of the new function must not be identical to an existing function in the current namespace; otherwise, the creation will fail.

2. <arg_type>

The input parameter type of the function. For variable-length parameters, use , ... to indicate them. If it is a variable-length type, the type of the variable-length parameters must be consistent with the type of the last non-variable-length parameter.

3. <ret_type>

The return parameter type of the function. This is a required parameter for creating a new function. If creating an alias for an existing function, this parameter is not necessary.

Optional Parameters

1. GLOBAL

If specified, the created function is effective globally.

2. AGGREGATE

If specified, the created function is an aggregate function.

3. TABLES

If specified, the created function is a table function.

4. ALIAS

If specified, the created function is an alias function.

If none of the above parameters representing the function type is selected, it indicates that the created function is a scalar function.

5. <inter_type>

Used to indicate the data type during the intermediate stage of an aggregate function.

6. <param>

Used to indicate the parameters of an alias function, with at least one parameter required.

7. <origin_function>

Used to indicate the original function corresponding to the alias function.

8. <properties>

  • file: Indicates the JAR package containing the user-defined function (UDF). In a multi-machine environment, it can also be downloaded via HTTP. This parameter is mandatory.
  • symbol: Indicates the class name containing the UDF class. This parameter is mandatory.
  • type: Indicates the UDF call type. The default is Native. Use JAVA_UDF when using a Java UDF.
  • always_nullable: Indicates whether the UDF result may contain NULL values. This is an optional parameter with a default value of true.
  • volatility: Indicates how stable the function result is. This is an optional parameter. The default value is volatile for scalar UDFs, and immutable for UDAF/UDTF. Valid values are:
    • immutable: The same input always produces the same output. Most deterministic UDFs should be marked as immutable so that the optimizer can apply more plan optimizations.
    • stable: The result is stable within a single SQL statement but may change between statements, similar to now(). SQL cache and materialized view rewrite are disabled for this type of function.
    • volatile: The result may change on each call, similar to random(). SQL cache, materialized view rewrite, and many optimizer rewrite rules are disabled for this type of function.

Access Control Requirements

To execute this command, the user must have ADMIN_PRIV privileges.

Example

  1. Create a custom UDF function. For more details, refer to JAVA-UDF.

    CREATE FUNCTION java_udf_add_one(int) RETURNS int PROPERTIES (
    "file"="file:///path/to/java-udf-demo-jar-with-dependencies.jar",
    "symbol"="org.apache.doris.udf.AddOne",
    "always_nullable"="true",
    "type"="JAVA_UDF",
    "volatility"="immutable"
    );
  2. Create a custom UDAF function.

    CREATE AGGREGATE FUNCTION simple_sum(INT) RETURNS INT PROPERTIES (
    "file"="file:///pathTo/java-udaf.jar",
    "symbol"="org.apache.doris.udf.demo.SimpleDemo",
    "always_nullable"="true",
    "type"="JAVA_UDF"
    );
  3. Create a custom UDTF function.

    CREATE TABLES FUNCTION java_udtf(string, string) RETURNS array<string> PROPERTIES (
    "file"="file:///pathTo/java-udaf.jar",
    "symbol"="org.apache.doris.udf.demo.UDTFStringTest",
    "always_nullable"="true",
    "type"="JAVA_UDF"
    );
  4. Create a custom alias function. For more information, refer to Alias Function.

    CREATE ALIAS FUNCTION id_masking(INT) WITH PARAMETER(id) AS CONCAT(LEFT(id, 3), '****', RIGHT(id, 4));
  5. Create a global custom alias function.

    CREATE GLOBAL ALIAS FUNCTION id_masking(INT) WITH PARAMETER(id) AS CONCAT(LEFT(id, 3), '****', RIGHT(id, 4));
  6. Create a volatile Python UDF. Functions such as uuid.uuid4() that depend on randomness should keep the default volatility = volatile and must not be incorrectly marked as immutable.

    CREATE TABLE cte_uuid_seed (id INT) ENGINE=OLAP DUPLICATE KEY(id)
    DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ("replication_num" = "1");
    INSERT INTO cte_uuid_seed VALUES (1),(2),(3);

    DROP FUNCTION IF EXISTS py_uuid_token(INT);
    CREATE FUNCTION py_uuid_token(INT)
    RETURNS STRING
    PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "py_uuid_token_impl",
    "always_nullable" = "false",
    "runtime_version" = "3.12.11",
    "volatility" = "volatile"
    )
    AS $$

import uuid def py_uuid_token_impl(x): return f"{x}-{uuid.uuid4()}"

SET enable_cte_materialize = true; SET inline_cte_referenced_threshold = 10; WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed) SELECT id, COUNT(DISTINCT token) AS distinct_tokens FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u GROUP BY id ORDER BY id; ``` Correct result: ```text +------+-----------------+ | id | distinct_tokens | +------+-----------------+ | 1 | 1 | | 2 | 1 | | 3 | 1 | +------+-----------------+ ``` For this function, the following definition is incorrect: ```sql DROP FUNCTION IF EXISTS py_uuid_token(INT); CREATE FUNCTION py_uuid_token(INT) RETURNS STRING PROPERTIES ( "type" = "PYTHON_UDF", "symbol" = "py_uuid_token_impl", "always_nullable" = "false", "runtime_version" = "3.12.11", "volatility" = "immutable" ) AS $$ import uuid def py_uuid_token_impl(x): return f"{x}-{uuid.uuid4()}" $$; ``` Run the same query again: ```sql WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed) SELECT id, COUNT(DISTINCT token) AS distinct_tokens FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u GROUP BY id ORDER BY id; ``` Incorrect result: ```text +------+-----------------+ | id | distinct_tokens | +------+-----------------+ | 1 | 2 | | 2 | 2 | | 3 | 2 | +------+-----------------+ ``` Why this is wrong: Because `py_uuid_token` is volatile, each call to `uuid.uuid4()` generates a new value. If the function is incorrectly marked as `volatility = immutable`, the optimizer may treat repeated references as safe to rewrite and may choose a plan that evaluates the UDF separately on both sides of `UNION ALL`. As a result, the same `id` can produce two different `token` values, and `COUNT(DISTINCT token)` changes from `1` to `2`.