SELECT

Applies to

Description

Spark supports a SELECT statement and conforms to the ANSI SQL standard. Queries are used to retrieve result sets from one or more tables. (Spark SQL Language Reference SELECT)

Grammar Syntax

[ WITH with_query [ , ... ] ]
select_statement [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select_statement, ... ]
    [ ORDER BY { expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] } ]
    [ SORT BY { expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] } ]
    [ CLUSTER BY { expression [ , ... ] } ]
    [ DISTRIBUTE BY { expression [, ... ] } ]
    [ WINDOW { named_window [ , WINDOW named_window, ... ] } ]
    [ LIMIT { ALL | expression } ]

select_statement :=
SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ [ named_expression | regex_column_names ] [ , ... ] | TRANSFORM (...) ] }
    FROM { from_item [ , ... ] }
    [ PIVOT clause ]
    [ UNPIVOT clause ]
    [ LATERAL VIEW clause ] [ ... ] 
    [ WHERE boolean_expression ]
    [ GROUP BY expression [ , ... ] ]
    [ HAVING boolean_expression ]
    
with_query :=
expression_name [ ( column_name [ , ... ] ) ] [ AS ] ( query )

from_item :=
table_relation |
join_relation |
table_value_function |
inline_table |
LATERAL(subquery) |
file_format.`file_path` 

Sample Source Patterns

GROUP BY

The WITH { CUBE | ROLLUP } syntax is transformed to its CUBE(expr1, ...) or ROLLUP(expr1, ...) equivalent

Input Code:

IN -> Databricks_01.sql
-- Basic case of GROUP BY
SELECT id, sum(quantity) FROM dealer GROUP BY 1;

-- Grouping by GROUPING SETS
SELECT city, car_model, sum(quantity) AS sum FROM dealer
    GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ());

-- Grouping by ROLLUP
SELECT city, car_model, sum(quantity) AS sum FROM dealer
    GROUP BY ROLLUP(city, car_model);

SELECT city, car_model, sum(quantity) AS sum FROM dealer
    GROUP BY city, car_model WITH ROLLUP;

-- Grouping by CUBE
SELECT city, car_model, sum(quantity) AS sum FROM dealer
    GROUP BY CUBE(city, car_model);

SELECT city, car_model, sum(quantity) AS sum FROM dealer
    GROUP BY city, car_model WITH CUBE;

Output Code:

OUT -> Databricks_01.sql
-- Basic case of GROUP BY
SELECT id, sum(quantity) FROM
    dealer
GROUP BY 1;

-- Grouping by GROUPING SETS
SELECT city, car_model, sum(quantity) AS sum FROM
    dealer
    GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ());

-- Grouping by ROLLUP
SELECT city, car_model, sum(quantity) AS sum FROM
    dealer
    GROUP BY ROLLUP(city, car_model);

SELECT city, car_model, sum(quantity) AS sum FROM
    dealer
GROUP BY
    ROLLUP(city, car_model);

-- Grouping by CUBE
SELECT city, car_model, sum(quantity) AS sum FROM
    dealer
    GROUP BY CUBE(city, car_model);

SELECT city, car_model, sum(quantity) AS sum FROM
    dealer
GROUP BY
    CUBE(city, car_model);

Hints

Snowflake performs automatic optimization of JOINs and partitioning, meaning that hints are unnecessary, they are preserved as comments in the output code.

Input Code:

IN -> Databricks_02.sql
SELECT
/*+ REBALANCE */ /*+ COALESCE(2) */
*
FROM my_table;

Output Code:

OUT -> Databricks_02.sql
SELECT
/*+ REBALANCE */ /*+ COALESCE(2) */
*
FROM
my_table;

CTE

The AS keyword is optional in Spark/Databricks, however in Snowflake is required so it is added.

Input Code:

IN -> Databricks_03.sql
WITH my_cte (
   SELECT id, name FROM my_table
)
SELECT *
FROM my_cte
WHERE id = 1;

Output Code:

OUT -> Databricks_03.sql
WITH my_cte AS (
     SELECT id, name FROM
        my_table
  )
SELECT *
FROM
     my_cte
WHERE id = 1;

LIMIT

LIMIT ALL is removed as it is not needed in Snowflake, LIMIT with a literal value is preserved as-is.

Input Code:

IN -> Databricks_03.sql
SELECT * FROM my_table LIMIT ALL;

SELECT * FROM my_table LIMIT 5;

Output Code:

OUT -> Databricks_03.sql

SELECT * FROM
my_table;

SELECT * FROM
my_table
LIMIT 5;

ORDER BY

WHERE

HAVING

FROM table_relation

FROM inline_table

UNION [ALL | DISTINCT]

INTERSECT (no keywords)

EXCEPT (no keywords)

Last updated