SAMPLE clause

Important Notice: Migration of Documentation Website

Please be advised that our documentation website is currently undergoing a migration to a new platform. To ensure you have access to the most up-to-date information, we kindly request that you visit our new documentation website located at:

Official Snowflake Snowconvert Documentation

For any immediate assistance or if you encounter any issues, please contact our support team at [email protected].

Thank you for your understanding.

Description

The SAMPLE clause in Teradata reduces the number of rows to be processed and it returns one or more samples of rows as a list of fractions or as a list of numbers of rows. The clause is used in the SELECT query. Please review the following Teradata documentation for more information.

Teradata syntax

SAMPLE
  [ WITH REPLACEMENT ]
  [ RANDOMIZED LOCALIZATION ]
  { { fraction_description | count_description } [,...] |
    when_clause ]
  }

Snowflake syntax

Review the following Snowflake documentation for more information. SAMPLE and TABLESAMPLE are synonymous.

SELECT ...
FROM ...
  { SAMPLE | TABLESAMPLE } [ samplingMethod ]
[ ... ]

Where:

samplingMethod ::= { 
{ BERNOULLI | ROW } ( { <probability> | <num> ROWS } ) |
{ SYSTEM | BLOCK } ( <probability> ) [ { REPEATABLE | SEED } ( <seed> ) ] }

In Snowflake, the following keywords can be used interchangeably:
- SAMPLE | TABLESAMPLE
- BERNOULLI | ROW
- SYSTEM | BLOCK
- REPEATABLE | SEED

Review the following table to check on key differences.

SAMPLE behavior

Teradata

Snowflake

Sample by probability

Also known as fraction description. It must be a fractional number between 0,1 and 1.

Decimal number between 0 and 100.

Fixed number of rows

Also known as count description. It is a positive integer that determines the number of rows to be sampled.

It specifies the number of rows (up to 1,000,000) to sample from the table. Can be any integer between 0 (no rows selected) and 1000000 inclusive.

Repeated rows

It is known as WITH REPLACEMENT. This is used to query more samples than there are rows in the table.

It is known as REPEATABLE or SEED. This is used to make the query deterministic. It means that the same set of rows will be the same for each query run.

Sampling methods

Proportional and RANDOMIZED ALLOCATION.

BERNOULLI or SYSTEM.

Sample Source Patterns

Sample data

Teradata

CREATE TABLE Employee (
    EmpNo INT,
    Name VARCHAR(100),
    DeptNo INT
);

INSERT INTO Employee (EmpNo, Name, DeptNo)
VALUES (1, 'Alice', 100);

INSERT INTO Employee (EmpNo, Name, DeptNo)
VALUES (2, 'Bob', 300);

INSERT INTO Employee (EmpNo, Name, DeptNo)
VALUES (3, 'Charlie', 500);

INSERT INTO Employee (EmpNo, Name, DeptNo)
VALUES (4, 'David', 200);

INSERT INTO Employee (EmpNo, Name, DeptNo)
VALUES (5, 'Eve', 100);

Snowflake

CREATE OR REPLACE TABLE Employee (
    EmpNo INT,
    Name VARCHAR(100),
    DeptNo INT
)
COMMENT = '{ "origin": "sf_sc", "name": "snowconvert", "version": {  "major": 0,  "minor": 0,  "patch": "0" }, "attributes": {  "component": "teradata",  "convertedOn": "01/14/2025",  "domain": "test" }}'
;

INSERT INTO Employee (EmpNo, Name, DeptNo)
VALUES (1, 'Alice', 100);

INSERT INTO Employee (EmpNo, Name, DeptNo)
VALUES (2, 'Bob', 300);

INSERT INTO Employee (EmpNo, Name, DeptNo)
VALUES (3, 'Charlie', 500);

INSERT INTO Employee (EmpNo, Name, DeptNo)
VALUES (4, 'David', 200);

INSERT INTO Employee (EmpNo, Name, DeptNo)
VALUES (5, 'Eve', 100);

SAMPLE clause

Fixed number of rows

Notice that for this example, the number of rows are a fixed number but not necessarily are the same result for each run.

Teradata

SELECT * FROM Employee SAMPLE 2;

Snowflake

SELECT * FROM Employee SAMPLE (2 ROWS);

Rows number based on probability

This option will return a variety of rows depending on the probability set.

Teradata

SELECT * FROM Employee SAMPLE 0.25;

Snowflake

SELECT * FROM Employee SAMPLE (25);

Known Issues

Fixed number of rows with replacement

This option will return a fixed number of rows and will allows the repetition of the rows. In Snowflake, it is not possible to request more samples than rows in a table.

Teradata sample

SELECT * FROM Employee SAMPLE WITH REPLACEMENT 8;

In Teradata, it is possible to assign a unique ID to each sample that is specified. It helps to identify which belongs to which sample. This is not ANSI grammar, instead it is an extension of Teradata.

Teradata sample

SELECT name, SAMPLEID FROM employee SAMPLE 0.5, 0.25, 0.25;

In Snowflake, there is not a SAMPLEID function. A possible workaround may be the following, but it has to be adaptaed to each single case:

Snowflake possible workaround

WITH sampled_data AS (
    -- Sample 100% of the rows from the Employee table
    SELECT *, 
           ROW_NUMBER() OVER (ORDER BY EmpNo) AS row_num,
           COUNT(*) OVER () AS total_rows  -- Get the total row count to calculate sample size
    FROM Employee
)
SELECT Name,
       CASE
           -- First 50% of the rows
           WHEN row_num <= total_rows * 0.5 THEN 1
           -- Next 25% of the rows
           WHEN row_num <= total_rows * 0.75 THEN 2
           -- Remaining 25% of the rows
           ELSE 3
       END AS sample_id
FROM sampled_data
ORDER BY sample_id, row_num;  -- Order by sample_id and row_num for consistency

Conditional sampling

In Snowflake there is not conditional sampling. This can be achieve by using CTE's.

Teradata sample

SELECT * FROM employee
SAMPLE WHEN DeptNo > 100 then 0.9 
ELSE 0.1 END;

SSC-EWI-0021: Syntax not supported in Snowflake.

PreviousReset When NextSet Operators

Last updated 12 days ago

Description

Sample Source Patterns

Sample data

Teradata

Snowflake

SAMPLE clause

Fixed number of rows

Rows number based on probability

Known Issues

Fixed number of rows with replacement

SAMPLEID related functionality

Conditional sampling

Related EWIs