Spark Reference Categories

Categories of references to the Spark API

SnowConvert for Spark divides Spark elements into several categories based on the kind of mapping that is present from Spark to Snowpark. Below is a summary of each of the categories that SnowConvert outputs to describe the translation of each Spark reference, along with a description, example, and whether the tool can automatically convert the reference (Tool Supported) and if it’s possible the Snowpark.

The following sections detail what each status means with some examples.

Direct

Direct translation. The same function exists in PySpark and Snowpark with no change needed.

  • Snowpark Supported: TRUE

  • Tool Supported: TRUE

  • Spark Example:

col("col1")
  • Snowpark Example:

col("col1")

Rename

The function from PySpark exists in Snowpark, but there is a rename that is needed.

  • Snowpark Supported: TRUE

  • Tool Supported: TRUE

  • Spark Example:

orderBy("date")
  • Snowpark Example:

sort("date")

Helper

Note: The Python extensions library has been deprecated as of Spark Conversion Core V2.40.0. No Spark elements from Python will be categorized as extensions from this version forward. Spark Scala will continue to support the helper classes in the Snowpark extensions library.

The function from Spark has a small difference in Snowpark than can be addressed by creating a function with an equivalent signature at an extension file that will resolve the difference. In other words, a "helper" function will be created in an extension library that will be called in each file where necessary.

You can find more information about the Snowpark extensions library in the extensions Git repository: https://github.com/Snowflake-Labs/snowpark-extensions.

Examples of this are "fixed" additional parameters, change order of parameters, etc.

  • Snowpark Supported: TRUE

  • Tool Supported: TRUE

  • Spark Example:

instr(str, substr)
  • Snowpark Example:

# creating a helper function named instr with an 
# identical signature as the pyspark function, like:

def instr(source: str, substr: str) => str : 
    return charindex(substr, str)

Transformation

The function is completely recreated to a functionally equivalent function in Snowpark, but doesn't resemble the original function. This can include calling several functions, or adding multiple lines of code.

  • Snowpark Supported: TRUE

  • Tool Supported: TRUE

  • Spark Example:

col1 = col("col1")
col2 = col("col2")
col1.contains(col2)
  • Snowpark Example:

col1 = col("col1")
col2 = col("col2")
from snowflake.snowpark.functions as f
f.contains(col, col2)

WorkAround

This category is employed when the tool cannot convert the PySpark element but there’s a known manual workaround to fix the conversion (the workaround is published in the tool documentation).

  • Snowpark Supported: TRUE

  • Tool Supported: FALSE

  • Spark Example:

instr(str, substr)
  • Snowpark Example:

#EWI: SPRKPY#### => pyspark function has a workaround, see documentation for more info
charindex(substr, str)

NotSupported

This category is employed when the tool cannot convert the PySpark element because there's no applicable equivalent in Snowflake.

  • Snowpark Supported: FALSE

  • Tool Supported: FALSE

  • Spark Example:

df:DataFrame = spark.createDataFrame(rowData, columns)
df.alias("d")
  • Snowpark Example:

df:DataFrame = spark.createDataFrame(rowData, columns)
# EWI: SPRKPY11XX => DataFrame.alias is not supported
# df.alias("d")

NotDefined

This category is employed when the tool detects the usage of a Pyspark element as such but cannot be converted because it is not in the tool's conversion database.

This category is employed when the tool cannot convert the PySpark element because there's no applicable equivalent in Snowflake.

  • Snowpark Supported: FALSE

  • Tool Supported: FALSE

  • Spark Example: N/A

  • Snowpark Example: N/A

The output of the assessment will categorize all identified references to the Spark API with one of these categories.

Last updated