How the Conversion Works

What is converted? How?

The Snowpark Migration Accelerator (SMA) generates a comprehensize assessment, but it can also convert some elements in a source codebase to something that is compatible with a target codebase. The SMA does this in the exact same way that it builds the initial assessment, then is just one added step.

Conversion in the SMA

In both assessment and conversion mode, the SMA:

  • Scans all files in a directory

  • Identifies the code files

  • Parses the code files (based on the language of the source code)

  • Builds an Abstract Syntax Tree (AST)

  • Populates a Symbol Table

  • Categorizes and Reports on Errors

  • Generates the output reports

All of this happens AGAIN when the SMA is run in conversion mode, even if it was already run in assessment mode. There is one final step that is done in conversion mode:

  • Pretty print the output code from the AST

The AST is a semantic model representing the functionality of the source codebase. As a result, if that functionality exists in both the source and the target, the SMA can print in the output code the functional equivalent of what existed in the source (where there is a functional equivalent). This last step is only done during a conversion executions of the SMA.

Types of Conversion in the SMA

The SMA currently outputs only the following conversions:

  • References from the Spark API to the Snowpark API in Python or Scala code files

  • SQL Elements from Spark SQL or HiveQL to Snowflake SQL

Let's look at an example of the first one in both Scala and Python.

Examples of Conversion of References to the Spark API to the Snowpark API

Example of Spark Scala to Snowpark

When you select Scala as the source language, the SMA converts references to the Spark API in Scala code to references to the Snowpark API. Here's an example of the conversion of a simple Spark Application. This application reads, filters, joins, calculates an average, and shows results from a given dataset.

Apache Spark Scala Code

import org.apache.spark.sql._ 
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.SparkSession 

object SimpleApp { 
  def avgJobSalary(session: SparkSession, dept: String) { 
    val employees = session.read.csv("path/data/employees.csv") 
    val jobs = session.read.csv("path/data/jobs.csv") 

    val jobsAvgSalary = employees.
                          filter($"Department" === dept).
                          join(jobs).
                          groupBy("JobName").
                          avg("Salary") 

    // Salaries in department 
    jobsAvgSalary.select(collect_list("Salary")).show() 

    // avg Salary 
    jobsAvgSalary.show() 
  } 
} 

The Converted Snowflake Code:

import com.snowflake.snowpark._ 
import com.snowflake.snowpark.functions._ 
import com.snowflake.snowpark.Session 

object SimpleApp { 
  def avgJobSalary(session: Session, dept: String) { 
    val employees = session.read.csv("path/data/employees.csv") 
    val jobs = session.read.csv("path/data/jobs.csv") 

    val jobsAvgSalary = employees.
                          filter($"Department" === dept).
                          join(jobs).
                          groupBy("JobName").
                          avg("Salary") 

    // Salaries in department 
    jobsAvgSalary.select(array_agg("Salary")).show() 
   
    // avg Salary 
    jobsAvgSalary.show() 
  } 
} 

In this example, most of the structure of the Scala code is the same, but the references to the Spark API have been changed to references to the Snowpark API.

Example of PySpark to Snowpark

When you select Python as the source language, the SMA converts references to the Spark API in Python code to references to the Snowpark API. Here is a script that uses several Pyspark functions:

from datetime import date, datetime
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Row

spark_session = SparkSession.builder.getOrCreate()

df = spark_session.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])

# cube()
df.cube("name", df.age).count().orderBy("name", "age").show()

# take()
df_new1.take(2)

# describe()
df.describe(['age']).show()

# explain()
df.explain() 
df.explain("simple") # Physical plan
df.explain(True) 

# intersect()
df1 = spark_session.createDataFrame([("a", 1), ("a", 1), ("b", 3), ("c", 4)], ["C1", "C2"])
df2 = spark_session.createDataFrame([("a", 1), ("a", 1), ("b", 3)], ["C1", "C2"])

# where()
df_new1.where(F.col('Id2')>30).show()

The Converted Snowflake Code:

from datetime import date, datetime
from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
from snowflake.snowpark import Row

spark_session = Session.builder.create()

df = spark_session.create_dataframe([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])

# cube()
df.cube("name", df.age).count().sort("name", "age").show()

# take()
df_new1.take(2)

# describe()
df.describe(['age']).show()

# explain()
df.explain()
df.explain("simple") # Physical plan
df.explain(True)

# intersect()
df1 = spark_session.create_dataframe([("a", 1), ("a", 1), ("b", 3), ("c", 4)], ["C1", "C2"])
df2 = spark_session.create_dataframe([("a", 1), ("a", 1), ("b", 3)], ["C1", "C2"])

# where()
df_new1.where(F.col('Id2')>30).show()

In this example, most of the structure of the Python code is the same, but the references to the Spark API have been changed to references to the Snowpark API.


This is what you can expect from conversion with the SMA.

Last updated

#332: [SIT-1562] SQL Readiness

Change request updated