LogoLogo
SnowflakeDocumentation Home
  • Snowpark Migration Accelerator Documentation
  • General
    • Introduction
    • Getting Started
      • Download and Access
      • Installation
        • Windows Installation
        • MacOS Installation
        • Linux Installation
    • Conversion Software Terms of Use
      • Open Source Libraries
    • Release Notes
      • Old Version Release Notes
        • SC Spark Scala Release Notes
          • Known Issues
        • SC Spark Python Release Notes
          • Known Issues
    • Roadmap
  • User Guide
    • Overview
    • Before Using the SMA
      • Supported Platforms
      • Supported Filetypes
      • Code Extraction
      • Pre-Processing Considerations
    • Project Overview
      • Project Setup
      • Configuration and Settings
      • Tool Execution
    • Assessment
      • How the Assessment Works
      • Assessment Quick Start
      • Understanding the Assessment Summary
      • Readiness Scores
      • Output Reports
        • Curated Reports
        • SMA Inventories
        • Generic Inventories
        • Assessment zip file
      • Output Logs
      • Spark Reference Categories
    • Conversion
      • How the Conversion Works
      • Conversion Quick Start
      • Conversion Setup
      • Understanding the Conversion Assessment and Reporting
      • Output Code
    • Using the SMA CLI
      • Additional Parameters
  • Use Cases
    • Assessment Walkthrough
      • Walkthrough Setup
        • Notes on Code Preparation
      • Running the Tool
      • Interpreting the Assessment Output
        • Assessment Output - In Application
        • Assessment Output - Reports Folder
      • Running the SMA Again
    • Conversion Walkthrough
    • Sample Project
    • Using SMA with Docker
    • SMA CLI Walkthrough
    • SMA-Checkpoints Walkthrough
      • Prerequisites
      • SMA Execution Guide
        • Feature Settings
          • Default Settings
        • SMA-Checkpoints inventories
      • Snowpark-Checkpoints Execution Guide
        • Collection
        • Validation
  • Issue Analysis
    • Approach
    • Issue Code Categorization
    • Issue Codes by Source
      • General
      • Python
        • SPRKPY1000
        • SPRKPY1001
        • SPRKPY1002
        • SPRKPY1003
        • SPRKPY1004
        • SPRKPY1005
        • SPRKPY1006
        • SPRKPY1007
        • SPRKPY1008
        • SPRKPY1009
        • SPRKPY1010
        • SPRKPY1011
        • SPRKPY1012
        • SPRKPY1013
        • SPRKPY1014
        • SPRKPY1015
        • SPRKPY1016
        • SPRKPY1017
        • SPRKPY1018
        • SPRKPY1019
        • SPRKPY1020
        • SPRKPY1021
        • SPRKPY1022
        • SPRKPY1023
        • SPRKPY1024
        • SPRKPY1025
        • SPRKPY1026
        • SPRKPY1027
        • SPRKPY1028
        • SPRKPY1029
        • SPRKPY1030
        • SPRKPY1031
        • SPRKPY1032
        • SPRKPY1033
        • SPRKPY1034
        • SPRKPY1035
        • SPRKPY1036
        • SPRKPY1037
        • SPRKPY1038
        • SPRKPY1039
        • SPRKPY1040
        • SPRKPY1041
        • SPRKPY1042
        • SPRKPY1043
        • SPRKPY1044
        • SPRKPY1045
        • SPRKPY1046
        • SPRKPY1047
        • SPRKPY1048
        • SPRKPY1049
        • SPRKPY1050
        • SPRKPY1051
        • SPRKPY1052
        • SPRKPY1053
        • SPRKPY1054
        • SPRKPY1055
        • SPRKPY1056
        • SPRKPY1057
        • SPRKPY1058
        • SPRKPY1059
        • SPRKPY1060
        • SPRKPY1061
        • SPRKPY1062
        • SPRKPY1063
        • SPRKPY1064
        • SPRKPY1065
        • SPRKPY1066
        • SPRKPY1067
        • SPRKPY1068
        • SPRKPY1069
        • SPRKPY1070
        • SPRKPY1071
        • SPRKPY1072
        • SPRKPY1073
        • SPRKPY1074
        • SPRKPY1075
        • SPRKPY1076
        • SPRKPY1077
        • SPRKPY1078
        • SPRKPY1079
        • SPRKPY1080
        • SPRKPY1081
        • SPRKPY1082
        • SPRKPY1083
        • SPRKPY1084
        • SPRKPY1085
        • SPRKPY1086
        • SPRKPY1087
        • SPRKPY1088
        • SPRKPY1089
        • SPRKPY1091
        • SPRKPY1101
      • Spark Scala
        • SPRKSCL1000
        • SPRKSCL1001
        • SPRKSCL1002
        • SPRKSCL1100
        • SPRKSCL1101
        • SPRKSCL1102
        • SPRKSCL1103
        • SPRKSCL1104
        • SPRKSCL1105
        • SPRKSCL1106
        • SPRKSCL1107
        • SPRKSCL1108
        • SPRKSCL1109
        • SPRKSCL1110
        • SPRKSCL1111
        • SPRKSCL1112
        • SPRKSCL1113
        • SPRKSCL1114
        • SPRKSCL1115
        • SPRKSCL1116
        • SPRKSCL1117
        • SPRKSCL1118
        • SPRKSCL1119
        • SPRKSCL1120
        • SPRKSCL1121
        • SPRKSCL1122
        • SPRKSCL1123
        • SPRKSCL1124
        • SPRKSCL1125
        • SPRKSCL1126
        • SPRKSCL1127
        • SPRKSCL1128
        • SPRKSCL1129
        • SPRKSCL1130
        • SPRKSCL1131
        • SPRKSCL1132
        • SPRKSCL1133
        • SPRKSCL1134
        • SPRKSCL1135
        • SPRKSCL1136
        • SPRKSCL1137
        • SPRKSCL1138
        • SPRKSCL1139
        • SPRKSCL1140
        • SPRKSCL1141
        • SPRKSCL1142
        • SPRKSCL1143
        • SPRKSCL1144
        • SPRKSCL1145
        • SPRKSCL1146
        • SPRKSCL1147
        • SPRKSCL1148
        • SPRKSCL1149
        • SPRKSCL1150
        • SPRKSCL1151
        • SPRKSCL1152
        • SPRKSCL1153
        • SPRKSCL1154
        • SPRKSCL1155
        • SPRKSCL1156
        • SPRKSCL1157
        • SPRKSCL1158
        • SPRKSCL1159
        • SPRKSCL1160
        • SPRKSCL1161
        • SPRKSCL1162
        • SPRKSCL1163
        • SPRKSCL1164
        • SPRKSCL1165
        • SPRKSCL1166
        • SPRKSCL1167
        • SPRKSCL1168
        • SPRKSCL1169
        • SPRKSCL1170
        • SPRKSCL1171
        • SPRKSCL1172
        • SPRKSCL1173
        • SPRKSCL1174
        • SPRKSCL1175
      • SQL
        • SparkSQL
          • SPRKSPSQL1001
          • SPRKSPSQL1002
          • SPRKSPSQL1003
          • SPRKSPSQL1004
          • SPRKSPSQL1005
          • SPRKSPSQL1006
        • Hive
          • SPRKHVSQL1001
          • SPRKHVSQL1002
          • SPRKHVSQL1003
          • SPRKHVSQL1004
          • SPRKHVSQL1005
          • SPRKHVSQL1006
      • Pandas
        • PNDSPY1001
        • PNDSPY1002
        • PNDSPY1003
        • PNDSPY1004
      • DBX
        • SPRKDBX1000
        • SPRKDBX1001
        • SPRKDBX1002
        • SPRKDBX1003
    • Troubleshooting the Output Code
      • Locating Issues
    • Workarounds
    • Deploying the Output Code
  • Translation Reference
    • Translation Reference Overview
    • SIT Tagging
      • SQL statements
    • SQL Embedded code
    • HiveSQL
      • Supported functions
    • Spark SQL
      • Spark SQL DDL
        • Create Table
          • Using
      • Spark SQL DML
        • Merge
        • Select
          • Distinct
          • Values
          • Join
          • Where
          • Group By
          • Union
      • Spark SQL Data Types
      • Supported functions
    • DBX Notebook
      • Dbutils
        • dbutils.notebook.run
        • dbutils.notebook.exit
      • Magic commands
        • %run
  • Workspace Estimator
    • Overview
    • Getting Started
  • INTERACTIVE ASSESSMENT APPLICATION
    • Overview
    • Installation Guide
  • Support
    • General Troubleshooting
      • How do I give SMA permission to the config folder?
      • Invalid Access Code error on VDI
      • How do I give SMA permission to Documents, Desktop, and Downloads folders?
    • Frequently Asked Questions (FAQ)
      • Using SMA with Jupyter Notebooks
      • How to request an access code
      • Sharing the Output with Snowflake
      • DBC files explode
    • Glossary
    • Contact Us
Powered by GitBook
On this page
  • Conversion in the SMA
  • Types of Conversion in the SMA
  • Examples of Conversion of References to the Spark API to the Snowpark API
  • Example of Spark Scala to Snowpark
  • Example of PySpark to Snowpark
  1. User Guide
  2. Conversion

How the Conversion Works

What is converted? How?

The Snowpark Migration Accelerator (SMA) generates a comprehensize assessment, but it can also convert some elements in a source codebase to something that is compatible with a target codebase. The SMA does this in the exact same way that it builds the initial assessment, then is just one added step.

Conversion in the SMA

In both assessment and conversion mode, the SMA:

  • Scans all files in a directory

  • Identifies the code files

  • Parses the code files (based on the language of the source code)

  • Builds an Abstract Syntax Tree (AST)

  • Populates a Symbol Table

  • Categorizes and Reports on Errors

  • Generates the output reports

All of this happens AGAIN when the SMA is run in conversion mode, even if it was already run in assessment mode. There is one final step that is done in conversion mode:

  • Pretty print the output code from the AST

The AST is a semantic model representing the functionality of the source codebase. As a result, if that functionality exists in both the source and the target, the SMA can print in the output code the functional equivalent of what existed in the source (where there is a functional equivalent). This last step is only done during a conversion executions of the SMA.

Types of Conversion in the SMA

The SMA currently outputs only the following conversions:

  • References from the Spark API to the Snowpark API in Python or Scala code files

The SMA does not perform any SQL conversion. For SQL files or SQL-only assessments, the tool provides assessment only, without any automated conversion.

Let's look at an example of the first one in both Scala and Python.

Examples of Conversion of References to the Spark API to the Snowpark API

Example of Spark Scala to Snowpark

When you select Scala as the source language, the SMA converts references to the Spark API in Scala code to references to the Snowpark API. Here's an example of the conversion of a simple Spark Application. This application reads, filters, joins, calculates an average, and shows results from a given dataset.

Apache Spark Scala Code

import org.apache.spark.sql._ 
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.SparkSession 

object SimpleApp { 
  def avgJobSalary(session: SparkSession, dept: String) { 
    val employees = session.read.csv("path/data/employees.csv") 
    val jobs = session.read.csv("path/data/jobs.csv") 

    val jobsAvgSalary = employees.
                          filter($"Department" === dept).
                          join(jobs).
                          groupBy("JobName").
                          avg("Salary") 

    // Salaries in department 
    jobsAvgSalary.select(collect_list("Salary")).show() 

    // avg Salary 
    jobsAvgSalary.show() 
  } 
} 

The Converted Snowflake Code:

import com.snowflake.snowpark._ 
import com.snowflake.snowpark.functions._ 
import com.snowflake.snowpark.Session 

object SimpleApp { 
  def avgJobSalary(session: Session, dept: String) { 
    val employees = session.read.csv("path/data/employees.csv") 
    val jobs = session.read.csv("path/data/jobs.csv") 

    val jobsAvgSalary = employees.
                          filter($"Department" === dept).
                          join(jobs).
                          groupBy("JobName").
                          avg("Salary") 

    // Salaries in department 
    jobsAvgSalary.select(array_agg("Salary")).show() 
   
    // avg Salary 
    jobsAvgSalary.show() 
  } 
} 

In this example, most of the structure of the Scala code is the same, but the references to the Spark API have been changed to references to the Snowpark API.

Example of PySpark to Snowpark

When you select Python as the source language, the SMA converts references to the Spark API in Python code to references to the Snowpark API. Here is a script that uses several Pyspark functions:

from datetime import date, datetime
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Row

spark_session = SparkSession.builder.getOrCreate()

df = spark_session.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])

# cube()
df.cube("name", df.age).count().orderBy("name", "age").show()

# take()
df_new1.take(2)

# describe()
df.describe(['age']).show()

# explain()
df.explain() 
df.explain("simple") # Physical plan
df.explain(True) 

# intersect()
df1 = spark_session.createDataFrame([("a", 1), ("a", 1), ("b", 3), ("c", 4)], ["C1", "C2"])
df2 = spark_session.createDataFrame([("a", 1), ("a", 1), ("b", 3)], ["C1", "C2"])

# where()
df_new1.where(F.col('Id2')>30).show()

The Converted Snowflake Code:

from datetime import date, datetime
from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
from snowflake.snowpark import Row

spark_session = Session.builder.create()

df = spark_session.create_dataframe([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])

# cube()
df.cube("name", df.age).count().sort("name", "age").show()

# take()
df_new1.take(2)

# describe()
df.describe(['age']).show()

# explain()
df.explain()
df.explain("simple") # Physical plan
df.explain(True)

# intersect()
df1 = spark_session.create_dataframe([("a", 1), ("a", 1), ("b", 3), ("c", 4)], ["C1", "C2"])
df2 = spark_session.create_dataframe([("a", 1), ("a", 1), ("b", 3)], ["C1", "C2"])

# where()
df_new1.where(F.col('Id2')>30).show()

In this example, most of the structure of the Python code is the same, but the references to the Spark API have been changed to references to the Snowpark API.


This is what you can expect from conversion with the SMA.

PreviousConversionNextConversion Quick Start

Last updated 26 days ago