LogoLogo
SnowflakeDocumentation Home
  • Snowpark Migration Accelerator Documentation
  • General
    • Introduction
    • Getting Started
      • Download and Access
      • Installation
        • Windows Installation
        • MacOS Installation
        • Linux Installation
    • Conversion Software Terms of Use
      • Open Source Libraries
    • Release Notes
      • Old Version Release Notes
        • SC Spark Scala Release Notes
          • Known Issues
        • SC Spark Python Release Notes
          • Known Issues
    • Roadmap
  • User Guide
    • Overview
    • Before Using the SMA
      • Supported Platforms
      • Supported Filetypes
      • Code Extraction
      • Pre-Processing Considerations
    • Project Overview
      • Project Setup
      • Configuration and Settings
      • Tool Execution
    • Assessment
      • How the Assessment Works
      • Assessment Quick Start
      • Understanding the Assessment Summary
      • Readiness Scores
      • Output Reports
        • Curated Reports
        • SMA Inventories
        • Generic Inventories
        • Assessment zip file
      • Output Logs
      • Spark Reference Categories
    • Conversion
      • How the Conversion Works
      • Conversion Quick Start
      • Conversion Setup
      • Understanding the Conversion Assessment and Reporting
      • Output Code
    • Using the SMA CLI
      • Additional Parameters
  • Use Cases
    • Assessment Walkthrough
      • Walkthrough Setup
        • Notes on Code Preparation
      • Running the Tool
      • Interpreting the Assessment Output
        • Assessment Output - In Application
        • Assessment Output - Reports Folder
      • Running the SMA Again
    • Conversion Walkthrough
    • Sample Project
    • Using SMA with Docker
    • SMA CLI Walkthrough
    • SMA-Checkpoints Walkthrough
      • Prerequisites
      • SMA Execution Guide
        • Feature Settings
          • Default Settings
        • SMA-Checkpoints inventories
      • Snowpark-Checkpoints Execution Guide
        • Collection
        • Validation
  • Issue Analysis
    • Approach
    • Issue Code Categorization
    • Issue Codes by Source
      • General
      • Python
        • SPRKPY1000
        • SPRKPY1001
        • SPRKPY1002
        • SPRKPY1003
        • SPRKPY1004
        • SPRKPY1005
        • SPRKPY1006
        • SPRKPY1007
        • SPRKPY1008
        • SPRKPY1009
        • SPRKPY1010
        • SPRKPY1011
        • SPRKPY1012
        • SPRKPY1013
        • SPRKPY1014
        • SPRKPY1015
        • SPRKPY1016
        • SPRKPY1017
        • SPRKPY1018
        • SPRKPY1019
        • SPRKPY1020
        • SPRKPY1021
        • SPRKPY1022
        • SPRKPY1023
        • SPRKPY1024
        • SPRKPY1025
        • SPRKPY1026
        • SPRKPY1027
        • SPRKPY1028
        • SPRKPY1029
        • SPRKPY1030
        • SPRKPY1031
        • SPRKPY1032
        • SPRKPY1033
        • SPRKPY1034
        • SPRKPY1035
        • SPRKPY1036
        • SPRKPY1037
        • SPRKPY1038
        • SPRKPY1039
        • SPRKPY1040
        • SPRKPY1041
        • SPRKPY1042
        • SPRKPY1043
        • SPRKPY1044
        • SPRKPY1045
        • SPRKPY1046
        • SPRKPY1047
        • SPRKPY1048
        • SPRKPY1049
        • SPRKPY1050
        • SPRKPY1051
        • SPRKPY1052
        • SPRKPY1053
        • SPRKPY1054
        • SPRKPY1055
        • SPRKPY1056
        • SPRKPY1057
        • SPRKPY1058
        • SPRKPY1059
        • SPRKPY1060
        • SPRKPY1061
        • SPRKPY1062
        • SPRKPY1063
        • SPRKPY1064
        • SPRKPY1065
        • SPRKPY1066
        • SPRKPY1067
        • SPRKPY1068
        • SPRKPY1069
        • SPRKPY1070
        • SPRKPY1071
        • SPRKPY1072
        • SPRKPY1073
        • SPRKPY1074
        • SPRKPY1075
        • SPRKPY1076
        • SPRKPY1077
        • SPRKPY1078
        • SPRKPY1079
        • SPRKPY1080
        • SPRKPY1081
        • SPRKPY1082
        • SPRKPY1083
        • SPRKPY1084
        • SPRKPY1085
        • SPRKPY1086
        • SPRKPY1087
        • SPRKPY1088
        • SPRKPY1089
        • SPRKPY1091
        • SPRKPY1101
      • Spark Scala
        • SPRKSCL1000
        • SPRKSCL1001
        • SPRKSCL1002
        • SPRKSCL1100
        • SPRKSCL1101
        • SPRKSCL1102
        • SPRKSCL1103
        • SPRKSCL1104
        • SPRKSCL1105
        • SPRKSCL1106
        • SPRKSCL1107
        • SPRKSCL1108
        • SPRKSCL1109
        • SPRKSCL1110
        • SPRKSCL1111
        • SPRKSCL1112
        • SPRKSCL1113
        • SPRKSCL1114
        • SPRKSCL1115
        • SPRKSCL1116
        • SPRKSCL1117
        • SPRKSCL1118
        • SPRKSCL1119
        • SPRKSCL1120
        • SPRKSCL1121
        • SPRKSCL1122
        • SPRKSCL1123
        • SPRKSCL1124
        • SPRKSCL1125
        • SPRKSCL1126
        • SPRKSCL1127
        • SPRKSCL1128
        • SPRKSCL1129
        • SPRKSCL1130
        • SPRKSCL1131
        • SPRKSCL1132
        • SPRKSCL1133
        • SPRKSCL1134
        • SPRKSCL1135
        • SPRKSCL1136
        • SPRKSCL1137
        • SPRKSCL1138
        • SPRKSCL1139
        • SPRKSCL1140
        • SPRKSCL1141
        • SPRKSCL1142
        • SPRKSCL1143
        • SPRKSCL1144
        • SPRKSCL1145
        • SPRKSCL1146
        • SPRKSCL1147
        • SPRKSCL1148
        • SPRKSCL1149
        • SPRKSCL1150
        • SPRKSCL1151
        • SPRKSCL1152
        • SPRKSCL1153
        • SPRKSCL1154
        • SPRKSCL1155
        • SPRKSCL1156
        • SPRKSCL1157
        • SPRKSCL1158
        • SPRKSCL1159
        • SPRKSCL1160
        • SPRKSCL1161
        • SPRKSCL1162
        • SPRKSCL1163
        • SPRKSCL1164
        • SPRKSCL1165
        • SPRKSCL1166
        • SPRKSCL1167
        • SPRKSCL1168
        • SPRKSCL1169
        • SPRKSCL1170
        • SPRKSCL1171
        • SPRKSCL1172
        • SPRKSCL1173
        • SPRKSCL1174
        • SPRKSCL1175
      • SQL
        • SparkSQL
          • SPRKSPSQL1001
          • SPRKSPSQL1002
          • SPRKSPSQL1003
          • SPRKSPSQL1004
          • SPRKSPSQL1005
          • SPRKSPSQL1006
        • Hive
          • SPRKHVSQL1001
          • SPRKHVSQL1002
          • SPRKHVSQL1003
          • SPRKHVSQL1004
          • SPRKHVSQL1005
          • SPRKHVSQL1006
      • Pandas
        • PNDSPY1001
        • PNDSPY1002
        • PNDSPY1003
        • PNDSPY1004
      • DBX
        • SPRKDBX1000
        • SPRKDBX1001
        • SPRKDBX1002
        • SPRKDBX1003
    • Troubleshooting the Output Code
      • Locating Issues
    • Workarounds
    • Deploying the Output Code
  • Translation Reference
    • Translation Reference Overview
    • SIT Tagging
      • SQL statements
    • SQL Embedded code
    • HiveSQL
      • Supported functions
    • Spark SQL
      • Spark SQL DDL
        • Create Table
          • Using
      • Spark SQL DML
        • Merge
        • Select
          • Distinct
          • Values
          • Join
          • Where
          • Group By
          • Union
      • Spark SQL Data Types
      • Supported functions
    • DBX Notebook
      • Dbutils
        • dbutils.notebook.run
        • dbutils.notebook.exit
      • Magic commands
        • %run
  • Workspace Estimator
    • Overview
    • Getting Started
  • INTERACTIVE ASSESSMENT APPLICATION
    • Overview
    • Installation Guide
  • Support
    • General Troubleshooting
      • How do I give SMA permission to the config folder?
      • Invalid Access Code error on VDI
      • How do I give SMA permission to Documents, Desktop, and Downloads folders?
    • Frequently Asked Questions (FAQ)
      • Using SMA with Jupyter Notebooks
      • How to request an access code
      • Sharing the Output with Snowflake
      • DBC files explode
    • Glossary
    • Contact Us
Powered by GitBook
On this page
  • Extraction
  • Considerations
  • Filetypes
  • Exported Files
  • Size
  • Does it run?
  • Use Case
  • Exports from Databricks Notebooks
  • Walkthrough Codebase
  1. Use Cases
  2. Assessment Walkthrough
  3. Walkthrough Setup

Notes on Code Preparation

Get ready to run the SMA

PreviousWalkthrough SetupNextRunning the Tool

Last updated 10 months ago

Before running the Snowpark Migration Accelerator (SMA), the source code must be accessible from the same machine where the SMA has been installed. The SMA does not connect to a source database or Spark environment; it only examines the source code.

This source code should be in a format readable by the SMA since it relies solely on the provided source code.

Extraction

In this walkthrough, the recommended source codebases are in GitHub repositories. These should be , and then unzipped locally. Depending on how you use Spark, you may have Python scripts or Scala projects in a series of directories or repositories. There may be notebooks run inside Databricks or locally by an end user. Regardless of where the code is, it's essential to get it together in a single folder (with as many subfolders as necessary) when running the SMA. If there is already an existing structure to how the files are arranged in a directory, that should be preserved here.

The more complete the codebase that can be scanned initially, the more complete the reporting generated by the SMA will be. This does not mean you should run all the code you can find through the tool (see below in ), but rather try to identify all the necessary code for your particular use case.

Considerations

Let's take a look at the file types that can be analyzed by the SMA, as well as a few other factors that should be considered when evaluating and preparing the source code that will be analyzed. by the SMA.

Filetypes

The SMA will scan every file in the source directory, regardless of file type. However, only the certain file extensions will be scanned for references to the Spark API and analyzed as code files. This includes both code files and notebook files.

You can see which code filetypes and notebook filetype are supported in the .

Exported Files

If you do not have your code in files, but in a source platform, you can export that code to a format that can be read by the SMA. Here are some suggestions on exporting files:

  • If you are using Databricks, you can export Databricks notebook files to .dbc format which can be read by the SMA. For more details on exporting Databricks notebook files, refer to .

  • For additional guidance exporting files, Snowflake Professional Services maintains . There is guidance published for Databricks, Hive, and other sources.

  • If you are using another source platform, there may be more information in this documentation on the . If not, please reach out to for any issues getting your code in a format that can be understood by the SMA.

Size

The SMA scans text (code), not data. If there is a substantial amount of extra code files or a large total size (in bytes) of those files, the tool could run out of internal resources on your machine to process the code. It's important to consider only running code that is unders consideration. Some users will export all the code from dependent libraries into a set of files and use those as input for the SMA, causing the tool to take much longer to analyze the codebase. Yet, the tool will only find the Spark references in the code designed to be used with Spark.

We recommend gathering all of the code files that...

  • are being run regularly as part of a process.

  • were used to create that process (if separate).

  • are custom libraries created by the source code's organization referenced by either of the two cases above.

All of the code that creates an established third-party library, such as Pandas, Sci-Kit Learn, or otherwise, is not necessary to include. The tool will catalog those references without the code that defines them.

Does it run?

Use Case

Exports from Databricks Notebooks

Generally, Databricks notebooks can support code from different languages (SQL, Scala, and Pyspark) in the same notebook. When exported, they are exported to a file with an extension based on the notebook category (such as .ipynb or .py for Python and .sql for SQL). Code that does not match the category will be commented out, such as SQL code in a cell in a Python notebook. This is illustrated in the example below, where you can see some SQL code written and executed as an SQL cell in a Databricks Python notebook that is commented out when exported:

Code in comments is treated as a comment (as it would be if run in Python) and not analyzed by the SMA. To ensure this is analyzed, some preprocessing will be required to expose the code in an extension that the tool can identify. Even if the code is uncommented, any code in a language other than the language related to the extension (such as Python in a .sql file). To ensure this code is analyzed, the code must be in an extension that matches the source language being analyzed.

Walkthrough Codebase

[Note that if the codebase has a folder structure, it would be best to preserve it as is. The output code (when performing a conversion) and some of the analysis done in the assessment are file-based.]

For this walkthrough, each codebase is less than 1MB and works in the source platform (you can run them in Spark, though it is out of scope in this walkthrough), and various use cases are present based on the name of the files. Note that these use cases are not necessarily representative of implemented code, but are meant to show different functions that can be assessed.

There are some jupyter (.ipynb) notebooks and Python script files (.py). Some additional files are in the source directory but not actual code files, just text files. Recall that the SMA will scan all files in a codebase regardless of the extension, but it will only search specific files looking for references to the Spark API.

The SMA fully parses the source codebase. This means that snippets of code that cannot be executed by themselves in will not be read by the SMA. If you run the tool and there are a lot of , this could indicate that the source code itself cannot be parsed. The input directory must contain syntactically correct code that runs in the source platform to be correctly analyzed by the SMA.

Understanding the use case for each codebase scanned is highly valuable when evaluating the output. It can also help you identify workloads that may not be ready for Snowpark, such as a notebook that doesn't reference Spark but uses and a connector to pull information from an existing database. While this would be valuable, the SMA will not report anything besides the third-party libraries imported for that particular notebook. That information is still valuable, but don't expect a Spark API Readiness Score from such a file. Understanding the use case will better inform your understanding of the readiness scores and the compatibility of the workload with Snowflake.

Commented Code when Exported

For additional considerations before running the tool, review the section on available in this documentation.

Once the suggested have been extracted and placed into separate directories, you can choose one of these two directories as the input for the SMA.

an unsupported SQL dialect
Pre-Processing Considerations
exported to zip files
Supported Filetypes section of this documentation
the Databricks documentation on exporting notebooks
an export scripts page in the Snowflake Labs Github repo
Code Extraction page
sma-support@snowflake.com
Size
Considerations
a supported source platform
sample codebases for this lab
parsing errors