Notes on Code Preparation

Get ready to run the SMA

Before running the Snowpark Migration Accelerator (SMA), the source code must be accessible from the same machine where the SMA has been installed. The SMA does not connect to a source database or Spark environment; it only examines the source code.

This source code should be in a format readable by the SMA since it relies solely on the provided source code.

Extraction

In this walkthrough, the recommended source codebases are in GitHub repositories. These should be exported to zip files, and then unzipped locally. Depending on how you use Spark, you may have Python scripts or Scala projects in a series of directories or repositories. There may be notebooks run inside Databricks or locally by an end user. Regardless of where the code is, it's essential to get it together in a single folder (with as many subfolders as necessary) when running the SMA. If there is already an existing structure to how the files are arranged in a directory, that should be preserved here.

The more complete the codebase that can be scanned initially, the more complete the reporting generated by the SMA will be. This does not mean you should run all the code you can find through the tool (see Size below in Considerations), but rather try to identify all the necessary code for your particular use case.

Considerations

Let's take a look at the file types that can be analyzed by the SMA, as well as a few other factors that should be considered when evaluating and preparing the source code that will be analyzed. by the SMA.

Filetypes

The SMA will scan every file in the source directory, regardless of file type. However, only the certain file extensions will be scanned for references to the Spark API and analyzed as code files. This includes both code files and notebook files.

You can see which code filetypes and notebook filetype are supported in the Supported Filetypes section of this documentation.

Exported Files

If you do not have your code in files, but in a source platform, you can export that code to a format that can be read by the SMA. Here are some suggestions on exporting files:

Size

The SMA scans text (code), not data. If there is a substantial amount of extra code files or a large total size (in bytes) of those files, the tool could run out of internal resources on your machine to process the code. It's important to consider only running code that is unders consideration. Some users will export all the code from dependent libraries into a set of files and use those as input for the SMA, causing the tool to take much longer to analyze the codebase. Yet, the tool will only find the Spark references in the code designed to be used with Spark.

We recommend gathering all of the code files that...

  • are being run regularly as part of a process.

  • were used to create that process (if separate).

  • are custom libraries created by the source code's organization referenced by either of the two cases above.

All of the code that creates an established third-party library, such as Pandas, Sci-Kit Learn, or otherwise, is not necessary to include. The tool will catalog those references without the code that defines them.

Does it run?

The SMA fully parses the source codebase. This means that snippets of code that cannot be executed by themselves in a supported source platform will not be read by the SMA. If you run the tool and there are a lot of parsing errors, this could indicate that the source code itself cannot be parsed. The input directory must contain syntactically correct code that runs in the source platform to be correctly analyzed by the SMA.

Use Case

Understanding the use case for each codebase scanned is highly valuable when evaluating the output. It can also help you identify workloads that may not be ready for Snowpark, such as a notebook that doesn't reference Spark but uses an unsupported SQL dialect and a connector to pull information from an existing database. While this would be valuable, the SMA will not report anything besides the third-party libraries imported for that particular notebook. That information is still valuable, but don't expect a Spark API Readiness Score from such a file. Understanding the use case will better inform your understanding of the readiness scores and the compatibility of the workload with Snowflake.

Exports from Databricks Notebooks

Generally, Databricks notebooks can support code from different languages (SQL, Scala, and Pyspark) in the same notebook. When exported, they are exported to a file with an extension based on the notebook category (such as .ipynb or .py for Python and .sql for SQL). Code that does not match the category will be commented out, such as SQL code in a cell in a Python notebook. This is illustrated in the example below, where you can see some SQL code written and executed as an SQL cell in a Databricks Python notebook that is commented out when exported:

Code in comments is treated as a comment (as it would be if run in Python) and not analyzed by the SMA. To ensure this is analyzed, some preprocessing will be required to expose the code in an extension that the tool can identify. Even if the code is uncommented, any code in a language other than the language related to the extension (such as Python in a .sql file). To ensure this code is analyzed, the code must be in an extension that matches the source language being analyzed.

For additional considerations before running the tool, review the section on Pre-Processing Considerations available in this documentation.

Walkthrough Codebase

Once the suggested sample codebases for this lab have been extracted and placed into separate directories, you can choose one of these two directories as the input for the SMA.

[Note that if the codebase has a folder structure, it would be best to preserve it as is. The output code (when performing a conversion) and some of the analysis done in the assessment are file-based.]

For this walkthrough, each codebase is less than 1MB and works in the source platform (you can run them in Spark, though it is out of scope in this walkthrough), and various use cases are present based on the name of the files. Note that these use cases are not necessarily representative of implemented code, but are meant to show different functions that can be assessed.

There are some jupyter (.ipynb) notebooks and Python script files (.py). Some additional files are in the source directory but not actual code files, just text files. Recall that the SMA will scan all files in a codebase regardless of the extension, but it will only search specific files looking for references to the Spark API.

Last updated

#332: [SIT-1562] SQL Readiness

Change request updated