Pre-Processing Considerations

What do you need to know to get the most out of the tool?

The Snowpark Migrator Accelerator (SMA) can only process the code in the input directory. As a result, there are a few things you should keep in mind when preparing the source code to be scanned in SMA.

Size

The SMA scans text (or code), not data. If there is a substantial amount of extra code files or a large total size (in bytes) of the files being scanned, the tool could run out of internal resources on your machine to process the code. For example, some users will export all the code from dependent libraries into a set of files and use those as input for SMA, causing the tool to take much longer to analyze the codebase. However, the tool will only find the Spark references in the code designed to be used with Spark.

We recommend gathering all of the code files that:

  • Are being run regularly as part of a process.

  • Were used to create that process (if separate).

  • Are custom libraries created by the source code’s organization that are referenced by either of the two above.

All of the code that creates an established third-party library, such as Pandas, Scikit-Learn, or otherwise, is not necessary to include. The tool will catalog those references without the code that defines them.

It should work

The Snowpark Migrator Accelerator (SMA) fully parses the source codebase. This means that snippets of code that cannot be executed by themselves in Scala or Python will not be parsed correctly by the SMA. If you run the tool and there are a relatively large amount of parsing errors, this could indicate that the source code itself cannot be parsed. The input directory must contain syntactically correct code that runs in the source platform if it is to be correctly analyzed by the SMA.

Use Case

This is less for the SMA itself and more for understanding the output generated by the SMA. If you know the use case for the codebase you've scanned, it can help you identify workloads that may not be ready for Snowpark. For example, a notebook that doesn’t reference Spark but uses SQL and a connector to pull information from an existing database. While this would be valuable to know, the SMA will not report anything other than the third-party libraries imported for that particular notebook. That information is still valuable, but don’t expect a readiness score from such a file.

Code from Databricks Notebooks

Generally, Databricks notebooks can support code from different languages (SQL, Scala, and Pyspark) in the same notebook. When exported, they are exported to a file with an extension based on the notebook category (such as .ipynb or .py for Python and .sql for SQL). Code that does not match the category will be commented out, such as SQL code in a cell in a Python notebook. This is illustrated in the example below, where you can see some SQL code written and executed as a SQL cell in a Databricks Python notebook that is commented out when exported:

Code in comments is treated as a comment (as it would be if run in the source language) and not analyzed by the SMA. To ensure that this is analyzed, some preprocessing will be required to expose the code in an extension that the tool can identify.

In notebooks, if the code is uncommented, a code language other than the language related to the extension (such as SQL in an .ipynb file) will be recognized and analyzed by SMA.

For non-notebooks, to ensure this code is analyzed, the code will have to be in an extension that matches the source language being analyzed (such as Python code in a .py file).

Last updated

#332: [SIT-1562] SQL Readiness

Change request updated