Extraction, Preprocessing, and Code Preparation

Getting ready to run

Before running Snowpark Migration Accelerator (SMA), you must get a hold of the source code. SMA does not connect to a source database or Spark environment; it only examines the source code itself.

For best results, the source code should be in a format readable by the Snowpark Migration Accelerator since it relies solely on the provided source code.

The recommended source code to use are in GitHub repositories. These should be exported to a couple of zip files. However, depending on how you use Spark, you may have Python scripts or Scala projects in a series of directories or repositories. There may be notebooks run inside Databricks or locally by an end user. Regardless of where the code is, it's essential to get it together in a single folder (with as many subfolders as necessary) when running the SMA. If there is an existing structure to how the files are arranged in a directory, that should be preserved here.

The more complete the codebase that can be scanned initially, the more complete the reporting generated by Snowpark Migration Accelerator will be. This does not mean you should run all the code you can find through the tool (see Size below in Considerations), but rather try to identify all the necessary code for your particular use case. Let's take a look at the file types that can be analyzed by Snowpark Migration Accelerator, as well as a few factors that should be considered when evaluating the source code that Snowpark Migration Accelerator will analyze.

Filetypes

Snowpark Migration Accelerator will scan every file in the source directory, regardless of file type. However, only the certain file extensions will be scanned for references to the Spark API and analyzed as code files. This includes both code files and notebook files.

You can see which code filetypes and notebook filetype are supported in the Supported Filetypes section of this documentation.

Exported Files

If you do not have your code in files, but in a source platform, you can export that code to a format that can be read by the SMA. Here are some suggestions on exporting files:

Additional Considerations

The Snowpark Migration Accelerator can only process the code that is available in the input directory. As a result, you should keep a few things in mind when preparing the source code to be scanned in the SMA.

Snowpark Migration Accelerator is scanning text (code), not data. If there is a substantial amount of extra code files or a large total size (in bytes) of those files, the tool could run out of internal resources on your machine to process the code. E.g. some users will export all the code from dependent libraries into a set of files and use those as input for Snowpark Migration Accelerator, causing the tool to take much longer to analyze the codebase. Yet, the tool will only find the Spark references in the code designed to be used with Spark.

We recommend gathering all of the code files that:

  • Are being run regularly as part of a process.

  • Were used to create that process (if separate).

  • Are custom libraries created by the source code's organization referenced by either of the two cases above.

All of the code that creates an established third-party library, such as Pandas, Sci-Kit Learn, or otherwise, is not necessary to include. The tool will catalog those references without the code that defines them.

Snowpark Migration Accelerator fully parses the source codebase. This means that snippets of code that cannot be executed by themselves in Scala or Python will not be read by Snowpark Migration Accelerator. If you run the tool and there are a lot of parsing errors, this could indicate that the source code itself cannot be parsed. The input directory must contain syntactically correct code that runs in the source platform to be correctly analyzed by Snowpark Migration Accelerator.

For additional considerations before running the tool, review the section on Pre-Processing Considerations available in this documentation.

Walkthrough Codebase

Once the suggested sample codebases for this lab have been extracted and in placed into separate directories, you can choose one of these two directories as the input for the SMA.

[Note that if the codebase has a folder structure, it would be best to preserve it as is. The output code (when performing a conversion) and some of the analysis done in the assessment are file-based.]

For this walkthrough, each codebase is less than 1MB and works in the source platform (this test codebase can run in Spark, though it is out of scope in this walkthrough), and various use cases are present based on the name of the files. Note that these use cases are not necessarily representative of implemented code, but are meant to show different functions that can be assessed.

There are some jupyter (.ipynb) notebooks and Python script files (.py). Some additional files are in the source directory but not actual code files, just text files. Recall that the SMA will scan all files in a codebase regardless of the extension, but it will only search specific files looking for references to the Spark API.

Last updated