Interpreting the Assessment Output

What to do with all of this information generated by the SMA?

When the tool has finished running, the analysis is complete. Let’s look at the output.

Assessment Artifacts

There are multiple places where information from the assessment will be populated. This will focus on reviewing the output in the application and the detailed report, b

  • Assessment Summary Page in the Application - When you choose “VIEW RESULTS” in the UI, it shows key summary information on that page.

  • Locally in the output directory - In the output directory specified in the Project Creation page, the user will get a series of inventories and reports based on the readiness score and the email entered in the project creation menu. Those will be detailed below.

  • Emailed to the user - An email is sent with a subset of the information on the assessment summary page in the application. This will have the execution ID for that run of the tool and the result (the workload is a good candidate for migration, or the workload needs further analysis). That is all.

Let’s look at these outputs, followed by a close analysis of the detailed report.

Assessment Summary Page in the Application

When you select “VIEW RESULTS” following the successful execution of the tool, you will see an output screen that shows some fundamental information about that execution of the tool, along with one of two “results” indicating that a workload is either a good candidate for migration (>60% Readiness Score) or that a workload needs further analysis to understand if it’s a good candidate (<60% Readiness Score). (More on the Readiness Score later on.)

The screen will look like this:

You will see that no Readiness Score is displayed here, nor are the elements that are ready for conversion listed here. This is meant only as a benign summary of the analysis done by Snowpark Migration Accelerator.

Reports output locally in the output directory

What is generated locally depends on two things: the user’s email and the readiness score. A complete output set of files and reports will be created for Snowflake users and for all users with a Readiness Score of about 90%. Outside of those conditions, only an inventory of items will be available to the user.

When you click on “VIEW REPORTS” in the application (shown above), Snowpark Migration Accelerator will take you to the “Reports” directory created by the tool. It will look like one of the two options below:

All of these have detailed information about what is present in the source codebase, but only the output for Snowflake employees will have the reports and the full Spark Reference Inventory.

You can view each of these by opening them in Excel. This can be helpful for the user to understand what is present in their codebase. You can view these here, but the summary of all of this will be presented in the detailed report. This walkthrough will spend more time there. You can find a complete inventory of what is present in each of these in the Snowpark Migration Accelerator Documentation.

Emailed to the user

The email that is used in the Project Creation menu will get a brief email from snowconvert-notifications@snowflake.com with a couple of key artifacts:

  • The “result” indicates that a workload is either a good candidate for migration (>60% Readiness Score) or that a workload needs further analysis to understand if it’s a good candidate (<60% Readiness Score).

  • The Tool Execution ID. This is a unique identifier that a Snowflake user can use to locate this execution in the assessment database. As a note, this is also reported in the local output in the tool_execution.csv file.

Here is what the email will look like:

The execution ID is called “Session ID” in the email. Note that the “Result” is the same that appears on the final screen.


The Detailed Report

This is the key artifact you will want to review to better understand how ready a codebase is for the Snowpark Migration Accelerator. This section will walk through each section of the report, how it affects the readiness score, and how to interpret the output.

The Readiness Score

Before we get started walking through the report, let’s define what the Readiness score is and how it is calculated. The readiness score is the primary indicator of readiness that the Snowpark Migration Accelerator produces. However, this readiness is based only on references to the Spark API, not on any other factors or third-party libraries that may be present in the codebase. As a result, this readiness score can be misleading. There are other factors that you will want to take into account including the presence of third-party libraries. The readiness score should be used as a starting point. The Readiness Score is simply a measure of how many references to the Spark API can be converted to the Snowpark API divided by the total number of references to the Spark API. Both of these values are present in the Spark API Summary section (3413 / 3748 in this example). The higher this value, the better prepared the workload is to work with the Snowpark API. However, recall that this does not take into account any third-party libraries. The Readiness Score is on the first page of the output detailed report. Speaking of that, let’s jump into each section of the Detailed Report.

Summary Page

The summary page has the readiness score and some information about your execution.

What to look out for ? The readiness score! Use the readiness score to determine how ready your codebase is to convert the references to the Spark API to references to the Snowpark API. If it is high, the Spark in this codebase is a good candidate for migration to Snowpark.

File Summary

The file summary contains basic information about the file extensions present in your codebase, including how many lines of code were in each file, what cells were present in any notebooks (if present in this run), and how many of the files had embedded SQL.

What to look out for? The size. If there are a huge number of files, but there are few with references to the Spark API, that could be an indication that this code has a lot going on without referencing the Spark API. This could be because the user is not using Spark for much (possibly just for extracting and loading) or the source code for referenced libraries could be included in the run. Regardless, it is a clear indication that you need to better understand the use case.

Spark Usage Summary

The Spark Usage summary describes how many references to the Spark API were found, and how many can be converted to the Snowpark API. These usages are broken out into different categories such as DataFrame, column, SparkSession, and others.

Each reference is also classified into one of seven different support statuses. These statuses identify both if and how a reference can be supported in Snowpark. Each status is defined in the appendixes at the end of the report. They are described right here:

  • Direct: Direct translation. The same function exists in PySpark and Snowpark with no change needed.

  • Rename: The function from PySpark exists in Snowpark, but there is a rename that is needed.

  • Helper: This function has a small difference in Snowpark that can be addressed by creating a functionally equivalent function to resolve the difference.

  • Transformation: The function is completely recreated to a functionally equivalent function in Snowpark, but doesn't resemble the original function. This can include calling several functions or adding multiple lines of code.

  • Workaround: This category is employed when the tool cannot convert the PySpark element but there’s a known manual workaround to fix the conversion (the workaround is published in the tool’s documentation).

  • NotSupported: NotSupported refers to any function that the tool cannot currently convert from PySpark because there's no applicable equivalent in Snowflake. An error message will be added to the output code.

  • NotDefined: Any detected usage of a Pyspark element that is not yet in the tool's conversion database. These elements will be marked for inclusion in a future version of the tool.

What to look out for?

The readiness score is posted here as well but you can also check to see how many of the references will require a workaround or if the majority are direct translations. The more workarounds, helpers, and transformations that are present indicate that his workload will require an accelerator such as Snowpark Migration Accelerator to assist in migrating the codebase.

Import Calls:

Every time SMA imports a package or library, it will count it as an import call. All recognized or common import calls will appear in the import summary on this detailed report page. (Note that all the tool records import calls in the local output inventories folder and the assessment database.) These import calls are not categorized yet as supported or not supported in Snowflake.

What to look out for?

Third-party libraries that are not supported in Snowflake. These libraries will tell the story where the readiness score cannot. Suppose you discover imports to libraries such as mllib, streaming, or third-party libraries such as graphs, subprocess, or smtplib. In that case, that is an indication that there will be challenges in the migration. The presence of these libraries does not mean that a codebase or use case cannot be migrated, but it does mean that we need to understand the use case further. This would be a great time to bring in the WLS team to understand this use case better.

Snowpark Migration Accelerator Issue Summary

This is a summary of issues and errors present when this workload is ready to be migrated. You can see more information here on specific elements that cannot yet be converted, but this section is really more important when starting the conversion.

What to look out for?

You can look for specific elements listed in this section as not converted or have a published workaround. These will also be listed in the spark reference inventory that is output in the inventories folder locally. To compare these to the existing mappings, you may need to query the database.


Summary:

  • The readiness score is the first indicator of whether a codebase is ready for Snowpark. A score above 80% is mainly ready to go. A score lower than 60% will require some additional effort.

For this workload, the score is over 90%. This is a good indicator.

  • The next indicator is size. If the workload has a large amount of code but a small number of references to the Spark API, that could indicate that this use case depends on several third-party libraries. Additionally, a low readiness score, but with only 100 lines of code or 5 references to the Spark API can be manually converted quickly regardless of automation.

For this workload, the size is very manageable. There are over 100 files, but less than 5000 references to the Spark API and less than 10000 lines of code. And ~98% of these files have references to the Spark API, so there is not a lot of non-Spark-related Python code.

  • This brings us to the third indicator: imported libraries. The import calls inventory will identify references to imported packages. Some of these packages will indicate that this workload needs more analysis. If there are many third-party references, bring in the WLS team to better understand the use case.

In this case, we have some referenced third-party libraries, but nothing related to ML, Streaming, or other libraries that can be difficult to replicate in Snowpark.

Since this workload appears to be a good candidate for Snowpark, continue to the next step in the Spark Attack process.

Note that the readiness score alone does not make this a good candidate for migration.

Last updated