Assessment Output - Reports Folder
What to do with all of this assessment information?
Last updated
What to do with all of this assessment information?
Last updated
A complete output set of files and reports will be created for all users of the Snowpark Migration Accelerator (SMA). You can see exactly what is output in the Output Reports section of this documentation.
Most of the reports are .csv files that you can view in any spreadsheet software. A summary of what is in these files is presented in the detailed report, which is where we will start when evaluating this output. We will also walk through a few of the csv files to better understand what is necessary to migrate this codebase, but this walkthrough will not review them all. To see every inventory file that will be generated by the SMA, you can review the SMA Inventories section of this documentation.
To access the reports, select VIEW REPORTS at the bottom of the screen and your file explorer should launch to the directory where the Reports can be found:
Let's start with what we can learn from the Detailed Report.
Note that the version of the detailed report and other inventories reveiwed on this page may look different than the version that you see when you run the SMA. The report shown is from the version of the tool available when this walkthrough was built. If you find something that looks significantly different or something that appears off in your results, reach out the SMA team by emailing sma-support@snowflake.com or report an issue in the tool. (And yes, you can report an issue with the documentation in the SMA itself!)
The Detailed Report is a .docx file that will summarize some of the relevant information that is present in the rest of the inventory files. This is the key artifact you will want to review to better understand how ready a codebase is for Snowpark. There is a full description of everything present in the report elsewhere in this documentation. This walkthrough will only highlight what you should pay attention to in the report, how it affects the readiness score(s), and how to interpret the output.
Let's start with the Readiness Score(s). You should review all of the readiness scores that are available in your report.
Before we get started walking through the report, let us once again define what the Spark API Readiness score is and how it is calculated. This readiness score is the primary indicator of readiness that the SMA produces. However, this readiness is based only on references to the Spark API, not on any other factors or third-party libraries that may be present in the codebase (Third Party API Readiness is next.) . As a result, this readiness score can often be misleading, but this readiness score should still be used as a starting point.
This score is a measure of how many references to the Spark API can be converted to the Snowpark API divided by the total number of references to the Spark API. Both of these values are shown in this section (3541 / 3746 in this execution). The higher this value, the better prepared the workload is to work with the Snowpark API. What is not converted automatically can still be made to work with Snowpark by working with the output code, but for assessment, this is a good value, and a good indicator.
The Third Party Libraries Readiness Score is available in this version of the report. This score is designed to give you an indicator of what APIs are present in the codebase.
The summary page has the readiness score and some information about your execution.
What to look out for ? The readiness score! Use the readiness score to determine how ready your codebase is to convert the references to the Spark API to references to the Snowpark API. If it is high, the Spark in this codebase is a good candidate for migration to Snowpark.
The file summary contains basic information about the file extensions present in your codebase, including how many lines of code were in each file, what cells were present in any notebooks (if present in this run), and how many of the files had embedded SQL.
What to look out for? The size. If there are a huge number of files, but there are few with references to the Spark API, that could be an indication that this code has a lot going on without referencing the Spark API. This could be because the user is not using Spark for much (possibly just for extracting and loading) or the source code for referenced libraries could be included in the run. Regardless, it is a clear indication that you need to better understand the use case.
The Spark Usage summary describes how many references to the Spark API were found, and how many can be converted to the Snowpark API. These usages are broken out into different categories such as DataFrame, column, SparkSession, and others.
Each reference is also classified into one of seven different support statuses. These statuses identify both if and how a reference can be supported in Snowpark. Each status is defined in the appendixes at the end of the report. They are described right here:
Direct: Direct translation. The same function exists in PySpark and Snowpark with no change needed.
Rename: The function from PySpark exists in Snowpark, but there is a rename that is needed.
Helper: This function has a small difference in Snowpark that can be addressed by creating a functionally equivalent function to resolve the difference.
Transformation: The function is completely recreated to a functionally equivalent function in Snowpark, but doesn't resemble the original function. This can include calling several functions or adding multiple lines of code.
Workaround: This category is employed when the tool cannot convert the PySpark element but there’s a known manual workaround to fix the conversion (the workaround is published in the tool’s documentation).
NotSupported: NotSupported refers to any function that the tool cannot currently convert from PySpark because there's no applicable equivalent in Snowflake. An error message will be added to the output code.
NotDefined: Any detected usage of a Pyspark element that is not yet in the tool's conversion database. These elements will be marked for inclusion in a future version of the tool.
What to look out for?
The readiness score is posted here as well but you can also check to see how many of the references will require a workaround or if the majority are direct translations. The more workarounds, helpers, and transformations that are present indicate that his workload will require an accelerator such as Snowpark Migration Accelerator to assist in migrating the codebase.
Every time SMA imports a package or library, it will count it as an import call. All recognized or common import calls will appear in the import summary on this detailed report page. (Note that all the tool records import calls in the local output inventories folder and the assessment database.) These import calls are not categorized yet as supported or not supported in Snowflake.
What to look out for?
Third-party libraries that are not supported in Snowflake. These libraries will tell the story where the readiness score cannot. Suppose you discover imports to libraries such as mllib, streaming, or third-party libraries such as graphs, subprocess, or smtplib. In that case, that is an indication that there will be challenges in the migration. The presence of these libraries does not mean that a codebase or use case cannot be migrated, but it does mean that we need to understand the use case further. This would be a great time to bring in the WLS team to understand this use case better.
This is a summary of issues and errors present when this workload is ready to be migrated. You can see more information here on specific elements that cannot yet be converted, but this section is really more important when starting the conversion.
What to look out for?
You can look for specific elements listed in this section as not converted or have a published workaround. These will also be listed in the spark reference inventory that is output in the inventories folder locally. To compare these to the existing mappings, you may need to query the database.
The readiness score is the first indicator of whether a codebase is ready for Snowpark. A score above 80% is mainly ready to go. A score lower than 60% will require some additional effort.
For this workload, the score is over 90%. This is a good indicator.
The next indicator is size. If the workload has a large amount of code but a small number of references to the Spark API, that could indicate that this use case depends on several third-party libraries. Additionally, a low readiness score, but with only 100 lines of code or 5 references to the Spark API can be manually converted quickly regardless of automation.
For this workload, the size is very manageable. There are over 100 files, but less than 5000 references to the Spark API and less than 10000 lines of code. And ~98% of these files have references to the Spark API, so there is not a lot of non-Spark-related Python code.
This brings us to the third indicator: imported libraries. The import calls inventory will identify references to imported packages. Some of these packages will indicate that this workload needs more analysis. If there are many third-party references, bring in the WLS team to better understand the use case.
In this case, we have some referenced third-party libraries, but nothing related to ML, Streaming, or other libraries that can be difficult to replicate in Snowpark.
Since this workload appears to be a good candidate for Snowpark, continue to the next step in the Spark Attack process.