SMA Inventories

Data for Decision Making

The Snowpark Migration Accelerator (SMA) generates a large amount of data when it is run on a codebase. That data is used to create the summary reporting present in both the assessment summary and the curated reports output by the tool. The raw data itself is also made available in the Reports folder when the tool is run as a series of inventories (spreadsheets).

Each inventory can be overwhelming, but understanding this information can unlock additional insight into the condition of both the original workload and the converted workload. Each column in every output file is given below along with the name of each file.

Some of these inventories are also shared via telemetry. More information can be found in the telemetry section of this documentation.

Assessment Report Details

The AssessmentReport.json file contains information that is shown in both the Detailed Report and the Assessment Summary in the application. This information is specifically to populate those reports and likely includes information that is also present in other spreadsheets.

Files Inventory

The files.csv has an inventory of each file present in that execution of the tool. The filetype and size are reported in this inventory.

  • Path: the filepath for each file. (Note: this is only within the root directory. For example, if a file is in the root folder only the filename will be recorded.)

  • Technology: source language scanner (Python or Scala)

  • FileKind: whether the file is a file with source code or another kind of file (like a text file or log)

  • BinaryKind: whether the file is readable or if it’s a binary file

  • Bytes: size of the file in bytes.

  • SupportedStatus: files are neither supported nor not-supported, so this file only reports "DoesNotApply"

Import Usages Inventory

The ImportUsagesInventory.csv has all the referenced import calls in the codebase. An import is classified as an external library that gets imported in at any point in the file.

  • Element: is the unique name for the actual spark reference.

  • ProjectId: name of the project (root directory the tool was run on)

  • FileId: file where the spark reference was found and the relative path to that file.

  • Count: the number of times that element shows up in a single line.

  • Alias: the alias of the element (if any).

  • Kind: null/empty value because all elements are imports.

  • Line: the line number in the source files where the element was found..

  • PackageName: the name of the package where the element was found.

  • Supported: Whether this reference is “supported” or not. Values: True/False.

  • Automated: null/empty. This column is deprecated.

  • Status: value Invalid. This column is deprecated.

  • Statement: the code where the element was used. [NOTE: This column is not sent via telemetry.]

  • SessionId: Unique identifier for each run of the tool.

  • SnowConvertCoreVersion: the version number for the core code process of the tool

  • SnowparkVersion: the version of snowpark API available for the specified technology and run of the tool.

  • ElementPackage: the package name where the imported element is declared (when available).

  • CellId: if this element was found in a notebook file, the numbered location of the cell where this element was in the file.

  • ExecutionId: the unique identifier for this execution of the SMA.

  • Origin: category of the import reference. Possible values are BuiltIn, ThirdPartyLib, or blank.

Input Files Inventory

Similar to the files inventory, the InputFilesInventory.csv has a list of every file by filetype and size.

  • Element: filename (same as FileId)

  • ProjectId: name of the project (root directory the tool was run on)

  • FileId: file where the spark reference was found and the relative path to that file.

  • Count: count of files with that filename

  • SessionId: Unique identifier for each session of the tool.

  • Extension: the file’s extension

  • Technology: the source file’s technology based on extension

  • Bytes: size of the file in bytes

  • CharacterLength: count of characters in the file

  • LinesOfCode: lines of code in the file

  • ParsingResult: “Successful” if the cell was fully parsed, “Error” if it was not

Input and Ouput Files Inventory

The IOFilesInventory.csv lists all external elements that are being read from or written to in the codebase.

  • Element: the file, variable, or other element being read or written

  • ProjectId: name of the project (root directory the tool was run on)

  • FileId: file where the spark reference was found and the relative path to that file.

  • Count: count of files with that filename

  • isLiteral: if the read/write location was in a literal

  • Format: if the SMA can determine the format of the element (such as csv, json, etc.)

  • FormatType: if the format above is specific

  • Mode: value will be Read or Write depending on whether there is a reader or writer

  • Supported: Whether this operation is supported in Snowpark

  • Line: the line in the file where the read or write occurs

  • SessionId: Unique identifier for each session of the tool

  • OptionalSettings: if a parameter is defined in the element, it will be listed here

  • CellId: cell id where that element was in that FileId (if in a notebook, null otherwise)

  • ExecutionId: Unique identifier for each run of the tool

Issue Inventory

The Issues.csv lists every conversion issue found in that codebase. A description, the exact location of the issue in the file, and a code associated with that issue will be reported in this document. You can find out more about each issue in the issue analysis section of this documentation.

  • Code: is the unique code for the issues reported by the tool .

  • Description: the text describing the issue and the name of the spark reference when applies.

  • Category: the classification of each issue. The options are Warning, Conversion Error, and Parser Error, Helper, Transformation, WorkAround, NotSupported, NotDefined.

  • NodeType: the name associated to the syntax node where the issue was found.

  • FileId: file where the spark reference was found and the relative path to that file.

  • ProjectId: name of the project (root directory the tool was run on)

  • Line: the line number in the source file where the issue was found.

  • Column: the column position in the source file where the issue was found.

Joins Inventory

The JoinsInventory.csv has an inventory of all dataframe joins done in that codebase.

  • Element: line number where the join begins (and ends, if not on a single line)

  • ProjectId: name of the project (root directory the tool was run on)

  • FileId: file where the spark reference was found and the relative path to that file.

  • Count: count of files with that filename

  • isSelfJoin: TRUE if the join is a self join, FALSE if not

  • HasLeftAlias: TRUE if the join has a left alias, FALSE if not

  • HasRightAlias: TRUE if the join has a right alias, FALSE if not

  • Line: line number where the join begins

  • SessionId: Unique identifier for each session of the tool

  • CellId: cell id where that element was in that FileId (if in a notebook, null otherwise)

  • ExecutionId: Unique identifier for each run of the tool

Notebook Cells Inventory

The NotebookCellsInventory.csv gives an inventory of all cells in a notebook based on the source code for each cell and the lines of code in that cell.

  • Element: source language (Python, Scala, or SQL)

  • ProjectId: name of the project (root directory the tool was run on)

  • FileId: file where the spark reference was found and the relative path to that file.

  • Count: count of files with that filename

  • CellId: cell id where that element was in that FileId (if in a notebook, null otherwise)

  • Arguments: null (this field will be empty)

  • LOC: lines of code in that cell

  • Size: count of characters in that cell

  • SupportedStatus: TRUE, unless there are any unsupported elements in that cell (FALSE)

  • ParsingResult: “Successful” if the cell was fully parsed, “Error” if it was not

Notebook Size Inventory

The NotebookSizeInventory.csv lists the size in lines of code of different source languages present in notebook files.

  • Element: filename (for this spreadsheet, it is the same as the FileId)

  • ProjectId: name of the project (root directory the tool was run on)

  • FileId: file where the spark reference was found and the relative path to that file.

  • Count: count of files with that filename

  • PythonLOC: Python lines of code present in notebook cells (will be 0 for non-notebook files)

  • ScalaLOC: Scala lines of code present in notebook cells (will be 0 for non-notebook files)

  • SqlLOC: SQL lines of code present in notebook cells (will be 0 for non-notebook files)

  • Line: null (this field will be empty)

  • SessionId: Unique identifier for each session of the tool.

  • ExecutionId: Unique identifier for each run of the tool.

Pandas Usages Inventory

[Python Only] The PandasUsagesInventory.csv lists every reference to the Pandas API present in the scanned codebase.

  • Element: is the unique name for the actual pandas reference.

  • ProjectId: name of the project (root directory the tool was run on)

  • FileId: file where the spark reference was found and the relative path to that file.

  • Count: the number of times that element shows up in a single line.

  • Alias: the alias of the element (applies just for import elements).

  • Kind: a category for each element. These could include Class, Variable, Function, Import and others.

  • Line: the line number in the source files where the element was found..

  • PackageName: the name of the package where the element was found.

  • Supported: Whether this reference is “supported” or not. Values: True/False.

  • Automated: Whether or not the tool can automatically convert it. Values: True/False.

  • Status: the categorization of each element. The options are Rename, Direct, Helper, Transformation, WorkAround, NotSupported, NotDefined.

  • Statement: how that element was used. [NOTE: This column is not sent via telemetry.]

  • SessionId: Unique identifier for each run of the tool.

  • SnowConvertCoreVersion: the version number for the core code process of the tool

  • SnowparkVersion: the version of Snowpark API available for the specified technology and run of the tool.

  • PandasVersion: version number of the pandas API that was used to identify elements in this codebase

  • CellId: cell id where that element was in that FileId (if in a notebook, null otherwise)

  • ExecutionId: Unique identifier for each run of the tool.

Spark Usages Inventory

The SparkUsagesInventory.csv shows the exact location and usage for each reference to the Spark API. This information is used to build the Readiness Score.

  • Element: is the unique name for the actual spark reference.

  • ProjectId: name of the project (root directory the tool was run on)

  • FileId: file where the spark reference was found and the relative path to that file.

  • Count: the number of times that element shows up in a single line.

  • Alias: the alias of the element (applies just for import elements).

  • Kind: a category for each element. These could include Class, Variable, Function, Import and others.

  • Line: the line number in the source files where the element was found..

  • PackageName: the name of the package where the element was found.

  • Supported: Whether this reference is “supported” or not. Values: True/False.

  • Automated: Whether or not the tool can automatically convert it. Values: True/False.

  • Status: the categorization of each element. The options are Rename, Direct, Helper, Transformation, WorkAround, NotSupported, NotDefined.

  • Statement: the code where the element was used. [NOTE: This column is not sent via telemetry.]

  • SessionId: Unique identifier for each run of the tool.

  • SnowConvertCoreVersion: the version number for the core code process of the tool

  • SnowparkVersion: the version of Snowpark API available for the specified technology and run of the tool.

  • CellId: if this element was found in a notebook file, the numbered location of the cell where this element was in the file.

  • ExecutionId: the unique identifier for this execution of the SMA.

SQL Statements Inventory

The SqlStatementsInventory.csv has a count of SQL keywords present in sql spark elements.

  • Element: name for the code element where the SQL was found

  • ProjectId: name of the project (root directory the tool was run on)

  • FileId: file where the spark reference was found and the relative path to that file.

  • Count: the number of times that element shows up in a single line.

  • InterpolationCount: count of other elements inserted into the element

  • Keywords: a dictionary of the keywords and count of each

  • Size: character count for each sql statement

  • LiteralCount: count of strings in this element

  • NonLiteralCount: sql components of the element not in a literal

  • Line: the line number where that element occurs

  • SessionId: Unique identifier for each session of the tool.

  • CellId: cell id where that element was in that FileId (if in a notebook, null otherwise)

  • ExecutionId: Unique identifier for each run of the tool.

Third Party Usages Inventory

The ThirdPartyUsagesInventory.csv has

  • Element: is the unique name for the third party reference.

  • ProjectId: name of the project (root directory the tool was run on)

  • FileId: file where the spark reference was found and the relative path to that file.

  • Count: the number of times that element shows up in a single line.

  • Alias: the alias of the element (if any).

  • Kind: categorization of the element such as variable, type, function, or class.

  • Line: the line number in the source files where the element was found.

  • PackageName: package name for the element (concatenation of ProjectId and FileId in Python).

  • Statement: the code where the element was used. [NOTE: This column is not sent via telemetry.]

  • SessionId: Unique identifier for each session of the tool.

  • CellId: cell id where that element was in that FileId (if in a notebook, null otherwise)

  • ExecutionId: Unique identifier for each execution of the tool.

Tool Execution Summary

The tool_execution.csv has some basic information about this run of the SMA tool.

  • ExecutionId: Unique identifier for each run of the tool.

  • ToolName: the name of the tool. Values: PythonSnowConvert SparkSnowConvert (scala tool)

  • Tool_Version: the version number of the tool.

  • AssemblyName: the name of the code processor (essentially, a longer version of the ToolName)

  • LogFile: whether a log file was sent on an exception/failure

  • FinalResult: where the tool stopped if there was an exception/failure

  • ExceptionReport: if an exception report was sent on an exception/failure

  • StartTime: The timestamp for when the tool started executing.

  • EndTime: The timestamp for when the tool stopped executing.

  • SystemName: The serial number of the machine where the tool was executing (this is only used for troubleshooting and license validation purposes).

Last updated

#332: [SIT-1562] SQL Readiness

Change request updated