SMA Inventories
Data for Decision Making
Last updated
Data for Decision Making
Last updated
The Snowpark Migration Accelerator (SMA) generates a large amount of data when it is run on a codebase. That data is used to create the summary reporting present in both the and the output by the tool. The raw data itself is also made available in the Reports folder when the tool is run as a series of inventories (spreadsheets).
Each inventory can be overwhelming, but understanding this information can unlock additional insight into the condition of both the original workload and the converted workload. Each column in every output file is given below along with the name of each file.
Some of these inventories are also shared via telemetry. More information can be found in the telemetry section of this documentation.
The AssessmentReport.json file contains information that is shown in both the Detailed Report and the Assessment Summary in the application. This information is specifically to populate those reports and likely includes information that is also present in other spreadsheets.
The DbxElementsInventory.csv lists the DBX elements found inside notebooks.
Element: The DBX element name.
ProjectId: Name of the project (root directory the tool was run on)
FileId: File where the element was found and the relative path to that file.
Count: The number of times that element shows up in a single line.
Category: The element category.
Alias: The alias of the element (applies just for import elements).
Kind: A category for each element. These could include Function or Magic.
Line: The line number in the source files where the element was found.
PackageName: The name of the package where the element was found.
Supported: Whether this reference is “supported” or not. Values: True/False.
Automated: Whether or not the tool can automatically convert it. Values: True/False.
Status: The categorization of each element. The options are Rename, Direct, Helper, Transformation, WorkAround, NotSupported, NotDefined.
Statement: The code where the element was used. [NOTE: This column is not sent via telemetry.]
SessionId: Unique identifier for each run of the tool.
SnowConvertCoreVersion: The version number for the core code process of the tool.
SnowparkVersion: The version of Snowpark API available for the specified technology and run of the tool.
CellId: If this element was found in a notebook file, the numbered location of the cell where this element was in the file.
ExecutionId: The unique identifier for this execution of the SMA.
The ExecutionFlowInventory.csv lists the relations between the different workload scopes, based on the function calls found. This inventory main purpose is to serve as the base for the entry points identification.
Caller: The full name of the scope where the call was found.
CallerType: The type of the scope where the call was found. This can be: Function, Class, or Module.
Invoked: The full name of the element that was called.
InvokedType: The type of the element. This can be: Function or Class.
FileId: The relative path of the file. (Starting from the input folder the user chose in the SMA tool)
CellId: The cell number where the call was found inside a notebook file, if applies.
Line: The line number where the call was found.
Column: The column number where the call was found.
ExecutionId: The execution id.
The Checkpoints.csv lists the generated checkpoints for the user workload, these checkpoints are completely capable to be used in the Checkpoints Feature from the Snowflake Exentesion.
Name: The checkpoint name (using the format described before).
FileId: the relative path of the file (starting from the input folder the user chose in the SMA tool).
CellId: the number of cell where the DataFrame operation was found inside a notebook file.
Line: line number where the DataFrame operation was found.
Column: the column number where the DataFrame operation was found.
Type: the use case of the checkpoints (Collection or Validation).
DataFrameName: The name of the DataFrame.
Location: The assignment number of the DataFrame name.
Enabled: Indicates whether the checkpoint is enabled (True or False).
Mode: The mode number of the collection (Schema [1] or DataFrame [2]).
Sample: The sample of the DataFrame.
EntryPoint: The entry point that guide the flow to execute the checkpoint.
ExecutionId: the execution id.
The DataFramesInventory.csv lists the dataframes assignments found in order to be used to generate checkpoints for the user workload.
FullName: The full name of the DataFrame.
Name: The simple name of the variable of the DataFrame.
FileId: The relative path of the file (starting from the input folder the user chose in the SMA tool).
CellId: The number of cells where the DataFrame operation was found inside a notebook file.
Line: The line number where the DataFrame operation was found.
Column: The column number where the DataFrame operation was found.
AssignmentNumber: The number of assignments for this particular identifier (not symbol) in the file.
RelevantFunction: The relevant function why this was collected.
RelatedDataFrames: The full qualified name of the DataFrame(s) involved in the operation (separated by semicolon).
EntryPoints: it will be empty for this phase. In a later phase, it will be filled.
ExecutionId: the execution id.
The ArtifactDependencyInventory.csv lists the artifact dependencies of each file analyzed by the SMA. This inventory allows the user to determine which artifacts are needed for the file to work properly in Snowflake.
The following are considered artifacts: a third-party library, SQL entity, source of a read or write operation, and another source code file in the workload.
ExecutionId: the identifier of the execution.
FileId: the identifier of the source code file.
Dependency: the artifact dependency that the current file has.
Type: the type of the artifact dependency.
UserCodeFile: source code or notebook.
IOSources: resource required for input and output operation.
ThirdPartyLibraries: a third-party library.
UnknownLibraries: a library whose origin was not determined by SMA.
SQLObjects: an SQL entity: table or view, for example.
Success: If the artifact needs any intervention, it shows FALSE; otherwise, it shows TRUE.
Status_Detail: the status of the artifact dependency, based on the type.
UserCodeFile:
Parsed: the file was parsed successfully.
NotParsed: the file parsing failed.
IOSources:
Exists: the resource of the operation is in the workload.
DoesNotExists: the resource of the operation is not present in the input.
ThirdPartyLibraries:
Supported: the library is supported by Snowpark Anaconda.
NotSupported: the library is not supported by Snowpark Anaconda.
UnknownLibraries:
NotSupported: since the origin was not determined by SMA.
SQLObject
DoesNotExists: the embedded statement that creates the entity is not in the input source code.
Exists: the embedded statement that creates the entity is in the input source code.
Arguments: an extra data of the artifact dependency, based on the type.
Location: the collection of cell ID and line number where the artifact dependency is being used in the source code file.
The files.csv has an inventory of each file present in that execution of the tool. The filetype and size are reported in this inventory.
Path: the filepath for each file. (Note: this is only within the root directory. For example, if a file is in the root folder only the filename will be recorded.)
Technology: source language scanner (Python or Scala)
FileKind: whether the file is a file with source code or another kind of file (like a text file or log)
BinaryKind: whether the file is readable or if it’s a binary file
Bytes: size of the file in bytes.
SupportedStatus: files are neither supported nor not-supported, so this file only reports "DoesNotApply"
The ImportUsagesInventory.csv has all the referenced import calls in the codebase. An import is classified as an external library that gets imported in at any point in the file.
Element: is the unique name for the actual spark reference.
ProjectId: name of the project (root directory the tool was run on)
FileId: file where the spark reference was found and the relative path to that file.
Count: the number of times that element shows up in a single line.
Alias: the alias of the element (if any).
Kind: null/empty value because all elements are imports.
Line: the line number in the source files where the element was found..
PackageName: the name of the package where the element was found.
Supported: Whether this reference is “supported” or not. Values: True/False.
Automated: null/empty. This column is deprecated.
Status: value Invalid. This column is deprecated.
Statement: the code where the element was used. [NOTE: This column is not sent via telemetry.]
SessionId: Unique identifier for each run of the tool.
SnowConvertCoreVersion: the version number for the core code process of the tool
SnowparkVersion: the version of snowpark API available for the specified technology and run of the tool.
ElementPackage: the package name where the imported element is declared (when available).
CellId: if this element was found in a notebook file, the numbered location of the cell where this element was in the file.
ExecutionId: the unique identifier for this execution of the SMA.
Origin: category of the import reference. Possible values are BuiltIn, ThirdPartyLib, or blank.
FullName: It represents the correct full path for the current element.
Similar to the files inventory, the InputFilesInventory.csv has a list of every file by filetype and size.
Element: filename (same as FileId)
ProjectId: name of the project (root directory the tool was run on)
FileId: file where the spark reference was found and the relative path to that file.
Count: count of files with that filename
SessionId: Unique identifier for each session of the tool.
Extension: the file’s extension
Technology: the source file’s technology based on extension
Bytes: size of the file in bytes
CharacterLength: count of characters in the file
LinesOfCode: lines of code in the file
ParsingResult: “Successful” if the cell was fully parsed, “Error” if it was not parsed.
The IOFilesInventory.csv lists all external elements that are being read from or written to in the codebase.
Element: the file, variable, or other element being read or written
ProjectId: name of the project (root directory the tool was run on)
FileId: file where the spark reference was found and the relative path to that file.
Count: count of files with that filename
isLiteral: if the read/write location was in a literal
Format: if the SMA can determine the format of the element (such as csv, json, etc.)
FormatType: if the format above is specific
Mode: value will be Read or Write depending on whether there is a reader or writer
Supported: Whether this operation is supported in Snowpark
Line: the line in the file where the read or write occurs
SessionId: Unique identifier for each session of the tool
OptionalSettings: if a parameter is defined in the element, it will be listed here
CellId: cell id where that element was in that FileId (if in a notebook, null otherwise)
ExecutionId: Unique identifier for each run of the tool
Code: is the unique code for the issues reported by the tool .
Description: the text describing the issue and the name of the spark reference when applies.
Category: the classification of each issue. The options are Warning, Conversion Error, and Parser Error, Helper, Transformation, WorkAround, NotSupported, NotDefined.
NodeType: the name associated to the syntax node where the issue was found.
FileId: file where the spark reference was found and the relative path to that file.
ProjectId: name of the project (root directory the tool was run on)
Line: the line number in the source file where the issue was found.
Column: the column position in the source file where the issue was found.
The JoinsInventory.csv has an inventory of all dataframe joins done in that codebase.
Element: line number where the join begins (and ends, if not on a single line)
ProjectId: name of the project (root directory the tool was run on)
FileId: file where the spark reference was found and the relative path to that file.
Count: count of files with that filename
isSelfJoin: TRUE if the join is a self join, FALSE if not
HasLeftAlias: TRUE if the join has a left alias, FALSE if not
HasRightAlias: TRUE if the join has a right alias, FALSE if not
Line: line number where the join begins
SessionId: Unique identifier for each session of the tool
CellId: cell id where that element was in that FileId (if in a notebook, null otherwise)
ExecutionId: Unique identifier for each run of the tool
The NotebookCellsInventory.csv gives an inventory of all cells in a notebook based on the source code for each cell and the lines of code in that cell.
Element: source language (Python, Scala, or SQL)
ProjectId: name of the project (root directory the tool was run on)
FileId: file where the spark reference was found and the relative path to that file.
Count: count of files with that filename
CellId: cell id where that element was in that FileId (if in a notebook, null otherwise)
Arguments: null (this field will be empty)
LOC: lines of code in that cell
Size: count of characters in that cell
SupportedStatus: TRUE, unless there are any unsupported elements in that cell (FALSE)
ParsingResult: “Successful” if the cell was fully parsed, “Error” if it was not parsed.
The NotebookSizeInventory.csv lists the size in lines of code of different source languages present in notebook files.
Element: filename (for this spreadsheet, it is the same as the FileId)
ProjectId: name of the project (root directory the tool was run on)
FileId: file where the spark reference was found and the relative path to that file.
Count: count of files with that filename
PythonLOC: Python lines of code present in notebook cells (will be 0 for non-notebook files)
ScalaLOC: Scala lines of code present in notebook cells (will be 0 for non-notebook files)
SqlLOC: SQL lines of code present in notebook cells (will be 0 for non-notebook files)
Line: null (this field will be empty)
SessionId: Unique identifier for each session of the tool.
ExecutionId: Unique identifier for each run of the tool.
[Python Only] The PandasUsagesInventory.csv lists every reference to the Pandas API present in the scanned codebase.
Element: is the unique name for the actual pandas reference.
ProjectId: name of the project (root directory the tool was run on)
FileId: file where the spark reference was found and the relative path to that file.
Count: the number of times that element shows up in a single line.
Alias: the alias of the element (applies just for import elements).
Kind: a category for each element. These could include Class, Variable, Function, Import and others.
Line: the line number in the source files where the element was found..
PackageName: the name of the package where the element was found.
Supported: Whether this reference is “supported” or not. Values: True/False.
Automated: Whether or not the tool can automatically convert it. Values: True/False.
Status: the categorization of each element. The options are Rename, Direct, Helper, Transformation, WorkAround, NotSupported, NotDefined.
Statement: how that element was used. [NOTE: This column is not sent via telemetry.]
SessionId: Unique identifier for each run of the tool.
SnowConvertCoreVersion: the version number for the core code process of the tool
SnowparkVersion: the version of Snowpark API available for the specified technology and run of the tool.
PandasVersion: version number of the pandas API that was used to identify elements in this codebase
CellId: cell id where that element was in that FileId (if in a notebook, null otherwise)
ExecutionId: Unique identifier for each run of the tool.
Element: is the unique name for the actual spark reference.
ProjectId: name of the project (root directory the tool was run on)
FileId: file where the spark reference was found and the relative path to that file.
Count: the number of times that element shows up in a single line.
Alias: the alias of the element (applies just for import elements).
Kind: a category for each element. These could include Class, Variable, Function, Import and others.
Line: the line number in the source files where the element was found..
PackageName: the name of the package where the element was found.
Supported: Whether this reference is “supported” or not. Values: True/False.
Automated: Whether or not the tool can automatically convert it. Values: True/False.
Status: the categorization of each element. The options are Rename, Direct, Helper, Transformation, WorkAround, NotSupported, NotDefined.
Statement: the code where the element was used. [NOTE: This column is not sent via telemetry.]
SessionId: Unique identifier for each run of the tool.
SnowConvertCoreVersion: the version number for the core code process of the tool
SnowparkVersion: the version of Snowpark API available for the specified technology and run of the tool.
CellId: if this element was found in a notebook file, the numbered location of the cell where this element was in the file.
ExecutionId: the unique identifier for this execution of the SMA.
The SqlStatementsInventory.csv has a count of SQL keywords present in sql spark elements.
Element: name for the code element where the SQL was found
ProjectId: name of the project (root directory the tool was run on)
FileId: file where the spark reference was found and the relative path to that file.
Count: the number of times that element shows up in a single line.
InterpolationCount: count of other elements inserted into the element
Keywords: a dictionary of the keywords and count of each
Size: character count for each sql statement
LiteralCount: count of strings in this element
NonLiteralCount: sql components of the element not in a literal
Line: the line number where that element occurs
SessionId: Unique identifier for each session of the tool.
CellId: cell id where that element was in that FileId (if in a notebook, null otherwise)
ExecutionId: Unique identifier for each run of the tool.
The SQLElementsInventory.csv has a count of SQL present in sql spark elements.
Element: Name for the code element where the SQL was found (e.g., SqlFromClause, SqlSelect, SqlSelectBody, SqlSignedNumericLiteral).
ProjectId: Name of the project (root directory the tool was run on).
FileId: File where the SQL reference was found and the relative path to that file.
Count: The number of times that element shows up in a single line.
NotebookCellId: The notebook cell ID.
Line: The line number where that element occurs.
Column: The column number where that element occurs.
SessionId: Unique identifier for each session of the tool.
ExecutionId: Unique identifier for each run of the tool.
SqlFlavor: The SQL flavor being used (e.g., Spark SQL, Hive SQL).
RootFullName: The fully qualified name of the root element in the code.
RootLine: The line number where the root element is located.
RootColumn: The column number where the root element is located.
TopLevelFullName: The fully qualified name of the top-level SQL statement or code block.
TopLevelLine: The line number where the top-level statement is located.
TopLevelColumn: The column number where the top-level statement is located.
ConversionStatus: The status of the SQL conversion (e.g., Success, Failed).
Category: The category of the SQL element (e.g., DDL, DML, DQL, DCL, TCL).
EWI: The EWI (Error Warning Information) code associated with the SQL element.
ObjectReference: The reference name of the object involved in the SQL (e.g., table, view).
The SqlEmbeddedUsageInventory.csv has a count of SQL keywords present in sql spark elements.
Element: Name for the code element where the SQL was found (e.g., SqlFromClause, SqlSelect, SqlSelectBody, SqlSignedNumericLiteral).
ProjectId: Name of the project (root directory the tool was run on).
FileId: File where the SQL reference was found and the relative path to that file.
Count: The number of times that element shows up in a single line.
ExecutionId: Unique identifier for each run of the tool.
LibraryName: Name of the library being used.
HasLiteral: Indicates whether the element contains literals.
HasVariable: Indicates whether the element contains variables.
HasFunction: Indicates whether the element contains functions.
ParsingStatus: Indicates the parsing status (e.g., Success, Failed, Partial).
HasInterpolation: Indicates whether the element contains interpolations.
CellId: The notebook cell ID.
Line: The line number where that element occurs. Column: The column number where that element occurs.
The ThirdPartyUsagesInventory.csv has
Element: is the unique name for the third party reference.
ProjectId: name of the project (root directory the tool was run on)
FileId: file where the spark reference was found and the relative path to that file.
Count: the number of times that element shows up in a single line.
Alias: the alias of the element (if any).
Kind: categorization of the element such as variable, type, function, or class.
Line: the line number in the source files where the element was found.
PackageName: package name for the element (concatenation of ProjectId and FileId in Python).
Statement: the code where the element was used. [NOTE: This column is not sent via telemetry.]
SessionId: Unique identifier for each session of the tool.
CellId: cell id where that element was in that FileId (if in a notebook, null otherwise)
ExecutionId: Unique identifier for each execution of the tool.
The packagesInventory.csv has
Element: is the name of the package.
ProjectId: name of the project (root directory the tool was run on)
FileId: file where package was found and the relative path to that file.
Count: the number of times that element shows up in a single line.
The tool_execution.csv has some basic information about this run of the SMA tool.
ExecutionId: Unique identifier for each run of the tool.
ToolName: the name of the tool. Values: PythonSnowConvert SparkSnowConvert (scala tool)
Tool_Version: the version number of the tool.
AssemblyName: the name of the code processor (essentially, a longer version of the ToolName)
LogFile: whether a log file was sent on an exception/failure
FinalResult: where the tool stopped if there was an exception/failure
ExceptionReport: if an exception report was sent on an exception/failure
StartTime: The timestamp for when the tool started executing.
EndTime: The timestamp for when the tool stopped executing.
SystemName: The serial number of the machine where the tool was executing (this is only used for troubleshooting and license validation purposes).
The Issues.csv lists every conversion issue found in that codebase. A description, the exact location of the issue in the file, and a code associated with that issue will be reported in this document. You can find out more about each issue in the section of this documentation.
The SparkUsagesInventory.csv shows the exact location and usage for each reference to the Spark API. This information is used to build the .