SMA Execution Guide
The SMA-Checkpoints feature is an extensive workflow, so this section was created in order to be a walkthrough for feature usage.
Last updated
The SMA-Checkpoints feature is an extensive workflow, so this section was created in order to be a walkthrough for feature usage.
Last updated
The SMA-Checkpoints feature requires a PySpark workload as its entry point, since it depends on detecting the use of PySpark DataFrames. This walkthrough will guide you through the feature using a single Python script, providing a straightforward example of how checkpoints are generated and utilized within a typical PySpark workflow.
Input workload
Sample.py file content
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("SparkFunctionsExample2").getOrCreate()
df1 = spark.createDataFrame([("Alice", "NY"), ("Bob", "LA")], ["name", "city"])
df2 = spark.createDataFrame([(10,), (20,)], ["number"])
df1_with_index = df1.withColumn("index", F.monotonically_increasing_id())
df2_with_index = df2.withColumn("index", F.monotonically_increasing_id())
df3 = df1_with_index.join(df2_with_index, on="index").drop("index")
df3.show()
If the SMA-Checkpoints feature is enabled, a checkpoints.json
file will be generated. If the feature is disabled, this file will not be created in either the input or output folders. Regardless of whether the feature is enabled, the following inventory files will always be generated: DataFramesInventory.csv
and CheckpointsInventory.csv
. These files provide metadata essential for analysis and debugging.
To create a convert your own project please follow up the following guide: SMA User Guide.
As part of the conversion process you can customize your conversion settings, take a look on the SMA-Checkpoints feature settings.
Note: This user guide used the default conversion settings.
Once the migration process is complete, the SMA-Checkpoints feature should have created two new inventory files and added a checkpoints.json
file to both the input and output folders.
checkpoints.json file content
{
"createdBy": "Snowpark Migration Accelerator",
"comment": "This file was automatically generated by the SMA tool as checkpoints collection was enabled in the tool settings. This file may also be modified or deleted during SMA execution.",
"type": "Collection",
"pipelines": [
{
"entryPoint": "sample.py",
"checkpoints": [
{
"name": "sample$BBVOC7$df1$1",
"file": "sample.py",
"df": "df1",
"location": 1,
"enabled": true,
"mode": 1,
"sample": "1.0"
},
{
"name": "sample$BBVOC7$df2$1",
"file": "sample.py",
"df": "df2",
"location": 1,
"enabled": true,
"mode": 1,
"sample": "1.0"
},
{
"name": "sample$BBVOC7$df3$1",
"file": "sample.py",
"df": "df3",
"location": 1,
"enabled": true,
"mode": 1,
"sample": "1.0"
}
]
}
]
}
checkpoints.json file content
{
"createdBy": "Snowpark Migration Accelerator",
"comment": "This file was automatically generated by the SMA tool as checkpoints collection was enabled in the tool settings. This file may also be modified or deleted during SMA execution.",
"type": "Validation",
"pipelines": [
{
"entryPoint": "sample.py",
"checkpoints": [
{
"name": "sample$BBVOC7$df1$1",
"file": "sample.py",
"df": "df1",
"location": 1,
"enabled": true,
"mode": 1,
"sample": "1.0"
},
{
"name": "sample$BBVOC7$df2$1",
"file": "sample.py",
"df": "df2",
"location": 1,
"enabled": true,
"mode": 1,
"sample": "1.0"
},
{
"name": "sample$BBVOC7$df3$1",
"file": "sample.py",
"df": "df3",
"location": 1,
"enabled": true,
"mode": 1,
"sample": "1.0"
}
]
}
]
}
Once the SMA execution flow is complete and both the input and output folders contain their respective checkpoints.json
files, you are ready to begin the Snowpark-Checkpoints execution process.
Take a look on to review the related inventories.