Introduction
Welcome to Snowflake SnowConvert for PySpark (Apache Spark Python). Let us be your guide on the road to a successful migration.
SnowConvert is not a find-and-replace or regex matching tool. SnowConvert is software that understands your source code (Python) by parsing and building a semantic model of your code's behavior. For Spark, SnowConvert identifies the usages of the Spark API, inventories them, and finally converts them to their functional equivalent in Snowpark.
Here are a few terms/definitions, so you know what we mean when we start dropping them all over the documentation:
- SnowConvert Qualification Tool: The version of SnowConvert for Spark that runs in assessment mode. Ultimately, this is software that identifies, precisely and automatically, all Apache Spark Python usages in a codebase.
- File Inventory: An inventory of all the files present in the input directory of the tool. This could be any file, not just the ones listed above. You will get a breakdown by file type that includes the source technology, code lines, comment lines, and size of the source files.
- Keyword Counts: A count of all present keywords broken out by technology. For example, if you have a PySpark statement in a .py file, this file will keep track of all of them. You will get a count of how many of each keyword you have by filetype.
- Spark Reference Inventory: Finally, you will get an inventory of every reference to the Spark API present in Python code.
- Readiness Score: The spark references will form the basis for assessing the level of conversion that can be applied to a given codebase.
- Conversion Score: This score is calculated by taking all spark usages that were converted automatically divided by all spark references found.
This documentation will walk through both the identification and conversion capabilities of SnowConvert for PySpark. If you're ready to start, visit the Getting Started page in this documentation.
SnowConvert for Spark Python converts references from the Spark API in Python code into the SnowPark API 1.3.0. Let's check an example to see how this works.
Here´s a script that uses several Pyspark functions
from datetime import date, datetime
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Row
spark_session = SparkSession.builder.getOrCreate()
df = spark_session.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
# cube()
df.cube("name", df.age).count().orderBy("name", "age").show()
# take()
df_new1.take(2)
# describe()
df.describe(['age']).show()
# explain()
df.explain()
df.explain("simple") # Physical plan
df.explain(True)
# intersect()
df1 = spark_session.createDataFrame([("a", 1), ("a", 1), ("b", 3), ("c", 4)], ["C1", "C2"])
df2 = spark_session.createDataFrame([("a", 1), ("a", 1), ("b", 3)], ["C1", "C2"])
# where()
df_new1.where(F.col('Id2')>30).show()
The Converted Snowflake Code:
from datetime import date, datetime
from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
from snowflake.snowpark import Row
spark_session = Session.builder.create()
df = spark_session.create_dataframe([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
# cube()
df.cube("name", df.age).count().sort("name", "age").show()
# take()
df_new1.take(2)
# describe()
df.describe(['age']).show()
# explain()
df.explain()
df.explain("simple") # Physical plan
df.explain(True)
# intersect()
df1 = spark_session.create_dataframe([("a", 1), ("a", 1), ("b", 3), ("c", 4)], ["C1", "C2"])
df2 = spark_session.create_dataframe([("a", 1), ("a", 1), ("b", 3)], ["C1", "C2"])
# where()
df_new1.where(F.col('Id2')>30).show()
In this example, most of the structure of the Python code is the same, but the references to the Spark API have been changed to references to the Snowpark API. To view the complete translation reference, please reach out to [email protected].
Last modified 19d ago