SPRKPY1029
pyspark.sql.readwriter.DataFrameReader.parquet
Last updated
pyspark.sql.readwriter.DataFrameReader.parquet
Last updated
Message: This issue appears when the tool detects the usage of pyspark.sql.readwriter.DataFrameReader.parquet. This function is supported, but some of the differences between Snowpark and the Spark API might require making some manual changes.
Category: Warning
This issue appears when the SMA detects a use of the function. This function is supported by Snowpark, however, there are some differences that would require some manual changes.
Input
Below is an example of a use of the pyspark.sql.readwriter.DataFrameReader.parquet
function that generates this EWI.
Output
The SMA adds the EWI SPRKPY1029
to the output code to let you know that this function is supported by Snowpark, but it requires some manual adjustments. Please note that the options supported by Snowpark are transformed into option
function calls and those that are not supported are removed. This is explained in more detail in the next sections.
Recommended fix
In this section, we explain how to configure the paths
and options
parameters to make them work in Snowpark.
1. paths parameter
2. options parameter
Please note that the Snowpark options are not exactly the same as the PySpark options so some manual changes might be needed. Below is a more detailed explanation of how to configure the most common PySpark options in Snowpark.
2.1 mergeSchema option
2.2 pathGlobFilter option
If you want to load only a subset of files from the stage, you can use the pattern
option to specify a regular expression that matches the files you want to load. The SMA already automates this as you can see in the output of this scenario.
2.3 recursiveFileLookupstr option
This option is not supported by Snowpark. The best recommendation is to use a regular expression like with the pathGlobFilter
option to achieve something similar.
2.4 modifiedBefore / modifiedAfter option
You can achieve the same result in Snowflake by using the metadata
columns.
Below is the full example of how the input code should be transformed in order to make it work in Snowpark:
In Snowflake, you can leverage other approaches for parquet data ingestion, such as:
When doing a migration is a good practice to leverage the SMA reports to try to build an inventory of files and determine after modernization to which stages/tables will the data be mapped.
In Spark, this parameter can be a local or cloud location. Snowpark only accepts cloud locations using a . So, you can create a temporal stage and add each file into it using the prefix file://
.
Snowpark does not allow defining the different options as parameters of the parquet
function. As a workaround, you can use the or functions to specify those parameters as extra options of the DataFrameReader.
Parquet supports schema evolution, allowing users to start with a simple schema and gradually add more columns as needed. This can result in multiple parquet files with different but compatible schemas. In Snowflake, thanks to the capabilities you don't need to do that and therefore the mergeSchema
option can just be removed.
Leveraging . Consider also
Parquet which can be pointed directly to cloud file locations.
Using .
For more support, you can email us at or post an issue .