SPRKPY1063

pyspark.sql.pandas.functions.pandas_udf has workaround.

Description

This issue appears when the tool detects the usage of pyspark.sql.pandas.functions.pandas_udf which has a workaround.

Input code:

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def modify_df(pdf):
    return pd.DataFrame({'result': pdf['col1'] + pdf['col2'] + 1})
df = spark.createDataFrame([(1, 2), (3, 4), (1, 1)], ["col1", "col2"])
new_df = df.groupby().apply(modify_df)

Output code:

#EWI: SPRKPY1062 => pyspark.sql.pandas.functions.pandas_udf has a workaround, see documentation for more info
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def modify_df(pdf):
    return pd.DataFrame({'result': pdf['col1'] + pdf['col2'] + 1})
df = spark.createDataFrame([(1, 2), (3, 4), (1, 1)], ["col1", "col2"])
new_df = df.groupby().apply(modify_df)

Scenario

@pandas_udf(schema : str, PandasUDFType.GROUPED_MAP : int)

Action: Specify explicitly the parameters types as a new parameter 'input_types', and remove 'functionType' parameter if applies. Created function must be called inside a select statement.

Manual Adjustment:

@pandas_udf(
    return_type = schema, 
    input_types = [PandasDataFrameType([IntegerType(), IntegerType()])]
)
def modify_df(pdf):
    return pd.DataFrame({'result': pdf['col1'] + pdf['col2'] + 1})
df = spark.createDataFrame([(1, 2), (3, 4), (1, 1)], ["col1", "col2"])
new_df = df.groupby().apply(modify_df) # You must modify function call to be a select and not an apply

Recommendation

  • For more support, you can email us at snowconvert-info@snowflake.com. If you have a contract for support with Snowflake, reach out to your sales engineer and they can direct your support needs.

Last updated

#332: [SIT-1562] SQL Readiness

Change request updated