SPRKSCL1145

org.apache.spark.sql.functions.sumDistinct

Message: org.apache.spark.sql.functions.sumDistinct has a workaround, see documentation for more info

Category: Warning

Description

This issue appears when the SMA detects a use of the org.apache.spark.sql.functions.sumDistinct function, which has a workaround.

Scenario

Input

Below is an example of the org.apache.spark.sql.functions.sumDistinct function, first used with a column name as an argument and then with a column object.

val df = Seq(
  ("Alice", 10),
  ("Bob", 15),
  ("Alice", 10),
  ("Alice", 20),
  ("Bob", 15)
).toDF("name", "value")

val result1 = df.groupBy("name").agg(sumDistinct("value"))
val result2 = df.groupBy("name").agg(sumDistinct(col("value")))

Output

The SMA adds the EWI SPRKSCL1145 to the output code to let you know that this function is not fully supported by Snowpark, but it has a workaround.

val df = Seq(
  ("Alice", 10),
  ("Bob", 15),
  ("Alice", 10),
  ("Alice", 20),
  ("Bob", 15)
).toDF("name", "value")

/*EWI: SPRKSCL1145 => org.apache.spark.sql.functions.sumDistinct has a workaround, see documentation for more info*/
val result1 = df.groupBy("name").agg(sumDistinct("value"))
/*EWI: SPRKSCL1145 => org.apache.spark.sql.functions.sumDistinct has a workaround, see documentation for more info*/
val result2 = df.groupBy("name").agg(sumDistinct(col("value")))

Recommended fix

As a workaround, you can use the sum_distinct function. For the Spark overload that receives a string argument, you additionally have to convert the string into a column object using the com.snowflake.snowpark.functions.col function.

val df = Seq(
  ("Alice", 10),
  ("Bob", 15),
  ("Alice", 10),
  ("Alice", 20),
  ("Bob", 15)
).toDF("name", "value")

val result1 = df.groupBy("name").agg(sum_distinct(col("value")))
val result2 = df.groupBy("name").agg(sum_distinct(col("value")))

Additional recommendations

Last updated