Dataset

org.apache.spark.sql.Dataset[T] => com.snowflake.snowpark.DataFrame

This section describes the mappings from org.apache.spark.sql.Dataset[T] to com.snowflake.snowpark.DataFrame. These methods are mapped to the DataFrame class since there is no Dataset class available in Snowpark.

Spark
Snowpark
Notes

cache()

Cache is an alias for persist.

dropDuplicates()

dropDuplicates(Seq colNames)

dropDuplicates(String[] colNames)

dropDuplicates(String col1, Seq cols)

dropDuplicates(String col1, String... cols)

filter(Column condition)

Mapped to method in com.snowflake.snowpark.DataFrame

orderBy(String sortCol, Seq[String] sortCols)

*

orderBy(String sortCol, String... sortCols)

*

Persist(newLevel: StorageLevel)

repartition(partitionExprs: Column*)

N/A

Repartition is a Spark concept that is not needed in Snowpark

repartition(numPartitions: Int, partitionExprs: Column*)

N/A

Repartition is a Spark concept that is not needed in Snowpark

repartition(numPartitions: Int)

N/A

Repartition is a Spark concept that is not needed in Snowpark

repartitionByRange(*cols: Column*): DataFrame

N/A

Repartition by range is a Spark concept that is not needed in Snowpark

repartitionByRange(numPartitions: int, *cols: Column*): DataFrame

N/A

Repartition by range is a Spark concept that is not needed in Snowpark

transform(scala.Function1<Dataset,Dataset> t)

*

unionByName(Dataset other)

unionByName ( other: DataFrame ) : DataFrame

Pending: Functional comparison

unionByName(Dataset other, boolean allowMissingColumns)

**

withColumn(String colName, Column col)

Last updated

Was this helpful?