Dataset

org.apache.spark.sql.Dataset[T] => com.snowflake.snowpark.DataFrame

This section describes the mappings from org.apache.spark.sql.Dataset[T] to com.snowflake.snowpark.DataFrame. These methods are mapped to the DataFrame class since there is no Dataset class available in Snowpark.

SparkSnowparkNotes

cache()

Cache is an alias for persist.

dropDuplicates()

dropDuplicates(Seq colNames)

dropDuplicates(String[] colNames)

dropDuplicates(String col1, Seq cols)

dropDuplicates(String col1, String... cols)

filter(Column condition)

Mapped to method in com.snowflake.snowpark.DataFrame

orderBy(Column... sortExprs)

orderBy(Seq[Column] sortExprs)

orderBy(String sortCol, Seq[String] sortCols)

*

orderBy(String sortCol, String... sortCols)

*

Persist()

Persist(newLevel: StorageLevel)

repartition(partitionExprs: Column*)

N/A

Repartition is a Spark concept that is not needed in Snowpark

repartition(numPartitions: Int, partitionExprs: Column*)

N/A

Repartition is a Spark concept that is not needed in Snowpark

repartition(numPartitions: Int)

N/A

Repartition is a Spark concept that is not needed in Snowpark

repartitionByRange(*cols: Column*): DataFrame

N/A

Repartition by range is a Spark concept that is not needed in Snowpark

repartitionByRange(numPartitions: int, *cols: Column*): DataFrame

N/A

Repartition by range is a Spark concept that is not needed in Snowpark

transform(scala.Function1<Dataset,Dataset> t)

*

unionByName(Dataset other)

unionByName ( other: DataFrame ) : DataFrame

Pending: Functional comparison

unionByName(Dataset other, boolean allowMissingColumns)

**

withColumn(String colName, Column col)

withColumnRenamed

Last updated