SPRKPY1068

pyspark.sql.DataFrame.toPandas

Message: toPandas contains columns of type ArrayType that is not supported and has a workaround.

Category: Warning

Description

pyspark.sql.DataFrame.toPandas doesn't work properly If there are columns of type ArrayType. The workaround for these cases is converting those columns into a Python Dictionary by using json.loads method.

Scenario

Input

ToPandas returns the data of the original DataFrame as a Pandas DataFrame.

sparkDF = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0))
])

pandasDF = sparkDF.toPandas()

Output

The tool adds this EWI to let you know that toPandas is not supported If there are columns of type ArrayType, but has workaround.

sparkDF = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0))
])
#EWI: SPRKPY1068 => toPandas doesn't work properly If there are columns of type ArrayType. The workaround for these cases is converting those columns into a Python Dictionary by using json.loads method. example: df[colName] = json.loads(df[colName]).
pandasDF = sparkDF.toPandas()

Recommended fix

Additional recommendations

Last updated