SPRKPY1068

pyspark.sql.DataFrame.toPandas

Message: toPandas contains columns of type ArrayType that is not supported and has a workaround.

Category: Warning

Description

pyspark.sql.DataFrame.toPandas doesn't work properly If there are columns of type ArrayType. The workaround for these cases is converting those columns into a Python Dictionary by using json.loads method.

Scenario

Input

ToPandas returns the data of the original DataFrame as a Pandas DataFrame.

sparkDF = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0))
])

pandasDF = sparkDF.toPandas()

Output

The tool adds this EWI to let you know that toPandas is not supported If there are columns of type ArrayType, but has workaround.

sparkDF = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0))
])
#EWI: SPRKPY1068 => toPandas doesn't work properly If there are columns of type ArrayType. The workaround for these cases is converting those columns into a Python Dictionary by using json.loads method. example: df[colName] = json.loads(df[colName]).
pandasDF = sparkDF.toPandas()

Recommended fix

pandas_df = sparkDF.toPandas()

# check/convert all resulting fields from calling toPandas when they are of
# type ArrayType,
# they will be reasigned by converting them into a Python Dictionary
# using json.loads method​

for field in pandas_df.schema.fields:
    if isinstance(field.datatype, ArrayType):
        pandas_df[field.name] = pandas_df[field.name].apply(lambda x: json.loads(x) if x is not None else x)

Additional recommendations

Last updated