SPRKPY1075

The expected result might be different if the schema doesn't match.

Description

The parse_json does not apply schema validation, if you need to filter/validate based on schema you might need to introduce some logic.

Example

Input

df.select(from_json(df.value, Schema))
df.select(from_json(schema=Schema, col=df.value))
df.select(from_json(df.value, Schema, option))

Output

#EWI: SPRKPY1075 => The parse_json does not apply schema validation, if you need to filter/validate based on schema you might need to introduce some logic.
df.select(parse_json(df.value))
#EWI: SPRKPY1075 => The parse_json does not apply schema validation, if you need to filter/validate based on schema you might need to introduce some logic.
df.select(parse_json(df.value))
#EWI: SPRKPY1075 => The parse_json does not apply schema validation, if you need to filter/validate based on schema you might need to introduce some logic.
df.select(parse_json(df.value))

For the function from_json the schema is not really passed for inference it is used for validation. See this examples:

data = [
    ('{"name": "John", "age": 30, "city": "New York"}',),
    ('{"name": "Jane", "age": "25", "city": "San Francisco"}',)
]
    
df = spark.createDataFrame(data, ["json_str"])

Example 1: Enforce Data Types and Change Column Names:

# Parse JSON column with schema
parsed_df = df.withColumn("parsed_json", from_json(col("json_str"), schema))

parsed_df.show(truncate=False)

# +------------------------------------------------------+---------------------------+
# |json_str                                              |parsed_json                |
# +------------------------------------------------------+---------------------------+
# |{"name": "John", "age": 30, "city": "New York"}       |{John, 30, New York}       |
# |{"name": "Jane", "age": "25", "city": "San Francisco"}|{Jane, null, San Francisco}|
# +------------------------------------------------------+---------------------------+
# notice that values outside of the schema were dropped and columns not matched are returned as null

Example 2: Select Specific Columns:

# Define a schema with only the columns we want to use
partial_schema = StructType([
    StructField("name", StringType(), True),
    StructField("city", StringType(), True)
])

# Parse JSON column with partial schema
partial_df = df.withColumn("parsed_json", from_json(col("json_str"), partial_schema))

partial_df.show(truncate=False)

# +------------------------------------------------------+---------------------+
# |json_str                                              |parsed_json          |
# +------------------------------------------------------+---------------------+
# |{"name": "John", "age": 30, "city": "New York"}       |{John, New York}     |
# |{"name": "Jane", "age": "25", "city": "San Francisco"}|{Jane, San Francisco}|
# +------------------------------------------------------+---------------------+
# there is also an automatic filtering

Recommendations

For more support, you can email us at [email protected]. If you have a contract for support with Snowflake, reach out to your sales engineer and they can direct your support needs.
Useful tools PEP-8 and Reindent.

PreviousSPRKPY1074 NextSPRKPY1076

Last updated 9 months ago