JSON in Databricks and PySpark

In the simple case, JSON is easy to handle within Databricks. You can read a file of JSON objects directly into a DataFrame or table, and Databricks knows how to parse the JSON into individual fields. But, as with most things software-related, there are wrinkles and variations. This article shows how to handle the most common situations and includes detailed coding examples. My use-case was HL7 healthcare data that had been translated to JSON, but the methods here apply to any JSON data. The three formats considered are: <ul> <li>A text file containing complete JSON objects, one per line. This is typical when you are loading JSON files to Databricks tables.</li> <li>A text file containing various fields (columns) of data, one of which is a JSON object. This is often seen in computer logs, where there is some plain-text meta-data followed by more detail in a JSON string.</li> <li>A variation of the above where the JSON field is an array of objects.</li> </ul> Getting each of these types of input into Databricks requires different techniques. <a href="https://towardsdatascience.com/json-in-databricks-and-pyspark-26437352f0e9">Read More</a>