top of page
Search
Writer's pictureYuchen Liu

JSON - The Best Pal of Python

Updated: May 15, 2020


It's been a while since the last time I visited my favorite restaurants in New York. Hmm...I really miss them. I'm always wondering how many restaurants are there in New York, but I didn't find a good source. I did find a dataset that contains partial information about the restaurants in Baltimore in JSON (JavaScript Object Notation). Since Baltimore is one of my favorite cities and JSON is my favorite format, let's create a topic of using Python to manipulate JSON!


Not every dataset comes ready for analysis, the purpose of this blog is to give you an idea of how to deal with those missing information and use the analytical goal as a guide to manipulate the dataset.


Dataset:

Restaurant.json (This data set was updated in April, 2015 which may contain incomplete and incorrect information.)


At the first glance of the dataset, I found there's no zip code associate with the address, and the geographic information is not provided. If we want to visualize the restaurant in Baltimore and find the most popular place and opportunity for a new business, we need several steps to complete the information.


Let's discuss a new format - JSON.


What is JSON?


JavaScript Object Notation (JSON) is an open standard file format for sending data by HTTP request between browsers and other applications. I like to visualize the structure of a format to get a better idea of how to deal with it in the future. Here is an example:

{
 "name": "Yuchen",
 "location" : ["US", "Earth"],
 "pet" : "Shiro",
 "sibling": null,
 "food" : [{"fruits":["cherry","banana","apple"]},
         {"vegetables": ["Celery","cucumber"]},
         {"meat" : ["beef","duck","pork"]}],
 "mood" : ["happy","calm"]
}

JSON is more like a combination of the data structures of Python, which makes it flexible and easy to extract information and transform to python objects.


🤪Check out my blog for the Python Data Structures Cheatsheet!


Let's go back to our dataset. This JSON file contains two parts - 'metadata' and 'actual data'. The metadata is configuring the fields and categories for abstraction. The file configures the field categories and lists available, as well as what type of values are permitted for each field. Read more about metadata here.


For the actual data which is a collection of restaurant info, I recommend to use a 'beautifier' to transform the original codes to a readable structure.


Now let‘s get into to the topic - how we use Python to extract information from JSON.


Step 1 - Make Some Necessary Imports and Check Your Current Working Directory

import json
import os
import pandas as pd

cwd = os.getcwd()
files = os.listdir(cwd)
cwd

Step 2 - Load JSON

db = json.load(open('Desktop/test.json'))
len(db)

*Step 3 - Give Yourself Some Example of How the Data Looks Like

If you are an expert, you can skip this step, but I like to give myself some examples before coding. Namely, I tried to slice the first row if the data to see the types and content first. In this example, I tried to get the complete address of a restaurant, so I applied the 13th element of the first chunk of data in the JSON file.

# Explore the data type
db_address = db[0][13][0]
print(db_address)
print (type(db_address))

# Transform string to dictionary
import ast 
db_address_res = ast.literal_eval(db_address)
print(db_address_res)
print (type(db_address_res))

In the output, I noticed the type of 'db_address' is 'string', so I added one more step to transform the type to a dictionary by using 'ast.literal_eval' function. Now we have an idea of how the data look like and we've done dealing with the type.


Step 5 - Append All the Address in JSON to a Data Frame

In this step, we're using a loop to get the address in each data chunk in JSON and append the result to a new Dataframe. In this example, I used the first element (db[0], db[1], db[2]...) of each restaurant as the index.

rec = 0
db_address_res_1 = []
restaurant_address_df = pd.DataFrame()

for rec in range(len(db)):
    db_address_res_1= ast.literal_eval(db[rec][13][0])
    
    restaurant_address_df_1 = pd.DataFrame(db_address_res_1,index = [db[rec][0]])
    
    restaurant_address_df = restaurant_address_df.append(restaurant_address_df_1)

Step 6 - Combine the Columns 'address', 'city' and 'state' to a New Column 'completeAddress'

Here, I added a new column 'completeAddress' to store the address of the restaurant.

restaurant_address_df['completeAddress'] = restaurant_address_df[restaurant_address_df.columns[0:3]].apply(lambda x: ','.join(x.dropna().astype(str)),
    axis=1
)
restaurant_address_df['completeAddress'][1]

*Step 7 - Example of the Geo Location

You may skip this step, but I like to give myself an example of the output of a new function. In the output, we can find the detailed address with zip code and the geographic coordinates (latitude, longitude).


🤪Yuchen's tip: you can use your email address as 'user_agent' in the 'Nominatim' function.

from geopy.geocoders import Nominatim
locator = Nominatim(user_agent= 'your email address')

for i in range(16):
    location = locator.geocode(restaurant_address_df['completeAddress'][i])
location

Out:

Location(Biddle Street, Maryland Manor, Rosedale, Baltimore County, Maryland, 21237, United States of America, (39.3097142, -76.5252203, 0.0))


Step 8 - Find the Geographical Information for all Restaurants

Given the example, we can apply the 'Nominatim' function to the entire DataFrame by a loop function. The reason that I used one more step to identify 'Na' is that I noticed not all the addresses can be found via GeoPy. I haven't figured out this problem, so I just filled them with 'Na' temporarily.

from geopy.geocoders import Nominatim
locator = Nominatim(user_agent= 'your email address')

temp_geo = pd.DataFrame() 
geo_2 = pd.DataFrame()

for i in range(len(db)):
    
    location = locator.geocode(restaurant_address_df['completeAddress'][i])
    if not location:
        Latitude = 'Na'
        Longitude = 'Na'
        data = {'Latitude':['Na'],'Langitute':['Na'] }
        temp_geo = pd.DataFrame(data,index =[db[i][0]])
        geo_2 = geo_2.append(temp_geo)
    else:
        Latitude = location.latitude
        Longitude = location.longitude
        data = {'Latitude':[Latitude],'Langitute':[Longitude] }
        temp_geo = pd.DataFrame(data,index =[db[i][0]])
        geo_2 = geo_2.append(temp_geo)

Step 9 - See the Final Result

new_restaurant_df = restaurant_address_df.join(geo_2)
new_restaurant_df

This is not the end of exploration, our goal is to utilize the prepared data to find insights. In this case, we may visualize the locations of the restaurants on a map to see the distribution, which can be done via Python or Tableau. The business case can be looking for the best locations to open new restaurants. We can also investigate why some restaurants cannot be found via GeoPy. There's much more to do with the data. I'll keep updating this blog to show more valuable insights.


View the codes at my GitHub Repository.


😊 Happy analyzing!

51 views0 comments

Recent Posts

See All

SQL vs. Python, Which Is Your Favorite?

Last week, I had a dinner break with my colleagues at Le District, New York. I asked one of them, "do you like coding?" His answer was,...

Stay Home and Detect Fake News With Python

Stay home is not a bad thing, watching the news every day reminds me of a project during my graduate study at NYU, which was an...

Comments


bottom of page