Web Scraping using API

Setup

In this blog post, we'll walk through the process of building a data scraper for Zomato, one of the leading food delivery platforms. We'll break down the process into two main steps: fetching paginated order data using Zomato's API and then extracting detailed order information for each item.

We will make use of python along with its requests package which can be used to automate web requests including scraping, testing and web automation. The post assumes basic knowledge of python and the concept of web api.

Prerequisites: How to copy headers for a web request.

Each request done by a web browser is captured in the network console. We can simply right click the request and select copy headers to get the headers used in the below codes. Note: You will need to login to the Zomato platform and copy the headers for a request to get auth details for running the below codes.

Checkout this link for more information on how to get headers for requests.

Python libraries used: Requests; Pandas; JSON; dateutil; re; pprint; time

Step 1: Fetching Paginated Order Data

The first step in our data scraping journey involves fetching paginated order data from Zomato's API. We start by obtaining a valid cookie from the server to authenticate our requests. Then, we iterate through the paginated order history, fetching data for each page until we reach the last page. We store the fetched data in a JSON file for further processing.

# Code snippet for extracting detailed order information
import requests
import json


if __name__ == "__main__":
    #get cookie from server
    headers = {'accept': '', 'accept-language': '', 'sec-ch-ua': '', 
	'sec-ch-ua-mobile': '', 'sec-ch-ua-platform': '', 'sec-fetch-dest': '', 
	'sec-fetch-mode': '', 'sec-fetch-site': '', 'x-zomato-csrft': '', 
	'Referer': '', 'Referrer-Policy': '', 'User-Agent': '',
	'cookie': ''}

    dataset = []
    total_pages = None
    current_page = 1

    while True:

      if total_pages and current_page == total_pages:
        break

      r = requests.get(url='https://www.zomato.com/webroutes/user/orders?page={0}'.format(current_page),
                     headers=headers)
      current_page += 1
      print(r.json())
      data = r.json()
      total_pages = data['sections']['SECTION_USER_ORDER_HISTORY']['totalPages']
      dataset.append(data)

    filename = r'C:\Users\user1\data.json'
    f = open(filename, 'w')
    json.dump(dataset, f)

Step 2: Extracting Detailed Order Information

Reading Paginated Data: The script reads the JSON file (data.json) containing the paginated order data obtained from Step 1. It loads the JSON content into memory for further processing.
Iterating Through Orders: It iterates through each page of order data and extracts the unique identifiers (hash IDs) for each order item. These hash IDs are essential for fetching detailed information about each item.
Fetching Detailed Information: For each hash ID, the script sends a GET request to another Zomato API endpoint (https://www.zomato.com/webroutes/order/details) to fetch detailed information about the order item.
Data Storage: The fetched detailed information, along with the corresponding hash ID, is stored in a list called dataset. This list contains dictionaries, with each dictionary representing an order item and its details.
Rate Limiting: To avoid overloading the server, a time delay of 0.5 seconds (time.sleep(0.5)) is introduced between each request.
Final Data Storage: Once all detailed information is fetched, the entire dataset is stored in another JSON file (data_per_item.json) for further analysis.

# Code snippet for extracting detailed order information
import json
import requests
import time


if __name__ == "__main__":
    #get cookie from server
    headers = {'accept': '', 'accept-language': '', 'sec-ch-ua': '', 
	'sec-ch-ua-mobile': '', 'sec-ch-ua-platform': '', 'sec-fetch-dest': '', 
	'sec-fetch-mode': '', 'sec-fetch-site': '', 'x-zomato-csrft': '', 
	'Referer': '', 'Referrer-Policy': '', 'User-Agent': '',
	'cookie': ''}

    hashlist = []

    order_file = r'C:\Users\user1\data.json'
    order_json = json.load(open(order_file,'r'))

    for _page in order_json:
        for k,v in _page['entities']['ORDER'].items():
            hashlist.append(v['hashId'])

    dataset = []

    for ind,_item in enumerate(hashlist):
        r = requests.get(url='https://www.zomato.com/webroutes/order/details?hashId={0}'.format(_item),
                         headers=headers)
        t = r.json()
        dataset.append({"hashid": _item, "data": t})
        time.sleep(0.5)

    filename = r'C:\Users\user1\data_per_item.json'
    json.dump(dataset, open(filename, 'w'))

Step 3: Transforming JSON files into csv files.

The code will contain two functions one to parse the items for orders and another to parse the order data at a higher level.

Below steps explain the code for generating order data from orders file.

Function process_order_data(datafile, outfile, per_item_file)

Loading Per-Item Data: This function takes three arguments: datafile (path to the JSON file containing order data), outfile (path for the resulting CSV file), and per_item_file (path to the JSON file containing detailed order information per item). It loads the detailed item data from per_item_file into a dictionary called itemdict, where keys are hashid and values are dictionaries containing various cost-related information for each order.
Parsing Order Data: It iterates through each page of order data in datafile. For each order, it extracts relevant information such as order_date, cost, payment_status, delivery_address, dish_string, restaurant_name, hashid, and cost-related information (taxes, delivery charge, packing charge, discounts) from itemdict.
Constructing Dataset: For each order, it constructs a row containing the extracted information. These rows are appended to the dataset list.
Creating DataFrame and Saving as CSV: After parsing all orders, it creates a pandas DataFrame from the dataset, specifying column names (datacols). Finally, it saves the DataFrame as a CSV file using df.to_csv().

# Code snippet for extracting detailed order information
import json
import pprint
import pandas
import re
from dateutil import parser


def process_order_data(datafile, outfile, per_item_file):
    datacols = ['order_date', 'cost', 'payment_status', 'delivery_address', 'dish_string',
                'resturant_name', 'hashid', 'taxes', 'delivery_charge', 'packing_charge', 'discounts']
    dataset = []

    item_json = json.load(open(per_item_file, 'r'))
    itemdict = {}
    for _rec in item_json:
        recdict = {}
        charge_dict = _rec['data']['details']['order']['items']

        taxes = sum([x['totalCost'] for x in charge_dict.get('tax', [])])

        delivery_charge = list(filter(lambda x: x['itemName'] == 'Delivery Charge', charge_dict.get('charge', [])))
        delivery_charge = delivery_charge[0]['totalCost'] if delivery_charge else 0

        packing_charge = list(filter(lambda x: x['itemName'] == 'Restaurant Packaging Charges', charge_dict.get('charge', [])))
        packing_charge = packing_charge[0]['totalCost'] if packing_charge else 0

        if charge_dict.get('voucher_discount'):
            discounts = sum([x['totalCost'] for x in charge_dict.get('voucher_discount')])
        else:
            discounts = 0

        costdict = {
            'taxes': taxes,
            'delivery_charge': delivery_charge,
            'packing_charge': packing_charge,
            'discounts': discounts,
        }

        for k,v in charge_dict.items():
            if k == 'dish':
                continue
            recdict[k] = {}
            for _charge_item in v:
                recdict[k][_charge_item['itemName']] = _charge_item['totalCost']

        itemdict[_rec['hashid']] = costdict


    currency_clean = re.compile(r'[^\d.,]+')

    jsondata = json.load(open(datafile, 'r'))
    for _page in jsondata:
        for k, v in _page['entities']['ORDER'].items():

            hashid = v['hashId']
            row = [parser.parse(v['orderDate']), currency_clean.sub('', v['totalCost']), v['paymentStatus'],
                   v['deliveryDetails']['deliveryAddress'],
                   v['dishString'], v['resInfo']['name'], hashid, itemdict[hashid]['taxes'],
                   itemdict[hashid]['delivery_charge'], itemdict[hashid]['packing_charge'],
                   itemdict[hashid]['discounts']]
            dataset.append(row)

    df = pandas.DataFrame(data=dataset, columns=datacols)
    df.to_csv(outfile, index=False)

Lets look at how to generate item information for each order.

Function parse_item_data(itemfile, itemoutfile)

Opening JSON File: This function takes two arguments: itemfile, the path to the JSON file containing detailed order information per item, and itemoutfile, the path where the resulting CSV file will be saved. It opens the JSON file using open() and loads its contents into memory using json.load().
Parsing JSON Data: It iterates through each record in the JSON data. For each record, it extracts the hashid, which is a unique identifier for the order, and the item_data, which contains detailed information about each item ordered.
Constructing Dataset: For each item in the item_data, it constructs a row containing the hashid, itemname, unitcost, totalcost, and quantity of the item. These rows are appended to the dataset list.
Creating DataFrame and Saving as CSV: After parsing all records, it creates a pandas DataFrame from the dataset list, specifying column names (datacols). Finally, it saves the DataFrame as a CSV file using df.to_csv().

# Code snippet for extracting detailed order information
import json
import pprint
import pandas
import re
from dateutil import parser

def parse_item_data(itemfile, itemoutfile):
    f = open(itemfile, 'r')
    datafile = json.load(f)
    datacols = ['hashid', 'itemname', 'unitcost', 'totalcost', 'quantity']
    dataset = []

    for _item in datafile:
        hashid = _item['hashid']
        item_data = _item['data']['details']['order']['items'].get('dish')
        if not item_data:
            continue
        for _dish in item_data:
            row = [hashid, _dish['itemName'], _dish['unitCost'], _dish['totalCost'], _dish['quantity']]
            dataset.append(row)

    df = pandas.DataFrame(data=dataset, columns=datacols)
    df.to_csv(itemoutfile, index=False)

Both functions are called to generate final files.

# Code snippet for calling transformation functions.

if __name__ == "__main__":

    itemdatafile = r'C:\Users\user1\data_per_item.json'
    itemoutfile = r'C:\Users\user1\data_per_item.csv'
    parse_item_data(itemdatafile, itemoutfile)

    datafile1 = r'C:\Users\user1\data.json'
    outfile1 = r'C:\Users\user1\data.csv'
    process_order_data(datafile1, outfile1, itemdatafile)

Conclusion

The post explores how web api can be used to extract data from websites which make use of such technology, we also looked at some basic transformation to create a dataset which can be easily analysed using tools like Tableau, Power BI, Excel. To checkout further analysis on this dataset, click here to checkout my project on this dataset.