Things to Take Note: Web Scraping

Isabella
4 min readFeb 6, 2021

--

Not all sites provide developer’s API for easy access to their data. Even if they do, chances are they may not contain all the information you need. Possessing web scraping knowledge allows data analysts to supplement their datasets with extracted data from websites, further enriching the analysis. Data scientists could also use additional data obtained for feature engineering and improve the performance of their models.

This post will not share in detail how to inspect the HTML to extract the information that you need from a site. Rather, it aims to provide a guide to prevent certain pitfalls that you may encounter when you start scraping on a larger scale.

Most Commonly Used Packages in Python:

  1. BeautifulSoup
    Parsing library that enables web scraping from HTML and XML documents. Provides functions that enables easy extraction.
    Good enough for static scraping.
  2. Selenium
    Oftentimes, static scraping is insufficient as more information on a page can be found by clicking JavaScript links such as “see more” in a long paragraph. Selenium is an additional package used on top of BeautifulSoup to automate web browser interaction via Python (read: automate button clicking).
  3. Scrapy
    I’ve not used this so far — you may read more here.

In general, a web scraping exercise can be divided into 4 components:

  1. Inspect HTML (right click on the data you want on the site) and click “Inspect”
  2. Use BeautifulSoup to extract the data on a single site (just to test it out first — this will take some time for beginners)
  3. Run for loop on Step 2
  4. Save your data every batch of iterations (just in case)

My learnings on web scraping so far:

  1. Always run in batches and print your progress
    When I first begin learning about web scraping, the printing of web scraping progress was not apparent to me. After testing out my script to scrape for 1 site, it was assumed that adding a for loop would complete the job. That is further from the truth — multiple issues can happen here:
    - Page error
    - Some sites may block web scrapers once they detect it (to circumvent this, see point 3)
    - The code is simply not generalisable
    It’s important to save your results consistently so that even in situations when the code breaks mid-way, you still have a copy of the results you have scraped so far.
  2. Put a timer so you know how long it takes to scrape X pages you need
    Given that you know how much data you need to scrape, you would be able to estimate how much time is needed to complete the job. A simple function to use would be time.time().
  3. Leave some time in between batches to prevent sites from blocking you
    The use of time.sleep()function allows you to leave some time in between to avoid a site from blocking you.
  4. Use of tqdm
    I’ve only recently found out this amazing library that displays the progress of for loop with a progress bar. With this, we would not need to manually code the count printer and it provides an estimate of the time required to complete the task.
An example of progress bar from tqdm

In the sample code below, the goal is to scrape Kiva for more details on the borrowers’ country such as their average annual income, funds lent in the country. The information can provide lenders with more information about the borrowers and could increase their intention to fund the project.

Source: Kiva.org

These data are not provided in their developer API, which is why scraping is necessary to obtain these information. The data provided by Kiva can be used for NLP projects given the description of borrowers’ stories and even recommender systems to match borrowers with lenders to improve the success rate of funds matching.

Sample Code

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
from tqdm import tqdm
loans_list = list(loans_sample[‘LOAN_ID’].unique())
scraped_df = []
printcounter = 1
start_time = time.time()
accum=0
for i in tqdm(range(0,len(loans_list)):
loans_info = {}
loan_id = loans_list[i]
# Concatenate to get new page URL
page_url = ‘https://www.kiva.org/lend/'+str(loan_id)+'?minimal=false'

# Obtain Request
res = requests.get(page_url)

# Turn into Soup
soup = bs4.BeautifulSoup(res.text,’lxml’)

# Return dictionary
raw_text = soup.find_all(‘script’)
text = str(raw_text[7])
text_1 = text.split(‘var kv = ‘)
loans_info[‘loan_id’] = loan_id
loans_info[‘avgAnnualIncome’] = re.search(‘avgAnnualIncome”:”(.+?)”’, text_1[0]).group(1)
loans_info[‘countryName’] = re.search(‘countryName”:”(.+?)”’, text_1[0]).group(1)
loans_info[‘fundsLentInCountry’] = re.search(‘fundsLentInCountry”:”(.+?)”’, text_1[0]).group(1)
loans_info[‘currencyFullName’] = re.search(‘currencyFullName”:”(.+?)”’, text_1[0]).group(1)
loans_info[‘currencyCode’] = re.search(‘currencyCode”:”(.+?)”’, text_1[0]).group(1)
loans_info[‘exchangeRate’] = re.search(‘exchangeRate”:”(.+?)”’, text_1[0]).group(1)
loans_info[‘numLoansFundraising’] = re.search(‘numLoansFundraising”:”(.+?)”’, text_1[0]).group(1)
scraped_df.append(loans_info)
# Set counter on number of runs
accum += 1

if (printcounter ==100):
print(f’Iterations: {accum}, Duration: {time.time()-start_time}’)

printcounter = 1
time.sleep(1)
start_time=time.time()
else:
printcounter += 1

# Save results to CSV every 100
df = pd.DataFrame.from_dict(scraped_df)
df.to_csv('scraped_data_'+str(printcounter)+'.csv',index=False)

--

--

Isabella
Isabella

Written by Isabella

Product analyst, curious about data science, personal finance, baking…! Currently snooping around growth — happy to chat if you’ve growth experiences!

No responses yet