Knowledge is the Only Good
  • About

Effective Python: requests, BeautifulSoup, and web scraping

programming
Python
effective python
Author

Stephen J. Mildenhall

Published

2022-02-20

Requests

The requests library does for web interaction what pathlib.Path path does for files.

import requests

r = requests.get('https://www.predictit.org/api/marketdata/all/')

BeautifulSoup

Install:

pip install bs4
pip install lxml

The manual is excellent and just longish page. Well worth reading. BeautifulSoup can be confusing because there is more than one way to do most things. But it is consistent and once you get the hang of its syntax, you’ll find it easy to use.

from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.mynl.com/blog', feautres='lxml')
soup = BeautifulSoup(r.content)

soup('a')
  • soup.prettify()
  • soup.findAll('a') or soup.find_all('a') are both the same as soup('a')
  • soup.find('a') is the first element in soup('a')
  • soup(['li', 'h1']) all li or h1 tags
  • soup('p', class_='mb-2')
  • soup('li', class_='nav-item nav-masthead')

All links

for i in soup('a'):
    print(i.text, i['href'])

In Pandas

df = pd.DataFrame([[i.text, i['href']] for i in soup('a')], columns=['text', 'target'])
df.head()

Straining Soup

print(len(soup('meta')))
# 9
for i in soup(['style', 'meta', 'script', 'link']):
    # see what you are deleting
    print(i.prettify()[:75])
    # remove from the soup
    i.decompose()
print(len(soup('meta')))
# 0

Run it twice! Nothing happens the second time.

Getting sneaky

headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                         "(KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
r = requests.get('https://www.mynl.com/blog', 'lxml', headers=headers)

Legal Note

It is up to you to ensure your use of third-party web services comports with their terms of service. Not all sites want to be scraped!

When you scrape, it is a good idea:

  • Not to scrape thoughtlessly
  • Caching content, so you only downloaded it once
  • Pause! Use time.sleep() or similar to space out your requests.

Stephen J. Mildenhall. License: CC BY-SA 2.0.

 

Website made with Quarto