Effective Python: requests
, BeautifulSoup
, and web scraping
programming
Python
effective python
Requests
The requests
library does for web interaction what pathlib.Path
path does for files.
import requests
= requests.get('https://www.predictit.org/api/marketdata/all/') r
BeautifulSoup
Install:
pip install bs4
pip install lxml
The manual is excellent and just longish page. Well worth reading. BeautifulSoup
can be confusing because there is more than one way to do most things. But it is consistent and once you get the hang of its syntax, you’ll find it easy to use.
from bs4 import BeautifulSoup
import requests
= requests.get('https://www.mynl.com/blog', feautres='lxml')
r = BeautifulSoup(r.content)
soup
'a') soup(
soup.prettify()
soup.findAll('a')
orsoup.find_all('a')
are both the same assoup('a')
soup.find('a')
is the first element insoup('a')
soup(['li', 'h1'])
allli
orh1
tagssoup('p', class_='mb-2')
soup('li', class_='nav-item nav-masthead')
All links
for i in soup('a'):
print(i.text, i['href'])
In Pandas
= pd.DataFrame([[i.text, i['href']] for i in soup('a')], columns=['text', 'target'])
df df.head()
Straining Soup
print(len(soup('meta')))
# 9
for i in soup(['style', 'meta', 'script', 'link']):
# see what you are deleting
print(i.prettify()[:75])
# remove from the soup
i.decompose()print(len(soup('meta')))
# 0
Run it twice! Nothing happens the second time.
Getting sneaky
= {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
headers "(KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
= requests.get('https://www.mynl.com/blog', 'lxml', headers=headers) r
Legal Note
It is up to you to ensure your use of third-party web services comports with their terms of service. Not all sites want to be scraped!
When you scrape, it is a good idea:
- Not to scrape thoughtlessly
- Caching content, so you only downloaded it once
- Pause! Use
time.sleep()
or similar to space out your requests.