Loading...

Postulate is the best way to take and share notes for classes, research, and other learning.

More info

Python web scraping with pyppeteer and BeautifulSoup

Profile picture of Samson ZhangSamson Zhang
Jun 6, 2022Last updated Jun 6, 20224 min read

The Pulitzer Prize publishes its board members on its website, going all the way back to 1916-17. Each board member has their title, organization, and a bio displayed.



Can we turn this into structured data -- a CSV or JSON file, or a spreadsheet -- to use for analysis and to display in our own way?

With some really great Chromium-based Python tools, yes we can!

First, we'll import some packages:

from pyppeteer import launch
from bs4 import BeautifulSoup
import pandas as pd
import asyncio

Now let's make an async function (because loading web stuff needs async/await) and write our pyppeteer (port of JS scraping tool puppeteer) code to get the HTML content of the page:

async def get_html():
    browser = await launch()
    page = await browser.newPage()
    await page.goto("https://www.pulitzer.org/board/2022")
    await page.waitForSelector("div.board-member", visible=True)
    html = await page.content()
    await browser.close()
    return html

We use `waitForSelector` because the Pulitzer website actually uses client-side AngularJS to load in the content. In other words, on first load the raw HTML of the page doesn't actually contain the content, i.e. the information of the board members: only after a few moments does it load in. This is also why we need to use a full browser API instead of sending a simple GET request.

Once we have the HTML code stored as a string, we can use BeautifulSoup to pull data out of it and put it into a pandas DataFrame:


html = asyncio.get_event_loop().run_until_complete(get_html())
soup = BeautifulSoup(html, "html.parser")

member_divs = soup.select("div.board-member")
members = pd.DataFrame()

for member_div in member_divs:
    name = member_div.select_one("span.board-title").text
    title = member_div.select_one("span[ng-bind-html='::member.field_job_title.und[0].safe_value']").text
    organization = member_div.select_one("span[ng-bind-html='::member.field_employer.und[0].safe_value']").text
    description = member_div.select_one("div[ng-bind-html='::member.body.und[0].safe_value | to_trusted']").text
    

    # using some string logic to extract one more piece of structured data
    search_string = "joined the Pulitzer Prize Board in "
    join_year = None
    if search_string in description:
        search_string_index = description.index(search_string)
        year_index = search_string_index + len(search_string)
        join_year = description[year_index:year_index + 4]

    members.loc[len(members.index)] = {
        "name": name,
        "title": title,
        "organization": organization,
        "description": description,
        "join_year": join_year
    }

members.to_csv("2020.csv")

The key technique is using BeautifulSoup's `select` and `select_one` functions, which take versatile CSS selectors and spit out either another "soup" object to use selectors on, or the text or html content of the selected DOM element.

Knowing what selectors to use is a matter of going into the Chrome (or whatever browser you use) inspector and seeing what classnames, IDs, etc. are available for you to latch on to.



And just like that, we have a CSV file that we can open up in Excel, or import back into code to use for our own purposes.



More complicated scraping behavior can easily be written. For example, the Livingston Awards don't display the bios of all judges on one page, but rather has a page for each judge that can be navigated to.



Here the scraping can happen in two steps: first, getting all the links from the main page, and then looping through the links, loading each one using pyppeteer to get all the needed information.

async def get_livingston():
    browser = await launch()
    page = await browser.newPage()

    members = pd.DataFrame(columns=["name", "title", "bio", "board"])

    await page.goto("https://wallacehouse.umich.edu/livingston-awards/judges/", waitUntil="load")
    html = await page.content()

    soup = BeautifulSoup(html, "html.parser")

    judges = soup.select("div.row.judges div.row.director")

    links = []

    for judge in judges:
        link = judge.select_one("p.name a.link")["href"]
        links.append(link)

    print(links)

    for link in links:
        await page.goto(link)
        html = await page.content()
        soup = BeautifulSoup(html, "html.parser")

        name = soup.select_one("h2.name").text

        title_raw = str(soup.select_one("p.title"))

        title_split = title_raw.replace("<br>", "<br/>").split("<br/>")
        title = BeautifulSoup(title_split[0], "html.parser").text
        board = BeautifulSoup(title_split[1], "html.parser").text

        bio_raw = soup.select_one("div.content")
        for el in bio_raw.select("p.title, p.bio"): el.decompose()
        bio = bio_raw.text

        members.loc[len(members.index)] = {
            "name": name,
            "title": title,
            "board": board,
            "bio": bio,
        }

    await browser.close()

    return members

The real code I used has all sorts of other custom formatting and data structure tidbits, but the basic web scraping code is as presented here, and can be easily adapted to other scraping projects. If you want to see the full award-scraping code as I used it, check out the GitHub repo.


Comments (loading...)

Sign in to comment

Data visualization

Notes on data visualization projects