Python Web Scraping – requests & BeautifulSoup Tutorial

Setup

Bash

pip install requests beautifulsoup4 lxml

Basic Page Fetch

Python

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")

print(soup.title.string)
for link in soup.find_all("a", href=True):
    print(link["href"], link.get_text(strip=True))

CSS Selectors

Python

headings = soup.select("h2.article-title")
first_para = soup.select_one("div.content > p")

images = soup.find_all("img", attrs={"class": "thumbnail"})
for img in images:
    print(img["src"], img.get("alt", ""))

Handling Pagination

Python

import time

def scrape_all_pages(base_url, max_pages=10):
    results = []
    for page in range(1, max_pages + 1):
        resp = requests.get(base_url, params={"page": page}, timeout=10)
        if resp.status_code != 200:
            break
        soup = BeautifulSoup(resp.text, "lxml")
        items = soup.select(".item-card")
        if not items:
            break
        for item in items:
            results.append({
                "title": item.select_one("h3").get_text(strip=True),
                "price": item.select_one(".price").get_text(strip=True),
            })
        time.sleep(1)
    return results

Saving Results

Python

import json, csv

data = [{"name": "Alice", "score": 95}]

with open("results.json", "w") as f:
    json.dump(data, f, indent=2)

with open("results.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "score"])
    writer.writeheader()
    writer.writerows(data)

Note: Always check a site's robots.txt and Terms of Service before scraping. Add delays between requests to avoid overloading servers.

Web Scraping: Extracting Data From HTML — Responsibly

Scraping means fetching a web page and pulling structured data out of its HTML. The typical stack: requests to download, BeautifulSoup to parse. But scraping carries legal/ethical duties most tutorials skip.

import requests
from bs4 import BeautifulSoup

html = requests.get(url, headers={"User-Agent": "my-bot"}, timeout=10).text
soup = BeautifulSoup(html, "html.parser")

soup.find("h1").text                     # first h1
[a["href"] for a in soup.select("a.product")]   # CSS selector

Method	Finds
`.find(tag)`	first match
`.find_all(tag)`	all matches
`.select("css")`	by CSS selector

Scrape responsibly — this matters: check the site's robots.txt and Terms of Service (some prohibit scraping), identify your bot with a User-Agent, and rate-limit your requests (a time.sleep() between them) so you don't hammer the server — aggressive scraping can get your IP banned or cause real harm. Prefer an official API if one exists; it's stabler and sanctioned. Technical caveat: requests+BeautifulSoup only see the initial HTML — pages that build content with JavaScript need a browser-automation tool (Selenium, Playwright). Scrape only public data, respect copyright, and never collect personal data without a lawful basis.

🏋️ Practical Exercise

Scrape a web page responsibly:

Fetch a page with requests and check the status code.
Parse the HTML with BeautifulSoup and extract all the links.
Use a CSS selector to grab specific elements (e.g. article titles).
Save the extracted data to a CSV file.

🔥 Challenge Exercise

Build a scraper for a paginated listing site (e.g. quotes or books from a scraping-practice site): fetch each page, extract structured records with CSS selectors, follow the “next” link until there are none, and write all results to a CSV. Add a polite delay between requests and a realistic User-Agent header. Bonus: handle missing fields gracefully and check robots.txt before scraping.

📋 Summary

Web scraping extracts data from web pages programmatically.
requests fetches HTML and BeautifulSoup parses it; CSS selectors locate elements.
For JavaScript-rendered pages, use a browser-automation tool like Selenium or Playwright.
Handle pagination by following “next” links or incrementing page parameters.
Scrape ethically: respect robots.txt, throttle requests, and set a sensible User-Agent.
Save results to CSV/JSON and handle missing or changed fields defensively.

Interview Questions on Web Scraping

What is web scraping and what are common use cases?
What libraries are commonly used for scraping in Python?
What is the difference between requests + BeautifulSoup and a tool like Selenium?
How do you handle pagination when scraping?
What are the legal and ethical considerations (robots.txt, rate limiting)?
How do you scrape pages that load content with JavaScript?
How do you make a scraper robust against page changes?

FAQ

Is web scraping legal? +

It depends on the site, its terms of service, and your jurisdiction. Always check robots.txt and the site’s terms, avoid scraping personal or copyrighted data you have no right to, throttle your requests, and prefer an official API when one exists.

When do I need Selenium instead of requests + BeautifulSoup? +

Use requests + BeautifulSoup for static HTML. When content is rendered by JavaScript after the page loads, a regular request won’t see it — a browser-automation tool like Selenium or Playwright runs the JS and gives you the final DOM.

How do I avoid getting blocked? +

Be polite: add delays between requests, set a realistic User-Agent, respect rate limits and robots.txt, and avoid hammering the server. Aggressive scraping can get your IP blocked and may violate the site’s terms.

How do I make my scraper resilient to site changes? +

Select elements with stable selectors, guard against missing fields (don’t assume every record is complete), log unexpected structures, and isolate parsing logic so you only have to update one place when the page layout changes.

Setup

Basic Page Fetch

CSS Selectors

Handling Pagination

Saving Results

Web Scraping: Extracting Data From HTML — Responsibly

🏋️ Practical Exercise

🔥 Challenge Exercise

📋 Summary

Interview Questions on Web Scraping

Related Topics

FAQ