Web Scraping Using Python

Vijay Kumari
2w
301
0
7

Article

What is Web Scraping?

Web scraping is a powerful technique for extracting data from websites. In this article, we will explore how to scrape book details such as titles, prices, and ratings from the website "Books to Scrape" using Python. We will also enhance the script to improve error handling and data storage.

Extracting Books Data Example

To demonstrate web scraping, we will extract book details from the "Books to Scrape" website. This example covers fetching book titles, prices, and ratings and saving the extracted data in a structured format.

Prerequisites

To follow along, ensure you have Python installed along with the necessary libraries:

pip install requests beautifulsoup4 csv sqlite3

Web Scraping Process

The script follows these steps:

Send an HTTP request to fetch the webpage content.
Parse the HTML using BeautifulSoup.
Extract relevant data (book title, price, and rating).
Save the data to a CSV file for further analysis.

Step 1. Import Required Libraries

import requests
from bs4 import BeautifulSoup
import csv
import sqlite3

These libraries help in sending requests, parsing HTML, writing to CSV files, and storing data in an SQLite database.

Step 2. Set Up HTTP Headers

To avoid being blocked by the server, we set up a User-Agent header to mimic a browser request:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

Step 3. Initialize CSV File

We create a CSV file and define column headers:

csv_file = open('books.csv', 'w', newline='', encoding='utf-8')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Title', 'Price', 'Rating'])

Step 4. Define the Rating Conversion Function

Since ratings are represented as class names in the HTML, we map them to numerical values:

def get_rating(rating_class):
    """Convert rating text to numerical value"""
    rating_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
    return rating_map.get(rating_class, 0)

Step 5. Scrape the Web Page

We send an HTTP GET request to the website and parse the response using BeautifulSoup:

url = 'http://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()

soup = BeautifulSoup(response.text, 'html.parser')
books = soup.find_all('article', class_='product_pod')

print(f"Processing first page... Found {len(books)} books")

Step 6. Extract and Store Data

We loop through the books and extract relevant information:

for book in books:
    try:
        title = book.h3.a['title']
        
        # Extract and clean price
        price_element = book.find('p', class_='price_color')
        price_text = price_element.text.strip() if price_element else '0'
        price_clean = ''.join(c for c in price_text if c.isdigit() or c == '.')
        
        # Handle multiple decimal points and empty values
        if not price_clean:
            price = 0.0
        else:
            parts = price_clean.split('.')
            if len(parts) > 1:
                price_clean = f"{parts[0]}.{''.join(parts[1:])}"
            price = float(price_clean)

        # Extract rating
        rating_element = book.find('p', class_='star-rating')
        rating_class = rating_element['class'][1] if rating_element else ''
        rating = get_rating(rating_class)

        # Write to CSV
        csv_writer.writerow([title, price, rating])
        print(f"Processed book: {title}")

    except Exception as e:
        print(f"Error processing book: {e}")
        print(f"Raw price text: {repr(price_text)}")
        print(f"Cleaned price: {repr(price_clean)}")
        continue

Output

Web Scraping Books Data Using Python

Enhancements and Future Improvements

Pagination Handling: Extend the script to scrape multiple pages automatically.
Database Storage: Save scraped data to an SQLite database for better data management.
Error Handling: Improve exception handling for robustness.
GUI Integration: Develop a simple interface to display results interactively.

With these improvements, you can transform this script into a fully functional web scraping tool.