What is Web Scraping?
Web scraping is a powerful technique for extracting data from websites. In this article, we will explore how to scrape book details such as titles, prices, and ratings from the website "Books to Scrape" using Python. We will also enhance the script to improve error handling and data storage.
Extracting Books Data Example
To demonstrate web scraping, we will extract book details from the "Books to Scrape" website. This example covers fetching book titles, prices, and ratings and saving the extracted data in a structured format.
Prerequisites
To follow along, ensure you have Python installed along with the necessary libraries:
pip install requests beautifulsoup4 csv sqlite3
Web Scraping Process
The script follows these steps:
- Send an HTTP request to fetch the webpage content.
- Parse the HTML using BeautifulSoup.
- Extract relevant data (book title, price, and rating).
- Save the data to a CSV file for further analysis.
Step 1. Import Required Libraries
import requests
from bs4 import BeautifulSoup
import csv
import sqlite3
These libraries help in sending requests, parsing HTML, writing to CSV files, and storing data in an SQLite database.
Step 2. Set Up HTTP Headers
To avoid being blocked by the server, we set up a User-Agent header to mimic a browser request:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
Step 3. Initialize CSV File
We create a CSV file and define column headers:
csv_file = open('books.csv', 'w', newline='', encoding='utf-8')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Title', 'Price', 'Rating'])
Step 4. Define the Rating Conversion Function
Since ratings are represented as class names in the HTML, we map them to numerical values:
def get_rating(rating_class):
"""Convert rating text to numerical value"""
rating_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
return rating_map.get(rating_class, 0)
Step 5. Scrape the Web Page
We send an HTTP GET request to the website and parse the response using BeautifulSoup:
url = 'http://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
books = soup.find_all('article', class_='product_pod')
print(f"Processing first page... Found {len(books)} books")
Step 6. Extract and Store Data
We loop through the books and extract relevant information:
for book in books:
try:
title = book.h3.a['title']
# Extract and clean price
price_element = book.find('p', class_='price_color')
price_text = price_element.text.strip() if price_element else '0'
price_clean = ''.join(c for c in price_text if c.isdigit() or c == '.')
# Handle multiple decimal points and empty values
if not price_clean:
price = 0.0
else:
parts = price_clean.split('.')
if len(parts) > 1:
price_clean = f"{parts[0]}.{''.join(parts[1:])}"
price = float(price_clean)
# Extract rating
rating_element = book.find('p', class_='star-rating')
rating_class = rating_element['class'][1] if rating_element else ''
rating = get_rating(rating_class)
# Write to CSV
csv_writer.writerow([title, price, rating])
print(f"Processed book: {title}")
except Exception as e:
print(f"Error processing book: {e}")
print(f"Raw price text: {repr(price_text)}")
print(f"Cleaned price: {repr(price_clean)}")
continue
Output
![Web Scraping Books Data Using Python]()
Enhancements and Future Improvements
- Pagination Handling: Extend the script to scrape multiple pages automatically.
- Database Storage: Save scraped data to an SQLite database for better data management.
- Error Handling: Improve exception handling for robustness.
- GUI Integration: Develop a simple interface to display results interactively.
With these improvements, you can transform this script into a fully functional web scraping tool.