Portfolio

Web Scraping with Python

Data, data, everywhere – in this digital age, data can be identified in every aspect of the digital world. It is the fuel that drives information in the modern world. But how does one collect it to inform different processes or systems? Not the manual way, at least, not for large amounts of it as this is manual labor that is not only time-consuming but also prone to errors. To address these pitfalls, one has to automate the process. And here is where web scraping comes into focus.

Web scraping is the process of automating the extraction of large volumes of data from a website. It involves using scripts to identify and extract information in a systematic manner using a programming language, such as Python, resulting in a structured output format. In this post, I will take you through a web scraping exercise that automated the extraction of course details from a university website. The end goal of this exercise is to use the information to make data-driven decisions in the education sector.

Project Background

Most educational instructions are identified by an online catalog that shows their course offerings. These catalogs are, however, dependent on dynamic content, which may present a hurdle when extracting the information manually or using simple tools. This is where web scraping with Python became the go-to solution.

The objective of this project was to:

  1. Automate the process of extracting course data.
  2. Organize the information into structured datasets.
  3. Enable seamless integration with the project database.

This solution was specifically designed for the University of Massachusetts Lowell’s GPS Course Catalog as an ongoing web scraping exercise. As this is not a glove fit-all, it is adaptable to other similar platforms.

What the Scraper Does

The scraper uses Selenium, a powerful web automation framework, to dynamically interact with the website and extract the following details:

  • Course Codes
  • Course Names
  • Semester Availability
  • Schedules (Days and Times)
  • Tuition Fees
  • Detailed Descriptions (including prerequisites and online learning options)

The scraper then stores this data in JSON format for further processing.

Code Snippet

    import json
    import traceback
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import NoSuchElementException, TimeoutException
    from webdriver_manager.chrome import ChromeDriverManager
    # Setup for Chrome WebDriver
    options = Options()
    options.add_experimental_option(“detach”, False)
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    # Open the web page
    driver.get(“https://gps.uml.edu/catalog/search/2024/summer/”)
    driver.maximize_window()
    # Wait for the page to load
    WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, “.showing”)))
    # Extract general course information
    rows = driver.find_elements(By.XPATH, “/html/body/div[3]/div[5]/div/div/div/div/div/table/tbody/tr”)
    course_list = []
    for row in rows:
        try:
            course_code = row.find_element(By.CSS_SELECTOR, “td.course-number”).text.strip()
            course_name = row.find_element(By.CSS_SELECTOR, “.course-name a”).text.strip()
            course_link = row.find_element(By.CSS_SELECTOR, “.course-name a”).get_attribute(“href”).strip()
            course_sis = row.find_element(By.CSS_SELECTOR, “td.isis-number”).text.strip()
            course_day = row.find_element(By.CSS_SELECTOR, “td.course-day”).text.strip()
            course_semester = row.find_element(By.CSS_SELECTOR, “td.course-date”).text.strip()
            course_tuition = row.find_element(By.CSS_SELECTOR, “td.course-tuition”).text.strip()
            course_list.append({
                “course_code”: course_code,
                “course_name”: course_name,
                “course_link”: course_link,
                “course_sis”: course_sis,
                “course_day”: course_day,
                “course_semester”: course_semester,
                “course_tuition”: course_tuition,
            })
        except NoSuchElementException:
            print(f”Details missing for some rows. Skipping.”)
    # Save to JSON
    with open(“courses.json”, “w”) as json_file:
        json.dump(course_list, json_file, indent=4)
    # Extract detailed course information
    processed_courses = []
    for course in course_list:
        try:
            course_link = course[“course_link”]
            driver.get(course_link)
            # Extract details
            WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, “h1”)))
            course_title = driver.find_element(By.CSS_SELECTOR, “h1”).text.strip()
            course_period = driver.find_element(By.CSS_SELECTOR, “.col-lg-8 p a:nth-child(1)”).text.strip()
            course_category_element = driver.find_element(By.CSS_SELECTOR, “.col-lg-8 a:nth-child(2)”)
            course_category = course_category_element.text.strip()
            course_category_link = course_category_element.get_attribute(“href”).strip()
            course_status = driver.find_element(By.CSS_SELECTOR, “.col-lg-8 p+ p”).text.strip()
            course_desc = driver.find_element(By.CSS_SELECTOR, “.setflush+ p”).text.strip()
            # Extract <ul> elements
            ul_elements = driver.find_elements(By.CSS_SELECTOR, “ul.nolist”)
            list_items_1, list_items_2 = [], []
            if len(ul_elements) > 0:
                li_elements = ul_elements[0].find_elements(By.TAG_NAME, “li”)
                for li in li_elements:
                    item_text = li.text.strip()
                    link_element = li.find_element(By.TAG_NAME, “a”) if li.find_elements(By.TAG_NAME, “a”) else None
                    item_link = link_element.get_attribute(“href”).strip() if link_element else None
                    list_items_1.append({“text”: item_text, “link”: item_link})
            if len(ul_elements) > 1:
                li_elements = ul_elements[1].find_elements(By.TAG_NAME, “li”)
                for li in li_elements:
                    item_text = li.text.strip()
                    link_element = li.find_element(By.TAG_NAME, “a”) if li.find_elements(By.TAG_NAME, “a”) else None
                    item_link = link_element.get_attribute(“href”).strip() if link_element else None
                    list_items_2.append({“text”: item_text, “link”: item_link})
            # Update course with details
            course.update({
                “details”: {
                    “title”: course_title,
                    “period”: course_period,
                    “category”: course_category,
                    “category_link”: course_category_link,
                    “course_status”: course_status,
                    “description”: course_desc,
                    “prerequisites_and_credits”: list_items_1,
                    “online_course_info”: list_items_2,
                }
            })
            processed_courses.append(course)
        except Exception as e:
            print(f”Error processing {course[‘course_link’]}: {e}”)
            traceback.print_exc()
    # Save processed courses to JSON
    with open(“courses_with_details.json”, “w”) as json_file:
        json.dump(processed_courses, json_file, indent=4)
    # Merge and save
    with open(“courses.json”, “r”) as general_file, open(“courses_with_details.json”, “r”) as detailed_file:
        general_courses = json.load(general_file)
        detailed_courses = json.load(detailed_file)
    merged_courses = [{**gc, **dc} for gc in general_courses for dc in detailed_courses if gc[“course_link”] == dc[“course_link”]]
    with open(“merged_courses.json”, “w”) as merged_file:
        json.dump(merged_courses, merged_file, indent=4)
    # Close the browser
    driver.quit()
    print(“Merged course details saved.”)

How It Works

  1. Dynamic Interaction with Web Pages using Selenium, the scraper navigates through dynamically loaded web pages, interacts with elements (like links and buttons), and ensures all data is fully loaded before extraction.
  2. Error Handling for Robust Performance websites are often inconsistent in how data is presented. This scraper is built to handle missing elements gracefully without disrupting the entire workflow.
  3. Detailed Data Parsing:
    • General course information is extracted from the main catalog page.
    • Detailed course descriptions, prerequisites, and category data are scraped from individual course pages.
  4. Data Structuring – extracted data is saved in a hierarchical JSON format, making it easy to analyze, visualize, or import into a database.

Technologies Used

  • Programming Language: Python
  • Automation Framework: Selenium
  • Browser Driver Management: WebDriver Manager
  • Data Handling: JSON

This project demonstrates my ability to use cutting-edge tools and frameworks to tackle real-world challenges in data automation.

Conclusion

Automating data collection through web scraping with Python offers significant benefits for educational institutions and EdTech platforms. It enhances efficiency by saving countless hours of manual work, allowing staff to focus on more critical tasks. By reducing errors in course information, automation improves accuracy, ensuring better communication with students and other stakeholders. Additionally, the scalability of this solution means it can be easily adapted for use across multiple institutions or course catalogs, making it a versatile tool for addressing a wide range of educational needs.

I’m excited about the potential of web scraping and data automation in education. Whether it’s for creating up-to-date course catalogs, analyzing trends in course enrollment, or streamlining administrative tasks, these technologies can empower institutions to focus on what matters most—delivering quality education.

Let’s harness the power of data to transform education, one project at a time!

Download

Cta

Let’s Work Together

Contact Me