Web Scraping with Python

Web Scraping with Python

Data, data, everywhere – in this digital age, data can be identified in every aspect of the digital world. It is the fuel that drives information in the modern world. But how does one collect it to inform different processes or systems? Not the manual way, at least, not for large amounts of it as this is manual labor that is not only time-consuming but also prone to errors. To address these pitfalls, one has to automate the process. And here is where web scraping comes into focus. Web scraping is the process of automating the extraction of large volumes of data from a website. It involves using scripts to identify and extract information in a systematic manner using a programming language, such as Python, resulting in a structured output format. In this post, I will take you through a web scraping exercise that automated the extraction of course details from a university website. The end goal of this exercise is to use the information to make data-driven decisions in the education sector.

Project Background

Most educational instructions are identified by an online catalog that shows their course offerings. These catalogs are, however, dependent on dynamic content, which may present a hurdle when extracting the information manually or using simple tools. This is where web scraping with Python became the go-to solution. The objective of this project was to:

Automate the process of extracting course data.
Organize the information into structured datasets.
Enable seamless integration with the project database.

This solution was specifically designed for the University of Massachusetts Lowell’s GPS Course Catalog as an ongoing web scraping exercise. As this is not a glove fit-all, it is adaptable to other similar platforms.

What the Scraper Does

The scraper uses Selenium, a powerful web automation framework, to dynamically interact with the website and extract the following details:

Course Codes
Course Names
Semester Availability
Schedules (Days and Times)
Tuition Fees
Detailed Descriptions (including prerequisites and online learning options)

The scraper then stores this data in JSON format for further processing.

Code Snippet

import json
import sys
import time
import os
import traceback

from openpyxl.styles import Alignment, Font, PatternFill
from openpyxl.utils import get_column_letter
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from collections import defaultdict

# Setup for Chrome WebDriver
options = Options()
options.add_experimental_option("detach", False)
# options.add_argument("--headless")
# options.add_argument("--disable-gpu")
# options.add_argument("--no-sandbox")
driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()), options=options
)


# Open the web page
driver.get("https://gps.uml.edu/catalog/search/2024/fall/")

# Maximize the browser window
driver.maximize_window()

# Wait for the page to load then print total records found
WebDriverWait(driver, 60).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, ".showing"))
)
total_record = driver.find_element(By.CSS_SELECTOR, ".showing").text.strip()
print(total_record)

# List to store extracted course details
course_list = []

# Get course rows
rows = driver.find_elements(
    By.XPATH, "/html/body/div[3]/div[5]/div/div/div/div/div/table/tbody/tr"
)

# Loop through each course row
for row in rows:
    try:
        # Extract course details
        course_code = row.find_element(By.CSS_SELECTOR, "td.course-number").text.strip()
        course_name = row.find_element(By.CSS_SELECTOR, ".course-name a").text.strip()
        course_link = (
            row.find_element(By.CSS_SELECTOR, ".course-name a")
            .get_attribute("href")
            .strip()
        )
        course_info = f"{course_name}: {course_link}"
        course_sis = row.find_element(By.CSS_SELECTOR, "td.isis-number").text.strip()
        course_day = row.find_element(By.CSS_SELECTOR, "td.course-day").text.strip()
        course_semester = row.find_element(
            By.CSS_SELECTOR, "td.course-date"
        ).text.strip()
        course_tuition = row.find_element(
            By.CSS_SELECTOR, "td.course-tuition"
        ).text.strip()

        # Append to course_list
        course_list.append(
            {
                "course_code": course_code,
                "course_name": course_name,
                "course_link": course_link,
                "course_sis": course_sis,
                "course_day": course_day,
                "course_semester": course_semester,
                "course_tuition": course_tuition,
            }
        )

        # print(course_info)

    except NoSuchElementException:
        print(f"Some details are missing for {course_code}. Skipping.")

# Save course_list to a JSON file
with open("courses.json", "w") as json_file:
    json.dump(course_list, json_file, indent=4)

print("Course details have been saved to courses.json.")

# Paths for checkpoint and output files
CHECKPOINT_FILE = "checkpoint.log"
OUTPUT_FILE = "merged_courses.json"

# Function to load checkpoint
def load_checkpoint():
    if os.path.exists(CHECKPOINT_FILE):
        with open(CHECKPOINT_FILE, "r") as f:
            return f.read().strip()
    return None

# Function to save checkpoint
def save_checkpoint(url):
    with open(CHECKPOINT_FILE, "w") as f:
        f.write(url)

# Function to save progress
def save_progress(data):
    with open(OUTPUT_FILE, "w") as f:
        json.dump(data, f, indent=4)

# Load previous progress
processed_courses = []
if os.path.exists(OUTPUT_FILE):
    with open(OUTPUT_FILE, "r") as f:
        processed_courses = json.load(f)

# Load checkpoint
last_processed_url = load_checkpoint()

# Load the general course data
with open("courses.json", "r") as general_file:
    general_courses = json.load(general_file)

# Start processing
try:
    for course in general_courses:
        course_link = course.get("course_link")

        # Skip if already processed
        if course_link in [c["course_link"] for c in processed_courses]:
            continue

        # Skip until the checkpoint is reached
        if last_processed_url and course_link != last_processed_url:
            continue
        last_processed_url = None  # Reset checkpoint after resuming

        try:
            # Open the course link
            driver.get(course_link)

            # Extract details from the page
            WebDriverWait(driver, 60).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "h1"))
            )
            course_title = driver.find_element(By.CSS_SELECTOR, "h1").text.strip()
            course_period = driver.find_element(By.CSS_SELECTOR, ".col-lg-8 p a:nth-child(1)").text.strip()
            course_category_element = driver.find_element(By.CSS_SELECTOR, ".col-lg-8 a:nth-child(2)")
            course_category = course_category_element.text.strip()
            course_category_link = course_category_element.get_attribute("href").strip()
            course_category = driver.find_element(By.CSS_SELECTOR, ".col-lg-8 a:nth-child(2)").text.strip()
            course_category_link = course_category.get_attribute("href").strip()
            course_status = driver.find_element(By.CSS_SELECTOR, ".col-lg-8 p+ p").text.strip()
            course_desc = driver.find_element(By.CSS_SELECTOR, ".setflush+ p").text.strip()

            # Extract first <ul class="nolist"> (Prerequisites, Core Codes, Credits, Instructor)
            ul_elements = driver.find_elements(By.CSS_SELECTOR, "ul.nolist")

            # Initialize two separate lists for the different <ul> elements
            list_items_1 = []  # For Prerequisites, Core Codes, Credits, Instructor
            list_items_2 = []  # For Online Course, Course Level, Tuition, Notes

            if len(ul_elements) > 0:
                # Extract items from the first <ul> (Prerequisites, Core Codes, Credits, Instructor)
                li_elements = ul_elements[0].find_elements(By.TAG_NAME, "li")
                for li in li_elements:
                    item_text = li.text.strip()
                    link_element = (
                        li.find_element(By.TAG_NAME, "a")
                        if li.find_elements(By.TAG_NAME, "a")
                        else None
                    )
                    item_link = link_element.get_attribute("href").strip() if link_element else None
                    list_items_1.append({"text": item_text, "link": item_link})

            if len(ul_elements) > 1:
                # Extract items from the second <ul> (Online Course, Course Level, Tuition, Notes)
                li_elements = ul_elements[1].find_elements(By.TAG_NAME, "li")
                for li in li_elements:
                    item_text = li.text.strip()
                    link_element = (
                        li.find_element(By.TAG_NAME, "a")
                        if li.find_elements(By.TAG_NAME, "a")
                        else None
                    )
                    item_link = link_element.get_attribute("href").strip() if link_element else None
                    list_items_2.append({"text": item_text, "link": item_link})

            # Add details to the course
            course.update({
                "details": {
                    "title": course_title,
                    "period": course_period,
                    "category": course_category,
                    "category_link": course_category_link,
                    "course_status": course_status,
                    "description": course_desc,
                    "prerequisites_and_credits": list_items_1,
                    "online_course_info": list_items_2,
                }
            })

            # Append to processed courses
            processed_courses.append(course)

            # Save progress and checkpoint
            save_progress(processed_courses)
            save_checkpoint(course_link)

        except Exception as e:
            print(f"Error processing {course_link}: {e}")
            print(traceback.format_exc())
            save_checkpoint(course_link)  # Save the checkpoint before exiting
            break

except KeyboardInterrupt:
    print("Script interrupted. Saving progress...")
    save_progress(processed_courses)

finally:
    driver.quit()
    print("Script completed or interrupted. Progress saved.")

# Load the general course data
with open("courses.json", "r") as general_file:
    general_courses = json.load(general_file)

# Load the detailed course data if it exists
detailed_courses = []
if os.path.exists("courses_with_details.json"):
    with open("courses_with_details.json", "r") as detailed_file:
        detailed_courses = json.load(detailed_file)

# Create a dictionary from detailed_courses for quick lookup using course_link
detailed_courses_dict = {course["course_link"]: course for course in detailed_courses}

# Merge the data
merged_courses = []
for course in general_courses:
    course_link = course.get("course_link")
    detailed_info = detailed_courses_dict.get(course_link, {})

    # Merge general course data with detailed info, ensuring that missing details are handled
    merged_course = {**course, **detailed_info}
    merged_courses.append(merged_course)

# Save the merged data to a new JSON file
with open("merged_courses.json", "w") as merged_file:
    json.dump(merged_courses, merged_file, indent=4)

print("Merged course details have been saved to merged_courses.json.")

# Exit Script
driver.quit()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

import json

import sys

import time

import os

import traceback

from openpyxl.styles import Alignment, Font, PatternFill

from openpyxl.utils import get_column_letter

from selenium import webdriver

from selenium.common.exceptions import NoSuchElementException, TimeoutException

from selenium.webdriver.chrome.options import Options

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.common.action_chains import ActionChains

from selenium.webdriver.common.by import By

from selenium.webdriver.common.keys import Keys

from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.support.ui import WebDriverWait

from webdriver_manager.chrome import ChromeDriverManager

from collections import defaultdict

# Setup for Chrome WebDriver

options = Options()

options.add_experimental_option("detach", False)

# options.add_argument("--headless")

# options.add_argument("--disable-gpu")

# options.add_argument("--no-sandbox")

driver = webdriver.Chrome(

service=Service(ChromeDriverManager().install()), options=options

)

# Open the web page

driver.get("https://gps.uml.edu/catalog/search/2024/fall/")

# Maximize the browser window

driver.maximize_window()

# Wait for the page to load then print total records found

WebDriverWait(driver, 60).until(

EC.presence_of_element_located((By.CSS_SELECTOR, ".showing"))

)

total_record = driver.find_element(By.CSS_SELECTOR, ".showing").text.strip()

print(total_record)

# List to store extracted course details

course_list = []

# Get course rows

rows = driver.find_elements(

By.XPATH, "/html/body/div[3]/div[5]/div/div/div/div/div/table/tbody/tr"

)

# Loop through each course row

for row in rows:

try:

# Extract course details

course_code = row.find_element(By.CSS_SELECTOR, "td.course-number").text.strip()

course_name = row.find_element(By.CSS_SELECTOR, ".course-name a").text.strip()

course_link = (

row.find_element(By.CSS_SELECTOR, ".course-name a")

.get_attribute("href")

.strip()

)

course_info = f"{course_name}: {course_link}"

course_sis = row.find_element(By.CSS_SELECTOR, "td.isis-number").text.strip()

course_day = row.find_element(By.CSS_SELECTOR, "td.course-day").text.strip()

course_semester = row.find_element(

By.CSS_SELECTOR, "td.course-date"

).text.strip()

course_tuition = row.find_element(

By.CSS_SELECTOR, "td.course-tuition"

).text.strip()

# Append to course_list

course_list.append(

{

"course_code": course_code,

"course_name": course_name,

"course_link": course_link,

"course_sis": course_sis,

"course_day": course_day,

"course_semester": course_semester,

"course_tuition": course_tuition,

}

)

# print(course_info)

except NoSuchElementException:

print(f"Some details are missing for {course_code}. Skipping.")

# Save course_list to a JSON file

with open("courses.json", "w") as json_file:

json.dump(course_list, json_file, indent=4)

print("Course details have been saved to courses.json.")

# Paths for checkpoint and output files

CHECKPOINT_FILE = "checkpoint.log"

OUTPUT_FILE = "merged_courses.json"

# Function to load checkpoint

def load_checkpoint():

if os.path.exists(CHECKPOINT_FILE):

with open(CHECKPOINT_FILE, "r") as f:

return f.read().strip()

return None

# Function to save checkpoint

def save_checkpoint(url):

with open(CHECKPOINT_FILE, "w") as f:

f.write(url)

# Function to save progress

def save_progress(data):

with open(OUTPUT_FILE, "w") as f:

json.dump(data, f, indent=4)

# Load previous progress

processed_courses = []

if os.path.exists(OUTPUT_FILE):

with open(OUTPUT_FILE, "r") as f:

processed_courses = json.load(f)

# Load checkpoint

last_processed_url = load_checkpoint()

# Load the general course data

with open("courses.json", "r") as general_file:

general_courses = json.load(general_file)

# Start processing

try:

for course in general_courses:

course_link = course.get("course_link")

# Skip if already processed

if course_link in [c["course_link"] for c in processed_courses]:

continue

# Skip until the checkpoint is reached

if last_processed_url and course_link != last_processed_url:

continue

last_processed_url = None # Reset checkpoint after resuming

try:

# Open the course link

driver.get(course_link)

# Extract details from the page

WebDriverWait(driver, 60).until(

EC.presence_of_element_located((By.CSS_SELECTOR, "h1"))

)

course_title = driver.find_element(By.CSS_SELECTOR, "h1").text.strip()

course_period = driver.find_element(By.CSS_SELECTOR, ".col-lg-8 p a:nth-child(1)").text.strip()

course_category_element = driver.find_element(By.CSS_SELECTOR, ".col-lg-8 a:nth-child(2)")

course_category = course_category_element.text.strip()

course_category_link = course_category_element.get_attribute("href").strip()

course_category = driver.find_element(By.CSS_SELECTOR, ".col-lg-8 a:nth-child(2)").text.strip()

course_category_link = course_category.get_attribute("href").strip()

course_status = driver.find_element(By.CSS_SELECTOR, ".col-lg-8 p+ p").text.strip()

course_desc = driver.find_element(By.CSS_SELECTOR, ".setflush+ p").text.strip()

# Extract first <ul class="nolist"> (Prerequisites, Core Codes, Credits, Instructor)

ul_elements = driver.find_elements(By.CSS_SELECTOR, "ul.nolist")

# Initialize two separate lists for the different <ul> elements

list_items_1 = [] # For Prerequisites, Core Codes, Credits, Instructor

list_items_2 = [] # For Online Course, Course Level, Tuition, Notes

if len(ul_elements) > 0:

# Extract items from the first <ul> (Prerequisites, Core Codes, Credits, Instructor)

li_elements = ul_elements[0].find_elements(By.TAG_NAME, "li")

for li in li_elements:

item_text = li.text.strip()

link_element = (

li.find_element(By.TAG_NAME, "a")

if li.find_elements(By.TAG_NAME, "a")

else None

)

item_link = link_element.get_attribute("href").strip() if link_element else None

list_items_1.append({"text": item_text, "link": item_link})

if len(ul_elements) > 1:

# Extract items from the second <ul> (Online Course, Course Level, Tuition, Notes)

li_elements = ul_elements[1].find_elements(By.TAG_NAME, "li")

for li in li_elements:

item_text = li.text.strip()

link_element = (

li.find_element(By.TAG_NAME, "a")

if li.find_elements(By.TAG_NAME, "a")

else None

)

item_link = link_element.get_attribute("href").strip() if link_element else None

list_items_2.append({"text": item_text, "link": item_link})

# Add details to the course

course.update({

"details": {

"title": course_title,

"period": course_period,

"category": course_category,

"category_link": course_category_link,

"course_status": course_status,

"description": course_desc,

"prerequisites_and_credits": list_items_1,

"online_course_info": list_items_2,

}

})

# Append to processed courses

processed_courses.append(course)

# Save progress and checkpoint

save_progress(processed_courses)

save_checkpoint(course_link)

except Exception as e:

print(f"Error processing {course_link}: {e}")

print(traceback.format_exc())

save_checkpoint(course_link) # Save the checkpoint before exiting

break

except KeyboardInterrupt:

print("Script interrupted. Saving progress...")

save_progress(processed_courses)

finally:

driver.quit()

print("Script completed or interrupted. Progress saved.")

# Load the general course data

with open("courses.json", "r") as general_file:

general_courses = json.load(general_file)

# Load the detailed course data if it exists

detailed_courses = []

if os.path.exists("courses_with_details.json"):

with open("courses_with_details.json", "r") as detailed_file:

detailed_courses = json.load(detailed_file)

# Create a dictionary from detailed_courses for quick lookup using course_link

detailed_courses_dict = {course["course_link"]: course for course in detailed_courses}

# Merge the data

merged_courses = []

for course in general_courses:

course_link = course.get("course_link")

detailed_info = detailed_courses_dict.get(course_link, {})

# Merge general course data with detailed info, ensuring that missing details are handled

merged_course = {**course, **detailed_info}

merged_courses.append(merged_course)

# Save the merged data to a new JSON file

with open("merged_courses.json", "w") as merged_file:

json.dump(merged_courses, merged_file, indent=4)

print("Merged course details have been saved to merged_courses.json.")

# Exit Script

driver.quit()

How It Works

Dynamic Interaction with Web Pages – using Selenium, the scraper navigates through dynamically loaded web pages, interacts with elements (like links and buttons), and ensures all data is fully loaded before extraction.
Error Handling for Robust Performance – websites are often inconsistent in how data is presented. This scraper is built to handle missing elements gracefully without disrupting the entire workflow.
Detailed Data Parsing:
- General course information is extracted from the main catalog page.
- Detailed course descriptions, prerequisites, and category data are scraped from individual course pages.
Data Structuring – extracted data is saved in a hierarchical JSON format, making it easy to analyze, visualize, or import into a database.

Technologies Used

Programming Language: Python
Automation Framework: Selenium
Browser Driver Management: WebDriver Manager
Data Handling: JSON

This project demonstrates my ability to use cutting-edge tools and frameworks to tackle real-world challenges in data automation.

Conclusion

Automating data collection through web scraping with Python offers significant benefits for educational institutions and EdTech platforms. It enhances efficiency by saving countless hours of manual work, allowing staff to focus on more critical tasks. By reducing errors in course information, automation improves accuracy, ensuring better communication with students and other stakeholders. Additionally, the scalability of this solution means it can be easily adapted for use across multiple institutions or course catalogs, making it a versatile tool for addressing a wide range of educational needs. I’m excited about the potential of web scraping and data automation in education. Whether it’s for creating up-to-date course catalogs, analyzing trends in course enrollment, or streamlining administrative tasks, these technologies can empower institutions to focus on what matters most—delivering quality education. Let’s harness the power of data to transform education, one project at a time!

Download Files

File 1: merged_courses.zip

Let’s Work Together

Contact Me