Data, data, everywhere – in this digital age, data can be identified in every aspect of the digital world. It is the fuel that drives information in the modern world. But how does one collect it to inform different processes or systems? Not the manual way, at least, not for large amounts of it as this is manual labor that is not only time-consuming but also prone to errors. To address these pitfalls, one has to automate the process. And here is where web scraping comes into focus.
Web scraping is the process of automating the extraction of large volumes of data from a website. It involves using scripts to identify and extract information in a systematic manner using a programming language, such as Python, resulting in a structured output format. In this post, I will take you through a web scraping exercise that automated the extraction of course details from a university website. The end goal of this exercise is to use the information to make data-driven decisions in the education sector.
Project Background
Most educational instructions are identified by an online catalog that shows their course offerings. These catalogs are, however, dependent on dynamic content, which may present a hurdle when extracting the information manually or using simple tools. This is where web scraping with Python became the go-to solution.
The objective of this project was to:
This solution was specifically designed for the University of Massachusetts Lowell’s GPS Course Catalog as an ongoing web scraping exercise. As this is not a glove fit-all, it is adaptable to other similar platforms.
What the Scraper Does
The scraper uses Selenium, a powerful web automation framework, to dynamically interact with the website and extract the following details:
The scraper then stores this data in JSON format for further processing.
Code Snippet
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 |
import json import sys import time import os import traceback from openpyxl.styles import Alignment, Font, PatternFill from openpyxl.utils import get_column_letter from selenium import webdriver from selenium.common.exceptions import NoSuchElementException, TimeoutException from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from webdriver_manager.chrome import ChromeDriverManager from collections import defaultdict # Setup for Chrome WebDriver options = Options() options.add_experimental_option("detach", False) # options.add_argument("--headless") # options.add_argument("--disable-gpu") # options.add_argument("--no-sandbox") driver = webdriver.Chrome( service=Service(ChromeDriverManager().install()), options=options ) # Open the web page driver.get("https://gps.uml.edu/catalog/search/2024/fall/") # Maximize the browser window driver.maximize_window() # Wait for the page to load then print total records found WebDriverWait(driver, 60).until( EC.presence_of_element_located((By.CSS_SELECTOR, ".showing")) ) total_record = driver.find_element(By.CSS_SELECTOR, ".showing").text.strip() print(total_record) # List to store extracted course details course_list = [] # Get course rows rows = driver.find_elements( By.XPATH, "/html/body/div[3]/div[5]/div/div/div/div/div/table/tbody/tr" ) # Loop through each course row for row in rows: try: # Extract course details course_code = row.find_element(By.CSS_SELECTOR, "td.course-number").text.strip() course_name = row.find_element(By.CSS_SELECTOR, ".course-name a").text.strip() course_link = ( row.find_element(By.CSS_SELECTOR, ".course-name a") .get_attribute("href") .strip() ) course_info = f"{course_name}: {course_link}" course_sis = row.find_element(By.CSS_SELECTOR, "td.isis-number").text.strip() course_day = row.find_element(By.CSS_SELECTOR, "td.course-day").text.strip() course_semester = row.find_element( By.CSS_SELECTOR, "td.course-date" ).text.strip() course_tuition = row.find_element( By.CSS_SELECTOR, "td.course-tuition" ).text.strip() # Append to course_list course_list.append( { "course_code": course_code, "course_name": course_name, "course_link": course_link, "course_sis": course_sis, "course_day": course_day, "course_semester": course_semester, "course_tuition": course_tuition, } ) # print(course_info) except NoSuchElementException: print(f"Some details are missing for {course_code}. Skipping.") # Save course_list to a JSON file with open("courses.json", "w") as json_file: json.dump(course_list, json_file, indent=4) print("Course details have been saved to courses.json.") # Paths for checkpoint and output files CHECKPOINT_FILE = "checkpoint.log" OUTPUT_FILE = "merged_courses.json" # Function to load checkpoint def load_checkpoint(): if os.path.exists(CHECKPOINT_FILE): with open(CHECKPOINT_FILE, "r") as f: return f.read().strip() return None # Function to save checkpoint def save_checkpoint(url): with open(CHECKPOINT_FILE, "w") as f: f.write(url) # Function to save progress def save_progress(data): with open(OUTPUT_FILE, "w") as f: json.dump(data, f, indent=4) # Load previous progress processed_courses = [] if os.path.exists(OUTPUT_FILE): with open(OUTPUT_FILE, "r") as f: processed_courses = json.load(f) # Load checkpoint last_processed_url = load_checkpoint() # Load the general course data with open("courses.json", "r") as general_file: general_courses = json.load(general_file) # Start processing try: for course in general_courses: course_link = course.get("course_link") # Skip if already processed if course_link in [c["course_link"] for c in processed_courses]: continue # Skip until the checkpoint is reached if last_processed_url and course_link != last_processed_url: continue last_processed_url = None # Reset checkpoint after resuming try: # Open the course link driver.get(course_link) # Extract details from the page WebDriverWait(driver, 60).until( EC.presence_of_element_located((By.CSS_SELECTOR, "h1")) ) course_title = driver.find_element(By.CSS_SELECTOR, "h1").text.strip() course_period = driver.find_element(By.CSS_SELECTOR, ".col-lg-8 p a:nth-child(1)").text.strip() course_category_element = driver.find_element(By.CSS_SELECTOR, ".col-lg-8 a:nth-child(2)") course_category = course_category_element.text.strip() course_category_link = course_category_element.get_attribute("href").strip() course_category = driver.find_element(By.CSS_SELECTOR, ".col-lg-8 a:nth-child(2)").text.strip() course_category_link = course_category.get_attribute("href").strip() course_status = driver.find_element(By.CSS_SELECTOR, ".col-lg-8 p+ p").text.strip() course_desc = driver.find_element(By.CSS_SELECTOR, ".setflush+ p").text.strip() # Extract first <ul class="nolist"> (Prerequisites, Core Codes, Credits, Instructor) ul_elements = driver.find_elements(By.CSS_SELECTOR, "ul.nolist") # Initialize two separate lists for the different <ul> elements list_items_1 = [] # For Prerequisites, Core Codes, Credits, Instructor list_items_2 = [] # For Online Course, Course Level, Tuition, Notes if len(ul_elements) > 0: # Extract items from the first <ul> (Prerequisites, Core Codes, Credits, Instructor) li_elements = ul_elements[0].find_elements(By.TAG_NAME, "li") for li in li_elements: item_text = li.text.strip() link_element = ( li.find_element(By.TAG_NAME, "a") if li.find_elements(By.TAG_NAME, "a") else None ) item_link = link_element.get_attribute("href").strip() if link_element else None list_items_1.append({"text": item_text, "link": item_link}) if len(ul_elements) > 1: # Extract items from the second <ul> (Online Course, Course Level, Tuition, Notes) li_elements = ul_elements[1].find_elements(By.TAG_NAME, "li") for li in li_elements: item_text = li.text.strip() link_element = ( li.find_element(By.TAG_NAME, "a") if li.find_elements(By.TAG_NAME, "a") else None ) item_link = link_element.get_attribute("href").strip() if link_element else None list_items_2.append({"text": item_text, "link": item_link}) # Add details to the course course.update({ "details": { "title": course_title, "period": course_period, "category": course_category, "category_link": course_category_link, "course_status": course_status, "description": course_desc, "prerequisites_and_credits": list_items_1, "online_course_info": list_items_2, } }) # Append to processed courses processed_courses.append(course) # Save progress and checkpoint save_progress(processed_courses) save_checkpoint(course_link) except Exception as e: print(f"Error processing {course_link}: {e}") print(traceback.format_exc()) save_checkpoint(course_link) # Save the checkpoint before exiting break except KeyboardInterrupt: print("Script interrupted. Saving progress...") save_progress(processed_courses) finally: driver.quit() print("Script completed or interrupted. Progress saved.") # Load the general course data with open("courses.json", "r") as general_file: general_courses = json.load(general_file) # Load the detailed course data if it exists detailed_courses = [] if os.path.exists("courses_with_details.json"): with open("courses_with_details.json", "r") as detailed_file: detailed_courses = json.load(detailed_file) # Create a dictionary from detailed_courses for quick lookup using course_link detailed_courses_dict = {course["course_link"]: course for course in detailed_courses} # Merge the data merged_courses = [] for course in general_courses: course_link = course.get("course_link") detailed_info = detailed_courses_dict.get(course_link, {}) # Merge general course data with detailed info, ensuring that missing details are handled merged_course = {**course, **detailed_info} merged_courses.append(merged_course) # Save the merged data to a new JSON file with open("merged_courses.json", "w") as merged_file: json.dump(merged_courses, merged_file, indent=4) print("Merged course details have been saved to merged_courses.json.") # Exit Script driver.quit() |
How It Works
Technologies Used
This project demonstrates my ability to use cutting-edge tools and frameworks to tackle real-world challenges in data automation.
Conclusion
Automating data collection through web scraping with Python offers significant benefits for educational institutions and EdTech platforms. It enhances efficiency by saving countless hours of manual work, allowing staff to focus on more critical tasks. By reducing errors in course information, automation improves accuracy, ensuring better communication with students and other stakeholders. Additionally, the scalability of this solution means it can be easily adapted for use across multiple institutions or course catalogs, making it a versatile tool for addressing a wide range of educational needs.
I’m excited about the potential of web scraping and data automation in education. Whether it’s for creating up-to-date course catalogs, analyzing trends in course enrollment, or streamlining administrative tasks, these technologies can empower institutions to focus on what matters most—delivering quality education.
Let’s harness the power of data to transform education, one project at a time!