← Back to OrbitUMD

For the Nerds · Technical Deep Dive

OrbitUMD: Under the Hood

If you landed here because the main page made you think "okay, but how does it actually work." Welcome. This is where I get to talk about the parts I found genuinely interesting to build. The live app is now running at orbitumd.com.

  • React 18
  • TypeScript
  • Vite
  • React Router
  • Zustand
  • Tailwind
  • Radix UI
  • Supabase
  • Postgres
  • Node.js
  • Cheerio
  • Vitest

Overview

At its core, OrbitUMD is three systems bolted together cleanly. The catalog scraper converts UMD's dense, inconsistent HTML course catalog into a structured tree of degree requirements. The requirement evaluator walks that tree against a student's transcript and planned schedule to produce a live audit. And the data sync layer keeps course, section, and meeting data fresh by pulling from three external APIs incrementally.

The frontend is React 18 + TypeScript + Vite, Zustand for course schedule builder state, React Context for app-wide toggles, UI built on Tailwind + Radix. The backend is Supabase, using Postgres with Row Level Security, auth via Google OAuth, migration-first schema management. All data engineering runs in Node.js.

Tech Stack

LayerTechnology
FrontendReact 18, TypeScript, Vite, React Router, Zustand, Tailwind CSS, Radix UI, React DnD, Recharts
Backend / DBSupabase, Postgres (migration-first, orbit schema), Row Level Security
AuthSupabase Auth (Google OAuth)
Data EngineeringNode.js, Cheerio (scraper), Python + Jupyter (early prototyping), pg driver
TestingVitest, Testing Library, regression corpus pipeline with quality scoring

The Catalog Scraper

UMD's course catalog is structured HTML, not an API. Every major and minor has a requirements page that uses headings, tables, paragraphs, and footnote superscripts in a way that looks automatable from a distance and breaks every naive assumption up close. The scraper runs five sequential stages, each consuming the output of the last.

1

Fetch + Parse

The catalog pages have no reliable CSS class names to select against, so the traversal uses Cheerio's DOM API to walk sibling elements inside the requirements region and classify each one by heuristic. A row containing an <a> whose text matches a department-code pattern (e.g. CMSC 131) is a course row. A row whose normalized text starts with "or" is an or-alternative. A bold or heading element with no course link is a section-header. A row mentioning "total" plus a credit count is a total row.

Each classified row becomes a TableContext object: rowType, courseId, courseName, credits (a number or a range), rawText, footnoteRefs (superscript symbols found in the row), and sectionContext (the nearest heading this row falls under). Whitespace normalization runs before classification: Unicode non-breaking spaces (\u00A0) and en-dash credit separators are replaced so downstream patterns have a consistent surface.

2

Semantic Pass

A stateful accumulator iterates over TableContext[] and produces RequirementSection[]. The accumulator tracks the current section label, a credit accumulator for the section, a running area index, and a buffer of pending or-alternative rows.

When the accumulator sees an or-alternative row after one or more course rows, it retroactively converts the last course row plus all buffered alternatives into a single ChoiceGroup. This matches the catalog's layout, where alternatives appear as subsequent rows beneath the primary option rather than being grouped by any HTML structure.

Section type inference: a header matching "Area N:" or "Group N:" produces an area_selection. A flat list with a "select N" constraint becomes a choice_group. A list with no selectability constraint is fixed_courses. A section with only a credit count and level requirement is free_electives.

Constraint prose is parsed with a set of independent pattern matchers rather than a chain of else-ifs, so multiple constraints can be extracted from the same header string. A header like "Select five 400-level courses from at least three different areas" runs through all matchers in parallel: one extracts minCourses: 5, one extracts levelRange: "400-499", one extracts minAreas: 3. Each result is merged into the SectionConstraints for that block.

3

Footnotes

Two-pass design. During the main parse, every TableContext records the superscript symbols found in that row's footnoteRefs[]. After the main pipeline, a separate pass scans the footer region of the HTML for footnote definitions, which typically appear as a <p> or <ul> below the last requirements table. Each definition is stored in a FootnoteMap keyed by symbol, then back-linked onto the corresponding RequirementItem.footnotes[] arrays.

Some footnotes carry conditional logic that can't be expressed in SectionConstraints, things like "this course may be waived if you placed out of MATH140" or "not available to students who completed CMSC216 before Fall 2022." These are stored as prose_rule blocks and surfaced to the student as informational text alongside the relevant requirement.

4

Specializations

Some programs define named tracks where a subset of requirements differs from the base definition. CS has Machine Learning, Data Science, and Cybersecurity tracks; each overrides a different slice of the upper-level elective requirements. The pipeline detects them by scanning section headers for keywords: Specialization, Track, Option, Concentration.

Each detected specialization gets its own independent sections[] subtree, parsed from the DOM zone between that specialization's heading and the next. The Specialization object carries its own sections, footnotes, and totalCredits, fully independent of the base program. When a student selects a track, the evaluator loads that specialization's tree alongside (or in place of) the relevant base-program blocks.

5

Validate + Ingest

Each parsed program runs through validateProgram(), which produces a ValidationReport: an errors[] array of hard blockers and a warnings[] array bucketed by severity (high, medium, low). Hard errors — a program with no parsed sections at all — stop the ingest. Warnings cover structural gaps: credit sum significantly below the catalog's stated total, dangling footnote references with no matching definition, empty areas inside an area_selection section, or a specialization heading that produced no requirement sections. --fail-on-blockers maps high-severity findings to a non-zero exit code for CI.

The regression baseline is a structural JSON snapshot of each program (block count by section type, section hierarchy, credit totals per section) rather than a text diff of the raw HTML. This avoids false positives from UMD's cosmetic catalog updates while catching meaningful structural changes. Each CI run diffs the fresh parse against the baseline and reports any divergence before a single database write happens.

ingestAllPrograms.ts discovers program URLs by parsing UMD's catalog sitemap XML. It deduplicates by program code (some programs appear under multiple catalog URLs) and writes to three tables: programs, requirement_blocks, and requirement_items. Runtime flags include --concurrency 8 for scrape-and-parse parallelism, --fail-on-blockers for CI exit codes, --dry-run to validate without writing, and term-year targeting for section data.

Requirement Data Model

The scraper's output is a Program object with a sections[] array of RequirementSection nodes, each classified by section type. This is the scraper-side representation, separate from the DB and evaluator schema. A separate ingest step maps scraper output into the database. The scraper section types are:

Section Types

  • fixed_courses: all listed courses required, no choices
  • choice_group: select N courses from a defined list
  • area_selection: pick across labeled sub-areas (e.g. CS upper-level electives)
  • concentration: outside-department block
  • free_electives: any courses satisfying credit/level rules
  • prose_rule: constraint expressible only as text

SectionConstraints Object

Each section carries a SectionConstraints object capturing:

  • levelRange: e.g. "300-400"
  • departmentConstraint: e.g. "outside CMSC"
  • minAreas / maxCoursesPerArea
  • minCredits / maxCredits
  • gpaRequirement
  • additionalRules[]

These constraints come directly from parsing catalog prose, which means the evaluator doesn't need to hard-code any program-specific logic. A new program just needs to scrape cleanly and the rest follows.

TypeScript Type System

The type system acts as the contract between the three main subsystems. Each pipeline stage has its own output type, so a type error at a boundary catches a misunderstanding about data shape before it becomes a runtime bug.

Pipeline Interfaces

  • TableContext — raw parse output per row: rowType, courseId, credits, footnoteRefs[], sectionContext
  • RequirementSection — semantic intermediate: a named block with a typed section kind and a resolved SectionConstraints object, before Postgres
  • RequirementBlock — DB-ready node: adds id, parentRequirementId, programId, and sortOrder for storage and tree reconstruction
  • RequirementItem — leaf node: courseId, credits, isOrAlternative, footnotes[]

Scraper SectionType vs. DB NodeType

These are two separate type systems. The scraper's SectionType is a semantic classification of catalog intent: fixed_courses, choice_group, area_selection, concentration, free_electives, prose_rule. The DB and evaluator use a structural RequirementNodeType enum: AND_GROUP, OR_GROUP, COURSE, GEN_ED, WILDCARD. The ingest step maps between them. Keeping the schemas separate means the scraper's classification logic can evolve without touching the evaluator.

SectionConstraints

An all-optional bag type. Every constraint the semantic pass can extract from catalog prose maps to a field here. The evaluator checks only the fields that are set.

  • minCourses / maxCourses — selectability constraint
  • minCredits / maxCredits — credit range
  • levelRange — e.g. "300-499"
  • departmentConstraint — e.g. "outside CMSC"
  • minAreas / maxCoursesPerArea — area spread rules
  • gpaRequirement — minimum GPA for the section
  • additionalRules[] — prose that was parsed but doesn't fit a structured field

Evaluator Result Types

  • BlockEvaluationResultV2satisfied, usedCourses[], remainingCourses, remainingCredits, messages[], children[], overrideApplied
  • Returned as BlockEvaluationResultV2[] — one per root block. The tree structure mirrors the block tree so the UI can walk it to render nested progress.
  • Companion types: RequirementBlockV2, RequirementItemV2, StudentCourseV2 — the V2 suffix marks the current schema generation throughout the evaluator.

Frontend Architecture

State Management

Zustand manages the course schedule builder: coursePlannerStore owns search input, active filters, section selections, drag state, calendar layout, and schedule persistence. Because this feature has the most co-located, high-frequency state, a single store with fine-grained selectors keeps re-renders contained. Auth state comes directly from Supabase's session listener. Degree audit and four-year plan data loads per-page from the repository layer.

coursePlannerStore (Zustand)

Search input, normalized query, filter state, section selections, hovered card, calendar meetings, conflict indexes, and schedule save/load. Also holds the active term and year so filter and search changes stay synchronized without prop drilling through the schedule builder's component tree.

DemoModeContext (React Context)

App-wide demo toggle exposed via useDemoMode(). The Context holds the UI state (isDemo and a toggle callback). The actual data interception happens separately via a standalone isDemoMode() module function that repositories and page code call directly.

Demo Mode

Demo mode is split across two layers. DemoModeContext is the React UI layer: it exposes useDemoMode() for components that need to show a demo indicator or toggle. The data layer is a standalone isDemoMode() module function that repositories and page code call directly to decide whether to return fixture data or make a live Supabase call.

Toggling demo mode triggers a window.location.assign() full-page reload. This is intentional: it ensures the flag is read fresh at every call site and no stale live data leaks into a demo session, or vice versa. The fixtures are typed to the same interfaces as live data responses, so they can't drift without a type error.

Drag-and-Drop

There are two drag surfaces. The four-year plan uses React's native DragEvent API: each planned-term column is a drop target and each course card is draggable. React DnD (the library) is used in the course schedule builder, where its more managed drag lifecycle fits the builder's conflict-tracking and calendar-layout requirements better.

Using different implementations for the two surfaces was a deliberate trade-off. The four-year plan's interactions are relatively simple (move a course between terms), and the native API is enough. The schedule builder's interactions are more complex, and React DnD's declarative model reduces the amount of drag-state bookkeeping that would otherwise live in the component.

The Requirement Evaluator

v2Evaluator.ts reconstructs the block tree in memory using three Maps: blocksById, childrenByParentId (sorted by sortOrder), and itemsByBlockId. It then recursively walks the tree. A block is satisfied iff its leaf course items are satisfied and all recursive children are also satisfied.

Course matching uses a coursePartsAreEquivalent function that handles cross-listed codes and honors equivalencies, so CMSC131H correctly satisfies a slot that lists CMSC131, and a course double-counted across two programs is tracked per-block so credit totals don't inflate.

This evaluation runs live on every plan edit. Change a course in the Four-Year Plan, and the evaluator re-walks every block tree for every active program, updating completion percentages, action items, and the Suggestions ranking simultaneously, all in the browser.

Tree Reconstruction

Reconstruction is two passes over the flat database rows. The first pass populates the three Maps in O(n) time. The second pass identifies root blocks (those with no parentRequirementId) and begins the recursive walk from each. This keeps reconstruction O(n) even for programs with deep specialization subtrees, and the same three-Map structure works whether the program has 20 blocks or 200.

Course Matching

Leaf evaluation calls coursePartsAreEquivalent and normalizeSubjectCode from a dedicated equivalency module. This handles honors suffixes (CMSC131H satisfies a CMSC131 slot), cross-listed codes, and subject-code normalization across sources. Matching logic is isolated so it can be updated or tested independently of the tree walk.

Pure Function Design

evaluateProgramRequirementsV2 and evaluateBlock are pure: given the same block tree and student courses, they always return the same result. No side effects, no shared mutable state. This makes the evaluator straightforward to unit test with fixture transcripts, and means re-evaluation on a plan edit is just calling the function again.

Performance

All evaluation runs in memory on the client. No server round-trip means zero network latency between a plan edit and an updated audit. For the largest plans, a full tree walk runs in single-digit milliseconds in the browser. The rendering bottleneck is the DOM update from the result, not the evaluation itself.

Data Sync

Course and section data comes from three external APIs: umd.io, JupiterP, and PlanetTerp. A Node.js sync worker SHA-256-fingerprints each course to detect changes and drive incremental updates, so a full re-scrape isn't needed just because one section's seat count changed. It supports --dry-run, --incremental, --force-full, and term/year targeting.

The Postgres schema (orbit schema) includes a course_search_index view with both full-text tsvector and pg_trgm trigram indexes for fast fuzzy course lookup, the kind of "I sort of remember the course name" search that students actually need.

Testing Methodology

The three main subsystems have different testing strategies because they have different failure modes. The evaluator is pure computation and tests straightforwardly. The scraper operates against an external source that changes under it. The frontend holds user state that has to survive multi-step interaction sequences.

Evaluator Unit Tests (Vitest)

Because v2Evaluator.ts is a set of pure functions (block tree in, result tree out), unit tests are straightforward. Each test constructs a minimal block tree and a StudentCourseV2[] fixture and asserts on the BlockEvaluationResultV2 shape. Edge cases covered include partial completion, honors equivalencies, cross-listed courses, and specialization subtree loading. No database or browser required.

Component Tests (Testing Library)

The audit display, plan kanban, and course search components are tested with React Testing Library. Tests drive the component through interaction sequences (adding a course, dragging to a different semester, opening the audit panel) and assert on visible output. The demo fixture layer doubles as the test fixture layer, so component tests run without a live database.

Scraper Regression Corpus

100+ program parse results are committed to the repo as JSON snapshots. Each snapshot records the structural signature of the parsed program: block count by section type, section hierarchy, and credit totals per section. Not raw HTML, so cosmetic catalog updates don't produce false positives, but any structural change that would affect student audits shows up as a diff and triggers review before it reaches production.

CI Pipeline (GitHub Actions)

A scheduled workflow runs the full scraper against all program URLs, diffs results against the corpus, and fails if any program introduces a blocker or drops below the quality score threshold. A separate workflow runs type checking and unit tests on every pull request. When the scheduled scraper flags a program in the warning band, it opens a GitHub issue automatically so nothing silently rots between deployments.

Biggest Challenges

Catalog Inconsistency

The UMD catalog is consistent enough to look automatable and inconsistent enough to break every early version of the scraper. The solution was a multi-stage pipeline with explicit semantic inference, plus a regression corpus that diffs each scrape against a quality-scored snapshot baseline. Anything below threshold gets flagged; nothing silently breaks. The scraper currently handles 100+ programs.

Cross-Program Double-Counting

When a course satisfies requirements in two different programs, the evaluator needs to record both satisfactions without inflating the shared credit totals. The block tree model handles this cleanly. Each block tracks its own credit count independently, so double-counting is visible and correct at every level.

Demo Mode

OrbitUMD needs to be demonstrable to advisors and faculty who don't have UMD credentials. Demo mode pre-loads a sample student profile, a synthetic transcript, and a set of planned courses, the full planning experience, with no live database calls. Implemented as a Zustand store flag that swaps every data source to a local fixture layer.

Live Audit Performance

Re-evaluating multiple full program trees on every plan edit could easily become slow. The evaluator mitigates this by caching intermediate block results where the subtree hasn't changed, and by doing all computation in-memory on the client rather than round-tripping to the server on every interaction.

What I'd Do Next

  • Generalize the scraper: the pipeline architecture is already clean enough to adapt to other university catalog formats. UC system next, then Big Ten.
  • Collaborative planning: share a plan with an advisor, co-edit with a friend who's on the same degree track.
  • Prerequisite chain visualization: show a directed graph of what unlocks what, so a student can see why MATH141 is a bottleneck for half their plan.
  • Testudo / iCalendar export: take a finalized semester plan and push it directly into registration, or export it as a calendar so class times are visible alongside deadlines.