For the Nerds · Technical Deep Dive
OrbitUMD: Under the Hood
If you landed here because the main page made you think "okay, but how does it actually work." Welcome. This is where I get to talk about the parts I found genuinely interesting to build. The live app is now running at orbitumd.com.
Overview
At its core, OrbitUMD is three systems bolted together cleanly. The catalog scraper converts UMD's dense, inconsistent HTML course catalog into a structured tree of degree requirements. The requirement evaluator walks that tree against a student's transcript and planned schedule to produce a live audit. And the data sync layer keeps course, section, and meeting data fresh by pulling from three external APIs incrementally.
The frontend is React 18 + TypeScript + Vite, Zustand for course schedule builder state, React Context for app-wide toggles, UI built on Tailwind + Radix. The backend is Supabase, using Postgres with Row Level Security, auth via Google OAuth, migration-first schema management. All data engineering runs in Node.js.
Tech Stack
| Layer | Technology |
|---|---|
| Frontend | React 18, TypeScript, Vite, React Router, Zustand, Tailwind CSS, Radix UI, React DnD, Recharts |
| Backend / DB | Supabase, Postgres (migration-first, orbit schema), Row Level Security |
| Auth | Supabase Auth (Google OAuth) |
| Data Engineering | Node.js, Cheerio (scraper), Python + Jupyter (early prototyping), pg driver |
| Testing | Vitest, Testing Library, regression corpus pipeline with quality scoring |
The Catalog Scraper
UMD's course catalog is structured HTML, not an API. Every major and minor has a requirements page that uses headings, tables, paragraphs, and footnote superscripts in a way that looks automatable from a distance and breaks every naive assumption up close. The scraper runs five sequential stages, each consuming the output of the last.
Fetch + Parse
The catalog pages have no reliable CSS class names to select against, so the traversal uses
Cheerio's DOM API to walk sibling elements inside the requirements region and classify each
one by heuristic. A row containing an <a> whose text matches a
department-code pattern (e.g. CMSC 131) is a course row. A row whose
normalized text starts with "or" is an or-alternative. A bold or heading element
with no course link is a section-header. A row mentioning "total" plus a credit
count is a total row.
Each classified row becomes a TableContext object: rowType, courseId, courseName, credits (a number or a range), rawText, footnoteRefs (superscript symbols found in the row), and sectionContext (the nearest heading this row falls under). Whitespace
normalization runs before classification: Unicode non-breaking spaces (\u00A0)
and en-dash credit separators are replaced so downstream patterns have a consistent surface.
Semantic Pass
A stateful accumulator iterates over TableContext[] and produces RequirementSection[]. The accumulator tracks the current section label, a credit
accumulator for the section, a running area index, and a buffer of pending or-alternative rows.
When the accumulator sees an or-alternative row after one or more course rows, it retroactively converts the last course row plus all buffered
alternatives into a single ChoiceGroup. This matches the catalog's layout,
where alternatives appear as subsequent rows beneath the primary option rather than being
grouped by any HTML structure.
Section type inference: a header matching "Area N:" or "Group N:" produces
an area_selection. A flat list with a "select N" constraint becomes a choice_group. A list with no selectability constraint is fixed_courses.
A section with only a credit count and level requirement is free_electives.
Constraint prose is parsed with a set of independent pattern matchers rather than a chain of
else-ifs, so multiple constraints can be extracted from the same header string. A header like "Select five 400-level courses from at least three different areas" runs through all
matchers in parallel: one extracts minCourses: 5, one extracts levelRange: "400-499", one extracts minAreas: 3. Each result is
merged into the SectionConstraints for that block.
Footnotes
Two-pass design. During the main parse, every TableContext records the
superscript symbols found in that row's footnoteRefs[]. After the main pipeline,
a separate pass scans the footer region of the HTML for footnote definitions, which typically
appear as a <p> or <ul> below the last requirements
table. Each definition is stored in a FootnoteMap keyed by symbol, then
back-linked onto the corresponding RequirementItem.footnotes[] arrays.
Some footnotes carry conditional logic that can't be expressed in SectionConstraints,
things like "this course may be waived if you placed out of MATH140" or "not available to
students who completed CMSC216 before Fall 2022." These are stored as prose_rule blocks and surfaced to the student as informational text alongside the relevant requirement.
Specializations
Some programs define named tracks where a subset of requirements differs from the base definition. CS has Machine Learning, Data Science, and Cybersecurity tracks; each overrides a different slice of the upper-level elective requirements. The pipeline detects them by scanning section headers for keywords: Specialization, Track, Option, Concentration.
Each detected specialization gets its own independent sections[] subtree, parsed
from the DOM zone between that specialization's heading and the next. The Specialization object carries its own sections, footnotes, and totalCredits, fully independent of the base program.
When a student selects a track, the evaluator loads that specialization's tree alongside (or
in place of) the relevant base-program blocks.
Validate + Ingest
Each parsed program runs through validateProgram(), which produces a ValidationReport: an errors[] array of hard blockers and a warnings[] array bucketed by severity (high, medium, low). Hard errors — a program with no parsed sections at all — stop the ingest.
Warnings cover structural gaps: credit sum significantly below the catalog's stated total,
dangling footnote references with no matching definition, empty areas inside an area_selection section, or a specialization heading that produced no requirement
sections. --fail-on-blockers maps high-severity findings to a non-zero exit code
for CI.
The regression baseline is a structural JSON snapshot of each program (block count by section type, section hierarchy, credit totals per section) rather than a text diff of the raw HTML. This avoids false positives from UMD's cosmetic catalog updates while catching meaningful structural changes. Each CI run diffs the fresh parse against the baseline and reports any divergence before a single database write happens.
ingestAllPrograms.ts discovers program URLs by parsing UMD's catalog sitemap XML.
It deduplicates by program code (some programs appear under multiple catalog URLs) and writes
to three tables: programs, requirement_blocks, and requirement_items. Runtime flags include --concurrency 8 for
scrape-and-parse parallelism, --fail-on-blockers for CI exit codes, --dry-run to validate without writing, and term-year targeting for section data.
Requirement Data Model
The scraper's output is a Program object with a sections[] array of RequirementSection nodes, each classified by section type. This is the scraper-side
representation, separate from the DB and evaluator schema. A separate ingest step maps scraper
output into the database. The scraper section types are:
Section Types
fixed_courses: all listed courses required, no choiceschoice_group: select N courses from a defined listarea_selection: pick across labeled sub-areas (e.g. CS upper-level electives)concentration: outside-department blockfree_electives: any courses satisfying credit/level rulesprose_rule: constraint expressible only as text
SectionConstraints Object
Each section carries a SectionConstraints object capturing:
levelRange: e.g."300-400"departmentConstraint: e.g."outside CMSC"minAreas/maxCoursesPerAreaminCredits/maxCreditsgpaRequirementadditionalRules[]
These constraints come directly from parsing catalog prose, which means the evaluator doesn't need to hard-code any program-specific logic. A new program just needs to scrape cleanly and the rest follows.
TypeScript Type System
The type system acts as the contract between the three main subsystems. Each pipeline stage has its own output type, so a type error at a boundary catches a misunderstanding about data shape before it becomes a runtime bug.
Pipeline Interfaces
TableContext— raw parse output per row:rowType,courseId,credits,footnoteRefs[],sectionContextRequirementSection— semantic intermediate: a named block with a typed section kind and a resolvedSectionConstraintsobject, before PostgresRequirementBlock— DB-ready node: addsid,parentRequirementId,programId, andsortOrderfor storage and tree reconstructionRequirementItem— leaf node:courseId,credits,isOrAlternative,footnotes[]
Scraper SectionType vs. DB NodeType
These are two separate type systems. The scraper's SectionType is a semantic
classification of catalog intent: fixed_courses, choice_group, area_selection, concentration, free_electives, prose_rule. The DB and evaluator use a structural RequirementNodeType enum: AND_GROUP, OR_GROUP, COURSE, GEN_ED, WILDCARD. The ingest step maps between them. Keeping the schemas separate means
the scraper's classification logic can evolve without touching the evaluator.
SectionConstraints
An all-optional bag type. Every constraint the semantic pass can extract from catalog prose maps to a field here. The evaluator checks only the fields that are set.
minCourses/maxCourses— selectability constraintminCredits/maxCredits— credit rangelevelRange— e.g."300-499"departmentConstraint— e.g."outside CMSC"minAreas/maxCoursesPerArea— area spread rulesgpaRequirement— minimum GPA for the sectionadditionalRules[]— prose that was parsed but doesn't fit a structured field
Evaluator Result Types
BlockEvaluationResultV2—satisfied,usedCourses[],remainingCourses,remainingCredits,messages[],children[],overrideApplied- Returned as
BlockEvaluationResultV2[]— one per root block. The tree structure mirrors the block tree so the UI can walk it to render nested progress. - Companion types:
RequirementBlockV2,RequirementItemV2,StudentCourseV2— the V2 suffix marks the current schema generation throughout the evaluator.
Frontend Architecture
State Management
Zustand manages the course schedule builder: coursePlannerStore owns search input,
active filters, section selections, drag state, calendar layout, and schedule persistence.
Because this feature has the most co-located, high-frequency state, a single store with
fine-grained selectors keeps re-renders contained. Auth state comes directly from Supabase's
session listener. Degree audit and four-year plan data loads per-page from the repository layer.
coursePlannerStore (Zustand)
Search input, normalized query, filter state, section selections, hovered card, calendar meetings, conflict indexes, and schedule save/load. Also holds the active term and year so filter and search changes stay synchronized without prop drilling through the schedule builder's component tree.
DemoModeContext (React Context)
App-wide demo toggle exposed via useDemoMode(). The Context holds the UI state
(isDemo and a toggle callback). The actual data interception
happens separately via a standalone isDemoMode() module function that
repositories and page code call directly.
Demo Mode
Demo mode is split across two layers. DemoModeContext is the React UI layer:
it exposes useDemoMode() for components that need to show a demo indicator or
toggle. The data layer is a standalone isDemoMode() module function that
repositories and page code call directly to decide whether to return fixture data or make a
live Supabase call.
Toggling demo mode triggers a window.location.assign() full-page reload. This is
intentional: it ensures the flag is read fresh at every call site and no stale live data leaks
into a demo session, or vice versa. The fixtures are typed to the same interfaces as live data
responses, so they can't drift without a type error.
Drag-and-Drop
There are two drag surfaces. The four-year plan uses React's native DragEvent API:
each planned-term column is a drop target and each course card is draggable. React DnD (the
library) is used in the course schedule builder, where its more managed drag lifecycle fits the
builder's conflict-tracking and calendar-layout requirements better.
Using different implementations for the two surfaces was a deliberate trade-off. The four-year plan's interactions are relatively simple (move a course between terms), and the native API is enough. The schedule builder's interactions are more complex, and React DnD's declarative model reduces the amount of drag-state bookkeeping that would otherwise live in the component.
The Requirement Evaluator
v2Evaluator.ts reconstructs the block tree in memory using three Maps: blocksById, childrenByParentId (sorted by sortOrder), and itemsByBlockId. It then recursively walks the tree. A block is satisfied iff its
leaf course items are satisfied and all recursive children are also satisfied.
Course matching uses a coursePartsAreEquivalent function that handles cross-listed
codes and honors equivalencies, so CMSC131H correctly satisfies a slot that lists CMSC131, and a course double-counted across two programs is tracked per-block so
credit totals don't inflate.
This evaluation runs live on every plan edit. Change a course in the Four-Year Plan, and the evaluator re-walks every block tree for every active program, updating completion percentages, action items, and the Suggestions ranking simultaneously, all in the browser.
Tree Reconstruction
Reconstruction is two passes over the flat database rows. The first pass populates the three
Maps in O(n) time. The second pass identifies root blocks (those with no parentRequirementId) and begins the recursive walk from each. This keeps
reconstruction O(n) even for programs with deep specialization subtrees, and the same
three-Map structure works whether the program has 20 blocks or 200.
Course Matching
Leaf evaluation calls coursePartsAreEquivalent and normalizeSubjectCode from a dedicated equivalency module. This handles honors
suffixes (CMSC131H satisfies a CMSC131 slot), cross-listed codes,
and subject-code normalization across sources. Matching logic is isolated so it can be
updated or tested independently of the tree walk.
Pure Function Design
evaluateProgramRequirementsV2 and evaluateBlock are pure: given
the same block tree and student courses, they always return the same result. No side effects,
no shared mutable state. This makes the evaluator straightforward to unit test with fixture
transcripts, and means re-evaluation on a plan edit is just calling the function again.
Performance
All evaluation runs in memory on the client. No server round-trip means zero network latency between a plan edit and an updated audit. For the largest plans, a full tree walk runs in single-digit milliseconds in the browser. The rendering bottleneck is the DOM update from the result, not the evaluation itself.
Data Sync
Course and section data comes from three external APIs: umd.io, JupiterP, and
PlanetTerp. A Node.js sync worker SHA-256-fingerprints each course to detect changes and drive
incremental updates, so a full re-scrape isn't needed just because one section's seat count
changed. It supports --dry-run, --incremental, --force-full,
and term/year targeting.
The Postgres schema (orbit schema) includes a course_search_index view with
both full-text tsvector and pg_trgm trigram indexes for fast fuzzy course
lookup, the kind of "I sort of remember the course name" search that students actually need.
Testing Methodology
The three main subsystems have different testing strategies because they have different failure modes. The evaluator is pure computation and tests straightforwardly. The scraper operates against an external source that changes under it. The frontend holds user state that has to survive multi-step interaction sequences.
Evaluator Unit Tests (Vitest)
Because v2Evaluator.ts is a set of pure functions (block tree in, result tree out),
unit tests are straightforward. Each test constructs a minimal block tree and a StudentCourseV2[] fixture and asserts on the BlockEvaluationResultV2 shape. Edge cases covered include partial completion, honors equivalencies, cross-listed
courses, and specialization subtree loading. No database or browser required.
Component Tests (Testing Library)
The audit display, plan kanban, and course search components are tested with React Testing Library. Tests drive the component through interaction sequences (adding a course, dragging to a different semester, opening the audit panel) and assert on visible output. The demo fixture layer doubles as the test fixture layer, so component tests run without a live database.
Scraper Regression Corpus
100+ program parse results are committed to the repo as JSON snapshots. Each snapshot records the structural signature of the parsed program: block count by section type, section hierarchy, and credit totals per section. Not raw HTML, so cosmetic catalog updates don't produce false positives, but any structural change that would affect student audits shows up as a diff and triggers review before it reaches production.
CI Pipeline (GitHub Actions)
A scheduled workflow runs the full scraper against all program URLs, diffs results against the corpus, and fails if any program introduces a blocker or drops below the quality score threshold. A separate workflow runs type checking and unit tests on every pull request. When the scheduled scraper flags a program in the warning band, it opens a GitHub issue automatically so nothing silently rots between deployments.
Biggest Challenges
Catalog Inconsistency
The UMD catalog is consistent enough to look automatable and inconsistent enough to break every early version of the scraper. The solution was a multi-stage pipeline with explicit semantic inference, plus a regression corpus that diffs each scrape against a quality-scored snapshot baseline. Anything below threshold gets flagged; nothing silently breaks. The scraper currently handles 100+ programs.
Cross-Program Double-Counting
When a course satisfies requirements in two different programs, the evaluator needs to record both satisfactions without inflating the shared credit totals. The block tree model handles this cleanly. Each block tracks its own credit count independently, so double-counting is visible and correct at every level.
Demo Mode
OrbitUMD needs to be demonstrable to advisors and faculty who don't have UMD credentials. Demo mode pre-loads a sample student profile, a synthetic transcript, and a set of planned courses, the full planning experience, with no live database calls. Implemented as a Zustand store flag that swaps every data source to a local fixture layer.
Live Audit Performance
Re-evaluating multiple full program trees on every plan edit could easily become slow. The evaluator mitigates this by caching intermediate block results where the subtree hasn't changed, and by doing all computation in-memory on the client rather than round-tripping to the server on every interaction.
What I'd Do Next
- Generalize the scraper: the pipeline architecture is already clean enough to adapt to other university catalog formats. UC system next, then Big Ten.
- Collaborative planning: share a plan with an advisor, co-edit with a friend who's on the same degree track.
- Prerequisite chain visualization: show a directed graph of what unlocks what, so a student can see why MATH141 is a bottleneck for half their plan.
- Testudo / iCalendar export: take a finalized semester plan and push it directly into registration, or export it as a calendar so class times are visible alongside deadlines.