Linguistic Taxonomies of STEM Programs: Parsing Course Descriptions
This post is a data component of the work going into the Intentional Curriculum Design: A DataCentric Approach for STEM Curricula. This contributes to the section defining The Curricular Landscape, where I define undergraduate STEM programs at VCU and from across the Commonwealth of Virginia, using a quantitative linguistic approach for quantifying contemporary and historical relationships among academic programs.
Data Acquisition
The first step in developing a workflow for parsing linguistic constituents of academic programs is getting the data into R. I contacted SCHEV (the State Council of Higher Education for Virginia), but they could not provide access to course descriptions and program requirements. OK.
So, I thought I'd start at the university where I work, Virginia Commonwealth University (hereafter VCU). My experience is that the institutional data group at VCU primarily functions to provide data for administrators well above the unit level. Moreover, a general workflow that starts with publically available data, such as published course bulletins, is preferable because there is no way that all the institutions in Virginia will be open to giving me their internal data to analyze. So, a general approach it is.
I started by getting the most recent set of bulletins from here to get the most recent bulletin. This is a 905page PDF document that is—thinking of us data processors who, at some point in the future, may want to access some of the information in this document in an automotive format—it is beautifully typeset as a dualcolumn layout... As it turns out, the bit with the list of all the courses on the books (a superset of all the classes taught) is constrained to the last 234 pages. Rather than trying to parse this as a pdf, I did it old school and just opened the pdf in Preview.app (macOS standard PDF viewer), and did:
 Delete all the pages I'm not interested in.
 Copy and paste into a text editor that is smart enough to understand what to do with the data on the clipboard (I'd recommend BBEdit or TextMate if you are a data nerd).
Extracting this 2column layout to a single text column resulted in 25,862 lines of text in a text file. I have saved it in the raw data folder as text so I can develop the code below without worrying about it. I suspect that the parsing of course bulletins for all programs in the Commonwealth will be on an institutionbyinstitution basis.
rm( list = ls() )
library(tidyverse)
vcu < readLines( "./data/raw_bulletin_text.txt")
length( vcu )
[1] 25862
And this is what it looks like.
vcu[1:20]
[1] "VCU Undergraduate Bulletin 202324 663"
[2] "UNDERGRADUATE COURSES"
[3] "College of Engineering"
[4] "Biomedical Engineering (EGRB)"
[5] "EGRB 101. Biomedical Engineering Practicum. 2 Hours."
[6] "Semester course; 2 lecture hours. 2 credits. Enrollment is restricted"
[7] "to students in the biomedical engineering department and requires"
[8] "permission of course coordinator. This course involves the introduction of"
[9] "clinical procedures and biomedical devices and technology to biomedical"
[10] "engineering freshmen. Students will tour medical facilities, clinics"
[11] "and hospitals and will participate in medical seminars, workshops"
[12] "and medical rounds. Students will rotate among various programs"
[13] "and facilities including orthopaedics, cardiology, neurology, surgery,"
[14] "otolaryngology, emergency medicine, pharmacy, dentistry, nursing,"
[15] "oncology, physical medicine, ophthalmology, pediatrics and internal"
[16] "medicine."
[17] "EGRB 102. Introduction to Biomedical Engineering. 3 Hours."
[18] "Semester course; 3 lecture hours. 3 credits. Prerequisite: MATH 151,"
[19] "MATH 200, MATH 201 or a satisfactory score on the math placement"
[20] "exam. Biomedical engineering is a multidisciplinary STEM field that"
Parsing Raw Text
So, let's start by going through it and removing the content of lines in this file that have stuff like:
VCU Undergraduate Bulletin 202324
... The 663 part is the page number, and that changes, so we'll have to identify the line by a fractional content match using the functiongrepl()
.UNDERGRADUATE COURSES
.
For both sets of data, we'll assign it an empty string.
vcu[ vcu == "UNDERGRADUATE COURSES" ] < ""
vcu[ grepl("^VCU Undergraduate Bulletin", vcu )] < ""
OK, let's start by finding all the entries whose line matches the start of a course. I'm looking for lines in the format shown below.
vcu[5]
[1] "EGRB 101. Biomedical Engineering Practicum. 2 Hours."
These have:
 A 4letter upper case program letter code, a space, a threenumber class number, and a period.
 The name of the class terminated by a period.
 The number of hours the class has.
As a regular expression, this can be developed to match this textual pattern. It just happens to be (after some futzing around—I never remember all my regex rules, so I have to look them up as well):
^[AZ]{4} \d{3}\. .+?\. \d+ Hour(s)?\.$
Where:

^[AZ]{4}
: Matches exactly four uppercase letters at the start of the line.  " " : Matches a space.
\d{3}\.
: Matches exactly three digits followed by a period..+?\.
: Matches any character (the course title) lazily until the next period. This is kind of the magic part of the whole thing.\d+
: Matches one or more digits (the number of hours).Hour(s)?
: Matches "Hour" or "Hours" as text. These are casesensitive, so if you have some entries that arehours
and some that areHours
, it will not match the first one. .$: Ensures the line ends with a period.
In R
, this RegEx needs to be escaped (e.g., the backslash needs to be escaped to be read) as:
class_line < "^[AZ]{4} \\d{3}\\. .+?\\. \\d+ Hour(s)?\\.$"
classLines < which(grepl( "^[AZ]{4} \\d{3}\\. .+?\\. \\d+ Hour(s)?\\.$", vcu ))
head( classLines )
[1] 5 17 27 43 52 63
And we can test them as follows:
vcu[ classLines ][1:20]
[1] "EGRB 101. Biomedical Engineering Practicum. 2 Hours."
[2] "EGRB 102. Introduction to Biomedical Engineering. 3 Hours."
[3] "EGRB 104. Introduction to Biomedical Engineering Laboratory. 1 Hour."
[4] "EGRB 105. Successes and Failures in Biomedical Technologies. 3 Hours."
[5] "EGRB 111. Introduction to Biological Systems in Engineering. 3 Hours."
[6] "EGRB 203. Statics and Mechanics of Materials. 3 Hours."
[7] "EGRB 209. Applied Physiology for Biomedical Engineers. 4 Hours."
[8] "EGRB 215. Computational Methods in Biomedical Engineering. 3 Hours."
[9] "EGRB 301. Biomedical Engineering Design Practicum. 3 Hours."
[10] "EGRB 303. Biotransport Processes. 3 Hours."
[11] "EGRB 307. Biomedical Instrumentation. 4 Hours."
[12] "EGRB 308. Biomedical Signal Processing. 4 Hours."
[13] "EGRB 310. Biomechanics. 4 Hours."
[14] "EGRB 315. Device Design Methods. 3 Hours."
[15] "EGRB 401. Biomedical Engineering Senior Design Studio. 3 Hours."
[16] "EGRB 402. Biomedical Engineering Senior Design Studio. 3 Hours."
[17] "EGRB 403. Tissue Engineering. 3 Hours."
[18] "EGRB 405. Finite Element Analysis in Solid Mechanics. 3 Hours."
[19] "EGRB 406. Artificial Organs. 3 Hours."
[20] "EGRB 407. Physical Principles of Medical Imaging. 3 Hours."
Looks pretty good. At least as a first pass through the data. Now, let's go in and pull create a data.frame
with these entries. Looking at the output, we were not catching the case where classes had the term "U.S." in it, so I replaced those with "US" before splitting up the line.
data.frame( Index = classLines, TitleRow = vcu[ classLines ] ) >
mutate( TitleRow = str_replace_all(TitleRow,"U.S.", "US")) >
mutate( class = str_split( TitleRow, "\\.", simplify=T)[,1] ) >
mutate( title = str_split( TitleRow, "\\.", simplify=T)[,2] ) >
mutate( hours = str_split( TitleRow, "\\.", simplify=T)[,3] ) >
select( TitleRow ) > df
So, this looks pretty good:
head( df )
Index class title hours
1 5 EGRB 101 Biomedical Engineering Practicum 2 Hours
2 17 EGRB 102 Introduction to Biomedical Engineering 3 Hours
3 27 EGRB 104 Introduction to Biomedical Engineering Laboratory 1 Hour
4 43 EGRB 105 Successes and Failures in Biomedical Technologies 3 Hours
5 52 EGRB 111 Introduction to Biological Systems in Engineering 3 Hours
6 63 EGRB 203 Statics and Mechanics of Materials 3 Hours
So now, I would like to go through it and take all the course descriptions and concatenate them together as a single item in this data.frame
. For this, I will need to loop through the data.frame
index, read from the index, find the lines between this index and the next one, and then concatenate them.
df$description < NA
for( i in 1:nrow(df) ) {
sIndex < df$Index[i] + 1
eIndex < ifelse( i == nrow(df), length( vcu ), (df$Index[i+1] 1 ) )
df$description[i] < paste( vcu[ sIndex: eIndex ], collapse = " " )
}
Now, let's take a look at that.
head( df$description )
[1] "Semester course; 2 lecture hours. 2 credits. Enrollment is restricted to students in the biomedical engineering department and requires permission of course coordinator. This course involves the introduction of clinical procedures and biomedical devices and technology to biomedical engineering freshmen. Students will tour medical facilities, clinics and hospitals and will participate in medical seminars, workshops and medical rounds. Students will rotate among various programs and facilities including orthopaedics, cardiology, neurology, surgery, otolaryngology, emergency medicine, pharmacy, dentistry, nursing, oncology, physical medicine, ophthalmology, pediatrics and internal medicine."
[2] "Semester course; 3 lecture hours. 3 credits. Prerequisite: MATH 151, MATH 200, MATH 201 or a satisfactory score on the math placement exam. Biomedical engineering is a multidisciplinary STEM field that combines biology and engineering, applying engineering principles and materials to medicine and health care. This course provides students with an introduction to biomedical engineering, beginning with a framework of core engineering principles, expanding to specializations within the field of biomedical engineering and connecting the concepts to realworld examples in medicine and health care."
[3] "Semester course; 3 laboratory hours. 1 credit. Enrollment is restricted to biomedical engineering majors. This laboratory course introduces students to practical laboratory skills required for biomedical engineering. Following successful completion of this course, students will be able to construct and design simple mechanicalelectric prototypes; solder electrical components to a breadboard; construct a bridge measurement circuit in order to measure a physiological signal; use a digital multimeter to analyze a circuit. This course is also a writing intensive course and will provide students with the skills necessary to analyze and write up the results of their experiments. Nontechnical skills that will be introduced in this course include how to set up and maintain a laboratory notebook; record and analyze data in Excel, including how to use Excel formulas, create pivot tables and generate graphs; how to plan and execute an experiment; how to read and write a laboratory report in IMRD format; how to write a design concept paper; oral presentation."
[4] "Semester course; 3 lecture hours. 3 credits. This course will look at successes and failures in biomedical engineering and technologies through case studies, as well as consider the ethical implementations and framework for developing evidencebased reasoning. Origins and recent advances in biomedical engineering and technologies will be explored, including applications of biomechanics, bio and nanotechnologies, medical imaging, rehabilitation engineering and biomaterials."
[5] "Semester course; 3 lecture hours. 3 credits. Prerequisites: MATH 151, MATH 200, MATH 201 or a satisfactory score on the math placement exam; and CHEM 100 with a minimum grade of B, CHEM 101, CHEM 102 or a satisfactory score on the chemistry placement exam. The cell is the principle unit of the human body. In this course, students will explore how the cell works from an engineering perspective. Students will learn the essential functions of cells, the components of cells and terminology related to cell biology. The course will also introduce key concepts in engineering, and students will learn how to apply these concepts to mammalian cells."
[6] "Semester course; 3 lecture hours. 3 credits. Prerequisites: MATH 201 and PHYS 207, both with a minimum grade of C. Enrollment is restricted to biomedical engineering majors. The theory and application of engineering mechanics applied to the design and analysis of rigid and deformable structures. The study of forces and their effects, including equilibrium of two and threedimensional bodies, stress, strain and constitutive relations, bending, torsion, shearing, deflection, and failure of materials."
That looks pretty good. To check my work, I printed out the first 16 characters of every line; if I did things correctly, they should all be formatted as Semester course;
. They did.
Now, I take out the Semester course;
part out (by replacing it with an empty string).
df >
mutate( description = str_replace( description,
"Semester course; ", "" ) ) > df
head( df$description )
[1] "2 lecture hours. 2 credits. Enrollment is restricted to students in the biomedical engineering department and requires permission of course coordinator. This course involves the introduction of clinical procedures and biomedical devices and technology to biomedical engineering freshmen. Students will tour medical facilities, clinics and hospitals and will participate in medical seminars, workshops and medical rounds. Students will rotate among various programs and facilities including orthopaedics, cardiology, neurology, surgery, otolaryngology, emergency medicine, pharmacy, dentistry, nursing, oncology, physical medicine, ophthalmology, pediatrics and internal medicine."
[2] "3 lecture hours. 3 credits. Prerequisite: MATH 151, MATH 200, MATH 201 or a satisfactory score on the math placement exam. Biomedical engineering is a multidisciplinary STEM field that combines biology and engineering, applying engineering principles and materials to medicine and health care. This course provides students with an introduction to biomedical engineering, beginning with a framework of core engineering principles, expanding to specializations within the field of biomedical engineering and connecting the concepts to realworld examples in medicine and health care."
[3] "3 laboratory hours. 1 credit. Enrollment is restricted to biomedical engineering majors. This laboratory course introduces students to practical laboratory skills required for biomedical engineering. Following successful completion of this course, students will be able to construct and design simple mechanicalelectric prototypes; solder electrical components to a breadboard; construct a bridge measurement circuit in order to measure a physiological signal; use a digital multimeter to analyze a circuit. This course is also a writing intensive course and will provide students with the skills necessary to analyze and write up the results of their experiments. Nontechnical skills that will be introduced in this course include how to set up and maintain a laboratory notebook; record and analyze data in Excel, including how to use Excel formulas, create pivot tables and generate graphs; how to plan and execute an experiment; how to read and write a laboratory report in IMRD format; how to write a design concept paper; oral presentation."
[4] "3 lecture hours. 3 credits. This course will look at successes and failures in biomedical engineering and technologies through case studies, as well as consider the ethical implementations and framework for developing evidencebased reasoning. Origins and recent advances in biomedical engineering and technologies will be explored, including applications of biomechanics, bio and nanotechnologies, medical imaging, rehabilitation engineering and biomaterials."
[5] "3 lecture hours. 3 credits. Prerequisites: MATH 151, MATH 200, MATH 201 or a satisfactory score on the math placement exam; and CHEM 100 with a minimum grade of B, CHEM 101, CHEM 102 or a satisfactory score on the chemistry placement exam. The cell is the principle unit of the human body. In this course, students will explore how the cell works from an engineering perspective. Students will learn the essential functions of cells, the components of cells and terminology related to cell biology. The course will also introduce key concepts in engineering, and students will learn how to apply these concepts to mammalian cells."
[6] "3 lecture hours. 3 credits. Prerequisites: MATH 201 and PHYS 207, both with a minimum grade of C. Enrollment is restricted to biomedical engineering majors. The theory and application of engineering mechanics applied to the design and analysis of rigid and deformable structures. The study of forces and their effects, including equilibrium of two and threedimensional bodies, stress, strain and constitutive relations, bending, torsion, shearing, deflection, and failure of materials."
OK, that looks pretty good. We should now separate the next set of information into its own columns. We have the following potential parts.
2 lecture hours.
2 credits.
 The potential for a
Prerequisites:
sentence  The rest of the description.
So, let's go through this again, line by line, and pull these parts out, saving the components we need and discarding the rest. Since each course description will be used to describe academic programs, we'll need to retain both the contact hours and the credits as potential weights for each course to be used as a relative fraction of the total credit loads for an undergraduate degree (for example). Don't allow brevity of current you throw away data that is potentially useful to future you.
df$contact < NA
df$credits < NA
df$prereqs < NA
df$bulletin < NA
for( i in 1:nrow(df) ) {
raw < df$description[i]
parts < str_split( raw, "\\. ", simplify=TRUE )
# Check to see if the first entry ends with ' hour(s)'
if( grepl(pattern= " hour(s)$", parts[1] ) ) {
df$contact[i] < parts[1]
parts[1] < ""
}
# Check to see if credit(s).
if( grepl( pattern = " credits$", parts[2] ) 
grepl( pattern = " credit$", parts[2] ) ) {
df$credits[i] < parts[2]
parts[2] < ""
}
# Check to see if there is a prerequesites sentence
idx < which( grepl( "(PrerequisitesPrerequisite)+: ", parts ) )
if( length( idx ) == 1 ) {
df$prereqs[i] < parts[idx]
parts[idx] < ""
}
# shove the bulletin in the end
df$bulletin[i] < paste( parts[ 1, nchar(parts[1,]) > 0 ],
collapse = ". " )
}
# Remove the raw description to clean it up.
df$description < NULL
Now that looks like this:
head( df$bulletin )
[1] "Enrollment is restricted to students in the biomedical engineering department and requires permission of course coordinator. This course involves the introduction of clinical procedures and biomedical devices and technology to biomedical engineering freshmen. Students will tour medical facilities, clinics and hospitals and will participate in medical seminars, workshops and medical rounds. Students will rotate among various programs and facilities including orthopaedics, cardiology, neurology, surgery, otolaryngology, emergency medicine, pharmacy, dentistry, nursing, oncology, physical medicine, ophthalmology, pediatrics and internal medicine."
[2] "Biomedical engineering is a multidisciplinary STEM field that combines biology and engineering, applying engineering principles and materials to medicine and health care. This course provides students with an introduction to biomedical engineering, beginning with a framework of core engineering principles, expanding to specializations within the field of biomedical engineering and connecting the concepts to realworld examples in medicine and health care."
[3] "Enrollment is restricted to biomedical engineering majors. This laboratory course introduces students to practical laboratory skills required for biomedical engineering. Following successful completion of this course, students will be able to construct and design simple mechanicalelectric prototypes; solder electrical components to a breadboard; construct a bridge measurement circuit in order to measure a physiological signal; use a digital multimeter to analyze a circuit. This course is also a writing intensive course and will provide students with the skills necessary to analyze and write up the results of their experiments. Nontechnical skills that will be introduced in this course include how to set up and maintain a laboratory notebook; record and analyze data in Excel, including how to use Excel formulas, create pivot tables and generate graphs; how to plan and execute an experiment; how to read and write a laboratory report in IMRD format; how to write a design concept paper; oral presentation."
[4] "This course will look at successes and failures in biomedical engineering and technologies through case studies, as well as consider the ethical implementations and framework for developing evidencebased reasoning. Origins and recent advances in biomedical engineering and technologies will be explored, including applications of biomechanics, bio and nanotechnologies, medical imaging, rehabilitation engineering and biomaterials."
[5] "The cell is the principle unit of the human body. In this course, students will explore how the cell works from an engineering perspective. Students will learn the essential functions of cells, the components of cells and terminology related to cell biology. The course will also introduce key concepts in engineering, and students will learn how to apply these concepts to mammalian cells."
[6] "Enrollment is restricted to biomedical engineering majors. The theory and application of engineering mechanics applied to the design and analysis of rigid and deformable structures. The study of forces and their effects, including equilibrium of two and threedimensional bodies, stress, strain and constitutive relations, bending, torsion, shearing, deflection, and failure of materials."
Excellent. Now, let's save the output to a file for subsequent analysis.
vcu_courses < df
save( vcu_courses, file="data/vcu_courses.rda")
Next, we will configure the linguistic mapping of the descriptions onto a numerical space for statistical classification.