photograph of James RowanMy name is James Rowan, and I am a rising junior studying math at MIT. I got to spend my summer working as an intern on the analytics team at edX, focusing on the data pipeline. The analytics team provides data and metrics for use by course teams, researchers, and internal teams at edX. My work this summer centered on improving our data warehouse, an HP Vertica database used to generate internal reports for the marketing and executive teams.

Extending the data warehouse

My first project was to add information about course subject areas to the data warehouse. While the initial warehouse schema allowed analysts to query for enrollment and activity metrics on a course-by-course basis, there was no way to aggregate and compare courses by subject area. Having course subject area information in the warehouse would allow the marketing team to see which subject areas have the highest verified certificate take rates, for example, while other analysts would be able to ask about engagement patterns by subject area (do math courses tend to see many learners attempting problems, while humanities courses feature more use of discussion forums?).

To perform this task, I first needed to learn about Luigi, the data processing framework that the open edX pipeline uses. Luigi is an open-source Python data processing framework developed by Spotify that breaks data processing workflows up into series of tasks, each with their own output, so that dependency handling is simplified and workflows can be restarted after intermediate steps fail. This property of Luigi leads us to aim for workflows that are as modular as possible.

The course subject areas workflow has three parts:

  1. Call the course catalog api, which contains a list of all courses marketed on the site, along with information like their subject(s), instructors, and lengths.
  2. Parse the output of the api into a tab-separated values file, with each row containing a course id, a subject area for the course, and some additional information.
  3. Load this subject area data into the Vertica database.

The three distinct phases of this process I identified above translated naturally into Luigi tasks, and the first two were fairly straightforward to write. The third was more interesting. While Luigi has pre-built tasks for loading into MySQL databases, there was no pre-built Vertica loading task, so I first had to build a general Vertica bulk-loading Luigi task. This general bulk-loader saw use by other team members over the course of the summer, and I enjoyed having ownership of this piece of the pipeline, which will provide the infrastructure for future workflows to further expand our internal analytics warehouse.

The following graph shows an example of a question analysts can now answer thanks to this project. It shows what percentage of the learners active in courses on edx.org were active in courses of particular subject areas. In the fall of 2014, around ten percent of active learners were active in courses in each of computer science, business, data analysis, and humanities. In 2015, however, a much larger fraction of the active learners in courses on edx.org had activity in computer science courses. In June of 2015, almost half of all active learners on edx.org were active in a computer science course! The other three subject areas shown also saw increased popularity among active users, rising to twenty percent each.

Fraction of Active Learners Active in Subjects over Time; shows increase in popularity of Computer Science, Business & Management, Data Analysis & Statistics, and Humanities courses.

After the subject areas project, I worked on an experimental feature for product analytics.  All user interactions with the platform, from playing videos to attempting problems to editing user profiles, are logged for analysis.  If a new feature is added and starts emitting events, the product team may want to know how frequently it is used, or if it sees disproportionately high use on desktop browsers (suggesting a problem with mobile compatibility).  Instead of having to wait for the analytics team to build out a pipeline workflow for aggregating tracking log information about the new feature, product and other teams should be able to do preliminary analysis on the event logs themselves.

The difficulty with directly querying event logs is that the event logs are semi-structured, being stored in JSON format.  Since different types of events have different attributes, it would be unwieldy to try to make a single table schema for events; instead, I took advantage of Vertica’s flex tables and chose a table schema that has materialized columns (first-class SQL table columns) for the fields common to all or almost all events (things like username, timestamp, and user device type) while still allowing users to query for event-specific fields (things like what timestamp within a video someone is navigating to).  While we are not sure whether this will be our framework for product analytics going forward, it did help us explore the capabilities of our warehouse.

Integrating BI Data Processing into the Pipeline

When I started working at edX this summer, we had basically two separate pipelines: one for the instructor-facing data used in Insights (Open edX’s analytics dashboard) and another for our own internal business intelligence (BI) reporting.  The pipeline for external-facing data is written in Luigi and features batch Hadoop jobs to populate a result store for the data API, while the pipeline for internal reporting uses Pentaho Data Integration to extract data from the LMS and result store, transform it, and load it into the warehouse.  The BI pipeline depended on some results from the main pipeline but could not communicate these dependencies to it, forcing the team to carefully time the execution of the tasks on our Jenkins build server and defeating the purpose of using Luigi for dependency handling in the first place.

Existing Architectures: Internal BI vs External Insights Open Pipeline: from the edx/edx-analytics-pipeline repo. Shows how the data path for internal analytics is completely separate from the data path for external analytics.

The old analytics pipeline architecture, with two disjoint paths of data flow.

To reduce the technical debt inherent in maintaining two systems, I was tasked with beginning the transition of the Pentaho tasks into Luigi tasks.  In addition to the clear benefits of transitioning to a single data processing framework, this undertaking also paves the way for further data warehouse improvements.  Since, in this new architecture, all writing to the warehouse is done through the pipeline, new data processing tasks (like my course subject areas task above) will be able to write to the warehouse without having to be manually scheduled around the Pentaho data loading steps.  A unified data processing workflow will also make data validation easier; data validation tasks could be embedded within the processing steps by simply adding Luigi tasks into the unified pipeline.

Reporting Framework, v1: Begin Integrating Environments. Shows how internal analytics system is getting hooked up to the external pipeline.

The internal analytics (BI) architecture after Pentaho retirement is complete.

 

This project allowed me to work with the pipeline’s complete technology stack.  I got the chance to write tasks interacting with the MySQL LMS databases, the event tracking logs, the Apache Hive intermediate results warehouse, and the finished Vertica data products warehouse, and it was a nice culminating use of the skills I had been acquiring over the course of the summer.

Conclusion

I enjoyed getting the opportunity to work at edX this summer.  This internship taught me a number of technical skills (the Luigi framework, the MapReduce paradigm, SQL, git), and being on the analytics team also enabled me to get firsthand experience with the design of data processing architectures, the maintenance of large codebases, and the development cycle of software and data products.  Learning to work in the context of an agile team is also be a broadly applicable skill, and it has led me to think more consciously about group dynamics and project management in non-work environments.  

I’d like to thank John Baker, Gabe Mulley, and Brian Wilson for helping me out with technical pipeline questions, the rest of the analytics team for providing a great working environment, and everyone else at edX for making this summer fun.