Big Data in Education (Part 1 of 2)

The following is Part 1 of 2 on data in education.  The first article introduces the concepts of how data is used in education.  The second article looks at recent movements by the Department of Education in data mining, modeling and learning systems.

Learning to Learn

The education industry is transforming into a 21st century data-driven enterprise.   Metrics based assessment has been a powerful force that has swept the national education community in response to widespread policy reform.  Passed in 2001, the No-Child-Left-Behind Act pushed the idea of standards-based education whereby schoolteachers and administrators are held accountable for the performance of their students.  The law elevated standardized tests and dropout rates as the primary way officials measure student outcomes and achievement.  Underperforming schools can be placed on probation, and if no improvement is seen after 3-4 years, the entire staff of the school can be replaced.

The political ramifications of the law inspire much debate amongst policy analysts.  However, from a data perspective, it is more informative to understand how advances in technology can help educators both meet the policy’s guidelines and work to create better student outcomes.


The emphasis on measurable outcomes has shifted the priorities of schools toward capturing data linking student performance with positive outcomes – including primary to higher education.  Positive “outcomes” translates to higher student attendance, improved test scores, and more students matriculating into college.

Everything is being measured – suspension from school, end of term testing (also known as “high-stakes” testing), academic degree history of teachers, minutes of recess and almost any else that can be assigned a number.

Predictably, this has also led to an explosion of data – the education sector has accumulated  269 petabytes  of information (and growing).  Further, they keep the data for at least 10 years, creating problems for storage and analysis.

In the past, all of these measurements went toward targeted statistical analyses to determine the correlative or casual effect different stimuli have on positive outcomes.  Studies have looked at topics from SAT psychometric techniques to the performance outcomes of school uniforms (which interestingly have no positive effect on students’ test scores.)

A significant problem with this is the incredible number of variables that need to be accounted for in attempts to create an accurate reflection of the learning environment.  Not only must all those measurements be collected (which presents its own set of significant changes) but also they must be replicated and compared to all other schools all across the country. However, the data sets are simply too immense, pushing reviewers to take only tiny fractions of data to perform their analysis.

Enter Big Data Analytics

There is an incredible opportunity to begin harvesting that information for the benefit of students everywhere.  The National Center for Education Statistics stores the equivalent of several libraries of information researchers can use for their analysis.   The platform would allow researchers to look beyond the tiny slivers of data gathered from individual schools and begin to work toward harnessing the power of the entire repository.

Startups and major companies are now turning their eye toward big data in the education sphere.

Civitas Learning is a young startup focused on using predictive analytics, machine learning, and recommendation engines to improve student outcomes.  The company built the largest cross-institutional learning data network in higher education to allow them to see major trends in grades, dropout and retention rates, access to online materials, and other metrics.

With a data set of over one million student records and over seven million course records, their software lets them detect known warning signs that lead to dropouts and poor performance.  Additionally will allow them to compare specific courses and degree paths that lead to attrition and also reveal which resources and interventions are most successful.

Traditional Analytics

IBM has been at the forefront of using large educational data sets in the education sphere. The significance of having one of the world’s greatest problem solvers turn its eye toward solving large problems in education is a powerful statement of the social good of technology.  While their research has not explicitly used Apache Hadoop, their work in data analytics can provide lessons for future tech forays into education.

IBM’s work with Mobile County Public Schools shows the impact information can have on schools in need.  When IBM entered into the picture, the county was facing yet-another increase in dropout rates that was already at 48%.  The school was in such dire straits, it was in threat of probation stemming from the No Child Lift Behind law, which penalizes and disciplines schools with overall poor student performance.  To combat this, the county had instituted a dropout indicator tool based on data gathered about students and used it to inform decision-making at the county level. However, this approach was met with a few road bumps.  As theIBM case study reads:

Having an early warning system to spot at-risk patterns among students is necessary, but not sufficient for dropout mitigation.  Schools systems must also have consistent retools for intervention and the means to carry them out effectively.

With lessons learned, they sought to then turn dropout indicator tool into an actionable early warning system of possible conflict in a student’s household – sending officers and social workers home with students to help mitigate family stressors.  In doing this, the county reversed years of stagnant or increasing dropout rates, ultimately lowering it by 3%.

Fixing Through Analysis

Repairing problems in the education system is not easy, but some attempt must be made to correct identify the problem before looking for a solution.  Or restated; you can’t fix what you can’t measure.   Collecting and analyzing data is not the perfect cure toward fixing every problem in our education system.  However it is a good first step in a chain that will ultimately will up schools out of a cycle of failure and toward the top floor of success.


Part 2 of 2 in this series will dive into how the Department of Education is currently looking into big data to improve information gathering to affect policy.

Categorized by :
Business Analytics Apps Hadoop Ecosystem Other

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.