Educational Data Mining

Contributors: Mimi Recker

Cyberlearning systems are increasingly engineered to capture fine-grained footprints of users’ online activities. Sometimes referred to as the ‘data deluge’, these voluminous datasets are ripe for analysis using data mining, a field that has evolved over the past several decades to support the discovery and extraction of implicit knowledge from one or more large collections of data.

Educational data mining (EDM), in turn, is the application of this process to educational datasets in order to better understand a system’s impact and its users. For example, EDM results might help researchers more rapidly understand how learners access and use learning materials, how they interact with teachers and peers, and how they create new content. We might also understand how different individuals engage with or potentially ‘game’ the system. Taken together, these learning analytics offer much rich information back to those implementing cyberlearning systems, to those using such systems and even to policy makers.

Transformative Potential

Transformative potentials exist at many levels, including students, classrooms, teachers, schools, districts, families, and communities. In addition, EDM results are critical in informing aspects successful and unsuccessful system design. For example, parents may be interested in better understanding how learners’ behaviors relate to academic performance, or how particular school, district, and/or classroom initiatives are impacting student learning. Administrators may use these rich data to inform resource allocations.

Here, we highlight some of these potentials.

1. Collating and mining student cyberlearning activities (learning analytics) can

  • Support the compilation of a rich record on student activities, sometimes called a Lifelong Learning Chronicle (CRA, 2005).
    Enable a better understanding of student’s knowledge status and student’s behaviors
  • Support real-time diagnostics of learning, supporting instant and customized feedback. This better enables the student to work in his/her zone of proximal development.
  • Support customized delivery and personalization of instruction.
    Enable student profiling and dynamic construction of student groups for collaborative work (based on either shared or dissimilar patterns).
  • Support more seamless transitions between formal and informal schooling.
  • Provide data in E-portfolios, for school mobility or workforce purposes.
  • Support validation and development of instructional practices and learning theories.

2. Collating and mining teacher activities (teaching analytics) can

  • Identify teachers’ with shared interests to support a more connected professional practice, including sharing, coaching, and professional development.
  • Recommend instructional resources for students with different abilities and demographic background.
  • Optimize the system to improve feedback loop


Over the weekend, Rafaela visits a local science exhibit on building and testing bridge designs. Her smart learning device captures a record of these interactions, along with engagement data measured by skin temperature. These are processed and stored in her personal LifeLongLearning (LLL) Portfolio in the Education Data Cloud.

Over her morning coffee, her principal, Ms. Jenkins, accesses visual analytics of her students’ recent activities, extracted from their LLL Portfolios. The display shows that many students have, of late, been in engaged by engineering ideas. The system also recommends several age-appropriate learning resources in engineering, including an engineering game called BuildThis. Armed with this information, Ms. Jenkins, gets assistance from online mentors to compile a short online professional development segment to help her teachers integrate these concepts into the week’s lesson plans. This information is also delivered to parents of home-bound and home-schooled students. In class that day, Rafaela is partnered with a student with similar interests to learn physics concepts in the context of a national design competition using BuildThis.


The following are but some of the challenges in this area:

  1. What emerging standards best balance the need for data privacy with the need to link student and teacher data across distributed systems? How can users be best informed about what data are collected in order to control access, anonymize, and support opt-in or out?
  2. What data should we be capturing and in what formats? What are incentives for various stakeholders to collect, share, and use their data? What kinds of resources are necessary to aggregate and curate these data on a continuing basis?
  3. What data mining algorithms show the most promise and in what situations? What processes best support synchronizing heterogeneous data streams from different sources, such as clickstream, audio, video, GPS, etc.
  4. This research paradigm is deeply interdisciplinary. How can we support Information/computer scientists bringing knowledge of emerging algorithms to work productively with statisticians and psychometrians, educational and cognitive psychologist, and educational researchers and practitioners?
  5. What data displays, visualizations, and visual analysis (format, time scale) are most informative and support effective decision making for different stakeholders? For example, teachers, in their busy, chaotic classrooms, will need different visualizations than parents interested in their child’s progress, or administrators interested their school or district’s performance.
  6. As part of model interpretation, how can EDM results be combined and/or triangulated with other, more conventional kinds of data, for example, test scores, surveys, observations, interviews?
  7. How do we know the appropriateness of a decision made based on EDM findings, such as putting students into different curricula? How to validate the claims and actions?


Baker, R. S. J. D., & Yacef, K. (2009). The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining, 1(1), 3-17.

Baker, R. S. J. D. (in press) Data Mining for Education. To appear in McGaw, B., Peterson, P., Baker, E. (Eds.) International Encyclopedia of Education (3rd edition). Oxford, UK: Elsevier.

Borgman, C. L., Abelson, H., Dirks, L., Johnson, R., Koedinger, K. R., Linn, M. C., … Szalay,A. (2008). Fostering learning in the networked world – The CyberLearning opportunity and challenge: A 21st century agenda for the National Science Foundation (Report of the NSF Task Force on CyberLearning). Arlington VA: NSF.

Computing Research Association. (2005). Cyberinfrastructure for education and learning for the future: A vision and research agenda. Washington, DC: Computing Research Association.

Han, J., & Kamber, M. (2006). Data mining: Concepts and techniques (2nd ed.). San Francisco, CA: Kaufmann.

Romero, C., & Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications, 33(1), 135-146. doi: 10.1016/j.eswa.2006.04.005.

Romero C.R., & Ventura, S. (2010). Educational data mining: A review of the state of the art. IEEE Transactions on Systems, Man and Cybernetics, part C: Applications and Reviews, 40(6), 601-618.