Posts by James Locus:


City Hall is Getting Schooled

Nothing happens in a vacuum anymore.  Cities now have the ability to use information collected from a massive variety of sources in order help solve common city problems.  The information can arise from anywhere – tweets, blog posts, and meter readings all can serve to inform public officials (and citizens as a whole) about how to better interact in a data-drenched world.

Most famously, IBM’s Smart Cities initiative looks at how city governments meet the needs of their expanding populations by using available resources more efficiently.  This is in direct contrast to the older practices of extracting ever-greater amounts of natural resources.  For example, optimizing how power plants distribute energy to city grids can alleviate power concerns during the summer months were A/C usage creates huge power demands.  The insight into how to do this better is always better than blind foresight.

(IBM has a white paper about their smarter cities initiative.)

Yet, just a single person can make a difference.  The Gothamist has an article of one observant filmmaker who decided to record a video of NYC subway goers tripping over the same staircase step in the course of a single day.  He then uploaded the video to YouTube where it immediately went viral.  What’s more impressive is how city workers later went on to repair the staircase step later that same day.

The same can be said for StreetBump, a smartphone app reviewed by the Huffington Post.  The app works by using a smartphone’s accelerometer to record the exact GPS location of potholes when a driver passes over cracks in the road.  This information can be relayed back to cities to improve the road conditions on a more dynamically rich scale than otherwise possible.

Mayors of cities have also taken the lead in communicating with their constituents using big data-enabled technologies.  New Jersey’s Star Ledger recently ran a report on the Cory Booker, the mayor of Newark and his persistent use of technology to directly (and personally) address the needs of individual Newarkers.  In the past, he has accepted tweets to fix potholes and repair stoplights in an aim make the position of mayor more accessible to the average person.

All of these points of data can be used to improve the way we interact with our increasingly more-connected world.  Officials can use all of this information to help improve the lives of everyone and work toward creating more livable cities.

The Coming Majority: Mainstream Adoption and Entrepreneurship

Small companies, big data.

Big data is sometimes at odds with the business-savvy entrepreneur who wants to exploit its full potential.   In essence, the business potential of big data is the massive (but promising) elephant in the room that remains invisible because the available talent necessary to take full advantage of the technology is difficult to obtain.

Inventing new technology for the platform is critical, but so too is making it easier to use.

The future of big data may not be a technological breakthrough by a select core of contributing engineers, but rather a platform that allows common, non-PhD holding entrepreneurs and developers to innovate.  Some incredible progress has been made in Apache Hadoop with Hortonworks’ HDP (Hortonworks Data Platform) in minimizing the installation process required for full implementation.  Further, the improved MapReduce v2 framework also greatly lowers the risk of adoption for businesses by expressly creating features designed to increase efficiency and usability (e.g. backward and forward compatibility).  Finally, with HCatalog, the platform is opened up to integrate with new and existing enterprise applications.

What kinds of opportunities lie ahead when more barriers are eliminated?

The current situation is similar to data processing servers before Cloud-based solutions like Amazon’s S3 and Elastic MapReduce (EMR).   In the early 2000s, entrepreneurs had to spend a great deal of time running and maintaining servers in-house that ran their business.  When cloud-based solutions entered, it allowed developers to focus on using servers to enhance their business rather than be bogged down by its limitations.  This revolution allowed a small 10-person startup and focus 100% of their attention on innovation and bringing value to their customers rather than on the limitations of the technology. Making the platform simpler and easy-to-use will have the same effect for big data.

Greater Adoption through Innovation

Enterprise Software

Buoyed by the efforts of the Apache Hadoop community, key enterprise software players have improved access to the platform.  Hadoop platforms like HDP democratizes big data by providing easy-to-use and wide spread access for the greater community.  Efforts like these help to push the technology past the early adopters to mass adoption markets.  However, companies at this level focus on the invention of the platform.  Sustainable technological growth arises only when companies use that invention in new, unexpected ways.

Business-to-Business (B2B) Applications

Beyond the large players like Yahoo!, Netflix, smaller (often non-Hadoop) operations have sprung up all across the country around the idea of big data.  One well-known example is Splunk, which created its own propriety platform to process and analyze big data on a large scale for companies that need it.  The benefit of companies like Splunk is their ability to identify desired elements from a variety of sources – machine data, cloud architectures, visual dashboards, and Hadoop – and package their offerings into a single product.

Another more recent entry is Durham, NC based company named EvoApp.  The company has built a big data platform called Bermuda specializing in customer and market intelligence.  Continuing the trend begun by Splunk, they focus primarily on analytics, though betting its speedy and accurate runtimes will be a significant differentiator in the market place.

Business-to-Consumer Applications

Startups are also working toward using big data to solve difficult problems for the everyday consumer.

One innovative use of big data is with a mobile app called Parker by Streetline. In major cities, locating empty parking spaces can be a major concern for commuters.  City governments and app developers alike are using big data to help car drivers locate available parking spaces more effectively by having modified parking meters broadcast their availability to the targeted servers that are paired with a notification system.

Another, The Climate Corporation, tailors its insurance policies based on weather-related risk factors that could negatively affect or potentially destroy entire crop yields.  The company uses big data to make weather and soil predictions to more intelligently bet against crop failure and issue policies accordingly.  The customer may not know (nor care) how the system works, but recognizes the value in being issued tailored insurance policies based on their personal risk factors.

Limits to Widespread Adoption

Imagine the possibilities of every high school student dreaming of the software possible with Hadoop in much the same way they now do for smartphone apps.  While technology champions are necessary to invent and evangelize young technologies, the real technological boom occurs when mainstream developers get involved and begin to push the limits of the platform.  As more startups innovate using big data technologies, we can look forward to seeing a new majority.

Big Data in Education (Part 2 of 2)

The following is Part 2 of 2 on data in education. The first article introduces the concept and application of data in education. The second article looks at recent movements by the Department of Education in data mining, modeling and learning systems.

Big data analytics are coming to public education. In 2012, the US Department of Education (DOE) was part of a host of agencies to share a $200 million initiative to begin applying big data analytics to their respective functions. The DOE targeted its $25 million share of the budget toward efforts to understand how students learn at an individualized level. This segment reviews the efforts enumerated in the draft paper released by the DOE on their big data analytics.

The ultimate goal of incorporating big data analytics in education is to improve student outcomes – as determined common metrics like end-of-grade testing, attendance, and dropout rates. Currently, the education sector’s application of big data analytics is to create “learning analytic systems” – here defined as a connected framework of data mining, modeling, and use-case applications.

The hope of these systems is to offer educators better, more accurate information on answer the “how” question in student learning. Is a student performing poor because she is distracted by her environment? Does a failing mark on the end-of-year test mean that the student did not fully grasp the year’s material, or was she having a off day? Learning analytics can help provide information to help educators answer some of these tough, real world questions.

Data Mining to Answer Questions

Educational data mining is a major part in the move toward big data learning analytics. Recent trends in education have allowed researchers to amass large volumes of unstructured data. Structured data has been collected for years in the education sector, typically in the form of grades or attendance records. New methods of interactive learning have led to more unstructured data through intelligent tutoring systems, simulations, and learning games. This allows for the collection of richer data sets than previously possible, creating new research opportunities into students’ learning environment.

Educational data has several unique characteristics. Summarized;

…[E]ducational data is … hierarchical. Data at the keystroke level, the answer level, the session level, the student level, the classroom level, the teacher level, and the school level are nested inside one another. (DOE: Learning Analytics, pg. 18, 2012)

Thus, when a student answers a single question, several variables are being simultaneously analyzed.

Time is also an important factor. Do large gaps between answering correct questions translate into better answers? Does a student spend too much time on the first parts of exams only to rush the latter parts?

The order, sequence, and context in which the questions are answered provide even greater amounts data researchers can use to uncover patterns in student learning. Students may preform better when asked a series of increasing difficult, but related questions rather than randomly selections of questions from a common pot. The move toward adaptive testing in the GRE (standardize testing for graduate school) shows a trend toward this effort.

Researchers can use all of this data to answer important questions about what makes the best learning environment for students. Understanding important questions academic questions can help educators create models about student learning efforts.

How the data is collected is important for its future usability. A challenge behind receiving the influx of data will be to standardize it on the front end so it can be usefully dissected. This does not mean converting unstructured to structured, but rather intuitive methods of categorizing incoming information similar to how YouTube has users categorize their videos during an upload. The DOE would need to be a standard-bearer for the organizing how this information is incorporated into databases for use modeling purposes.

User Knowledge and Behavior Modeling

Monitoring “how” a student tests has enabled researchers to model student behavior effectively. Beyond simply getting the correct answer, how a student works toward that goal can be just as important,

• How long has the student taken between questions?

• What previous kinds of questions have the student gotten correct/wrong?

•What kind of hints does the student benefit from most?

Monitoring these interactions can help create a behavior profile for individual students that can help educators understand the specific processes a student goes through in order to grasp the material.

Creating adaptive learning systems using these student behavior profiles can enhance the effect. Armed with the information of “how” a student learns, developers can then tailor future questions and hints designed to increase the retention and synthesis of information. Developers like DreamBox Learning and Knewton have created and released their versions of an adaptive learning system. Their software provides millions of ways students can work through the program based on how they complete their assignments.

Education Use-Cases

Educators and researchers have developed five major techniques for extracting value from educators’ big data.

• Prediction – for understanding the likelihood of expected events. For example, having the ability to know when a student intentionally misses a question despite actual ability.

• Clustering – Discovering data points that naturally go together. Useful for putting together students of similar academic ability.

• Relationship Mining – discovering relationships between variables and encoding them for later use. Useful for detecting if a student gets the correct answer reliability after seeking help.

• Distillation for human judgment – building visual models human parsing to aid in machine learning models.

• Discovery with models – meta-study using models developed using big data analytics.

Researchers believe these techniques will help educators more effectively guide students toward a more individualized learning process.

What is striking is how these education use-cases overlap with other common uses of big data analytic systems. For example, commercial banks may use clustering algorithms for profiles of purchases that will allow them to more readily detect fraud in a system. These uses provide a framework for the creation of useful learning analytic systems.

Learning Analytic Systems

The implementation of all of these leads to the creation of a learning analytics system – techniques hold the promise of improving the academic outcomes of students. While similar systems have been in place in the commercial sector of years, the education sector has many challenges ahead before it truly becomes a success story.

Acquiring the data presents its own sets of challenges. For college-age and mature students, data collection is not a major issue, however for school-age students, it does require collectors to jump over some hurdles to prevent potentially identifying individual students. Some hurdles are legal, while others are ethical. Regardless this does slow down the overall process of collection.

The number and skill of data collectors is also an issue. Website’s use of cookies for data gather is a common method companies can uniformly gather information. The DOE, however would have to rely on the thousands of school districts and networks of researchers to refine and certify data.

Even with its innate challenges, learning analytics represent a quantum leap in creating a customized learning environment for each student. Custom-fit learning curricula handed daily to each student, early detection systems designed to find the warning signs of potential disenrollment and dropouts, multi-year learning plans designed to challenge rather induce boredom. All made possible through the use of big data analytics.

Big Data in Education (Part 1 of 2)

The following is Part 1 of 2 on data in education.  The first article introduces the concepts of how data is used in education.  The second article looks at recent movements by the Department of Education in data mining, modeling and learning systems.

Learning to Learn

The education industry is transforming into a 21st century data-driven enterprise.   Metrics based assessment has been a powerful force that has swept the national education community in response to widespread policy reform.  Passed in 2001, the No-Child-Left-Behind Act pushed the idea of standards-based education whereby schoolteachers and administrators are held accountable for the performance of their students.  The law elevated standardized tests and dropout rates as the primary way officials measure student outcomes and achievement.  Underperforming schools can be placed on probation, and if no improvement is seen after 3-4 years, the entire staff of the school can be replaced.

The political ramifications of the law inspire much debate amongst policy analysts.  However, from a data perspective, it is more informative to understand how advances in technology can help educators both meet the policy’s guidelines and work to create better student outcomes.

Measurements

The emphasis on measurable outcomes has shifted the priorities of schools toward capturing data linking student performance with positive outcomes – including primary to higher education.  Positive “outcomes” translates to higher student attendance, improved test scores, and more students matriculating into college.

Everything is being measured – suspension from school, end of term testing (also known as “high-stakes” testing), academic degree history of teachers, minutes of recess and almost any else that can be assigned a number.

Predictably, this has also led to an explosion of data – the education sector has accumulated  269 petabytes  of information (and growing).  Further, they keep the data for at least 10 years, creating problems for storage and analysis.

In the past, all of these measurements went toward targeted statistical analyses to determine the correlative or casual effect different stimuli have on positive outcomes.  Studies have looked at topics from SAT psychometric techniques to the performance outcomes of school uniforms (which interestingly have no positive effect on students’ test scores.)

A significant problem with this is the incredible number of variables that need to be accounted for in attempts to create an accurate reflection of the learning environment.  Not only must all those measurements be collected (which presents its own set of significant changes) but also they must be replicated and compared to all other schools all across the country. However, the data sets are simply too immense, pushing reviewers to take only tiny fractions of data to perform their analysis.

Enter Big Data Analytics

There is an incredible opportunity to begin harvesting that information for the benefit of students everywhere.  The National Center for Education Statistics stores the equivalent of several libraries of information researchers can use for their analysis.   The platform would allow researchers to look beyond the tiny slivers of data gathered from individual schools and begin to work toward harnessing the power of the entire repository.

Startups and major companies are now turning their eye toward big data in the education sphere.

Civitas Learning is a young startup focused on using predictive analytics, machine learning, and recommendation engines to improve student outcomes.  The company built the largest cross-institutional learning data network in higher education to allow them to see major trends in grades, dropout and retention rates, access to online materials, and other metrics.

With a data set of over one million student records and over seven million course records, their software lets them detect known warning signs that lead to dropouts and poor performance.  Additionally will allow them to compare specific courses and degree paths that lead to attrition and also reveal which resources and interventions are most successful.

Traditional Analytics

IBM has been at the forefront of using large educational data sets in the education sphere. The significance of having one of the world’s greatest problem solvers turn its eye toward solving large problems in education is a powerful statement of the social good of technology.  While their research has not explicitly used Apache Hadoop, their work in data analytics can provide lessons for future tech forays into education.

IBM’s work with Mobile County Public Schools shows the impact information can have on schools in need.  When IBM entered into the picture, the county was facing yet-another increase in dropout rates that was already at 48%.  The school was in such dire straits, it was in threat of probation stemming from the No Child Lift Behind law, which penalizes and disciplines schools with overall poor student performance.  To combat this, the county had instituted a dropout indicator tool based on data gathered about students and used it to inform decision-making at the county level. However, this approach was met with a few road bumps.  As theIBM case study reads:

Having an early warning system to spot at-risk patterns among students is necessary, but not sufficient for dropout mitigation.  Schools systems must also have consistent retools for intervention and the means to carry them out effectively.

With lessons learned, they sought to then turn dropout indicator tool into an actionable early warning system of possible conflict in a student’s household – sending officers and social workers home with students to help mitigate family stressors.  In doing this, the county reversed years of stagnant or increasing dropout rates, ultimately lowering it by 3%.

Fixing Through Analysis

Repairing problems in the education system is not easy, but some attempt must be made to correct identify the problem before looking for a solution.  Or restated; you can’t fix what you can’t measure.   Collecting and analyzing data is not the perfect cure toward fixing every problem in our education system.  However it is a good first step in a chain that will ultimately will up schools out of a cycle of failure and toward the top floor of success.

 

Part 2 of 2 in this series will dive into how the Department of Education is currently looking into big data to improve information gathering to affect policy.

Lessons from Anime and Big Data (Ghost in the Shell)

What lessons might the anime (Japanese animation) “Ghost in the Shell” teach us about the future of big data?  The show, originally a graphic novel from creator Masamune Shirow, explores the consequences of a “hyper”-connected society so advanced one is able to download one’s consciousness temporarily into human-like android shells (hence the work’s title).  If this sounds familiar, it’s because Ghost in the Shell was a major point of inspiration for the Wachowski brothers, the creators of the  Matrix Trilogy.

The ability to handle, process, and manipulate big data is a major theme of the show and focuses on the challenges of a high tech police unit in thwarting potential cyber crimes.  The graphic novel was originally created in 1991, long before the concept of big data had grown to prominence (and for-all-intents-and-purposes even before what we now think of as the internet…)

Visions of a “Big Data” Future

While such visions of an interconnected techno-future are common in anime, what makes Ghost in the Shell special is its treatment of the power of big data.  Technology is not used simply for its exploitative value, but as a means to create a greater, more capable society.  Data becomes the engine that drives an entire civilization towards achieving taller buildings, faster cars, and yes – even androids.

Big data puts many of Ghost in the Shell’s “technological advances” just within reach.  The show features almost instantaneous transfers of petabyte hard drives and facial recognition searches about as fast as a Google search.

Far off? Or is it?

So when will we be able to control androids?  Pretty soon, apparently.  Doctors and scientists have successfully linked the human nervous system with electronics, allowing amputees to make macro movements using artificial arms and finer movements with its fingers.

But the prosthetic arms go beyond simply allowing movement.  They also provide tactile sensory information, like vibrations and pressure (the sense of “touch), by connecting the unit directly to the patient’s previously nerve ends.  More significantly, patients can learn to use the arm in as little as five hours.

The most amazing thing about science fiction is how fast it becomes science fact.

Are We There Yet?

The ability to give the human brain to control over more and more of the artificial bodies is a major point of overlap with in the anime and the progress of actual science. Treating the brain as data to be interfaced with, stored and transferred is not new to science or science fiction.  In general terms, the brain operates by sending electrical signals across a host of different connectors which receive, process and store information that the conscious mind can use to interact with the world around it.  Professor Paul Reber of Northwestern University estimates the human brain can hold 2.5 petabytes of data – information that can be transferred, managed, and processed like any other.

Naturally, Apache Hadoop platform would be would be the natural platform to handle these massive storage and analysis of this unstructured data.  While our understanding of how the brain works is in its earliest stages the possibility of being able to capture the massive amount of data necessary for brain functions represents an alluring and attractive goal for many scientists.  High goals for a platform less than a decade old, but what better way to store the cornucopia of unstructured information in the human brain as Hadoop?

And Returning to Reality…

No one can know where science and technology will go in the future. However, the emergence of forward-looking shows that allow us to look and dream about a more advanced tomorrow certainly has enabled us to hope for that day when fiction like that shown in Ghost in the Machine becomes reality.

Kiss the Weatherman

Weather Hurts

Catastrophic weather events like the historic 2011 floods in Pakistan or prolonged droughts in the horn of Africa make living conditions unspeakably harsh for tens of millions of families living in these affected areas.  In the US, the winter storms of 2009-2010 and 2010-2011 brought record-setting snowfall, forcing mighty metropolises into an icy standstill. Extreme weather can profoundly impact the human kind.

The effects of extreme weather can send terrible ripples throughout an entire community.  Unexpected cold snaps or overly hot summers can devastate crop yields and forcing producers to raise prices. When food prices rise, it becomes more difficult for some people to earn enough money to provide for their families, creating even larger problems for societies as a whole.

The central problem is the inability of current forecasting models to more accurately predict large-scale weather patterns.  Weathermen are good at predicting weather but poor at predicting climate.  Weather occurs over a shorter period of time and can be reliability predicted within a 3-day timeframe.  Climate stretches many months, years, or even centuries.  Matching historical climate data with current weather data to make future weather and climate is a major challenge for scientists.

Read More