Hadoop is Transforming the Public Sector

Use Apache Hadoop for Efficient Government and National Defense

The public sector is charged with protecting citizens, responding to constituents, providing services and maintaining infrastructure. In many instances, the demands of these responsibilities increase while government resources simultaneously shrink under budget pressures.

How can government, defense and intelligence agencies and government contractors do more with less? Apache Hadoop is part of the answer.

The open source Apache Hadoop framework is philosophically aligned with the transparency we expect from good government. At Hortonworks we offer one year support contracts, so we know that every year is an election year for our customers. We work hard to earn that vote by supporting HDP in production and innovating Hadoop to better enable public agencies to meet their mandates.

The following is a list of some of ways public sector customers use Hadoop.

Use Machine and Sensor Data to Proactively Maintain Public Infrastructure

Metro Transit of St. Louis (MTL) operates the public transportation system for the St. Louis metropolitan region. Hortonworks Data Platform helps MTL meet their mission by storing and analyzing IoT data from the city’s Smart Buses, which helped the agency cut average cost per mile driven by its buses from $0.92 to $0.43. It achieved that cost reduction while simultaneously doubling the annual miles driven per bus.

Hortonworks delivered the MTL solution in partnership with LHP Telematics, an industry leader in creating custom telematics solutions for connected vehicles in the heavy equipment OEM marketplace, transportation, service, and construction fleets. The combined solution is making MTL bus service more reliable–improving the Mean Time Between Failures (MTBF) for metro buses by a factor of five, from four thousand to twenty-one thousand miles.

Read the MTL Customer Story​

Understand Public Sentiment About Government Performance

One federal ministry in a European country wanted to better understand the views of its constituents related to a major initiative to reduce obesity. Direct outreach for feedback might have been effective for a few high-quality interactions with a small number of citizens or school age children, but those methods lacked both reach and persistence.

So the Ministry started analyzing social media posts related to its program to reduce obesity. Every day, a team uses HDP to analyze tweets, posts and chat sessions and give daily sentiment reports to members of parliament for rapid feedback on which polices work and which flop.

Protect Critical Networks from Threats (Both Internal and External)

Large IT networks generate server logs with data on who accesses the network and the actions that they take. Server log data is typically seen as exhaust data, characterized by a “needle-in-a-haystack” dilemma: almost all server logs have no value, but some logs contain information critical to national defense. The challenge is to identify actual risks amongst the noise, before they lead to loss of classified information.

Now intruders plan long-term, strategic campaigns referred to as “Advanced Persistent Threats” (APTs). Both internal actors like Edward Snowden or external attackers in foreign governments conduct sophisticated, multi-year intrusion campaigns. Hadoop’s processing power makes it easier to find the “needles” left by these intruders across the different data “haystacks”.

This generalized approach is described in a paper published by Lockheed Martin entitled “Intelligence-Driven Computer Network Defense Informed by Analysis of Adversary Campaigns and Intrusion Kill Chains”:

Network defense techniques which leverage knowledge about these adversaries can create an intelligence feedback loop, enabling defenders to establish a state of information superiority which decreases the adversary’s likelihood of success with each subsequent intrusion attempt.

Apache Hadoop can provide that information superiority to protect against sustained campaigns by malicious users.

Prevent Fraud and Waste

One federal agency with a large pool of beneficiaries turned to Apache Hadoop and the Hortonworks Data Platform to discover fraudulent claims for benefits. The implementation reduced ETL processing from 9 hours to 1 hour, which allowed them to create new data models around fraud, waste and abuse.

After speeding the ETL process, the agency used that efficiency to triple the data included in its daily processing. Because Hadoop is a “schema on read” system, rather than the traditional “schema on load” platform, the agency now plans to search additional legacy systems and include more upstream contextual data (such as social media and online content) in its analysis. All of this will make it easier to identify and stop fraud, waste and abuse.

Analyze Social Media to Identify Terrorist Threats

Terrorist networks attempt to avoid detection by organizing and communicating across diffuse, informal networks. Yet the nature of these social networks contains information that can be used to detect and thwart malicious activity. With social network analysis over huge sets of data, intelligence agencies who identify one malicious individual can find accomplices within six degrees of separation from the known bad guy.

Apache Hadoop makes this analysis efficient. Of course, not everyone in contact with a known terrorist is complicit. In fact, most are uninvolved in any wrongdoing. Social data analysis at scale gives agencies actionable intelligence, helping them protect innocents and effectively focus on those intending harm.

Decrease Budget Pressures by Offloading Expensive SQL Workloads

During the recent sequestration standoff within the United States federal government, IT budgets came under increased scrutiny and budgetary pressure. Many agencies turned to a major consulting firm that recommended Hortonworks Data Platform for offloading certain data sets to Hadoop.

This recommendation was based on the best practice of putting each and every data workload in the most appropriate place. HDP interoperates with all of the major relational data warehouse platforms used by federal agencies. It doesn’t make economic sense to store certain types of data in those platforms, so transitioning less structured data sets to Hadoop reduced expenses without disrupting any existing data or operations.

Now the same data is accessible as before but stored at a lower cost.

Crowdsource Reporting for Repairs to Roads and Public Infrastructure

Any large city has a backlog of physical repairs to roads and infrastructure. This is a major prioritization challenge. Citizen complaints like, “They don’t give a darn about fixing potholes in my neighborhood,” might indicate a lack of information more than a lack of civic responsibility. No government can fix problems that haven’t come to its attention.

Cities like Palo Alto, California and Boston, Massachusetts are using sensor data and photos captured by citizens’ mobile devices to “crowdsource” reporting on civic infrastructure in need of repair. All of this data can be stored in Apache Hadoop for easy prioritization and rapid response. Compare that to how many municipalities do it today: “cumbersome surveys that involve engineers in pickup trucks dragging chains behind them and measuring the vibrations of the metal.”

Over the long term, data reported by citizens (or transmitted automatically from sensors) can be mined to develop policies that reduce the rate at which public infrastructure degrades.

Fulfill “Open Records” and Freedom of Information Requests

Open records acts, like the Public Information Act in Texas, allow citizens to request data anytime often within a pre-determined amount of time. For example, this Texas statute specifies ten days:

If an officer for public information cannot produce public information for inspection or duplication within ten business days after the date the information is requested, section 552.221(d) requires the officer to “certify that fact in writing to the requestor and set a date and hour within a reasonable time when the information will be available for inspection or duplication.”

This puts a burden on state and local IT teams, since the data for a particular request may be scattered across multiple legacy data systems. It can be challenging for a small team to fulfill those types of requests in a timely manner.

HDP, as part of a modern data architecture, can store multiple data sets, retain them for decades, and combine them to meet specific information requests, improving efficiency and accountability.

Government agencies at the federal, state and local levels can purchase support for Hortonworks Data Platform (HDP) from the immixGroup off of the GSA Schedule.

Get the Whitepaper

Metro Transit of St. Louis (MTL) operates the public transportation system for the St. Louis metropolitan region. Hortonworks Data Platform helps MTL meet their mission by storing and analyzing IoT data from the city’s Smart Buses, cutting average cost per mile driven by its buses from $0.92 to $0.43.
With Hortonworks partner immixGroup, government agencies at the federal, state and local levels can purchase Hortonworks support off of the IT 70 GSA Schedule and adopt HDP to meet their mandates with greater efficiency.


Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.