Category Archives: Uncategorized


The Coming Majority: Mainstream Adoption and Entrepreneurship

Small companies, big data.

Big data is sometimes at odds with the business-savvy entrepreneur who wants to exploit its full potential.   In essence, the business potential of big data is the massive (but promising) elephant in the room that remains invisible because the available talent necessary to take full advantage of the technology is difficult to obtain.

Inventing new technology for the platform is critical, but so too is making it easier to use.

The future of big data may not be a technological breakthrough by a select core of contributing engineers, but rather a platform that allows common, non-PhD holding entrepreneurs and developers to innovate.  Some incredible progress has been made in Apache Hadoop with Hortonworks’ HDP (Hortonworks Data Platform) in minimizing the installation process required for full implementation.  Further, the improved MapReduce v2 framework also greatly lowers the risk of adoption for businesses by expressly creating features designed to increase efficiency and usability (e.g. backward and forward compatibility).  Finally, with HCatalog, the platform is opened up to integrate with new and existing enterprise applications.

What kinds of opportunities lie ahead when more barriers are eliminated?

The current situation is similar to data processing servers before Cloud-based solutions like Amazon’s S3 and Elastic MapReduce (EMR).   In the early 2000s, entrepreneurs had to spend a great deal of time running and maintaining servers in-house that ran their business.  When cloud-based solutions entered, it allowed developers to focus on using servers to enhance their business rather than be bogged down by its limitations.  This revolution allowed a small 10-person startup and focus 100% of their attention on innovation and bringing value to their customers rather than on the limitations of the technology. Making the platform simpler and easy-to-use will have the same effect for big data.

Greater Adoption through Innovation

Enterprise Software

Buoyed by the efforts of the Apache Hadoop community, key enterprise software players have improved access to the platform.  Hadoop platforms like HDP democratizes big data by providing easy-to-use and wide spread access for the greater community.  Efforts like these help to push the technology past the early adopters to mass adoption markets.  However, companies at this level focus on the invention of the platform.  Sustainable technological growth arises only when companies use that invention in new, unexpected ways.

Business-to-Business (B2B) Applications

Beyond the large players like Yahoo!, Netflix, smaller (often non-Hadoop) operations have sprung up all across the country around the idea of big data.  One well-known example is Splunk, which created its own propriety platform to process and analyze big data on a large scale for companies that need it.  The benefit of companies like Splunk is their ability to identify desired elements from a variety of sources – machine data, cloud architectures, visual dashboards, and Hadoop – and package their offerings into a single product.

Another more recent entry is Durham, NC based company named EvoApp.  The company has built a big data platform called Bermuda specializing in customer and market intelligence.  Continuing the trend begun by Splunk, they focus primarily on analytics, though betting its speedy and accurate runtimes will be a significant differentiator in the market place.

Business-to-Consumer Applications

Startups are also working toward using big data to solve difficult problems for the everyday consumer.

One innovative use of big data is with a mobile app called Parker by Streetline. In major cities, locating empty parking spaces can be a major concern for commuters.  City governments and app developers alike are using big data to help car drivers locate available parking spaces more effectively by having modified parking meters broadcast their availability to the targeted servers that are paired with a notification system.

Another, The Climate Corporation, tailors its insurance policies based on weather-related risk factors that could negatively affect or potentially destroy entire crop yields.  The company uses big data to make weather and soil predictions to more intelligently bet against crop failure and issue policies accordingly.  The customer may not know (nor care) how the system works, but recognizes the value in being issued tailored insurance policies based on their personal risk factors.

Limits to Widespread Adoption

Imagine the possibilities of every high school student dreaming of the software possible with Hadoop in much the same way they now do for smartphone apps.  While technology champions are necessary to invent and evangelize young technologies, the real technological boom occurs when mainstream developers get involved and begin to push the limits of the platform.  As more startups innovate using big data technologies, we can look forward to seeing a new majority.

Happy Birthday Hortonworks!

Last week was an important milestone for Hortonworks: our one year anniversary. Given all of the activity around Apache Hadoop and Hortonworks, it’s hard to believe it’s only been one year. In honor of our birthday, I thought I would look back to contrast our original intentions with what we delivered over the past year.

Hortonworks was officially announced at Hadoop Summit 2011. At that time, I published a blog on the Hortonworks Manifesto. This blog told our story, including where we came from, what motivated the original founders and what our plans were for the company. I wanted to address many of the important statements from this blog here:

Hortonworks was formed to “accelerate the development and adoption of Apache Hadoop”. I returned to this point often throughout the manifesto. We committed to working with the community to accelerate the development and adoption of Apache Hadoop and we absolutely delivered on this promise. Over the past year, Apache Hadoop released Hadoop-1.0, the most stable line of Apache Hadoop ever. Hadoop-2.0, including the next generations architectures for both MapReduce and HDFS, was also released in alpha form. Apache Hadoop continues to gain momentum as proven by every important metric (downloads, web traffic, press & analyst coverage, conference and Meetup attendance, etc.). It was a banner year for Apache Hadoop and we are proud to have played an important role in making it happen.

We are “committed to open source” and commit that “all core code will remain open source”. This commitment is as solid today as it was a year ago. All code developed by Hortonworks has been contributed back to open source. In addition to our significant contributions to core Hadoop projects (MapReduce and HDFS), we have also made significant contributions to other Hadoop ecosystem projects including Ambari, HCatalog, Pig and ZooKeeper. We will continue to be a leader in the Hadoop community process and will continue to contribute all of our Hadoop development efforts back into the Apache community development process.

We will “make Apache Hadoop easier to install, manage and use”. This was a key focus for Hortonworks over the past year. We quickly learned that it would be beneficial to the market to offer a Hortonworks distribution of Apache Hadoop that delivered on this promise. Hortonworks Data Platform, which we recently made available to the entire ecosystem, addresses each of these areas. We have included an installer that greatly simplifies the installation process for Apache Hadoop. We included, for the first time, Apache Ambari, which allows organizations to manage and monitor their Hadoop clusters. We also tightly integrated Hortonworks Data Platform with Talend Open Studio for Big Data, which provides a visual design environment for connecting Hadoop with hundreds of enterprise data systems in order to make Hadoop easier to use. The result is a greatly simplified process for organizations that are looking for a pure Apache Hadoop distribution.

We will “make Apache Hadoop more robust”. Again, I’m pleased that we delivered on this promise. We were instrumental in the re-architectures of MapReduce and HDFS to address the enterprise needs of each of these core components. Our team has written a number of blogs and presentations on these topics that I strongly recommend you read if you haven’t already. Among the most significant are the following: NextGen MapReduce presentation, NextGen MapReduce Hits Mainline, Delivering on Hadoop .NEXT, Benchmarking Performance, Apache Hadoop 2.0 (Alpha) Released, Data Integrity and Availability in Apache Hadoop HDFS, An Introduction to HDFS Federation, NameNode HA Reaches an Important Milestone, Snapshots for HDFS and High Availability and Hadoop 1.0 – Perfect Together . The last post covers the ability to add new HA capabilities to the stable and proven Hadoop-1.0 line.

We will “make Apache Hadoop easier to integrate and extend”. We have made some important advancements in this area that may have gone unnoticed. Much of this work is related to HCatalog, an Apache project that provides a metadata and table management system for Hadoop. We feel strongly that HCatalog is the preferred path for simplifying data sharing between Hadoop and other enterprise data systems and have invested heavily into advancing this project and related APIs for HCatalog. By tightly integrating Talend Open Studio for Big Data, we have also made it much easier for a broader audience to integrate Hadoop with hundreds of existing data systems. We have also formed important partnerships with leaders such as Microsoft and Teradata to ensure that their platforms and applications are tightly integrated and optimized to work with Apache Hadoop.

We will “deliver an ever-increasing array of services aimed at improving the Hadoop experience and support in the growing needs of enterprises, systems integrators and technology vendors”. Over the past year, we have made available Hortonworks University, an exceptional Hadoop training program for developers, administrators and analysts; and Hortonworks Services, which leverages the deep domain experience of the Hortonworks technical staff to provide technical support to enterprises, systems integrators and technology vendors. Our training courses, in particularly, have been very well received by students who have continually praised our hands-on lab exercises as the best in the industry. We have recently expanded our training schedule, so check it out.
There were certainly many other notable achievements over the past year including

  • The Hortonworks team grew significantly and now numbers around 90 people. We are hiring too!
  • We established partnerships with major enterprise software vendors including Microsoft and Teradata that are changing the way Hadoop will be consumed.
  • We hosted the 5th annual Hadoop Summit with great success and rave reviews and over 2250 attendees.

As you can see, we are very proud of our accomplishments in our first year. We were also glad to be recognized by Forrester as a leader in the Forrester Wave on Enterprise Hadoop Solutions. Really, how often do companies get recognized as leaders by Forrester in their very first year of existence?

While this blog took a look back at last year, stay tuned for another blog that looks forward to what we have planned for year two.

~ E14

 

Big Data in Education (Part 2 of 2)

The following is Part 2 of 2 on data in education. The first article introduces the concept and application of data in education. The second article looks at recent movements by the Department of Education in data mining, modeling and learning systems.

Big data analytics are coming to public education. In 2012, the US Department of Education (DOE) was part of a host of agencies to share a $200 million initiative to begin applying big data analytics to their respective functions. The DOE targeted its $25 million share of the budget toward efforts to understand how students learn at an individualized level. This segment reviews the efforts enumerated in the draft paper released by the DOE on their big data analytics.

The ultimate goal of incorporating big data analytics in education is to improve student outcomes – as determined common metrics like end-of-grade testing, attendance, and dropout rates. Currently, the education sector’s application of big data analytics is to create “learning analytic systems” – here defined as a connected framework of data mining, modeling, and use-case applications.

The hope of these systems is to offer educators better, more accurate information on answer the “how” question in student learning. Is a student performing poor because she is distracted by her environment? Does a failing mark on the end-of-year test mean that the student did not fully grasp the year’s material, or was she having a off day? Learning analytics can help provide information to help educators answer some of these tough, real world questions.

Data Mining to Answer Questions

Educational data mining is a major part in the move toward big data learning analytics. Recent trends in education have allowed researchers to amass large volumes of unstructured data. Structured data has been collected for years in the education sector, typically in the form of grades or attendance records. New methods of interactive learning have led to more unstructured data through intelligent tutoring systems, simulations, and learning games. This allows for the collection of richer data sets than previously possible, creating new research opportunities into students’ learning environment.

Educational data has several unique characteristics. Summarized;

…[E]ducational data is … hierarchical. Data at the keystroke level, the answer level, the session level, the student level, the classroom level, the teacher level, and the school level are nested inside one another. (DOE: Learning Analytics, pg. 18, 2012)

Thus, when a student answers a single question, several variables are being simultaneously analyzed.

Time is also an important factor. Do large gaps between answering correct questions translate into better answers? Does a student spend too much time on the first parts of exams only to rush the latter parts?

The order, sequence, and context in which the questions are answered provide even greater amounts data researchers can use to uncover patterns in student learning. Students may preform better when asked a series of increasing difficult, but related questions rather than randomly selections of questions from a common pot. The move toward adaptive testing in the GRE (standardize testing for graduate school) shows a trend toward this effort.

Researchers can use all of this data to answer important questions about what makes the best learning environment for students. Understanding important questions academic questions can help educators create models about student learning efforts.

How the data is collected is important for its future usability. A challenge behind receiving the influx of data will be to standardize it on the front end so it can be usefully dissected. This does not mean converting unstructured to structured, but rather intuitive methods of categorizing incoming information similar to how YouTube has users categorize their videos during an upload. The DOE would need to be a standard-bearer for the organizing how this information is incorporated into databases for use modeling purposes.

User Knowledge and Behavior Modeling

Monitoring “how” a student tests has enabled researchers to model student behavior effectively. Beyond simply getting the correct answer, how a student works toward that goal can be just as important,

• How long has the student taken between questions?

• What previous kinds of questions have the student gotten correct/wrong?

•What kind of hints does the student benefit from most?

Monitoring these interactions can help create a behavior profile for individual students that can help educators understand the specific processes a student goes through in order to grasp the material.

Creating adaptive learning systems using these student behavior profiles can enhance the effect. Armed with the information of “how” a student learns, developers can then tailor future questions and hints designed to increase the retention and synthesis of information. Developers like DreamBox Learning and Knewton have created and released their versions of an adaptive learning system. Their software provides millions of ways students can work through the program based on how they complete their assignments.

Education Use-Cases

Educators and researchers have developed five major techniques for extracting value from educators’ big data.

• Prediction – for understanding the likelihood of expected events. For example, having the ability to know when a student intentionally misses a question despite actual ability.

• Clustering – Discovering data points that naturally go together. Useful for putting together students of similar academic ability.

• Relationship Mining – discovering relationships between variables and encoding them for later use. Useful for detecting if a student gets the correct answer reliability after seeking help.

• Distillation for human judgment – building visual models human parsing to aid in machine learning models.

• Discovery with models – meta-study using models developed using big data analytics.

Researchers believe these techniques will help educators more effectively guide students toward a more individualized learning process.

What is striking is how these education use-cases overlap with other common uses of big data analytic systems. For example, commercial banks may use clustering algorithms for profiles of purchases that will allow them to more readily detect fraud in a system. These uses provide a framework for the creation of useful learning analytic systems.

Learning Analytic Systems

The implementation of all of these leads to the creation of a learning analytics system – techniques hold the promise of improving the academic outcomes of students. While similar systems have been in place in the commercial sector of years, the education sector has many challenges ahead before it truly becomes a success story.

Acquiring the data presents its own sets of challenges. For college-age and mature students, data collection is not a major issue, however for school-age students, it does require collectors to jump over some hurdles to prevent potentially identifying individual students. Some hurdles are legal, while others are ethical. Regardless this does slow down the overall process of collection.

The number and skill of data collectors is also an issue. Website’s use of cookies for data gather is a common method companies can uniformly gather information. The DOE, however would have to rely on the thousands of school districts and networks of researchers to refine and certify data.

Even with its innate challenges, learning analytics represent a quantum leap in creating a customized learning environment for each student. Custom-fit learning curricula handed daily to each student, early detection systems designed to find the warning signs of potential disenrollment and dropouts, multi-year learning plans designed to challenge rather induce boredom. All made possible through the use of big data analytics.

Hadoop: A Powerful Weapon for Retailers

Big Data Shopping Bag

With big data basking in the limelight, it is no surprise that large retailers have been closely watching its development… and more power to them! By learning to effectively utilize big data, retailers can significantly mold the market to their advantage, making themselves more competitive and increasing the likelihood that they will come out on top as a successful retailer. Now that there are open source analytical platforms like Hadoop, which allow for unstructured data to be transformed and organized, large retailers are able to make smart business decisions using the information they collect about customers’ habits, preferences, and needs.

As IT industry analyst Jeff Kelly explained on Wikibon, “Big Data combined with sophisticated business analytics have the potential to give enterprises unprecedented insights into customer behavior and volatile market conditions, allowing them to make data-driven business decisions faster and more effectively than the competition.” Predicting what customers want to buy, without a doubt, affects how many products they want to buy (especially if retailers add on a few of those wonderful customer discounts). Not only will big data analytics prove financially beneficial, it will also present the opportunity for customers to have a more individualized shopping experience.

This all sounds very promising but the difficulty lies in the fact that there are many channels in the consumer business now, such as online, in-store, call centers, mobile, social, etc., each with its own target-marketing advantage. In order for retailers to thrive in the market, they must learn to manage and hone in on all (or at least most) of these facets of business, which can be difficult if you keep in mind the amount of data that each channel generates. Sam Sliman, president at Optimal Solutions Integration, summarizes it perfectly: “Transparency rules the day. Inconsistency turns customers away. Retailer missteps can be glaring and costly.” By making fast market decisions, retailers can increase sales, win and maintain customers, improve margins, and boost market share, but this can really only be done with the right business analytics tools.

Who’s doing it right?

One impressive example of analytics usage is @WalmartLabs, which deals with the social and mobile aspects of retail to redefine commerce for Walmart and help its customers have a more positive shopping experience. Through its Social Genome knowledge base, @WalmartLabs zones in on entities, relationships, and events in the social world (for instance, a tweet about a specific movie title) in order to send out appropriate suggestions to customers. “We do this using public data on the Web, proprietary data, and a lot of social media. From such data we identify interesting entities and relationships, extract them, augment them with as much information as we can find, then add them to the Social Genome.” @WalmartLabs uses its own, in-house data platform called Muppet that is meant to process data at lightning speed.

Sears is another retailer that is focused on the advantages of big data and is using Hadoop to develop its business. If you were able to make it to Hadoop Summit 2012, you had the chance to see Phil Shelley speak about the company’s use of Hadoop and provide some interesting insight about the benefits of the open source platform (If you couldn’t make it, you can find the session slides here). Through Hadoop, Sears is able to compare and organize information about product availability, competitor’s prices, local economic conditions, etc. Before Hadoop, Sears was only using 10% of the information it had in store and was using most of its money and resources on running price elasticity algorithms. Rachael King of the CIO Journal explains, “The company now offloads data from its mainframe computers onto servers using Hadoop to run algorithms that analyze the data and feeds the results back into the mainframe. The retailer is able to use 100% of the data it collects.”

 Big Insights

Eric Williams, CIO at Catalina Marketing, offered some helpful information in an interview by Alison Bolen of SAS. According to Williams, retailers can use big data and business analytics to answer questions like, “what products are selling, what’s the association of one product to another, what do my consumers look like, what is the marketplace doing?” With 20,000 new products being introduced in the United States annually it is essential for companies to sort the information about all of these products in order to gauge which ones worked great and which ones were a total flop.

The sorting of information through platforms like Hadoop will allow for extensive feedback in finance, marketing, operations, sales, and other areas of a business, which, in turn, will offer a more “per-customer profitability” approach. Sales associates will be able to access information on the spot (through a mobile device, for example) about which products are up-and-coming or which items a customer may be interested in based on the questions they might ask in the store. So, not only will online shopping continue to become more and more personalized but in-store experiences will also be highly sensitive to what each customer is looking for.

A white paper by Keplar LLP goes through the process of using Hadoop for retail business analytics and offers a list of ways to use the information that is collected through different channels:

  • Learn more about the customer, including who she is, how she engages with the product, company or brand, how she feels about the product and what role she plays in evangelizing it to others
  • Identify ways to better tailor the product and service to that customer or customer segment, improving customer loyalty
  • Identify ways to improve the product for all users, by comparing the way that this customer used it with other customers. Are there particular workflows that customers struggled with or abandoned?
  • Identify new products / services to offer that customer, or a segment of customers made up of people like her
  • Grow customer lifetime value, and hence profit

The white paper explains that both consumer and product analytics are significantly affected by the presence of big data and to manage both of these, Hadoop is a great solution. Since it uses a parallel structure, Hadoop can run various analyses on smaller data sets which makes it easy for retailers to compare and contrast various products, customer feedback, and the mass of social information that is generated every minute of every day. The possibilities for a thriving retail business are endless.

Without a platform like Hadoop, retailers have to spend big bucks on designing the appropriate data warehouses for the information they collect. Hadoop doesn’t require a pre-defined schema, so storing and interpreting unstructured data like product descriptions or social media conversations between users becomes considerably easier.

 Ready to Check Out?

In such a consumer-driven society it seems almost necessary to establish a system of organization that could help make sense of consumer behaviors and trends; Apache Hadoop is a smart (and affordable) way to do this. With the social and technological worlds advancing at such an incredible speed, online, mobile, and social consumerism is becoming more of a norm rather than an option. Retail companies can truly receive the most from their business (and provide a positive experience for customers) if they happily open their arms to the big data coming their way and simultaneously understand how to transform this data into a positive business model.

We’re Heading to OSCON, Are You?

We’re heading to our very first OSCON conference to talk all things Apache Hadoop, the biggest gathering for the entire open source community in Portland, Oregon, and we would love to meet you there!

Meet our founders, Arun Murthy and Mahadev Konar, along with others from the Hortonworks team at this year’s conference.

There are many ways to meet the Hortonworks team and we would love to chat with you about how you are considering using Hadoop.

We’ll be speaking!

Arun Murthy will be presenting “Apache Hadoop- The future is Now” on Wednesday, July 18 @ 10:40am in Portland 252

Mahadev Konar will present “ Apache ZooKeeper in Action” on Wednesday, 7/18 @ 2:30pm in D139-140

And hosting!

Birds of a Feather (BoF) session on the Next Generation of Apache Hadoop, Wednesday 7/18 @ 7pm

And we’re exhibiting!

Come by booth #207, say hello, geek out to Hadoop and big data and pick up an awesome shirt while you’re at it.

See you there!

Big Data in Education (Part 1 of 2)

The following is Part 1 of 2 on data in education.  The first article introduces the concepts of how data is used in education.  The second article looks at recent movements by the Department of Education in data mining, modeling and learning systems.

Learning to Learn

The education industry is transforming into a 21st century data-driven enterprise.   Metrics based assessment has been a powerful force that has swept the national education community in response to widespread policy reform.  Passed in 2001, the No-Child-Left-Behind Act pushed the idea of standards-based education whereby schoolteachers and administrators are held accountable for the performance of their students.  The law elevated standardized tests and dropout rates as the primary way officials measure student outcomes and achievement.  Underperforming schools can be placed on probation, and if no improvement is seen after 3-4 years, the entire staff of the school can be replaced.

The political ramifications of the law inspire much debate amongst policy analysts.  However, from a data perspective, it is more informative to understand how advances in technology can help educators both meet the policy’s guidelines and work to create better student outcomes.

Measurements

The emphasis on measurable outcomes has shifted the priorities of schools toward capturing data linking student performance with positive outcomes – including primary to higher education.  Positive “outcomes” translates to higher student attendance, improved test scores, and more students matriculating into college.

Everything is being measured – suspension from school, end of term testing (also known as “high-stakes” testing), academic degree history of teachers, minutes of recess and almost any else that can be assigned a number.

Predictably, this has also led to an explosion of data – the education sector has accumulated  269 petabytes  of information (and growing).  Further, they keep the data for at least 10 years, creating problems for storage and analysis.

In the past, all of these measurements went toward targeted statistical analyses to determine the correlative or casual effect different stimuli have on positive outcomes.  Studies have looked at topics from SAT psychometric techniques to the performance outcomes of school uniforms (which interestingly have no positive effect on students’ test scores.)

A significant problem with this is the incredible number of variables that need to be accounted for in attempts to create an accurate reflection of the learning environment.  Not only must all those measurements be collected (which presents its own set of significant changes) but also they must be replicated and compared to all other schools all across the country. However, the data sets are simply too immense, pushing reviewers to take only tiny fractions of data to perform their analysis.

Enter Big Data Analytics

There is an incredible opportunity to begin harvesting that information for the benefit of students everywhere.  The National Center for Education Statistics stores the equivalent of several libraries of information researchers can use for their analysis.   The platform would allow researchers to look beyond the tiny slivers of data gathered from individual schools and begin to work toward harnessing the power of the entire repository.

Startups and major companies are now turning their eye toward big data in the education sphere.

Civitas Learning is a young startup focused on using predictive analytics, machine learning, and recommendation engines to improve student outcomes.  The company built the largest cross-institutional learning data network in higher education to allow them to see major trends in grades, dropout and retention rates, access to online materials, and other metrics.

With a data set of over one million student records and over seven million course records, their software lets them detect known warning signs that lead to dropouts and poor performance.  Additionally will allow them to compare specific courses and degree paths that lead to attrition and also reveal which resources and interventions are most successful.

Traditional Analytics

IBM has been at the forefront of using large educational data sets in the education sphere. The significance of having one of the world’s greatest problem solvers turn its eye toward solving large problems in education is a powerful statement of the social good of technology.  While their research has not explicitly used Apache Hadoop, their work in data analytics can provide lessons for future tech forays into education.

IBM’s work with Mobile County Public Schools shows the impact information can have on schools in need.  When IBM entered into the picture, the county was facing yet-another increase in dropout rates that was already at 48%.  The school was in such dire straits, it was in threat of probation stemming from the No Child Lift Behind law, which penalizes and disciplines schools with overall poor student performance.  To combat this, the county had instituted a dropout indicator tool based on data gathered about students and used it to inform decision-making at the county level. However, this approach was met with a few road bumps.  As theIBM case study reads:

Having an early warning system to spot at-risk patterns among students is necessary, but not sufficient for dropout mitigation.  Schools systems must also have consistent retools for intervention and the means to carry them out effectively.

With lessons learned, they sought to then turn dropout indicator tool into an actionable early warning system of possible conflict in a student’s household – sending officers and social workers home with students to help mitigate family stressors.  In doing this, the county reversed years of stagnant or increasing dropout rates, ultimately lowering it by 3%.

Fixing Through Analysis

Repairing problems in the education system is not easy, but some attempt must be made to correct identify the problem before looking for a solution.  Or restated; you can’t fix what you can’t measure.   Collecting and analyzing data is not the perfect cure toward fixing every problem in our education system.  However it is a good first step in a chain that will ultimately will up schools out of a cycle of failure and toward the top floor of success.

 

Part 2 of 2 in this series will dive into how the Department of Education is currently looking into big data to improve information gathering to affect policy.

Data Integration Services & Hortonworks Data Platform

What’s possible with all this data?

Data Integration is a key component of the Hadoop solution architecture. It is the first obstacle encountered once your cluster is up and running. Ok, I have a cluster… now what? Do I write a script to move the data? What is the language? Isn’t this just ETL with HDFS as another target?Well, yes…

Sure you can write custom scripts to perform a load, but that is hardly repeatable and not viable in the long term. You could also use Apache Sqoop (available in HDP today), which is a tool to push bulk data from relational stores into HDFS. While effective and great for basic loads, there is work to be done on the connections and transforms necessary in these types of flows. While custom scripts and Sqoop are both viable alternatives, they won’t cover everything and you still need to be a bit technical to be successful.

For wide scale adoption of Apache Hadoop, tools that abstract integration complexity are necessary for the rest of us.  Enter Talend Open Studio for Big Data. We have worked with Talend in order to deeply integrate their graphical data integration tools with HDP as well as extend their offering beyond HDFS, Hive, Pig and HBase into HCatalog (metadata service) and Oozie (workflow and job scheduler).

Talend addresses four key concerns for those using HDP:

  • Bridge the skills gap– Not everyone has a PHD in computer science…  Talend presents a graphical tool where you drag and drop pre-built components on to a canvas, configure them and then all the underlying code is created for you.  This is Java code that can be executed anywhere Java runs and even package as a service.  You can also customize the code however you see fit or use it within another IDE.  This radically simplifies the data load process.  All you need to know is the basic configurations and voila!… your data is loaded.
      
  • HCatalog Integration – Hortonworks and Talend engineering teams have partnered to bring HCatalog specific components and functions deeply integrated with the Talend connectors.  Components allow you to easily create, drop and modify tables and databases and check for existence, etc. Also, when storing data you can choose HCatalog as a storage option.  This provides the developer with options within the specific tools for Hive and Pig to integrate with HCatalog and share data and its structure much more easily. HCatalog then provides the metadata services for the underlying data and opens up the platform.
  • Connect to the entire enterprise – The enterprise is full of different sources and targets for data.  These can be databases, applications, files, services and even data warehouses and cubes.  Integration with these resources is not always simple.  We could take the top ten and provide connectors and call it a day, but enterprise data centers are not so homogeneous. With Talend we are able present a palette full of options, in fact they have over 400 connectors available.  In this video, you can see us grab and parse an Apache log file in seconds using a component.  These pre-tested components that save integration time by providing proven and tested APIs and schemas to make these connections.  Want to pull data from Salesforce.com?  …drop a component, configure your login credentials and your Salesforce metadata and data are at your fingertips.
  • Graphic Pig Script Creation– Talend also provides components to deliver Pig Scripts without writing a line of code.  Components for join, aggregate, filtering, cross and others are all included.  Again you drop a component, connect schema, configure the function, and then all the underlying code is written for you…making your time to delivery all that faster.

This approach can help all of your Hadoop-related projects move a lot faster so you can quickly move past the “where do I start?” question to the more interesting “what’s possible with all this data?”.

Related links:

The Data Lifecycle, Part Three: Booting HCatalog on Elastic MapReduce

Series Introduction

This is part three of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re exploring the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in Hive, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

  • Series Part One: Avroizing the Enron Emails. In that post, we used Pig to extract, transform and load a MySQL database of the Enron emails to document format and serialize them in Avro.The Enron emails are available in Avro format here.
  • Series Part Two: Mining Avros with Pig, Consuming Data with Hive. In part two of the series, we extracted new and interesting properties from our data for consumption by analysts and users, using Pig, EC2 and Hive.Code examples for this post are available here: https://github.com/rjurney/enron-hcatalog.
  • Series Part Three: Booting HCatalog on Elastic MapReduce. Here we will use HCatalog to streamline the sharing of data between Pig and Hive, and to aid data discovery for consumers of processed data.

Read More

Kiss the Weatherman

Weather Hurts

Catastrophic weather events like the historic 2011 floods in Pakistan or prolonged droughts in the horn of Africa make living conditions unspeakably harsh for tens of millions of families living in these affected areas.  In the US, the winter storms of 2009-2010 and 2010-2011 brought record-setting snowfall, forcing mighty metropolises into an icy standstill. Extreme weather can profoundly impact the human kind.

The effects of extreme weather can send terrible ripples throughout an entire community.  Unexpected cold snaps or overly hot summers can devastate crop yields and forcing producers to raise prices. When food prices rise, it becomes more difficult for some people to earn enough money to provide for their families, creating even larger problems for societies as a whole.

The central problem is the inability of current forecasting models to more accurately predict large-scale weather patterns.  Weathermen are good at predicting weather but poor at predicting climate.  Weather occurs over a shorter period of time and can be reliability predicted within a 3-day timeframe.  Climate stretches many months, years, or even centuries.  Matching historical climate data with current weather data to make future weather and climate is a major challenge for scientists.

Read More

High Availability and Hadoop 1.0 – Perfect Together

In Shaun Connolly’s post about balancing community innovation and enterprise stability, he discussed the importance of an open source project forging ahead with big improvements that are expected to be initially buggy and incomplete functionally but then stabilize over time. In the case of Apache Hadoop 2.0, currently in community Alpha release, the big improvements have been underway for the past 3 years and include such things as:

  1. Next-gen MapReduce (aka YARN) that opens up Hadoop’s job processing architecture to other application workloads beyond MapReduce,
  2. New HDFS pipe-line to support append and flush,
  3. HDFS federation and performance improvements that enable Hadoop to scale to 1000’s more nodes in a cluster, and
  4. High availability improvements that address some of the single point of failure issues that are often used as examples of how Hadoop may not be as enterprise-ready as it could be.

In the case of high availability (HA), it can take many months or years to get these types of solutions rock solid. While Hadoop 2.0 contains important HA-related features such as HDFS hot standby, we want to make sure we give it time to complete its community release process and allow extra time after that for bugs to be found and fixed to harden it for broad enterprise production use.

Read More

Hortonworks Recognized as a Leader in Forrester Wave Report

I am pleased to report that Hortonworks has been named a leader in the recently released Forrester Wave report on Enterprise Hadoop Solutions. We scored well across all three rating areas: current offering, market presence and strategy.

We appreciate the recognition, particularly this sentence that highlighted our role in the marketplace: ”(Hortonworks) is the technology leader and ecosystem building for the entire Hadoop industry and has recently released its Hortonworks Data Platform, which incorporates purely open-source Apache Hadoop software.”

Being named a Leader in the Forrester Wave on Enterprise Hadoop Solutions is one of many achievements for Hortonworks over the past seven months (stay tuned for a blog on this topic). While we proud of our past, we are much more focused on our future. We know that we must continue to drive innovation and work with the community to deliver high-quality Apache Hadoop releases. It’s important that the Apache Hadoop core remains strong in order to avoid forking. A strong code base, rapid innovation and a vibrant ecosystem will ensure Apache Hadoop remains unified and well positioned to become the foundation for the next generation data platform. This has always been our focus and we appreciate Forrester’s recognition for this commitment.

~E14

Apache Hadoop Reaches Milestone: Release 1.0.0

Congratulations! The Hadoop Community has given itself a big holiday present: Release 1.0.0! This release has been six years in the making, and has involved:

  • Hard work and cooperation from dozens of software developers and contributors from across the industry, including of course Doug Cutting and Mike Cafarella’s early work in Nutch and the founding Hadoop team at Yahoo, Doug, Owen O’Malley and many others, with leadership from Eric14.  Special thanks to all the Hadoop committers.
  • Commitment to stability, joined with testing and indispensable production experience at scale, at industry-leading companies like Yahoo!, Facebook, LinkedIn, and others, including hundreds of millions of compute-hours and exabytes of data processed.
  • Feedback from hundreds of knowledgeable users, data scientists, systems engineers and architects.
  • Commitment to the philosophy and practice of opensource from Google, who published their seminal papers and have long supported Apache.
  • The Apache Software Foundation, which provided a structured home for the growth of the ecosystem and blossoming of multiple associated projects.

Read More

Go to page:1234