Category Archives: Hadoop Ecosystem


City Hall is Getting Schooled

Nothing happens in a vacuum anymore.  Cities now have the ability to use information collected from a massive variety of sources in order help solve common city problems.  The information can arise from anywhere – tweets, blog posts, and meter readings all can serve to inform public officials (and citizens as a whole) about how to better interact in a data-drenched world.

Most famously, IBM’s Smart Cities initiative looks at how city governments meet the needs of their expanding populations by using available resources more efficiently.  This is in direct contrast to the older practices of extracting ever-greater amounts of natural resources.  For example, optimizing how power plants distribute energy to city grids can alleviate power concerns during the summer months were A/C usage creates huge power demands.  The insight into how to do this better is always better than blind foresight.

(IBM has a white paper about their smarter cities initiative.)

Yet, just a single person can make a difference.  The Gothamist has an article of one observant filmmaker who decided to record a video of NYC subway goers tripping over the same staircase step in the course of a single day.  He then uploaded the video to YouTube where it immediately went viral.  What’s more impressive is how city workers later went on to repair the staircase step later that same day.

The same can be said for StreetBump, a smartphone app reviewed by the Huffington Post.  The app works by using a smartphone’s accelerometer to record the exact GPS location of potholes when a driver passes over cracks in the road.  This information can be relayed back to cities to improve the road conditions on a more dynamically rich scale than otherwise possible.

Mayors of cities have also taken the lead in communicating with their constituents using big data-enabled technologies.  New Jersey’s Star Ledger recently ran a report on the Cory Booker, the mayor of Newark and his persistent use of technology to directly (and personally) address the needs of individual Newarkers.  In the past, he has accepted tweets to fix potholes and repair stoplights in an aim make the position of mayor more accessible to the average person.

All of these points of data can be used to improve the way we interact with our increasingly more-connected world.  Officials can use all of this information to help improve the lives of everyone and work toward creating more livable cities.

The Coming Majority: Mainstream Adoption and Entrepreneurship

Small companies, big data.

Big data is sometimes at odds with the business-savvy entrepreneur who wants to exploit its full potential.   In essence, the business potential of big data is the massive (but promising) elephant in the room that remains invisible because the available talent necessary to take full advantage of the technology is difficult to obtain.

Inventing new technology for the platform is critical, but so too is making it easier to use.

The future of big data may not be a technological breakthrough by a select core of contributing engineers, but rather a platform that allows common, non-PhD holding entrepreneurs and developers to innovate.  Some incredible progress has been made in Apache Hadoop with Hortonworks’ HDP (Hortonworks Data Platform) in minimizing the installation process required for full implementation.  Further, the improved MapReduce v2 framework also greatly lowers the risk of adoption for businesses by expressly creating features designed to increase efficiency and usability (e.g. backward and forward compatibility).  Finally, with HCatalog, the platform is opened up to integrate with new and existing enterprise applications.

What kinds of opportunities lie ahead when more barriers are eliminated?

The current situation is similar to data processing servers before Cloud-based solutions like Amazon’s S3 and Elastic MapReduce (EMR).   In the early 2000s, entrepreneurs had to spend a great deal of time running and maintaining servers in-house that ran their business.  When cloud-based solutions entered, it allowed developers to focus on using servers to enhance their business rather than be bogged down by its limitations.  This revolution allowed a small 10-person startup and focus 100% of their attention on innovation and bringing value to their customers rather than on the limitations of the technology. Making the platform simpler and easy-to-use will have the same effect for big data.

Greater Adoption through Innovation

Enterprise Software

Buoyed by the efforts of the Apache Hadoop community, key enterprise software players have improved access to the platform.  Hadoop platforms like HDP democratizes big data by providing easy-to-use and wide spread access for the greater community.  Efforts like these help to push the technology past the early adopters to mass adoption markets.  However, companies at this level focus on the invention of the platform.  Sustainable technological growth arises only when companies use that invention in new, unexpected ways.

Business-to-Business (B2B) Applications

Beyond the large players like Yahoo!, Netflix, smaller (often non-Hadoop) operations have sprung up all across the country around the idea of big data.  One well-known example is Splunk, which created its own propriety platform to process and analyze big data on a large scale for companies that need it.  The benefit of companies like Splunk is their ability to identify desired elements from a variety of sources – machine data, cloud architectures, visual dashboards, and Hadoop – and package their offerings into a single product.

Another more recent entry is Durham, NC based company named EvoApp.  The company has built a big data platform called Bermuda specializing in customer and market intelligence.  Continuing the trend begun by Splunk, they focus primarily on analytics, though betting its speedy and accurate runtimes will be a significant differentiator in the market place.

Business-to-Consumer Applications

Startups are also working toward using big data to solve difficult problems for the everyday consumer.

One innovative use of big data is with a mobile app called Parker by Streetline. In major cities, locating empty parking spaces can be a major concern for commuters.  City governments and app developers alike are using big data to help car drivers locate available parking spaces more effectively by having modified parking meters broadcast their availability to the targeted servers that are paired with a notification system.

Another, The Climate Corporation, tailors its insurance policies based on weather-related risk factors that could negatively affect or potentially destroy entire crop yields.  The company uses big data to make weather and soil predictions to more intelligently bet against crop failure and issue policies accordingly.  The customer may not know (nor care) how the system works, but recognizes the value in being issued tailored insurance policies based on their personal risk factors.

Limits to Widespread Adoption

Imagine the possibilities of every high school student dreaming of the software possible with Hadoop in much the same way they now do for smartphone apps.  While technology champions are necessary to invent and evangelize young technologies, the real technological boom occurs when mainstream developers get involved and begin to push the limits of the platform.  As more startups innovate using big data technologies, we can look forward to seeing a new majority.

Big Data in Education (Part 2 of 2)

The following is Part 2 of 2 on data in education. The first article introduces the concept and application of data in education. The second article looks at recent movements by the Department of Education in data mining, modeling and learning systems.

Big data analytics are coming to public education. In 2012, the US Department of Education (DOE) was part of a host of agencies to share a $200 million initiative to begin applying big data analytics to their respective functions. The DOE targeted its $25 million share of the budget toward efforts to understand how students learn at an individualized level. This segment reviews the efforts enumerated in the draft paper released by the DOE on their big data analytics.

The ultimate goal of incorporating big data analytics in education is to improve student outcomes – as determined common metrics like end-of-grade testing, attendance, and dropout rates. Currently, the education sector’s application of big data analytics is to create “learning analytic systems” – here defined as a connected framework of data mining, modeling, and use-case applications.

The hope of these systems is to offer educators better, more accurate information on answer the “how” question in student learning. Is a student performing poor because she is distracted by her environment? Does a failing mark on the end-of-year test mean that the student did not fully grasp the year’s material, or was she having a off day? Learning analytics can help provide information to help educators answer some of these tough, real world questions.

Data Mining to Answer Questions

Educational data mining is a major part in the move toward big data learning analytics. Recent trends in education have allowed researchers to amass large volumes of unstructured data. Structured data has been collected for years in the education sector, typically in the form of grades or attendance records. New methods of interactive learning have led to more unstructured data through intelligent tutoring systems, simulations, and learning games. This allows for the collection of richer data sets than previously possible, creating new research opportunities into students’ learning environment.

Educational data has several unique characteristics. Summarized;

…[E]ducational data is … hierarchical. Data at the keystroke level, the answer level, the session level, the student level, the classroom level, the teacher level, and the school level are nested inside one another. (DOE: Learning Analytics, pg. 18, 2012)

Thus, when a student answers a single question, several variables are being simultaneously analyzed.

Time is also an important factor. Do large gaps between answering correct questions translate into better answers? Does a student spend too much time on the first parts of exams only to rush the latter parts?

The order, sequence, and context in which the questions are answered provide even greater amounts data researchers can use to uncover patterns in student learning. Students may preform better when asked a series of increasing difficult, but related questions rather than randomly selections of questions from a common pot. The move toward adaptive testing in the GRE (standardize testing for graduate school) shows a trend toward this effort.

Researchers can use all of this data to answer important questions about what makes the best learning environment for students. Understanding important questions academic questions can help educators create models about student learning efforts.

How the data is collected is important for its future usability. A challenge behind receiving the influx of data will be to standardize it on the front end so it can be usefully dissected. This does not mean converting unstructured to structured, but rather intuitive methods of categorizing incoming information similar to how YouTube has users categorize their videos during an upload. The DOE would need to be a standard-bearer for the organizing how this information is incorporated into databases for use modeling purposes.

User Knowledge and Behavior Modeling

Monitoring “how” a student tests has enabled researchers to model student behavior effectively. Beyond simply getting the correct answer, how a student works toward that goal can be just as important,

• How long has the student taken between questions?

• What previous kinds of questions have the student gotten correct/wrong?

•What kind of hints does the student benefit from most?

Monitoring these interactions can help create a behavior profile for individual students that can help educators understand the specific processes a student goes through in order to grasp the material.

Creating adaptive learning systems using these student behavior profiles can enhance the effect. Armed with the information of “how” a student learns, developers can then tailor future questions and hints designed to increase the retention and synthesis of information. Developers like DreamBox Learning and Knewton have created and released their versions of an adaptive learning system. Their software provides millions of ways students can work through the program based on how they complete their assignments.

Education Use-Cases

Educators and researchers have developed five major techniques for extracting value from educators’ big data.

• Prediction – for understanding the likelihood of expected events. For example, having the ability to know when a student intentionally misses a question despite actual ability.

• Clustering – Discovering data points that naturally go together. Useful for putting together students of similar academic ability.

• Relationship Mining – discovering relationships between variables and encoding them for later use. Useful for detecting if a student gets the correct answer reliability after seeking help.

• Distillation for human judgment – building visual models human parsing to aid in machine learning models.

• Discovery with models – meta-study using models developed using big data analytics.

Researchers believe these techniques will help educators more effectively guide students toward a more individualized learning process.

What is striking is how these education use-cases overlap with other common uses of big data analytic systems. For example, commercial banks may use clustering algorithms for profiles of purchases that will allow them to more readily detect fraud in a system. These uses provide a framework for the creation of useful learning analytic systems.

Learning Analytic Systems

The implementation of all of these leads to the creation of a learning analytics system – techniques hold the promise of improving the academic outcomes of students. While similar systems have been in place in the commercial sector of years, the education sector has many challenges ahead before it truly becomes a success story.

Acquiring the data presents its own sets of challenges. For college-age and mature students, data collection is not a major issue, however for school-age students, it does require collectors to jump over some hurdles to prevent potentially identifying individual students. Some hurdles are legal, while others are ethical. Regardless this does slow down the overall process of collection.

The number and skill of data collectors is also an issue. Website’s use of cookies for data gather is a common method companies can uniformly gather information. The DOE, however would have to rely on the thousands of school districts and networks of researchers to refine and certify data.

Even with its innate challenges, learning analytics represent a quantum leap in creating a customized learning environment for each student. Custom-fit learning curricula handed daily to each student, early detection systems designed to find the warning signs of potential disenrollment and dropouts, multi-year learning plans designed to challenge rather induce boredom. All made possible through the use of big data analytics.

Hadoop: A Powerful Weapon for Retailers

Big Data Shopping Bag

With big data basking in the limelight, it is no surprise that large retailers have been closely watching its development… and more power to them! By learning to effectively utilize big data, retailers can significantly mold the market to their advantage, making themselves more competitive and increasing the likelihood that they will come out on top as a successful retailer. Now that there are open source analytical platforms like Hadoop, which allow for unstructured data to be transformed and organized, large retailers are able to make smart business decisions using the information they collect about customers’ habits, preferences, and needs.

As IT industry analyst Jeff Kelly explained on Wikibon, “Big Data combined with sophisticated business analytics have the potential to give enterprises unprecedented insights into customer behavior and volatile market conditions, allowing them to make data-driven business decisions faster and more effectively than the competition.” Predicting what customers want to buy, without a doubt, affects how many products they want to buy (especially if retailers add on a few of those wonderful customer discounts). Not only will big data analytics prove financially beneficial, it will also present the opportunity for customers to have a more individualized shopping experience.

This all sounds very promising but the difficulty lies in the fact that there are many channels in the consumer business now, such as online, in-store, call centers, mobile, social, etc., each with its own target-marketing advantage. In order for retailers to thrive in the market, they must learn to manage and hone in on all (or at least most) of these facets of business, which can be difficult if you keep in mind the amount of data that each channel generates. Sam Sliman, president at Optimal Solutions Integration, summarizes it perfectly: “Transparency rules the day. Inconsistency turns customers away. Retailer missteps can be glaring and costly.” By making fast market decisions, retailers can increase sales, win and maintain customers, improve margins, and boost market share, but this can really only be done with the right business analytics tools.

Who’s doing it right?

One impressive example of analytics usage is @WalmartLabs, which deals with the social and mobile aspects of retail to redefine commerce for Walmart and help its customers have a more positive shopping experience. Through its Social Genome knowledge base, @WalmartLabs zones in on entities, relationships, and events in the social world (for instance, a tweet about a specific movie title) in order to send out appropriate suggestions to customers. “We do this using public data on the Web, proprietary data, and a lot of social media. From such data we identify interesting entities and relationships, extract them, augment them with as much information as we can find, then add them to the Social Genome.” @WalmartLabs uses its own, in-house data platform called Muppet that is meant to process data at lightning speed.

Sears is another retailer that is focused on the advantages of big data and is using Hadoop to develop its business. If you were able to make it to Hadoop Summit 2012, you had the chance to see Phil Shelley speak about the company’s use of Hadoop and provide some interesting insight about the benefits of the open source platform (If you couldn’t make it, you can find the session slides here). Through Hadoop, Sears is able to compare and organize information about product availability, competitor’s prices, local economic conditions, etc. Before Hadoop, Sears was only using 10% of the information it had in store and was using most of its money and resources on running price elasticity algorithms. Rachael King of the CIO Journal explains, “The company now offloads data from its mainframe computers onto servers using Hadoop to run algorithms that analyze the data and feeds the results back into the mainframe. The retailer is able to use 100% of the data it collects.”

 Big Insights

Eric Williams, CIO at Catalina Marketing, offered some helpful information in an interview by Alison Bolen of SAS. According to Williams, retailers can use big data and business analytics to answer questions like, “what products are selling, what’s the association of one product to another, what do my consumers look like, what is the marketplace doing?” With 20,000 new products being introduced in the United States annually it is essential for companies to sort the information about all of these products in order to gauge which ones worked great and which ones were a total flop.

The sorting of information through platforms like Hadoop will allow for extensive feedback in finance, marketing, operations, sales, and other areas of a business, which, in turn, will offer a more “per-customer profitability” approach. Sales associates will be able to access information on the spot (through a mobile device, for example) about which products are up-and-coming or which items a customer may be interested in based on the questions they might ask in the store. So, not only will online shopping continue to become more and more personalized but in-store experiences will also be highly sensitive to what each customer is looking for.

A white paper by Keplar LLP goes through the process of using Hadoop for retail business analytics and offers a list of ways to use the information that is collected through different channels:

  • Learn more about the customer, including who she is, how she engages with the product, company or brand, how she feels about the product and what role she plays in evangelizing it to others
  • Identify ways to better tailor the product and service to that customer or customer segment, improving customer loyalty
  • Identify ways to improve the product for all users, by comparing the way that this customer used it with other customers. Are there particular workflows that customers struggled with or abandoned?
  • Identify new products / services to offer that customer, or a segment of customers made up of people like her
  • Grow customer lifetime value, and hence profit

The white paper explains that both consumer and product analytics are significantly affected by the presence of big data and to manage both of these, Hadoop is a great solution. Since it uses a parallel structure, Hadoop can run various analyses on smaller data sets which makes it easy for retailers to compare and contrast various products, customer feedback, and the mass of social information that is generated every minute of every day. The possibilities for a thriving retail business are endless.

Without a platform like Hadoop, retailers have to spend big bucks on designing the appropriate data warehouses for the information they collect. Hadoop doesn’t require a pre-defined schema, so storing and interpreting unstructured data like product descriptions or social media conversations between users becomes considerably easier.

 Ready to Check Out?

In such a consumer-driven society it seems almost necessary to establish a system of organization that could help make sense of consumer behaviors and trends; Apache Hadoop is a smart (and affordable) way to do this. With the social and technological worlds advancing at such an incredible speed, online, mobile, and social consumerism is becoming more of a norm rather than an option. Retail companies can truly receive the most from their business (and provide a positive experience for customers) if they happily open their arms to the big data coming their way and simultaneously understand how to transform this data into a positive business model.

Search Data at Scale in Five Minutes with Pig, Wonderdog and ElasticSearch

Working code examples for this post (for both Pig 0.10 and ElasticSearch 0.18.6) are available here.

ElasticSearch makes search simple. ElasticSearch is built over Lucene and provides a simple but rich JSON over HTTP query interface to search clusters of one or one hundred machies. You can get started with ElasticSearch in five minutes, and it can scale to support heavy loads in the enterprise. ElasticSearch has a Whirr Recipe, and there is even a Platform-as-a-Service provider, Bonsai.io.

Apache Pig makes Hadoop simple. In a previous post, we prepared the Berkeley Enron Emails in Avro format. The entire dataset is available in Avro format here: https://s3.amazonaws.com/rjurney.public/enron.avro. Lets check them out:

Read More

Big Data in Education (Part 1 of 2)

The following is Part 1 of 2 on data in education.  The first article introduces the concepts of how data is used in education.  The second article looks at recent movements by the Department of Education in data mining, modeling and learning systems.

Learning to Learn

The education industry is transforming into a 21st century data-driven enterprise.   Metrics based assessment has been a powerful force that has swept the national education community in response to widespread policy reform.  Passed in 2001, the No-Child-Left-Behind Act pushed the idea of standards-based education whereby schoolteachers and administrators are held accountable for the performance of their students.  The law elevated standardized tests and dropout rates as the primary way officials measure student outcomes and achievement.  Underperforming schools can be placed on probation, and if no improvement is seen after 3-4 years, the entire staff of the school can be replaced.

The political ramifications of the law inspire much debate amongst policy analysts.  However, from a data perspective, it is more informative to understand how advances in technology can help educators both meet the policy’s guidelines and work to create better student outcomes.

Measurements

The emphasis on measurable outcomes has shifted the priorities of schools toward capturing data linking student performance with positive outcomes – including primary to higher education.  Positive “outcomes” translates to higher student attendance, improved test scores, and more students matriculating into college.

Everything is being measured – suspension from school, end of term testing (also known as “high-stakes” testing), academic degree history of teachers, minutes of recess and almost any else that can be assigned a number.

Predictably, this has also led to an explosion of data – the education sector has accumulated  269 petabytes  of information (and growing).  Further, they keep the data for at least 10 years, creating problems for storage and analysis.

In the past, all of these measurements went toward targeted statistical analyses to determine the correlative or casual effect different stimuli have on positive outcomes.  Studies have looked at topics from SAT psychometric techniques to the performance outcomes of school uniforms (which interestingly have no positive effect on students’ test scores.)

A significant problem with this is the incredible number of variables that need to be accounted for in attempts to create an accurate reflection of the learning environment.  Not only must all those measurements be collected (which presents its own set of significant changes) but also they must be replicated and compared to all other schools all across the country. However, the data sets are simply too immense, pushing reviewers to take only tiny fractions of data to perform their analysis.

Enter Big Data Analytics

There is an incredible opportunity to begin harvesting that information for the benefit of students everywhere.  The National Center for Education Statistics stores the equivalent of several libraries of information researchers can use for their analysis.   The platform would allow researchers to look beyond the tiny slivers of data gathered from individual schools and begin to work toward harnessing the power of the entire repository.

Startups and major companies are now turning their eye toward big data in the education sphere.

Civitas Learning is a young startup focused on using predictive analytics, machine learning, and recommendation engines to improve student outcomes.  The company built the largest cross-institutional learning data network in higher education to allow them to see major trends in grades, dropout and retention rates, access to online materials, and other metrics.

With a data set of over one million student records and over seven million course records, their software lets them detect known warning signs that lead to dropouts and poor performance.  Additionally will allow them to compare specific courses and degree paths that lead to attrition and also reveal which resources and interventions are most successful.

Traditional Analytics

IBM has been at the forefront of using large educational data sets in the education sphere. The significance of having one of the world’s greatest problem solvers turn its eye toward solving large problems in education is a powerful statement of the social good of technology.  While their research has not explicitly used Apache Hadoop, their work in data analytics can provide lessons for future tech forays into education.

IBM’s work with Mobile County Public Schools shows the impact information can have on schools in need.  When IBM entered into the picture, the county was facing yet-another increase in dropout rates that was already at 48%.  The school was in such dire straits, it was in threat of probation stemming from the No Child Lift Behind law, which penalizes and disciplines schools with overall poor student performance.  To combat this, the county had instituted a dropout indicator tool based on data gathered about students and used it to inform decision-making at the county level. However, this approach was met with a few road bumps.  As theIBM case study reads:

Having an early warning system to spot at-risk patterns among students is necessary, but not sufficient for dropout mitigation.  Schools systems must also have consistent retools for intervention and the means to carry them out effectively.

With lessons learned, they sought to then turn dropout indicator tool into an actionable early warning system of possible conflict in a student’s household – sending officers and social workers home with students to help mitigate family stressors.  In doing this, the county reversed years of stagnant or increasing dropout rates, ultimately lowering it by 3%.

Fixing Through Analysis

Repairing problems in the education system is not easy, but some attempt must be made to correct identify the problem before looking for a solution.  Or restated; you can’t fix what you can’t measure.   Collecting and analyzing data is not the perfect cure toward fixing every problem in our education system.  However it is a good first step in a chain that will ultimately will up schools out of a cycle of failure and toward the top floor of success.

 

Part 2 of 2 in this series will dive into how the Department of Education is currently looking into big data to improve information gathering to affect policy.

Lessons from Anime and Big Data (Ghost in the Shell)

What lessons might the anime (Japanese animation) “Ghost in the Shell” teach us about the future of big data?  The show, originally a graphic novel from creator Masamune Shirow, explores the consequences of a “hyper”-connected society so advanced one is able to download one’s consciousness temporarily into human-like android shells (hence the work’s title).  If this sounds familiar, it’s because Ghost in the Shell was a major point of inspiration for the Wachowski brothers, the creators of the  Matrix Trilogy.

The ability to handle, process, and manipulate big data is a major theme of the show and focuses on the challenges of a high tech police unit in thwarting potential cyber crimes.  The graphic novel was originally created in 1991, long before the concept of big data had grown to prominence (and for-all-intents-and-purposes even before what we now think of as the internet…)

Visions of a “Big Data” Future

While such visions of an interconnected techno-future are common in anime, what makes Ghost in the Shell special is its treatment of the power of big data.  Technology is not used simply for its exploitative value, but as a means to create a greater, more capable society.  Data becomes the engine that drives an entire civilization towards achieving taller buildings, faster cars, and yes – even androids.

Big data puts many of Ghost in the Shell’s “technological advances” just within reach.  The show features almost instantaneous transfers of petabyte hard drives and facial recognition searches about as fast as a Google search.

Far off? Or is it?

So when will we be able to control androids?  Pretty soon, apparently.  Doctors and scientists have successfully linked the human nervous system with electronics, allowing amputees to make macro movements using artificial arms and finer movements with its fingers.

But the prosthetic arms go beyond simply allowing movement.  They also provide tactile sensory information, like vibrations and pressure (the sense of “touch), by connecting the unit directly to the patient’s previously nerve ends.  More significantly, patients can learn to use the arm in as little as five hours.

The most amazing thing about science fiction is how fast it becomes science fact.

Are We There Yet?

The ability to give the human brain to control over more and more of the artificial bodies is a major point of overlap with in the anime and the progress of actual science. Treating the brain as data to be interfaced with, stored and transferred is not new to science or science fiction.  In general terms, the brain operates by sending electrical signals across a host of different connectors which receive, process and store information that the conscious mind can use to interact with the world around it.  Professor Paul Reber of Northwestern University estimates the human brain can hold 2.5 petabytes of data – information that can be transferred, managed, and processed like any other.

Naturally, Apache Hadoop platform would be would be the natural platform to handle these massive storage and analysis of this unstructured data.  While our understanding of how the brain works is in its earliest stages the possibility of being able to capture the massive amount of data necessary for brain functions represents an alluring and attractive goal for many scientists.  High goals for a platform less than a decade old, but what better way to store the cornucopia of unstructured information in the human brain as Hadoop?

And Returning to Reality…

No one can know where science and technology will go in the future. However, the emergence of forward-looking shows that allow us to look and dream about a more advanced tomorrow certainly has enabled us to hope for that day when fiction like that shown in Ghost in the Machine becomes reality.

Recap of Hadoop Summit 2012

I wanted to take this opportunity to say thanks to the more than 2,200 attendees, speakers and sponsors that helped to make Hadoop Summit 2012 a great success. There was tremendous buzz throughout the conference; exceeding the excitement levels of all past Hadoop conferences. It’s a great indicator for the future of Apache Hadoop and the broader big data ecosystem.

The content from this conference was outstanding, from the opening keynotes to the last round of breakout sessions. I wanted to thank the track chairs (Abhishek Mehta, Ashish Thusoo, Avik Dey, Ben Reed, Peter Sirota and Val Bercovici) for making the hard decisions that led to such an outstanding agenda. I thought the group did a great job selecting the right mix of technical, use case and best practices sessions for developers, operators and analysts. I would also like to thank the more than 110 speakers for putting in the time and effort to share their Apache Hadoop experiences.

All of the sessions at this year’s conference were recorded and we are in the process of editing these videos for placement on the Hadoop Summit website. We have also now posted most of the slides as well. Simply visit the Sessions page to access the slides and recordings.

I am pleased to announce that all of the keynote session recordings are now available. These include compelling presentations from the following speakers:

Geoffrey Moore (author of “Crossing the Chasm” and “Escape Velocity”)

Scott Burke (SVP, Advertising & Data, Yahoo!)

Dr. Philip Shelley (CTO, Sears)

Scott Gnau (VP and GM of R&D, Teradata)

Shaun Connolly (VP of Corporate Strategy, Hortonworks)

Eric Baldeschwieler (CTO, Hortonworks)

Also, if you have not yet seen the introductory video from Hadoop Summit, I strongly encourage you to watch it now (below). I have heard from quite a few folks that this video got them even more excited about the role they have played in the Apache Hadoop ecosystem.

(click HERE for a full screen version on Vimeo)

On behalf of this year’s co-hosts Hortonworks and Yahoo!, let me again thank everyone for their role in making Hadoop Summit 2012 such a success. Because of the emergence of Apache Hadoop as the foundation of the next generation enterprise data architecture, I have no doubt that next year’s conference will be even bigger and better. I can’t wait.

~ John Kreisa

Data Integration Services & Hortonworks Data Platform

What’s possible with all this data?

Data Integration is a key component of the Hadoop solution architecture. It is the first obstacle encountered once your cluster is up and running. Ok, I have a cluster… now what? Do I write a script to move the data? What is the language? Isn’t this just ETL with HDFS as another target?Well, yes…

Sure you can write custom scripts to perform a load, but that is hardly repeatable and not viable in the long term. You could also use Apache Sqoop (available in HDP today), which is a tool to push bulk data from relational stores into HDFS. While effective and great for basic loads, there is work to be done on the connections and transforms necessary in these types of flows. While custom scripts and Sqoop are both viable alternatives, they won’t cover everything and you still need to be a bit technical to be successful.

For wide scale adoption of Apache Hadoop, tools that abstract integration complexity are necessary for the rest of us.  Enter Talend Open Studio for Big Data. We have worked with Talend in order to deeply integrate their graphical data integration tools with HDP as well as extend their offering beyond HDFS, Hive, Pig and HBase into HCatalog (metadata service) and Oozie (workflow and job scheduler).

Talend addresses four key concerns for those using HDP:

  • Bridge the skills gap– Not everyone has a PHD in computer science…  Talend presents a graphical tool where you drag and drop pre-built components on to a canvas, configure them and then all the underlying code is created for you.  This is Java code that can be executed anywhere Java runs and even package as a service.  You can also customize the code however you see fit or use it within another IDE.  This radically simplifies the data load process.  All you need to know is the basic configurations and voila!… your data is loaded.
      
  • HCatalog Integration – Hortonworks and Talend engineering teams have partnered to bring HCatalog specific components and functions deeply integrated with the Talend connectors.  Components allow you to easily create, drop and modify tables and databases and check for existence, etc. Also, when storing data you can choose HCatalog as a storage option.  This provides the developer with options within the specific tools for Hive and Pig to integrate with HCatalog and share data and its structure much more easily. HCatalog then provides the metadata services for the underlying data and opens up the platform.
  • Connect to the entire enterprise – The enterprise is full of different sources and targets for data.  These can be databases, applications, files, services and even data warehouses and cubes.  Integration with these resources is not always simple.  We could take the top ten and provide connectors and call it a day, but enterprise data centers are not so homogeneous. With Talend we are able present a palette full of options, in fact they have over 400 connectors available.  In this video, you can see us grab and parse an Apache log file in seconds using a component.  These pre-tested components that save integration time by providing proven and tested APIs and schemas to make these connections.  Want to pull data from Salesforce.com?  …drop a component, configure your login credentials and your Salesforce metadata and data are at your fingertips.
  • Graphic Pig Script Creation– Talend also provides components to deliver Pig Scripts without writing a line of code.  Components for join, aggregate, filtering, cross and others are all included.  Again you drop a component, connect schema, configure the function, and then all the underlying code is written for you…making your time to delivery all that faster.

This approach can help all of your Hadoop-related projects move a lot faster so you can quickly move past the “where do I start?” question to the more interesting “what’s possible with all this data?”.

Related links:

Kiss the Weatherman

Weather Hurts

Catastrophic weather events like the historic 2011 floods in Pakistan or prolonged droughts in the horn of Africa make living conditions unspeakably harsh for tens of millions of families living in these affected areas.  In the US, the winter storms of 2009-2010 and 2010-2011 brought record-setting snowfall, forcing mighty metropolises into an icy standstill. Extreme weather can profoundly impact the human kind.

The effects of extreme weather can send terrible ripples throughout an entire community.  Unexpected cold snaps or overly hot summers can devastate crop yields and forcing producers to raise prices. When food prices rise, it becomes more difficult for some people to earn enough money to provide for their families, creating even larger problems for societies as a whole.

The central problem is the inability of current forecasting models to more accurately predict large-scale weather patterns.  Weathermen are good at predicting weather but poor at predicting climate.  Weather occurs over a shorter period of time and can be reliability predicted within a 3-day timeframe.  Climate stretches many months, years, or even centuries.  Matching historical climate data with current weather data to make future weather and climate is a major challenge for scientists.

Read More

Big Data in Genomics and Cancer Treatment

 

Why genomics?

Big data. These are two words the world has been hearing a lot lately and it has been in relevance to a wide array of use cases in social media, government regulation, auto insurance, retail targeting, etc. The list goes on. However, a very important concept that should receive the same (if not more) recognition is the presence of big data in human genome research.

Three billion base pairs make up the DNA present in humans. It’s probably safe to say that such a massive amount of data should be organized in a useful way, especially if it presents the possibility of eliminating cancer. Cancer treatment has been around since its first documented case in Egypt (1500 BC) when humans began distinguishing between malignant and benign tumors by learning how to surgically remove them. It is intriguing and scientifically helpful to take a look at how far the world’s knowledge of cancer has progressed since that time and what kind of role big data (and its management and analysis) plays in the search for a cure.

The most concerning issue with cancer, and the ultimate reason for why it still hasn’t been completely cured, is that it mutates differently for every individual and reacts in unexpected ways with people’s genetic make up. Professionals and researchers in the field of oncology have to assert the fact that each patient requires personalized treatment and medication in order to manage the specific type of cancer that they have. Elaine Mardis, PhD, co-director of the Genome Institute at the School of Medicine, believes that it is essential to identify mutations at the root of each tumor and to map their genetic evolution in order to make progress in the battle against cancer. “Genome analysis can play a role at multiple time points during a patient’s treatment, to identify ‘driver’ mutations in the tumor genome and to determine whether cells carrying those mutations have been eliminated by treatment.”

Read More

My Review of Hadoop Summit 2012

The fifth annual Hadoop Summit drew to a close last week, with over 2200 Hadoopniks in attendance. While there were many innovations demonstrated, for me the best action was about Pig, HCatalog and Hive from Hortonworks and Twitter.

At the Hadoop Summit Pig Meetup, Twitter announced Ambrose, which now includes an excellent graph layout of Pig EXPLAIN data. This visualization can be used to debug and better understand your Pig scripts.

Read More

Teradata Aster & Hortonworks Webinar on Thursday

I wanted to draw your attention to a Webinar taking place this Thursday at 1pm EDT, 10am PDT. “Back to the Future – MapReduce, Hadoop and the Data Scientist” will highlight the benefits of Apache Hadoop and the role that data scientists are playing in big data. The speakers include:

  • Colin White – Founder of BI Research, a leading research, education and consulting firm helping companies understand and benefit from evolving and leading edge technologies in the areas of business intelligence and data management.
  • Tasso Argyros – Co-President of Teradata Aster
  • Ari Zilka – Chief Products Officer for Hortonworks

Among the topics discussed during this free Webinar are:

  • MapReduce for the data scientist: Hadoop/Hive and RDBMS approaches
  • Back to the future: file systems vs. database systems
  • Hadoop and RDBMS coexistence strategies
  • Bridging the gap: new approaches for analyzing data using Hadoop

This promises to be a very interesting and informative presentation so please Register today.

~ Lisa Sensmeier

Introducing Hortonworks Data Platform v1.0

I wanted to take this opportunity to share some important news. Today, Hortonworks announced version 1.0 of the Hortonworks Data Platform, a 100% open source data management platform based on Apache Hadoop. We believe strongly that Apache Hadoop, and therefore, Hortonworks Data Platform, will become the foundation for the next generation enterprise data architecture, helping companies to load, store, process, manage and ultimately benefit from the growing volume and variety of data entering into, and flowing throughout their organizations. The imminent release of Hortonworks Data Platform v1.0 represents a major step forward for achieving this vision.

You can read the full press release here. You can also read what many of our partners have to say about this announcement here. We were extremely pleased that industry leaders such as Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata and VMware all expressed their support and excitement for Hortonworks Data Platform.

Those who have followed Hortonworks since our initial launch already know that we are absolutely committed to open source and the Apache Software Foundation. You will be glad to know that our commitment remains the same today. We don’t hold anything back. No proprietary code is being developed at Hortonworks.

Read More

An Advance Look at Hadoop Summit

Hadoop Summit is just around the corner and by that, I mean next week! There is still time to register for the conference but please do it soon as the conference is filling up quickly. Today is also the last day in which online registration will remain open. After today, you will need to register on-site at the conference itself.

This year’s Hadoop Summit conference, now in its fifth year, promises to be the biggest and best yet. In fact, there are already more people registered for Hadoop Summit 2012 than any other Hadoop conference ever!

I wanted to take this opportunity share some of the highlights for next week’s conference:

Geoffrey Moore and Other Compelling Keynote Speakers:

Geoffrey Moore, author of “Crossing the Chasm” and “Escape Velocity”, will share his views on “Digitizing the World, the Driving Force Behind Apache Hadoop’s Adoption Life Cycle”. You will also hear from other industry luminaries, who will share their vision for where Apache Hadoop is going and how it is destined to become the foundation for the next generation enterprise data platform.

Read More

Go to page:12345