Posts by Lisa Sensmeier:


Optimizing Hadoop for Microservers

SM15K_Frt2_RThere are plenty of server and storage options for the wave of data that is being collected and analyzed.  New platforms such as Apache™ Hadoop® provide the opportunity to make all the new data types being collected useful.  However, like any other platform, performance varies depending on the underlying servers being used.  There is great promise in what Hadoop can deliver in terms of business value, and the ecosystem is continuously growing with companies making strides to make Hadoop easier to deploy and manage.

One area that has experienced huge advancements is the data center server.  The power and cooling requirements of data centers have really become an important issue, and the major vendors are all focused on helping the industry become cleaner and greener.  AMD SeaMicro has been a leader in this area and reimagined the server and pioneered fabric-based dense, micro server with technology that interconnects pools of resources over a supercompute fabric with an unprecedented 1.28 Tbps bisectional bandwidth that can access more than five petabytes of direct attached storage.  The SeaMicro Freedom™ Fabric removes the constraints of the traditional server and allows data centers to expand in multiple dimensions without adding unneeded hardware and costs. Hadoop does not need the fastest processor, but it does need to be affordable and easily scaled out as the amount of data that is collected and analyzed increases.

The data center server is the key underlying infrastructure that enables all of these new innovative services.  Though the amount of data being collected is unlimited, data center capacity clearly is not.  The industry is realizing that data center servers need real innovation that extends beyond the individual server components and takes into account the end-to-end perspective encompassing compute, storage and networking. It’s time to re-imagine the data center server, and deliver what the industry needs.  Companies are experiencing problems that just cannot be solved with traditional servers.

To hear more about how microservers can improve your Hadoop performance and minimize operations, join AMD SeaMicro and Hortonworks for a provocative discussion in the June 18 webinar: How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers, hosted by the Linux Journal.

To learn more about AMD SeaMicro visit: www.seamicro.com

For more on delivering a modern data architecture for your business, click here.

 

 

Hadoop Use Case: Harnessing Big Data in the Social Advertising Industry

Successful social advertising campaigns today take a special blend of data intelligence and automation – enabling businesses to link fluctuations in media and tactics to sales and revenues.  Those with better data relative to their competitors, will be positioned to outperform their peers tactically and, if used effectively, strategically.  At one of the fastest growing Advertising Technology startups, harnessing Big Data made big sense in a highly competitive business environment.

The Advertising Technology startup sells Social Ad Campaign management software and wanted its in-house engineering team to focus on its core product and to outsource certain areas of its non-core technology needs. The non-core portion of its technology stack required cutting edge computing skills and entailed creating a Big Data Analytics infrastructure built on the Hortonworks Data Platform (HDP), and hosted on the Amazon Cloud.

A key component for the system development was a scalable crawler to aggregate social data to meet demanding latency requirements.  The crawl infrastructure had to meet two aspects; 1/ to support the timely refresh of data (e.g. existing social profile data), and 2/  to keep up with the exponential growth of data collection requirements in a timely manner.   Both of these requirements lead the startup logically to a Hadoop based framework and Hortonworks Data Platform (HDP) as the platform of choice.

SerendioFlow

Other factors the Advertising Technology startup needed to address was a reduction in cost for designing and maintaining a highly available and scalable HDP infrastructure with robust analytics and a predictive modeling backend to meet evolving business initiatives.  To accomplish this, the startup enlisted the services of Serendio, a provider of a Big Data Science platform – DisKoveror, designed to enable Enterprises to Aggregate, Discover, Analyze, Visualize, and Predict business outcomes from seemingly unrelated facts and relationships buried in all forms of digital assets for holistic intelligence and insights.

Harnessing Big Data proved to make big sense for the Advertising Technology startup; creating new revenue streams and a reduction in operational costs by 60%.

“We chose Hortonworks Data Platform (HDP) and Serendio for our implementation because collectively they had the right technology, skills and expertise to scale our infrastructure rapidly to keep up with our fast growing business.”

Thank you to our partner Serendio for this HDP (Hadoop) use case. For more use cases, visit Serendio’s case studies

Serendio’s Big Data Science solutions help in driving Decisions and Actions for a wide variety of businesses in Retail, Insurance, Media, Education, and Healthcare. Visit Serendio at http://www.serendio.com

For more on delivering a modern data architecture for your business, click here.

Get Started with Analytics with Alteryx and Hortonworks

alteryxthumbIt’s an exciting time in the analytics space. The promise of big data analytics is driving big investment in the companies that are multiplying the benefits of big data by putting it into the hands of business users.

Talk of Big Data and the Ramifications

Every day we hear of the coming benefits of big data. Some benefits have huge ramifications for us – think about how faster medical diagnoses will impact you and your family. Some benefits are more incremental – saving time by having retailers suggest purchases based upon people like us. Businesses are seeing these benefits starting to come to fruition – being more frugal with marketing efforts by targeting only the customers that are most likely to respond, or being able to retain customers that might otherwise switch to another provider. These capabilities drive quantifiable benefits because they are forward looking – they can suggest actions that change outcomes.

Investments in Companies to drive Big Data benefits

As businesses reap these benefits, investors are seeing a tremendous growth opportunity in the companies that can provide these insights. In the past week we have seen significant investments in two of these companies, both partners of Hortonworks, Alteryx and Tableau. But more than providing insight, there is another key difference to what these companies are providing.

Access to Analytics is Multiplying the Benefits

Instead of providing analytics capabilities that are accessible only to IT developers and data scientists, a new breed of tools are making analytics available to business people – the people who understand the business questions, and can find the answers. Alteryx and Tableau are great examples of two companies that have developed tools that provide analytics capabilities on a personal level, for business people that can quickly master it and put it to use.

Alteryx Project Edition can access the data in Hortonworks Data Platform (HDP) and perform statistical, geographic, and even predictive analytics. But the ability to add context by blending Big Data stored in sources such as HDP with other data sources using Alteryx gives analysts a deeper understanding of the reason for a trend and helps predict future outcomes. Alteryx also offers a range of packaged industry-specific analytic solutions that enable organizations to overlay internal data with U.S. census data and syndicated data from dozens of providers, including Dun & Bradstreet, Experian, TomTom, and many others.

Three steps to implementing Hadoop-based Analytics

Alteryx and Hortonworks have just released a whitepaper, “The Business Analyst’s Guide to Hadoop.” Business analysts can read this paper to learn about Hadoop and get a practical 3 step guide to implementing Hadoop-based analytics.  Get the paper and begin experiencing the results that are generating so much excitement in the world of analytics.

Download “The Business Analyst’s Guide to Hadoop” Whitepaper to learn more.

Mobile Telco Dials In and Harnesses Big Data with Hadoop

actuateSmartphones have transformed our daily lives. A key indicator of this trend is our increased spend on data plans versus voice. We are a new generation of people who are in a constant state of activity, communication, and community building wherever we go ─ including the couch in front of the television where we can multi-screen and multi-task!

What does this mean for the Mobile Telecom industry?  For one of the top five mobile phone service providers in the world, responsible for developing and managing advanced data services for European countries with data services including mobile internet access for various devices, mobile email, instant messaging, news, weather updates and traffic reports ─ it means as mobile data services grow in revenue, so does the need to monitor that contribution easily and accurately. While that sounds obvious, the mobile telecom growth rate has expanded so rapidly, the company’s existing systems could not keep up. And once the business leaders had the data – they wouldn’t trust it. Making accurate business decisions at the right time would be essential for their success and growth.

Big Data Challenge

The customer – a Mobile Telecom giant – had an existing method for determining business performance was ad hoc and decentralized. There was no single system to extract the information in a reliable and consistent manner. “We had a mix of systems and information which needed lots of cross-checking – if indeed this was even possible. Getting access to data took a long time and, even then, the business users in marketing had no real confidence in the information they were getting.” This in turn compromised their ability to develop and manage these services.

In order to gain market share and stay competitive, the customer had to be able to:

  • Leverage the data from mobile usage to get accurate information about real customer activity to provide improved levels of customer satisfaction.
  • Spot upcoming trends in mobile use to drive intelligent marketing.
  • Improve the information on customer usage, which drives the changes needed to their service offerings, such as the ability to offer the latest mobile phone technologies.
  • Handle large volumes of data, be easily configurable by in-house business users, and provide graphical representations of the results

Hadoop Solution

The strategy included harnessing Hadoop to handle the large volumes of data – 36 terabytes- that had to be consolidated into a single environment. Our Mobile Telecom customer decided to use Actuate – a Hortonworks partner in open source based Business Intelligence and Reporting Tools (BIRT) technology that connects analytics capabilities directly to Hadoop. Actuate’s ability to report directly against the Hadoop big data source, meanwhile, allows business users to generate on-demand analytics and reports consisting of thousands of pages in a matter of seconds through an easy-to-use web portal, with negligible training.

The Mobile Telecom giant now has a single source of clean data they can stand behind with absolute confidence in making the right decisions to stay competitive, and keep customer satisfaction levels high.  In addition, the consumer data services division is now in a position where it can replace several of its older systems, dropping extra licenses and hardware, because of the ability to do all of its business analytics in one place.

A Business Intelligence Analyst at the company stated; “It’s all automatic. Before, business users would be sending emails and calls to chase the data. Anyone across the whole business can have access to the information they need, and find it on their own. I particularly like the ability to drill down into the figures. You can now see at a glance what’s happening right across our activities.

Customers’ want accurate and fast analytics reporting without a lot of training so a partnership between Hortonworks and Actuate, just makes big data sense.

Thank you to our partner Actuate for this Hadoop use case. Find more partners here.

 Actuate founded and co-leads the BIRT (Business Intelligence and Reporting Tools) open source project with the Eclipse Foundation, the home of the open source Eclipse Development Framework, the leading IDE worldwide. The BIRT project’s goal was to bring the web design metaphor to creating visualizations of data. 

Advanced Analytics: Making Decisions at the Speed of Business

Retailers today are faced with addressing the new behaviors of an evolving customer base by leveraging the changing landscape and its new dynamics.  Retail consumers online are sharing, friend validating, researching, learning and developing a point of view ─ offline they are touching, brand comparing and brand associating.  Retailers now more than ever before have to think in terms of “integrated commerce” and leverage Big Data for big results in the marketplace.

Forward-thinking organizations are discovering the possibilities of unconstrained analytics and quickly realizing the potential of accelerating the spread of analytics across the company ─ ultimately driving the speed of acquiring new customers, responding to consumer and market change, and increasing their “share of wallet”. Retail analysts want to spend more time in the analytic discovery process, and less time acquiring and preparing data, so they can uncover new market opportunities and reduce risks. Their goal is to create a sustain­able competitive advantage that lets retailers predict con­sumer shopping patterns, increase market basket size by small percentages and better target new customers  – quickly translating into millions or billions of dollars.

paraccelHortonworks partner ParAccel has an Analytic Platform with parallel, bi-directional integration between ParAccel and Hortonworks Data Platform enabling cooperative analytic processing, leveraging the data and analytic functions of both sys­tems. The ParAccel Analytic Platform is built to run deep “in-database” analytics on massive amounts of data across systems ─ extending Hortonworks Data Platform for big data analytics. Joint customers find the integrated platforms provide a powerful, cost-effective solution for big data management and advanced analytics.

The architecture creates an open environment where analysts can bring in data from data warehouses and leverage data in Hortonworks Data Platform before or in the middle of a query. ParAccel also recently added support for HCatalog, making this integration speedy and efficient. It’s a great solution for offloading analytics from traditional platforms or bringing in internet, sensor data or normalized (structured) social media data. These out of the box modules give analytic-driven retailers access to the full range of data needed to make the right marketing, merchandising, and store operations decisions every time at the speed of business.

Learn more – join Hortonworks in the upcoming ParAccel webinar “Advanced Analytics on Hadoop Data” May 21st at 10 am PT.

Enterprise Big Data Analytics with Hortonworks and Datameer

Today, 94% of Hadoop users perform analytics on large volumes of data that were not possible before. How do they do it? Cool applications, that’s how.

You have seen various stats that indicate enterprises need better ways of making use of data but they bear repeating: The volume of business data worldwide, across all companies, doubles every 1.2 years, according to a study published by eBay in May, 2012. And market research firm IDC released a forecast showing the big data market may grow from $3.2 billion in 2010 to $16.9 billion in 2015. Clearly, enterprises need better ways of making use of all of this data, which contains innumerable insights for improving business processes and profitability.

datameerHortonworks partner Datameer, has a horizontal application for big data discovery that includes self-service data integration, analytics and visualization on top of Hadoop, including pre-built analytic applications.

While Datameer itself is a horizontal application for big data discovery that includes self-service data integration, analytics and visualization on top of Hadoop, Datameer takes it one step further and even offers pre-built analytic applications. Datameer’s Analytics App Market is the world’s first marketplace for buying and selling analytic applications that allows users to simply plug in their own data and see the final results visualized, without having to do the work of building the analysis.

The applications are downloaded with a single click, and range from broad, horizontal use cases that most any organization could utilize like email analytics or social media brand sentiment analysis to very specific use-case driven applications like Zendesk Forum analytics or JIRA ticket analyses. The best part is the marketplace is constantly growing as data scientists and subject matter experts from around the world create and contribute new applications for virtually any structured or unstructured data source.

Betting on Hadoop

Joe Nicholson, VP of marketing at Datameer, explains that the idea of Big Data analytics has exploded in the past 5 years. Business intelligence is not new, he said. What changed is the rise of so-called unstructured data. “Today, companies want to track things like customer paths taken through a website, email network usage, comments posted on websites or collaborative tools, or find the useful information hidden in millions and millions of tweets,” said Nicholson.

“There’s no way to do it all in any sort of timely fashion, or without breaking the bank, without first getting all of your data in Hadoop. So first and foremost comes the need to be able to get that data in yourself, without relying on IT. Then you want to point-and-click your way through your analysis and get instant feedback so you can analyze the same way you think. When you’ve built your analysis, that’s when you want to run it against your entire dataset. And finally, you want to visualize your results with just a few clicks. Import corporate logos, add text, make the report your own. We do all of that in Datameer, and we couldn’t do it if we hadn’t made this fundamental bet on Hadoop.”

Datameer partnered with Hortonworks back in 2011, and the two companies have been working together to accelerate the development and adoption of big data analytic solutions that leverage or extend the Apache Hadoop platform, and allow users to tap into the massive amounts of unstructured data.

The joint webinar conducted earlier this year, “Big Data Analytics: Is Your Elephant Enterprise Ready?” addressed critical project components such as data security, high availability, user training and use case development.

 

For more information on Datameer visit www.Datameer.com or @datameer

6 Key Hardware Considerations for Deploying Hadoop in Your Environment

To deploy, configure, manage and scale Hadoop clusters in a way that optimizes performance and resource utilization there is a lot to consider. Here are  6 key things to think about as part of your planning:

hp

  1. Operating system:  Using a 64-bit operating system helps to avoid constraining the amount of memory that can be used on worker nodes. For example, 64-bit Red Hat Enterprise Linux 6.1 or greater is often preferred, due to better ecosystem support, more comprehensive functionality for components such as RAID controllers.
  2. Computation: Computational (or processing) capacity is determined by the aggregate number of Map/Reduce slots available across all nodes in a cluster. Map/Reduce slots are configured on a per-server basis. I/O performance issues can arise from sub-optimal disk-to-core ratios (too many slots and too few disks). HyperThreading improves process scheduling, allowing you to configure more Map/Reduce slots.
  3. Memory: Depending on the application, your system’s memory requirements will vary. They differ between the management services and the worker services. For the worker services, sufficient memory is needed to manage the TaskTracker and FileServer services in addition to the sum of all the memory assigned to each of the Map/Reduce slots. If you have a memory-bound Map/Reduce Job, you may need to increase the amount of memory on all the nodes running worker services. When increasing memory, you should always populate all the memory channels available to ensure optimum performance.
  4. Storage: A Hadoop platform that’s designed to achieve performance and scalability by moving the compute activity to the data is preferable. Using this approach, jobs are distributed to nodes close to the associated data, and tasks are run against data on local disks. Data storage requirements for the worker nodes may be best met by direct attached storage (DAS) in a Just a Bunch of Disks (JBOD) configuration and not as DAS with RAID or Network Attached Storage (NAS).
  5. Capacity:  The number of disks and their corresponding storage capacity determines the total amount of the FileServer storage capacity for your cluster. Large Form Factor (3.5”) disks cost less and store more, compared to Small Form Factor disks. A number of block copies should be available to provide redundancy. The more disks you have, the less likely it is that you will have multiple tasks accessing a given disk at the same time. More tasks will be able to run against node-local data, as well.
  6. Network: Configuring only a single Top of Rack (TOR) switch per rack introduces a single point of failure for each rack. In a multi-rack system, such a failure will result in a flood of network traffic as Hadoop rebalances storage. In a single-rack system, this type of failure can bring down the whole cluster. Configuring two TOR switches per rack provides better redundancy, especially if link aggregation is configured between the switches. This way, if either switch fails, the servers will still have full network functionality. Not all switches have the ability to do link aggregation from individual servers to multiple switches. Incorporating dual power supplies for the switches can also help mitigate failures.

Thanks to HP for pulling this information together and testing Hortonworks Data Platform on HP hardware. For the full report, download the “HP Reference Architecture for Hortonworks Data Platform” whitepaper.

Integrating Apache Hadoop and SAP

With any enterprise software implementation, the challenge is often the integration of a chosen system with existing enterprise systems architecture. One such existing investment may be an ERP (and related) systems such as those provided by SAP. In this real-world instance, SAP partnered with Hortonworks to enable integration of Apache Hadoop into SAP Real-Time Data Platforms using Hortonworks Data Platform to facilitate business intelligence and analysis of Big Data.

The business challenges at hand will be familiar to everyone and are a great fit for a Hadoop solution. These are:

  • Data does not fit neatly in a relational format. The customer gathers more than one hundred million surveys each year. The most valuable data is in the “comments” field which is unstructured and therefore not analyzed.
  • The business cannot view data across departments. Customer training data, for example, is not typically joined across departments with the call center’s CRM application to help tailor a support call to the customer’s expertise.
  • Even if custom solutions are built to handle free-form, unstructured data like comment fields, and custom logic associates training and certification data with CRM data, there is no model to deal with the next unstructured data set or join together previously unrelated data in a powerful manner.

sap1

The customer – a major hardware manufacturer – has operated on the combination of the SAP ERP application, Oracle RAC, and SAP Sybase® IQ software for years. The company’s business processes, from customer relationship management (CRM) to inventory management, manufacturing, and fulfillment, all run on SAP software. Oracle RAC supports the system’s transactional data flow, and SAP analytics solutions are used to analyze and report on data stored in SAP Sybase IQ. This two-database architecture helps improve throughput by separating out transactional and analytic workloads.The company chose to implement Hortonworks Data Platform to refine previously unstructured data sets and to begin to explore the relationships among previously unrelated data. Within the first half of the year, these explorations proved valuable. Today, the company enriches the view of the customer over time and across systems to improve customer satisfaction, leading to improved retention and repeat business.

New business capabilities that this enables include automatic support escalation, improved customer records, better customer insight and improved customer support.

We want to thank our partner SAP for documenting this with us. For more SAP and Hortonworks use cases, business impacts, architectural patterns and reference architectures, get the whitepaper: Combining SAP Real-Time Data Platform with Hortonworks Data Platform.

 

 

Seamless Reporting & Analytics for Apache Hadoop & Big Data Users

Jaspersoft, a Hortonworks certified technology partner, recently completed a survey on the early use of Apache Hadoop in the enterprise. The company found 38% of respondents require real-time or near real-time analytics for their Big Data with Hadoop. Also, within the enterprise, there is a diverse group of people who use Hadoop for such insights: 63% are application developers, 15% are BI report developers and 10% are BI admins or casual business users. Register for a free webinar to hear more.

So, for Hadoop users, the partnership between Hortonworks and Jaspersoft provides a good combination– Jaspersoft provides the ideal complement for reporting and analysis of Hadoop-based Big Data systems through a full suite of ETL, Apache Hive, and native Apache HBase connectors for low-latency data exploration. Not only does the company have an open source model that empowers users to deploy Big Data reporting and analytics quickly and cost-effectively, pre-defined reports make it easy for a wide group of users to gain and share immediate insight.

Jaspersoft joined the Hortonworks Technology Partner Program in 2012, extending advanced reporting capabilities to Hadoop users. The Hortonworks Technology Partner Program is designed to assist ISVs and other solution providers to integrate and extend their solutions for Hadoop, and includes a variety of technical enablement, technical support and training offerings. According to Hortonworks’ CTO Eric Baldschwieler, “Jaspersoft’s industry-leading reporting, analysis, and dashboard products, together with the Hortonworks Data Platform, make it easy and cost-effective for customers to derive maximum insights and value from their largest data stores.”

Choosing the right analytical approach

As easy as this sounds, there are still several approaches to analyzing and reporting on Big Data and numerous use cases— web analytics, fraud detection, security monitoring and healthcare just to name a few. Choosing the right approach depends on what insights you need and why you need them, and can make all the difference in how much value you extract from your data.

An upcoming webinar hosted by Hortonworks and Jaspersoft on March 13 will delve into the various architectural choices used in Hadoop reporting and analytics, and several use cases will be discussed. Register now.

 

The Hadoop Ecosystem: Big Data Analytics Meets Advertising (Webinar)

Please join Hortonworks, Impetus and Entravision/Luminar for a webinar on how big data analytics is being used in the advertising industry to identify predictability models of consumer behavior. The webinar will take place on Tuesday, February 12th at 1pm (EST), 10am (PST).

Register Now

Big data analytics is becoming increasingly useful to professionals in digital media, gaming, healthcare, security, finance and government, and nearly every industry you can name. Companies are analyzing vast amounts of data from various sources to shed light on customer behaviors, accelerate lead conversion, pinpoint security threats and enrich social media marketing efforts. In fact, new tools and technologies are making it easier to harness the power of Big Data and put it to use, and businesses are quickly uncovering valuable insights that were previously unavailable.

Entravision Communications Corporation is one company looking to reap the benefit of big data through careful analytics. The diversified Spanish-language media company has created an analytics, modeling and insights division—called Luminar– with the goal of expanding the value of its traditional advertisement services.

Luminar is the first big data analytics and modeling provider connecting marketeers with U.S. Latino consumers. The division was made possible by a partnership between Impetus Technologies, a Big Data thought leader, and Hortonworks, a leading commercial vendor who promotes and develops support for the Apache Hadoop platform. The two companies have partnered to create a solution for using the Hortonworks Data Platform powered by Apache Hadoop to access big data environments and third-party data sources.

Impetus’ experience with building big data frameworks like LaDaP has helped Luminar setup an analytics infrastructure that can linearly scale up thousands of nodes using commodity hardware.

The successful launch of Luminar has included a number of powerful offerings, including an Insights App, a Customer Decision Engine, Real time Cloud Insights, and Analytics. Today, Luminar is helping clients identify predictability models of consumer behavior to allow companies to reach, upsell and retain Latino consumers more affectively.

Join Hortonworks, Impetus, and Entravision/Luminar on Tuesday to learn more about how Entravision is putting big data to work. This free webinar will explore how they’re leveraging big data to obtain valuable insights and expand the value of its traditional advertisement services.

Register Now

The Hadoop Ecosystem: Bigger Data on Your Budget (Webinar)

Please join Hortonworks and Appnovation for a webinar titled “Bigger Data on Your Budget” taking place on Wednesday, February 13th at 2pm EST, 11am PST.

Register Now

Appnovation is a new Hortonworks Systems Integrator partner that is focused on cutting edge open source technologies. They are experts in Drupal, Alfresco, SproutCore and now Apache Hadoop.

In advance of this webinar, I interviewed Dave Porter, Appnovation & SproutCore Lead Developer, about the technologies they support and how Appnovation and Hortonworks are working together to provide big insights without breaking the bank.

Question: In your opinion, what are the best technologies to combine with Apache Hadoop?

Dave: Any stack is going to require a place to store your Hadoop insights, a way to get at that data (say, as a web API), and a way to view the data. My favorite stack is Hadoop for processing and storage, node.js for the web API, and SproutCore for the rich, data-driven sophistication that it brings to web application development. I also like MongoDB because it’s an agile and scalable open source NoSQL database.

Question: Why those technologies, and why is this solution unique?

Dave: Each interface (e.g. Hadoop to Mongo, Mongo to node) is clear, well established, and best-in-class. One of the biggest challenges to heterogeneous systems is cleanly translating the data formats between layers. This system doesn’t have that problem, because the data is JSON all the way down.

Hadoop and MongoDB work very well together, as do MongoDB and node. I’m a node acolyte myself, but I know that Ruby can do a good job here as well. If your dashboard needs are very simple – for example, reload to view an updated pie chart – then SproutCore is overkill. However, if you’re looking for an interactive, live-updating, drillable dashboard then SproutCore has all the tools you need to build sophisticated, data-driven rich web apps.

The best thing about this solution is that it’s high profile open-source from tip to toe. So just like Hadoop means bigger data on a smaller budget, this entire solution allows you to put insights gained from Hadoop in front of important eyeballs without licensing fees. Plus, all of these technologies are at the core of Appnovation’s competencies. We know how to build great products with each technology and we can provide ongoing support and peace of mind.

Question: What use cases can this solution solve? What’s the real value to customers here? 

Dave: Let’s say you’re a regional retail giant. Your inventory management system runs on an overnight batch cycle, so if some radio DJ in Framingham unexpectedly plugs Widget A and your Framingham store is sold out of it by 10AM, your inventory guy doesn’t know about it until the next morning and probably can’t restock until day 2. By that time, the DJ is talking about something else.

By moving your batch cycle analysis to Hadoop, you can scale your system with commodity hardware and run that batch cycle every two hours. Your inventory system knows that Framingham is selling more Widget As than usual by 10AM, and it knows you’re sold out by noon. The data pipes through the system almost instantly, and your SproutCore dashboard, which is open on your inventory guy’s computer and automatically updating itself, is flashing red forty-five seconds later. By 1PM, he’s got an overnight truck full of widgets scheduled from the warehouse to Framingham for arrival the next morning. You’ve cut your real-world, widget-on-the-shelf reaction time down from two days to less than one, allowing you to take quicker advantage of facts on the ground and increase your sales of Widget A.

It’s important to understand that Hadoop is very focused on the Big Data problem. It knows that its job is to crunch massive amounts of unstructured, opaque data down to small, structured insights as quickly and inexpensively as possible, and it’s very good at that job. What Hadoop doesn’t do is show you those insights in a way that makes sense to us humans. Taking the insights and getting them in front of your CEO’s eyeballs is still your responsibility. Luckily, there are a lot of great technologies to help you with that.

Conclusion

By attending this webinar from Hortonworks and Appnovation, you will get a better understanding of what Big Data is all about, the challenges associated with accumulating exceedingly large amounts of complex data, what your options are to handle this information, and most importantly, what this data can mean for your business once it has been translated into a usable format.

You don’t want to miss this webinar, so please register now.

The Hadoop Ecosystem: Unleashing the Marketing Potential of Big Data

The customer data that companies collect from websites, social media, blogs, digital advertising and mobile is exploding. And as big data gets bigger, the amount of untapped insights available from analyzing that day is also growing exponentially. Marketers covet those insights as a way to better understand and engage with their customers and ultimately drive revenue—but how do they get to it?

According to Gartner, organization that successfully integrate high-value, diverse new information types and sources into a coherent information management infrastructure will outperform their industry peers financially by more than 20 percent.* Fortunately, a new solution that combines Hortonworks Data Platform (HDP) with the expertise of eSage Group allows marketing professionals to extract value from Big Data, quickly and with relative ease.

esage_diagram

We interviewed eSage’s Dean Bedard, COO, about how the combination helps marketers unleash the power of Big Data and put it to use:

Q. Why is eSage Groups solution for mining big data important to marketing professionals?

Dean: Marketing organizations need a robust solution that can provide actionable customer and campaign insights from the large amounts of structured and unstructured data they collect.  These insights can be used to create better-targeted cross-channel campaigns and provide timely information to help tune marketing campaigns as they’re running. For example, a certain percentage of the original investment might be dispersed differently between digital advertising and social outreach at a certain point during the campaign, and big data can lend insight into what split will be most effective.

Q. How are eSage and Hortonworks working together to enable this insight?

Dean: eSage and Hortonworks are unleashing the potential of big data in a matter of weeks with a flexible solution that provides marketers a level of unlimited detailed cross-channel analysis that they previous didn’t have. HDP provides a big data foundation to efficiently store and process all this data, while eSage Group helps extract business intelligence through a combination of user friendly analysis technology, deep understanding of marketing analytics and business-focused delivery methodologies. This combination of technology and process provides a robust, flexible and extremely efficient solution that allows rapid development of rich and powerful analytics.

Q. How do the two platforms interact?

Dean: eSage Group’s Enterprise Marketing Platform includes connectors to HDP that enable rapid extraction of the most valuable marketing data. Once the data is within the eSage platform, logic can be implemented for powerful cross-channel analytics and key performance indicators.  With this layer of intelligence in place, marketers can begin to make sense of data and gain the kind of insight they need to support and shape their efforts.

Q. Marketers typically aren’t very technical. Can they still use the platform?

Dean: Certainly. eSage Group provides marketers with access to Technical Business Analysts that understand the technology, as well as the business needs. The Analysts can help Marketing personnel identify what goals to measure, how to measure them and what data is required, then work with IT to obtain that data and get it into user friendly analysis tools. eSage enables data access and analysis using the business tools marketers are already familiar with, such as Microsoft Excel, PowerPivot for Excel and PowerView for SharePoint.

Q. So, the solution bridges the gap between enterprise data and marketing?

Dean: Absolutely!  Hortonworks can collect and process terabytes, even petabytes of both structured and unstructured data very cost-effectively. With eSage Group’s intelligence laid on top of the platform, marketers can now extract and analyze this information in a very cost-effective and rapid manner.

Conclusion

It’s clear that big data offers huge potential for marketing organizations that can uncover customer and campaign insights from the large volumes of structured and unstructured data they are collecting. Together, Hortonworks and eSage Group are helping marketing organizations to realize this value quickly and with relative ease.

For more information about how eSage Group and Hortonworks are partnering to make key information available to marketing organizations, please visit eSagegroup.com. You can also follow eSage Group on Twitter (@eSageGroup) or by reading their blog.

~ Lisa Sensmeier

 

*Gartner, July 2012

Proper Care and Feeding of Drives in a Hadoop Cluster: A Conversation with StackIQ’s Dr. Bruno

In a recent blog post, Hortonworks’ Steve Loughran discussed Apache Hadoop’s preference for JBOD-configured storage vs. the allure of RAID-0. As more enterprises are beginning to move beyond the science experiment stage and begin deploying Hadoop into their production environments, they are learning that Hadoop is quite different than other services in their data centers, such as web, mail, and database servers.They are learning that to achieve optimal performance, you need to pay particular attention to configuring the underlying hardware.

To find out more, we had a chat with Dr. Greg Bruno, VP of Engineering, and co-founder of StackIQ, a Hortonworks partner, about the real life implications of managing hard drives (HDDs) in a modern Hadoop cluster.

Q. Why isn’t it considered good practice to configure drives in Hadoop clusters as RAID-0 disk arrays?

A. Hadoop prefers a set of separate disks to the same set managed as a RAID-0 disk array. Read speeds are particularly important to the performance of a Hadoop cluster, and in his post, Steve makes the point that since drive speeds vary, and RAID-0 reads occur at the speed of the slowest disk in the array, a RAID-0 configuration may well be slower than a non-RAID configuration. The bigger issue, in my opinion, is reliability. If a set of disks is configured as a RAID-0 array, then one disk failure in that array will take that entire volume down, and if all the disks in a node are configured as a single RAID-0 array, then a single disk failure will take all the node’s data down. By configuring multiple disks in a RAID-0 array, you magnify the probability of that volume going offline due to a single disk failure and you maximize the amount of data that goes offline when that single failure occurs.

Q: Modern servers have a lot of disks. What’s the impact of losing a single disk when you have 12 3TB drive in each node?

A:  When a single drive fails when Hadoop is configured in its default state, the ENTIRE NODE gets taken offline. Back when servers typically had 6 x 1.5TB drives in them, losing a single disk would cause the loss of 0.02% of total storage in a typical 10PB, three-replica setup. With today’s hardware — typically 12 x 3TB drives per node, losing a single disk results in the loss of five times as much data.

Q: Aren’t today’s HDDs much more reliable than they used to be? Is it worth the extra work to handle the rare cases when a drive fails?

A: While drives are much more reliable than they used to be, they are still the cause of the lion’s share of support tickets in a Hadoop cluster. In fact, according to Bharath Mundlapudi, a Core Hadoop Engineer while working at Yahoo, disk drive failures account for fully 50% of siteops trouble tickets. That’s more than three times the next highest source of tickets.

Q: What does that represent in real terms?

A: It represents a lot of work for systems administrators. How much depends on the size and age of the cluster in question. For example, Facebook, which has some very large clusters, reports that their failure detection and automated repair system is doing the work of approximately 200 full time system administrators.

Q: OK, but not many organizations have clusters that large. What about a typical enterprise setup?

A: Our experience indicates that a 1,000 node cluster containing 12,000 drives for a total raw storage capacity of 48 peta-bytes can expect about 3 drive failures a day in its third year of operation. Drive failure rates rise as the devices age. For a 500 node cluster, you’re looking at a drive failure every 17 hours or so.

Q: Doesn’t this make it hard for the cluster operator to manage? How do they keep up?

A: Without the right tools and methodology, it is very difficult for cluster operators to manage clusters at scale. They typically have to write scripts to scan the cluster, detect disk failures, and report them. Then, once the offending drive has been replaced, commands must be run for the controller to recognize the new drive, OS commands need to be executed to format the drive, and then some Hadoop commands are required to add the disk back to the configuration.

Q: Presumably it’s not quite as challenging for StackIQ customers?

A: StackIQ’s mission is to make cluster operation as painless as possible, which is why we have developed tools to manage the entire lifecycle of the disk. While we haven’t figured out how to get our software to physically pull a bad drive and replace it with a new one, we automate the rest of it — from the initial deployment of the drive, detecting and reporting the error, and re-integrating the replacement drive into the configuration.

One of the features we’ve developed in StackIQ’s management software automatically configures chassis with LSI MegaRaid controllers into “JBODs”, that is, every disk in the chassis will be configured as an individual device.

In addition, a user can specify which disk they want in the chassis to be the boot disk via an attribute (e.g., “bootdisk0″) and if an optional secondary boot disk attribute is specified (“bootdisk1″), then our code will configure both those disks as a “mirror” (RAID1) while still making all the other non-boot disks available to Hadoop as individual disks.  A recent StackIQ customer made their purchasing decision on this feature alone, as they recently went through the painful exercise of changing a mid-size cluster’s RAID configuration by booting each server, one-by-one, catching a key press at the controller prompt, and fixing the configuration by-hand.  Not a fun exercise when you are under the gun by management to get production cluster online.

Q: With that many drive failures, clusters will be chewing through disks at a brisk rate. That could get expensive. That works out to something like 1000 drives/year X $100/drive = $100k per year just for replacement drives.

A: True, which speaks to the need for software which will make the most efficient use of your resources –  intelligent, automated cluster management software can find faulty drives automatically, and bring up a replacement drive quickly.

Q: Doesn’t automation take control out of the hands of the skilled cluster operators?

A: We believe it should be up to the cluster operator to set policies on how much automation to incorporate into their workflows. Our software reflects that philosophy, letting operators choose from a range of policies that go all the way from having the operator run all the commands manually, all the way to a fully automated repair where all the operator needs to do is push in the new drive and let StackIQ’s software do the rest.

Q: Can’t this be done with a simple command script that runs on all nodes?

A: That might be workable in a homogeneous environment, where all the nodes are the same. But in the real world, different nodes require different configurations. Even the disks are likely configured differently in nodes within the clusters. Handling those variables in a static script would be very difficult to accomplish. For example, if your cluster expands over time, you may be adding chassis with different drive configurations. Static scripts wouldn’t be able to deal with this situation. The StackIQ management software has intimate knowledge of the hardware and software in the cluster, so it knows exactly how to handle each drive in each node across the entire cluster, even in a heterogeneous environment.

Conclusion

So there you have it. The folks behind StackIQ cluster management software agree with Steve Loughran’s recommendation to forego RAID-0 for Hadoop clusters. In fact, they provide the management tools to make it easier to do. So take the advice of our experts, and configure your cluster servers as “Just a Bunch of Disks.”

For more information on StackIQ, please visit their website or follow their Twitter handle (@StackIQ). You can also follow Dr. Greg Bruno directly on his Twitter handle (@itsDrBruno).

~ Lisa Sensmeier

Teradata Webinar: Business Value with Big Analytics

Back in June we joined Teradata Aster in a webcast “Back to the Future – MapReduce, Hadoop and the Data Scientist” to highlight the benefits of Apache Hadoop and the role that data scientists are playing in big data. You can check out the replay here. The discussion focused around how big data architectures could bring more value to businesses using relational DBMS technology and Hadoop, and how the two can coexist.

On October 17th at 10am PDT, Teradata will host a webcast that raises the level and builds on the important theme of Hadoop and business value, recognizing that many are deeply involved with discovering the easiest and best way to bring their data to life. Teradata Aster plans to show how executives, analysts and IT managers can leverage breakthrough enterprise class big analytics solutions to inject innovative analytics into business processes for better data-driven decisions. All this while minimizing risk, maximizing ROI and accelerating time-to-value.

Read more or register for this webcast and join speakers Scott Gnau, President, Teradata Labs, Teradata Corporation, and Tasso Argyros, Co-President, Teradata Aster and get the inside scoop on Teradata Aster’s newest big analytics technology.

Answer Big Questions with Big Data

Partner Webinar Series

On September 18 at 10am PT/1pm ET we join our partner Datameer in a webcast aimed at providing answers to some common questions we hear in the industry. Specifically, what are some of the use cases that big data analytics is perfect for?

By looking at some common uses we are seeing, you’ll be able to envision how you can leverage the analytics results from your own data. Ultimately these analytics will lead to uncovering ideas for new business approaches you can use for a huge competitive advantage.

Obviously you need to weigh in the costs required so you can determine if the payoff is worth the investment for your business. What should you be considering when you are trying to decide if Hadoop and big data analytics are going to pay off?

These questions will be the topic for our webinar on September 18 at 10am PT. Join our speakers Matt Schumpert, Director of Solutions Engineering at Datameer and Jim Walker, Director of Product Marketing at Hortonworks in this Big Data Analytics webcast.

Register here.

Hortonworks boasts a rich and vibrant ecosystem of partners representing a huge array of solutions that leverage Hadoop, and specifically Hortonworks Data Platform, to provide big data insights for customers. The goal of our Partner Webinar Series is to help communicate the value and benefit of our partners’ solutions and how they connect and use Hortonworks Data Platform.

Go to page:12