Last week, we hosted Get Started with Big Data in the Cloud ASAP webinar with speakers from Hortonworks, Shaun Connolly and Ovum, Tony Baer. The webinar provided a very informative overview around the challenges enterprises are facing with the overwhelming number of choices available in the cloud. It covered how businesses can get over the hurdle so it can focus on a “lift and reshape” cloud strategy that will enable organizations to take full advantage of the benefits of your cloud deployment.
Some great questions came across during the webinar and as promised, here is a brief capture of that Q&A and the slides.
A: The best way to start is to understand your cloud strategy over the next few years and work with a vendor that can grow and is flexible with your cloud strategy. Some clients/customers have a preference for a certain cloud vendor(s) depending on if they have other services with that cloud vendor or if that cloud vendor is compatible with other 3rd party apps and services.
A: We see a lot of clients/customers that use a hybrid approach, especially the over $1 billion dollar revenue businesses that have been established in the data center for a long time. We are starting to see more clients/customers going cloud first; especially if it is under $1B revenue organization. In these situations, they often consider running their entire business, team or department in the cloud.
A: This depends if you want to manage your own environment or now. If you prefer a managed Hadoop-as-a-Service (the managed cloud-as-a-software solution) — where the vendor manages your cloud infrastructure and provides support — Microsoft Azure HDInsight is the powerful option.
If you want a Platform-as-a-Service (the managed cloud-as-a-platform) — which is a more self-service oriented solution for selecting pre-tuned workloads for
Data Science and Exploration, ETL & Data Preparation and Analytics — Hortonworks Data Cloud for AWS is great option.
Both choices can be considered a “graceful” way to transition into cloud. But keep in mind: depending on your situation, the transition could take take and ultimately, there might not be a “full transition to the cloud” (due to regulatory or data security requirements). Be sure to consider how you will span both data center and cloud at least for a period of time.
A: To date most of the activity has centered around “lift and shift” owing to the tactical nature of early cloud workloads, such as conducting test/development or launching new standalone cloud native workloads. But as we see the growth of managed services, we expect that the tide will turn. We expect that the brunt of new big data workloads deployed to the cloud will follow the “lift and reshape” pattern both because of the need for simplification and the reality of data gravity. We also expect that over time, organizations that have lifted and shifted heartbeat workloads such as online transaction systems, will gradually look for new optimization opportunities as process transformation opportunities arise.
A: The definition of Hadoop has expanded quite a bit over the years and today as it accommodates a growing array of processing and storage engines, and can support a variety of workloads through YARN. Hadoop, in conjunction with core and related open source technologies such as Apache Hive LLAP, Apache Beam, Apache Kafka, and Apache Spark is becoming more supportive of real-time interactive and streaming workloads. While we don’t expect Hadoop to replace data warehouses, we do expect that advances in Apache projects and hardware (such as Flash and emerging NVRAM high-speed storage technologies) will enable Hadoop to add more real-time workloads.
A: Securing any environment, cloud or otherwise, involves looking at the system from multiple perspectives and trying to minimize the area of exposure. You start at the network, and work your way into the data sets. You want to make sure endpoints and communications are protected all the way to controlling access to data based on authentication, authorization and data encryption (all of which is powered by technologies such as Apache Ranger, Apache Atlas and Apache Knox).
A: At a high level, evaluate the workloads that your organization is running and those that are on the wish list. The workloads best suited for the cloud are those that are highly changeable and/or volatile. These are workloads that might be extremely transient, necessary to fire up for to address for a specific problem.
Begin like you would any IT project, which is starting small with a pilot, and then steadily grow and learn from success. As you monitor projects, track resource consumption and service levels against requirements, you can determine whether specific workloads are affordable. Understand that changing the mix of compute, storage, and service levels impacts the cost of the workloads.
Keep in mind, when running in the cloud, you are managing elastic compute, not capacity. It will provide ample opportunities for experimenting to find the best combination for the workload. You can prioritize workloads based on whether they merit reserved, on demand, or spot pricing. And of course, don’t neglect security. When deciding whether to store specific data sets in the cloud and/or run workloads, ask yourself the following questions:
A: Yes, we are seeing Data Science being very common workloads for cloud. Data Scientists need access to good sized data sets for model building and validation, and their workload profile can vary greatly over time. The integration with cloud storage coupled with the agility of cloud infrastructure make “Data Science + Cloud” a great combination. Checkout this white paper, Powering Data Science with Apache Spark in the Cloud. It shows how to best take advantage of solving data science problems with the cloud. By the way, Hortonworks has a great solution for Data Science workloads in the cloud: Hortonworks Data Cloud for AWS.
A: With Hortonworks Data Platform (HDP) deployed directly on EC2, you have access to the most configuration and customization options, all powered by a certified stack of HDP. HDP is 100% open source and developed by expertise of the talent and contributors from the open source community around Apache Hadoop. You can obtain best in class enterprise support directly from Hortonworks. In addition, you are paying for the AWS infrastructure to run HDP in the cloud.
With AWS EMR, you are getting a package of Hadoop projects. You need to pay for the AWS infrastructure to run it in the cloud in addition to support from AWS.
Of course, we believe going the HDP route is the preferred option, since Hortonworks focuses on providing the expertise and open source leadership to help make your data processing deployment successful.
A: This answer depends on many factors. Among those include: the size of your data set; the types of processing you plan to perform; and the type of access you plan to provide to your end users. The cloud certainly is a quick and easy way to evaluate big data solutions for your workloads. Checkout Hortonworks Data Cloud for AWS.
If you didn’t get a chance to watch the webinar, you can checkout the replay here:
To learn more about Hortonworks Cloud Solutions:
|Hortonworks Data Cloud Documentation||http://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.14.4/index.html|
|“Get Started with HDCloud” Webinar||https://hortonworks.com/webinar/hadoop-in-the-cloud-aws/|
|For a 5 day free trial for Hortonworks Data Cloud for AWS||https://aws.amazon.com/marketplace/pp/B01M193KGR|