Top Questions from Get Started with Big Data in the Cloud ASAP - Webinar Recap - Hortonworks
Last week, we hosted  Get Started with Big Data in the Cloud ASAP webinar with speakers from Hortonworks, Shaun Connolly and Ovum, Tony Baer. The webinar provided a very informative overview around the challenges enterprises are facing with the overwhelming number of choices available in the cloud. It covered how businesses can get over the hurdle so it can focus on a âlift and reshapeâ cloud strategy that will enable organizations to take full advantage of the benefits of your cloud deployment.[1]
Some great questions came across during the webinar and as promised, here is a brief capture of that Q&A and the slides.
- Â Â What is the best way to choose a right Cloud? As there are many Cloud vendors in the market.
A: The best way to start is to understand your cloud strategy over the next few years and work with a vendor that can grow and is flexible with your cloud strategy. Some clients/customers have a preference for a certain cloud vendor(s) depending on if they have other services with that cloud vendor or if that cloud vendor is compatible with other 3rd party apps and services.
- Is it good idea to have âHybrid Cloudsâ? Both from technical and business point-of-view?
A: We see a lot of clients/customers that use a hybrid approach, especially the over $1 billion dollar revenue businesses that have been established in the data center for a long time. We are starting to see more clients/customers going cloud first; especially if it is under $1B revenue organization. In these situations, they often consider running their entire business, team or department in the cloud.
- Â Â Â Which is a more graceful way to transition into cloud from current structure (i.e. data center)? Using managed cloud as a platform and then manage our own clusters or using managed cloud as a software (with HDP pre-setup) and use it?
A: This depends if you want to manage your own environment or now. If you prefer a managed Hadoop-as-a-Service (the managed cloud-as-a-software solution) â where the vendor manages your cloud infrastructure and provides support â Microsoft Azure HDInsight is the powerful option.[2]
If you want a Platform-as-a-Service (the managed cloud-as-a-platform) â which is a more self-service oriented solution for selecting pre-tuned workloads for
Data Science and Exploration, ETL & Data Preparation and Analytics â Hortonworks Data Cloud for AWS is great option.[3]
Both choices can be considered a âgracefulâ way to transition into cloud. But keep in mind: depending on your situation, the transition could take take and ultimately, there might not be a âfull transition to the cloudâ (due to regulatory or data security requirements). Be sure to consider how you will span both data center and cloud at least for a period of time.
- Â Â Â Â In your experience, do your clients tend to do lift and shift or lift and reshape?
A: To date most of the activity has centered around âlift and shiftâ owing to the tactical nature of early cloud workloads, such as conducting test/development or launching new standalone cloud native workloads. But as we see the growth of managed services, we expect that the tide will turn. We expect that the brunt of new big data workloads deployed to the cloud will follow the âlift and reshapeâ pattern both because of the need for simplification and the reality of data gravity. We also expect that over time, organizations that have lifted and shifted heartbeat workloads such as online transaction systems, will gradually look for new optimization opportunities as process transformation opportunities arise.
- Do you see Apache Hadoop getting realtime/near real time?
A: The definition of Hadoop has expanded quite a bit over the years and today as it accommodates a growing array of processing and storage engines, and can support a variety of workloads through YARN. Hadoop, in conjunction with core and related open source technologies such as Apache Hive LLAP, Apache Beam, Apache Kafka, and Apache Spark is becoming more supportive of real-time interactive and streaming workloads. While we donât expect Hadoop to replace data warehouses, we do expect that advances in Apache projects and hardware (such as Flash and emerging NVRAM high-speed storage technologies) will enable Hadoop to add more real-time workloads.[4]
- How is security ensured in a cloud environment?
A: Securing any environment, cloud or otherwise, involves looking at the system from multiple perspectives and trying to minimize the area of exposure. You start at the network, and work your way into the data sets. You want to make sure endpoints and communications are protected all the way to controlling access to data based on authentication, authorization and data encryption (all of which is powered by technologies such as Apache Ranger, Apache Atlas and Apache Knox).[5]
- Â As a refresher, what are the main things that I need to focus regarding Big data analytics in the Cloud.
A: Â At a high level, evaluate the workloads that your organization is running and those that are on the wish list. The workloads best suited for the cloud are those that are highly changeable and/or volatile. These are workloads that might be extremely transient, necessary to fire up for to address for a specific problem.
Begin like you would any IT project, which is starting small with a pilot, and then steadily grow and learn from success. As you monitor projects, track resource consumption and service levels against requirements, you can determine whether specific workloads are affordable. Understand that changing the mix of compute, storage, and service levels impacts the cost of the workloads.
Keep in mind, when running in the cloud, you are managing elastic compute, not capacity. It will provide ample opportunities for experimenting to find the best combination for the workload. Â You can prioritize workloads based on whether they merit reserved, on demand, or spot pricing. Â And of course, donât neglect security. When deciding whether to store specific data sets in the cloud and/or run workloads, ask yourself the following questions:
- Can I trust the data to be stored in the public cloud?
- What data security/governance policies will be applicable for specific workloads/data sets that are deployed in the public cloud?
- Are there any relevant data sovereignty issues that must be factored?
- Â Do you see more Data Scientists running workloads in the Cloud?
A: Yes, we are seeing Data Science being very common workloads for cloud. Data Scientists need access to good sized data sets for model building and validation, and their workload profile can vary greatly over time. The integration with cloud storage coupled with the agility of cloud infrastructure make âData Science + Cloudâ a great combination. Checkout this white paper, Powering Data Science with Apache Spark in the Cloud[6]. It shows how to best take advantage of solving data science problems with the cloud. By the way, Hortonworks has a great solution for Data Science workloads in the cloud: Hortonworks Data Cloud for AWS.[7]
- Â On AWS, which one would be more preferred? Using AWS EMR or deploying HDP on EC2 instances?
A: With Hortonworks Data Platform (HDP) deployed directly on EC2, you have access to the most configuration and customization options, all powered by a certified stack of HDP. HDP is 100% open source and developed by expertise of the talent and contributors from the open source community around Apache Hadoop. You can obtain best in class enterprise support directly from Hortonworks. In addition, you are paying for the AWS infrastructure to run HDP in the cloud.
With AWS EMR, you are getting a package of Hadoop projects. You need to pay for the AWS infrastructure to run it in the cloud in addition to support from AWS.
Of course, we believe going the HDP route is the preferred option, since Hortonworks focuses on providing the expertise and open source leadership to help make your data processing deployment successful.
- Â Â We have production data hosted in the cloud in many isolated SQL databases which is obfuscated and collated in a different store for analytical purposes. Do you suggest NoSQL or document based storage for optimal big data analysis or have you seen highly successful big data solutions in SQL databases? Â
A: This answer depends on many factors. Â Among those include: the size of your data set; the types of processing you plan to perform; and the type of access you plan to provide to your end users. The cloud certainly is a quick and easy way to evaluate big data solutions for your workloads. Checkout Hortonworks Data Cloud for AWS.[8]
If you didnât get a chance to watch the webinar, you can checkout the replay here:
Get Started with Big Data in the Cloud ASAP: Maximize Agility, Minimize Time-to-Benefit[9]
To learn more about Hortonworks Cloud Solutions:
Â
References
- ^ Get Started with Big Data in the Cloud ASAP (hortonworks.com)
- ^ Microsoft Azure HDInsight (hortonworks.com)
- ^ Hortonworks Data Cloud for AWS (hortonworks.com)
- ^ Apache Hive LLAP (hortonworks.com)
- ^ Apache Ranger (hortonworks.com)
- ^ Powering Data Science with Apache Spark in the Cloud (hortonworks.com)
- ^ Hortonworks Data Cloud for AWS (hortonworks.com)
- ^ Hortonworks Data Cloud for AWS (hortonworks.com)
- ^ Get Started with Big Data in the Cloud ASAP: Maximize Agility, Minimize Time-to-Benefit (hortonworks.com)
Comments