Data Integration Services & Hortonworks Data Platform

What’s possible with all this data?

Data Integration is a key component of the Hadoop solution architecture. It is the first obstacle encountered once your cluster is up and running. Ok, I have a cluster… now what? Do I write a script to move the data? What is the language? Isn’t this just ETL with HDFS as another target?Well, yes…

Sure you can write custom scripts to perform a load, but that is hardly repeatable and not viable in the long term. You could also use Apache Sqoop (available in HDP today), which is a tool to push bulk data from relational stores into HDFS. While effective and great for basic loads, there is work to be done on the connections and transforms necessary in these types of flows. While custom scripts and Sqoop are both viable alternatives, they won’t cover everything and you still need to be a bit technical to be successful.

For wide scale adoption of Apache Hadoop, tools that abstract integration complexity are necessary for the rest of us.  Enter Talend Open Studio for Big Data. We have worked with Talend in order to deeply integrate their graphical data integration tools with HDP as well as extend their offering beyond HDFS, Hive, Pig and HBase into HCatalog (metadata service) and Oozie (workflow and job scheduler).

Talend addresses four key concerns for those using HDP:

  • Bridge the skills gap– Not everyone has a PHD in computer science…  Talend presents a graphical tool where you drag and drop pre-built components on to a canvas, configure them and then all the underlying code is created for you.  This is Java code that can be executed anywhere Java runs and even package as a service.  You can also customize the code however you see fit or use it within another IDE.  This radically simplifies the data load process.  All you need to know is the basic configurations and voila!… your data is loaded.
      
  • HCatalog Integration – Hortonworks and Talend engineering teams have partnered to bring HCatalog specific components and functions deeply integrated with the Talend connectors.  Components allow you to easily create, drop and modify tables and databases and check for existence, etc. Also, when storing data you can choose HCatalog as a storage option.  This provides the developer with options within the specific tools for Hive and Pig to integrate with HCatalog and share data and its structure much more easily. HCatalog then provides the metadata services for the underlying data and opens up the platform.
  • Connect to the entire enterprise – The enterprise is full of different sources and targets for data.  These can be databases, applications, files, services and even data warehouses and cubes.  Integration with these resources is not always simple.  We could take the top ten and provide connectors and call it a day, but enterprise data centers are not so homogeneous. With Talend we are able present a palette full of options, in fact they have over 400 connectors available.  In this video, you can see us grab and parse an Apache log file in seconds using a component.  These pre-tested components that save integration time by providing proven and tested APIs and schemas to make these connections.  Want to pull data from Salesforce.com?  …drop a component, configure your login credentials and your Salesforce metadata and data are at your fingertips.
  • Graphic Pig Script Creation– Talend also provides components to deliver Pig Scripts without writing a line of code.  Components for join, aggregate, filtering, cross and others are all included.  Again you drop a component, connect schema, configure the function, and then all the underlying code is written for you…making your time to delivery all that faster.

This approach can help all of your Hadoop-related projects move a lot faster so you can quickly move past the “where do I start?” question to the more interesting “what’s possible with all this data?”.

Related links:

Categorized by :
Hadoop Ecosystem HDP Other

Comments

james Cloud
|
August 14, 2012 at 3:12 am
|

Hello Guys,

Data Integration is a key component of the Hadoop solution architecture. For beginners, integration means you can focus on promotion design rather than cobbling together connections in a time-consuming and offers the platform and support for doing so.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.