Meet the Committer: 3 Minutes on Apache Sqoop with Venkat Ranganathan

We’re continuing our series of quick interviews with Apache Hadoop project committers at Hortonworks.

This week Venkat Ranganathan discusses using Apache Sqoop for bulk data movement between Hadoop and enterprise data stores. Sqoop can also move data the other way, from Hadoop into an EDW.

Venkat is a Hortonworks engineer and Apache Sqoop committer who wrote the connector between Sqoop and the Netezza data warehousing platform. He also worked with colleagues at Hortonworks and in the Apache community to improve integration between Sqoop and Apache HCatalog, delivered in Sqoop 1.4.4.

Better Sqoop/HCatalog integration resolved known data fidelity issues with two very common use cases:

  • importing data from enterprise data stores into Apache Hive, and
  • exporting data from Hive into relational data stores.

 

 

Importing into Hive with Sqoop

Originally, Sqoop only supported text formats for importing into Hive. Text fields with embedded field and record delimiters caused errors when the data was imported into Hive.

This meant that users who wanted to use more efficient (non-text) Hive storage formats had to insert the data from the imported table into a new table and convert to text—adding steps that slowed processing time.

To fix these issues, Venkat and the rest of the engineers in the Apache Sqoop community adjusted the code so that now an HCatalog table can be a specific target for a Sqoop import. They did this without sacrificing the Sqoop feature for automatic schema mapping.

Also, the team is continuing to improve HCatalog integration with Sqoop by enabling high speed connectors (such as the Netezza connector) to work with HCatalog tables.

Now HCatalog abstracts the storage formats and makes the Sqoop jobs agnostic to those formats. Since formats like RCFile or ORCFile readily handle text fields with delimiter chars, there is no need to massage the data before import into Hive.

Exporting out of Hive with Sqoop

Another advantage of Sqoop/HCatalog integration concerns data moving in the other direction: exports from Hive into RDBMS stores.  Before, Sqoop export jobs could only export text files. Now, Sqoop export jobs can use any HCatalog table to be the source of an export to a relational database, regardless of the format of the data in that HCatalog table. Now users can easily export formats such as SequenceFiles, RCFiles, or ORCFiles.

Learn more about Sqoop here or at the Apache Sqoop project site.

Categorized by :
Administrator Developer Hortonworks People Sqoop

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.