Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
September 11, 2013
prev slideNext slide

Meet the Committer: 3 Minutes on Apache Sqoop with Venkat Ranganathan

We’re continuing our series of quick interviews with Apache Hadoop project committers at Hortonworks.

This week Venkat Ranganathan discusses using Apache Sqoop for bulk data movement between Hadoop and enterprise data stores. Sqoop can also move data the other way, from Hadoop into an EDW.

Venkat is a Hortonworks engineer and Apache Sqoop committer who wrote the connector between Sqoop and the Netezza data warehousing platform. He also worked with colleagues at Hortonworks and in the Apache community to improve integration between Sqoop and Apache HCatalog, delivered in Sqoop 1.4.4.

Better Sqoop/HCatalog integration resolved known data fidelity issues with two very common use cases:

  • importing data from enterprise data stores into Apache Hive, and
  • exporting data from Hive into relational data stores.



Importing into Hive with Sqoop

Originally, Sqoop only supported text formats for importing into Hive. Text fields with embedded field and record delimiters caused errors when the data was imported into Hive.

This meant that users who wanted to use more efficient (non-text) Hive storage formats had to insert the data from the imported table into a new table and convert to text—adding steps that slowed processing time.

To fix these issues, Venkat and the rest of the engineers in the Apache Sqoop community adjusted the code so that now an HCatalog table can be a specific target for a Sqoop import. They did this without sacrificing the Sqoop feature for automatic schema mapping.

Also, the team is continuing to improve HCatalog integration with Sqoop by enabling high speed connectors (such as the Netezza connector) to work with HCatalog tables.

Now HCatalog abstracts the storage formats and makes the Sqoop jobs agnostic to those formats. Since formats like RCFile or ORCFile readily handle text fields with delimiter chars, there is no need to massage the data before import into Hive.

Exporting out of Hive with Sqoop

Another advantage of Sqoop/HCatalog integration concerns data moving in the other direction: exports from Hive into RDBMS stores.  Before, Sqoop export jobs could only export text files. Now, Sqoop export jobs can use any HCatalog table to be the source of an export to a relational database, regardless of the format of the data in that HCatalog table. Now users can easily export formats such as SequenceFiles, RCFiles, or ORCFiles.

Learn more about Sqoop here or at the Apache Sqoop project site.


Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums