We’re continuing our series of quick interviews with Apache Hadoop project committers at Hortonworks.
This week Venkat Ranganathan discusses using Apache Sqoop for bulk data movement between Hadoop and enterprise data stores. Sqoop can also move data the other way, from Hadoop into an EDW.
Venkat is a Hortonworks engineer and Apache Sqoop committer who wrote the connector between Sqoop and the Netezza data warehousing platform. He also worked with colleagues at Hortonworks and in the Apache community to improve integration between Sqoop and Apache HCatalog, delivered in Sqoop 1.4.4.
Better Sqoop/HCatalog integration resolved known data fidelity issues with two very common use cases:
Originally, Sqoop only supported text formats for importing into Hive. Text fields with embedded field and record delimiters caused errors when the data was imported into Hive.
This meant that users who wanted to use more efficient (non-text) Hive storage formats had to insert the data from the imported table into a new table and convert to text—adding steps that slowed processing time.
To fix these issues, Venkat and the rest of the engineers in the Apache Sqoop community adjusted the code so that now an HCatalog table can be a specific target for a Sqoop import. They did this without sacrificing the Sqoop feature for automatic schema mapping.
Also, the team is continuing to improve HCatalog integration with Sqoop by enabling high speed connectors (such as the Netezza connector) to work with HCatalog tables.
Now HCatalog abstracts the storage formats and makes the Sqoop jobs agnostic to those formats. Since formats like RCFile or ORCFile readily handle text fields with delimiter chars, there is no need to massage the data before import into Hive.
Another advantage of Sqoop/HCatalog integration concerns data moving in the other direction: exports from Hive into RDBMS stores. Before, Sqoop export jobs could only export text files. Now, Sqoop export jobs can use any HCatalog table to be the source of an export to a relational database, regardless of the format of the data in that HCatalog table. Now users can easily export formats such as SequenceFiles, RCFiles, or ORCFiles.