Apache Sqoop

Efficiently transfers bulk data between Apache Hadoop and structured datastores

Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB

What Sqoop Does

Apache Sqoop does the following to integrate bulk data movement between Hadoop and structured datastores:

Function Benefit
Import sequential datasets from mainframe Satisfies the growing need to move data from mainframe to HDFS​
Import direct to ORCFiles ​Improved compression and light-weight indexing for improved query performance
Data imports Moves certain data from external stores and EDWs into Hadoop to optimize cost-effectiveness of combined data storage and processing
Parallel data transfer For faster performance and optimal system utilization
Fast data copies From external systems into Hadoop
Efficient data analysis Improves efficiency of data analysis by combining structured data with unstructured data in a schema-on-read data lake
Load balancing Mitigates excessive storage and processing loads to other systems

YARN coordinates data ingest from Apache Sqoop and other services that deliver data into the Enterprise Hadoop cluster.

How Sqoop Works

Sqoop provides a pluggable mechanism for optimal connectivity to external systems. The Sqoop extension API provides a convenient framework for building new connectors which can be dropped into Sqoop installations to provide connectivity to various systems. Sqoop itself comes bundled with various connectors that can be used for popular database and data warehousing systems.

Sqoop Illustration

Hortonworks Focus for Sqoop

The Apache Sqoop community is working on improvements around security, support for additional data platforms, and closer integration with other Hadoop components.

Theme Planned Enhancements
​Ease of Use
  • ​Integration with Hive Query View
  • ​Connection Builder with test capability
  • ​Hive Merge (Upsert)
Enterprise Readiness
  • ​Improved error handling and RestAPI
  • Improved handling of temporary tables​
  • Target DBA and deliver ETL in under an hour regardless of the source

Recent Progress in Sqoop

Recent Sqoop releases extended data integration between Apache Hadoop and relational data stores.

Sqoop Version Progress
1.4.6 – July 2015
  • Support importing mainframe sequential datasets
  • Export data from HDFS back to an RDMS
  • Import data from database to Hive as Parquet files
  • Import data to HDFS as a set of Parquet files
  • Upsert export for SQL Server​
1.4.5 – August 2014
  • Added high-performance Oracle connector
  • Enhancements to HCatalog integration
  • Support for Apache Accumulo
1.4.4 – July 2013
  • HCatalog integration
  • Support for Oracle Wallet
  • Bulk load to PostgreSQL
  • Support for composite HBase keys
1.4.3 – March 2013
  • Custom schemas for SQL Server and PostgreSQL
  • Bulk import from PostgreSQL
  • Support for using stored procedures during exports

Sqoop Tutorials

Sqoop in our Blog


to create new topics or reply. | New User Registration

This forum contains 115 topics and 208 replies, and was last updated by  Tej Kiran Sharma 6 days, 14 hours ago.

Viewing 20 topics - 1 through 20 (of 115 total)
Viewing 20 topics - 1 through 20 (of 115 total)

You must be to create new topics. | Create Account

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Stay up to date!
Developer updates!