cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
Apache Projects
Apache Sqoop

Apache Sqoop

MENU

OVERVIEW

Efficiently transfers bulk data between Apache Hadoop and structured datastores

Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB

What Sqoop Does

Apache Sqoop does the following to integrate bulk data movement between Hadoop and structured datastores:

Function Benefit
Import sequential datasets from mainframe Satisfies the growing need to move data from mainframe to HDFS​
Import direct to ORCFiles ​Improved compression and light-weight indexing for improved query performance
Data imports Moves certain data from external stores and EDWs into Hadoop to optimize cost-effectiveness of combined data storage and processing
Parallel data transfer For faster performance and optimal system utilization
Fast data copies From external systems into Hadoop
Efficient data analysis Improves efficiency of data analysis by combining structured data with unstructured data in a schema-on-read data lake
Load balancing Mitigates excessive storage and processing loads to other systems

YARN coordinates data ingest from Apache Sqoop and other services that deliver data into the Enterprise Hadoop cluster.

How Sqoop Works

Sqoop provides a pluggable mechanism for optimal connectivity to external systems. The Sqoop extension API provides a convenient framework for building new connectors which can be dropped into Sqoop installations to provide connectivity to various systems. Sqoop itself comes bundled with various connectors that can be used for popular database and data warehousing systems.

Sqoop Illustration

Hortonworks Focus for Sqoop

The Apache Sqoop community is working on improvements around security, support for additional data platforms, and closer integration with other Hadoop components.

Theme Planned Enhancements
​Ease of Use
  • ​Integration with Hive Query View
  • ​Connection Builder with test capability
  • ​Hive Merge (Upsert)
Enterprise Readiness
  • ​Improved error handling and RestAPI
  • Improved handling of temporary tables​
Simplicity
  • Target DBA and deliver ETL in under an hour regardless of the source

Recent Progress in Sqoop

Recent Sqoop releases extended data integration between Apache Hadoop and relational data stores.

Sqoop Version Progress
1.4.6 – July 2015
  • Support importing mainframe sequential datasets
  • Export data from HDFS back to an RDMS
  • Import data from database to Hive as Parquet files
  • Import data to HDFS as a set of Parquet files
  • Upsert export for SQL Server​
1.4.5 – August 2014
  • Added high-performance Oracle connector
  • Enhancements to HCatalog integration
  • Support for Apache Accumulo
1.4.4 – July 2013
  • HCatalog integration
  • Support for Oracle Wallet
  • Bulk load to PostgreSQL
  • Support for composite HBase keys
1.4.3 – March 2013
  • Custom schemas for SQL Server and PostgreSQL
  • Bulk import from PostgreSQL
  • Support for using stored procedures during exports

Forums

Sqoop in our Blog

Webinars & Presentations