Apache Sqoop

Efficiently transfers bulk data between Apache Hadoop and structured datastores

Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB

What Sqoop Does

Apache Sqoop does the following to integrate bulk data movement between Hadoop and structured datastores:

Function Benefit
Data imports Moves certain data from external stores and EDWs into Hadoop to optimize cost-effectiveness of combined data storage and processing
Parallel data transfer For faster performance and optimal system utilization
Fast data copies From external systems into Hadoop
Efficient data analysis Improves efficiency of data analysis by combining structured data with unstructured data in a schema-on-read data lake
Load balancing Mitigates excessive storage and processing loads to other systems

YARN coordinates data ingest from Apache Sqoop and other services that deliver data into the Enterprise Hadoop cluster.

How Sqoop Works

Sqoop provides a pluggable mechanism for optimal connectivity to external systems. The Sqoop extension API provides a convenient framework for building new connectors which can be dropped into Sqoop installations to provide connectivity to various systems. Sqoop itself comes bundled with various connectors that can be used for popular database and data warehousing systems.

Sqoop Illustration

Hortonworks Focus for Sqoop

The Apache Sqoop community is working on improvements around security, support for additional data platforms, and closer integration with other Hadoop components.

Theme Planned Enhancements
Security Password management through Hadoop credential provider API
Parquet support Export data from and import data to Parquet files
Integration Improved integration with Apache Hive

Recent Progress in Sqoop

Recent Sqoop releases extended data integration between Apache Hadoop and relational data stores.

Sqoop Version Progress
1.4.5 – August 2014
  • Added high-performance Oracle connector
  • Enhancements to HCatalog integration
  • Support for Apache Accumulo
1.4.4 – July 2013
  • HCatalog integration
  • Support for Oracle Wallet
  • Bulk load to PostgreSQL
  • Support for composite HBase keys
1.4.3 – March 2013
  • Custom schemas for SQL Server and PostgreSQL
  • Bulk import from PostgreSQL
  • Support for using stored procedures during exports

Sqoop Tutorials

Sqoop in our Blog


to create new topics or reply. | New User Registration

This forum contains 111 topics and 205 replies, and was last updated by  Priyanka Vijay 1 week, 2 days ago.

Viewing 20 topics - 1 through 20 (of 111 total)
Viewing 20 topics - 1 through 20 (of 111 total)

You must be to create new topics. | Create Account

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Stay up to date!
Developer updates!