Efficiently transfers bulk data between Apache Hadoop and structured datastores
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
What Sqoop Does
Apache Sqoop does the following to integrate bulk data movement between Hadoop and structured datastores:
|Data imports||Moves certain data from external stores and EDWs into Hadoop to optimize cost-effectiveness of combined data storage and processing|
|Parallel data transfer||For faster performance and optimal system utilization|
|Fast data copies||From external systems into Hadoop|
|Efficient data analysis||Improves efficiency of data analysis by combining structured data with unstructured data in a schema-on-read data lake|
|Load balancing||Mitigates excessive storage and processing loads to other systems|
YARN coordinates data ingest from Apache Sqoop and other services that deliver data into the Enterprise Hadoop cluster.
How Sqoop Works
Sqoop provides a pluggable mechanism for optimal connectivity to external systems. The Sqoop extension API provides a convenient framework for building new connectors which can be dropped into Sqoop installations to provide connectivity to various systems. Sqoop itself comes bundled with various connectors that can be used for popular database and data warehousing systems.
Hortonworks Focus for Sqoop
The Apache Sqoop community is working on improvements around security, support for additional data platforms, and closer integration with other Hadoop components.
|Security||Password management through Hadoop credential provider API|
|Parquet support||Export data from and import data to Parquet files|
|Integration||Improved integration with Apache Hive|
Recent Progress in Sqoop
Recent Sqoop releases extended data integration between Apache Hadoop and relational data stores.
|1.4.5 – August 2014||
|1.4.4 – July 2013||
|1.4.3 – March 2013||