The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

Sqoop Forum

Importing Data Using Sqoop

  • #55837
    Vijay Kumar

    As We have a Table with 2 columns lets suppose in SQL
    ( we doesnt have any created_date,Updated_date,Flag columns in SQL Source Table and not to modify source Table )

    id is primary key
    id name
    1 AAAAA
    2 BBBBB
    3 CCCCC
    4 ADAEAB
    5 GGAGAG
    i pull the data using sqoop into hive as a Main table its ok
    But if the source data is Updated like below
    id name
    1 ACACA
    2 BASBA
    3 CCHAH
    4 AASDA1
    5 GGAGAG

    Problem :

    My Issue is that without effecting the Main table data in hive i need to pull the Updated or Inserted or Deleted data using Sqoop and also simultaneously update in the Hive Main Table without effecting the Existing once….
    i have tried tried to use
    –incremental …. so on properties but no result….

    Result Should be:

    output main table is having all the 10 records… it should be 5 records….

    on day1 i have 1millions of records
    on day 2 i have 1million + current day + updated lets say 2 million
    on day2 i have to pull only updated and newly inserted data rather than whole data.
    and also
    can Anyone Help me how to combine day1 hive data with day2 updated data…

    In case if Anyone has Any other solution like any Alternative please suggest me Clearly Becoz i m new to hadoop….


  • Author
  • #55846
    MC Brown


    For incremental to work with Sqoop you must update the the table to contain an identifier so that you can pull the changes. Take a look at this article:

    For more information.

    What data are you using for the source data?


    Vijay Kumar

    Thank you for ur reply….

    As i m using telecom data which is structured and placed on SQL or MYSQL or any but the data we get is more ex: 10 million records daily…..
    so no modification done like adding a trigger on source data…..
    Source data will be modified on daily basis(Insertion,Deletion,Updation)…..
    in our scenario lets take two fields like id,name where id is primary key….. where is modified daily…
    i have to take only modified data……using Hive is a good idea or Using Hbase is a good idea or will hbase directly support these kind of situation …can hbase link up with sqoop for daily modification purpose ….. or using hivehbase integration can solve this problem……….
    i m confused….please help me

    MC Brown


    This is difficult to do with Sqoop, since it expects to either take everything, or be able to identify only the changes by identifying them from the table sources, either using a unique ID or using an update timestamp from which to perform the data movement.

    For the type of movement you are looking for, you need some form of replication that will take the changes. Since you are are using MySQL, have you looked at Tungsten Replicator (see for more info). That might suit your needs better to get the live stream of changes.


    Vijay Kumar

    Yes itss ture but my source is comming from sql server its my mistake…..

    please tell me the best approach to use….


    Hi Vijay,

    Per the information provided, the only thing that I can think of is to have a view which does a select command for the most recent changes on the actual table and you can use the sqoop to use this view as a table to load the data into HDFS.


The topic ‘Importing Data Using Sqoop’ is closed to new replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.