Importing Data Using Sqoop

to create new topics or reply. | New User Registration

Tagged: , , , ,

This topic contains 5 replies, has 3 voices, and was last updated by  Mahesh Balakrishnan 9 months, 1 week ago.

  • Creator
  • #55837

    Vijay Kumar

    As We have a Table with 2 columns lets suppose in SQL
    ( we doesnt have any created_date,Updated_date,Flag columns in SQL Source Table and not to modify source Table )

    id is primary key
    id name
    1 AAAAA
    2 BBBBB
    3 CCCCC
    4 ADAEAB
    5 GGAGAG
    i pull the data using sqoop into hive as a Main table its ok
    But if the source data is Updated like below
    id name
    1 ACACA
    2 BASBA
    3 CCHAH
    4 AASDA1
    5 GGAGAG

    Problem :

    My Issue is that without effecting the Main table data in hive i need to pull the Updated or Inserted or Deleted data using Sqoop and also simultaneously update in the Hive Main Table without effecting the Existing once….
    i have tried tried to use
    –incremental …. so on properties but no result….

    Result Should be:

    output main table is having all the 10 records… it should be 5 records….

    on day1 i have 1millions of records
    on day 2 i have 1million + current day + updated lets say 2 million
    on day2 i have to pull only updated and newly inserted data rather than whole data.
    and also
    can Anyone Help me how to combine day1 hive data with day2 updated data…

    In case if Anyone has Any other solution like any Alternative please suggest me Clearly Becoz i m new to hadoop….


Viewing 5 replies - 1 through 5 (of 5 total)

The topic ‘Importing Data Using Sqoop’ is closed to new replies.

  • Author
  • #56012

    Hi Vijay,

    Per the information provided, the only thing that I can think of is to have a view which does a select command for the most recent changes on the actual table and you can use the sqoop to use this view as a table to load the data into HDFS.



    Vijay Kumar

    Yes itss ture but my source is comming from sql server its my mistake…..

    please tell me the best approach to use….


    MC Brown


    This is difficult to do with Sqoop, since it expects to either take everything, or be able to identify only the changes by identifying them from the table sources, either using a unique ID or using an update timestamp from which to perform the data movement.

    For the type of movement you are looking for, you need some form of replication that will take the changes. Since you are are using MySQL, have you looked at Tungsten Replicator (see for more info). That might suit your needs better to get the live stream of changes.



    Vijay Kumar

    Thank you for ur reply….

    As i m using telecom data which is structured and placed on SQL or MYSQL or any but the data we get is more ex: 10 million records daily…..
    so no modification done like adding a trigger on source data…..
    Source data will be modified on daily basis(Insertion,Deletion,Updation)…..
    in our scenario lets take two fields like id,name where id is primary key….. where is modified daily…
    i have to take only modified data……using Hive is a good idea or Using Hbase is a good idea or will hbase directly support these kind of situation …can hbase link up with sqoop for daily modification purpose ….. or using hivehbase integration can solve this problem……….
    i m confused….please help me


    MC Brown


    For incremental to work with Sqoop you must update the the table to contain an identifier so that you can pull the changes. Take a look at this article:

    For more information.

    What data are you using for the source data?


Viewing 5 replies - 1 through 5 (of 5 total)
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.