Hive / HCatalog Forum

Importing Data From Sqoop to Hive

  • #55838
    Vijay Kumar
    Participant

    As We have a Table with 2 columns lets suppose in SQL
    ( we doesnt have any created_date,Updated_date,Flag columns in SQL Source Table and not to modify source Table )

    id is primary key
    id name
    1 AAAAA
    2 BBBBB
    3 CCCCC
    4 ADAEAB
    5 GGAGAG
    i pull the data using sqoop into hive as a Main table its ok
    But if the source data is Updated like below
    id name
    1 ACACA
    2 BASBA
    3 CCHAH
    4 AASDA1
    5 GGAGAG

    Problem :
    —————–

    My Issue is that without effecting the Main table data in hive i need to pull the Updated or Inserted or Deleted data using Sqoop and also simultaneously update in the Hive Main Table without effecting the Existing once….
    i have tried tried to use
    –incremental …. so on properties but no result….

    Result Should be:
    ——————————–

    output main table is having all the 10 records… it should be 5 records….

    Requirement:
    ——————————
    on day1 i have 1millions of records
    on day 2 i have 1million + current day + updated lets say 2 million
    on day2 i have to pull only updated and newly inserted data rather than whole data.
    and also
    can Anyone Help me how to combine day1 hive data with day2 updated data…

    In case if Anyone has Any other solution like any Alternative please suggest me Clearly Becoz i m new to hadoop….

to create new topics or reply. | New User Registration

  • Author
    Replies
  • #55957
    Tom Hanlon
    Moderator

    Synchronizing a source table in SQL and a hive table using sqoop will be challenging.

    Can you simply load the complete current state of the table once a day ? If so that is the simplest solution.

    A daily sqoop import into a new or empty hive table with all the records ?

    As far as I recall the increment option only tracks largest auto_increment primary key and imports keys larger than that. Basically select * from table where primary key > max inserted last primary key.

    So sqoop increment, (unless it has changed in ways I was not aware) is not going to help.

    Without knowing more about your system it is hard to advise.

    I can try to monitor this forum and if you provide more information perhaps I can advise.

    If you can clarify this question.
    “can Anyone Help me how to combine day1 hive data with day2 updated data…”
    perhaps I can help.

    Are you saying on day1 you pull all of the data, and on day2 you pull all of the data , and you want a resultset of rows that have changed ?

    In general I think the best start would be to pull all the data all at once, and repeat once a day. Each day’s import would have the up to date records. Why do you need old versions ? The database is not keeping old versions.

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.