Home Forums Hortonworks Sandbox select * from nyse_stocks where stock_symbol="IBM" fails

This topic contains 7 replies, has 4 voices, and was last updated by  tedr 1 year, 8 months ago.

  • Creator
    Topic
  • #16269

    Brian Feeny
    Member

    I am going through tutorial 1, and the query select * from nyse_stocks where stock_symbol=”IBM” returns no rows. If I view the file via the File Browser, I can see all the data is there. But when I view the nyse_stocks table in HCatalog, it only contains stocks that begin with letter A. So its as though the creation of the table from the file has created a truncated table. All I did was enter the table name and description and select the file (which I verified has all of the stocks in it). On the next page I just accepted the defaults (delimiter of tab), and on the final page for HCatalog I set my columns to the appropriate data types. My “describe nyse_stocks” output matches the tutorial exactly.

    Does anyone have any idea what may be causing this strange behavior? I have dropped the table in HCatalog and tried to re-create but I get the same results. I even dropped the file and re-uploaded it. Same result.

Viewing 7 replies - 1 through 7 (of 7 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #17461

    tedr
    Member

    Hi Brian,

    Thanks for the information.

    Ted.

    Collapse
    #17460

    Bill Blaney
    Member

    FWIW, the problem will also occur if you unzip the file in Windows and upload the unzipped version.

    Collapse
    #16311

    Brian Feeny
    Member

    Thanks for letting me know, I must have skipped past reading that in my excitement to get started. I appreciate all the help, and things are working well now!

    Collapse
    #16309

    Yi Zhang
    Moderator

    Hi Brian,

    This is a known issue stated here:

    http://hortonworks.com/products/sandbox-instructions/

    The Hortonworks Sandbox is built on the Hortonworks Data Platform 1.2. However, excluded from this are:

    Third party tools and downloads (like Talend)
    The Hortonworks Management Console (Apache Ambari)
    Data sets uncompressed by Safari from .gz extension to .tsv extensions may not fully import. To solve this issue, using Safari on a Mac, please ensure that the following configuration is set in Preferences: General->uncheck “Open “safe” files after downloading”.

    – See more at: http://hortonworks.com/products/sandbox-instructions/#sthash.Y5AVvyGk.dpuf

    Thanks,

    Yi.

    Collapse
    #16308

    Brian Feeny
    Member

    Ok, so here is the update. If I download the NYSE stock data and have Safari automatically unzip it, by having the Preferences->General->Open “safe” files after downloading set, then the situation is as I explained.

    If I have Safari not automatically decompress the file, then it works fine! This is concerning, because it should not matter should it?

    As a test, I uploaded the file uncompressed, and called the file and table “test”, and then I uploaded the file compressed, and called it “nyse_stocks”. You can see the difference, the test file is 1048446 in size and the nyse_stocks file is 44005832. As a test, someone from hortonworks may wish to replicate, just have the file decompressed before uploading. Its almost as if Hive clips the data after so much size, so if you have it compressed your good, but if not, you will lose data. It doesn’t make sense to me but what I can tell you is this is repeatable, all you have to do is have safari set to automatically decompress files on download, and then upload the decompressed stock data.

    Collapse
    #16304

    Yi Zhang
    Moderator

    Hi Brian,

    Can you browse into /apps/hive/warehouse/nyse_stocks and see if it has the complete data? There should be many blocks there.

    If the data is not complete there, can you try another upload, watch the logs in /var/log/webhcat, any clue there?

    Thanks,

    Yi.

    Collapse
    #16279

    Brian Feeny
    Member

    I just reinstalled and tested this again. Same outcome, so I think there is an issue with the latest Sandbox build. I am using the Fusion version. It appears that perhaps the file is partitioned into multiple parts and HCatalog is only importing/using one of those parts (stocks_symbols that begin with “A”).

    The file itself is correct, but when moved into HCatalog its truncated and so the select on symbol_name IBM fails. And of course the SELECT count(*) returns less rows than is shown.

    Collapse
Viewing 7 replies - 1 through 7 (of 7 total)