Home Forums HDFS Importing Web pages into Hadoop

This topic contains 1 reply, has 2 voices, and was last updated by  tedr 1 year, 1 month ago.

  • Creator
    Topic
  • #29568

    TomaszB
    Member

    Hi,

    I watch all this hadoop presentation, and I see : import everything, import web pages, import jpg files, etc.
    I have a few questions, especially about importing web pages.

    For example, currently we have a lot of different internal web sites in our company(windows domain authentications – I have permission to load them 23:00-7:00 each day).
    I want to import those web pages into hadoop .
    How it can be done ? What tool should I pick? Can I load only new pages?

    Then when I got my pages (in HTML, I assume ?) I want to search information within these pages.
    Do I have to write separate “distributed map reduce Java HTML parser” for each problem? (which will check if in a innerHtml match “SAP codes” etc.)
    Is there any wrapper for HTML pages in hadoop(can PIG or HIVE query such dumped page, or it needs to be in CSV)?

    Sorry if I missed Hadoop concept entirely, but I can’t find easy solution for my problem.
    It seems that Sqoop = importing database and Flume is for getting logs.
    I found myself hitting the wall and of course after three Big Data presentations, I assumed that importing web pages is bread and butter of Hadoop.

Viewing 1 replies (of 1 total)

The topic ‘Importing Web pages into Hadoop’ is closed to new replies.

  • Author
    Replies
  • #29587

    tedr
    Moderator

    Hi Tomasz,

    importing the web pages into hadoop is really quite easy. You copy them from the server they reside on to the local file system of one of the nodes in your cluster (typically a client node), then in the shell on that node execute a ‘hadoop fs -put ‘ where is a regex that will match all of the files you want copied to hadoop. Once in Hadoop web pages are pretty much just text pages and can be queried as such initially it would be easier to do this with pig. For hive to be able to query them they would need to have some sort of a table structure applied. This structure can be applied via the use of a SerDe. this SerDe you may have to write yourself or find one out there on the internet as I am sure your not the first person to want to process html in hive.

    Thanks,
    Ted.

    Collapse
Viewing 1 replies (of 1 total)