HDFS Forum

Importing Web pages into Hadoop

  • #29568


    I watch all this hadoop presentation, and I see : import everything, import web pages, import jpg files, etc.
    I have a few questions, especially about importing web pages.

    For example, currently we have a lot of different internal web sites in our company(windows domain authentications – I have permission to load them 23:00-7:00 each day).
    I want to import those web pages into hadoop .
    How it can be done ? What tool should I pick? Can I load only new pages?

    Then when I got my pages (in HTML, I assume ?) I want to search information within these pages.
    Do I have to write separate “distributed map reduce Java HTML parser” for each problem? (which will check if in a innerHtml match “SAP codes” etc.)
    Is there any wrapper for HTML pages in hadoop(can PIG or HIVE query such dumped page, or it needs to be in CSV)?

    Sorry if I missed Hadoop concept entirely, but I can’t find easy solution for my problem.
    It seems that Sqoop = importing database and Flume is for getting logs.
    I found myself hitting the wall and of course after three Big Data presentations, I assumed that importing web pages is bread and butter of Hadoop.

to create new topics or reply. | New User Registration

  • Author
  • #29587

    Hi Tomasz,

    importing the web pages into hadoop is really quite easy. You copy them from the server they reside on to the local file system of one of the nodes in your cluster (typically a client node), then in the shell on that node execute a ‘hadoop fs -put ‘ where is a regex that will match all of the files you want copied to hadoop. Once in Hadoop web pages are pretty much just text pages and can be queried as such initially it would be easier to do this with pig. For hive to be able to query them they would need to have some sort of a table structure applied. This structure can be applied via the use of a SerDe. this SerDe you may have to write yourself or find one out there on the internet as I am sure your not the first person to want to process html in hive.


The topic ‘Importing Web pages into Hadoop’ is closed to new replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.