importing the web pages into hadoop is really quite easy. You copy them from the server they reside on to the local file system of one of the nodes in your cluster (typically a client node), then in the shell on that node execute a 'hadoop fs -put ' where is a regex that will match all of the files you want copied to hadoop. Once in Hadoop web pages are pretty much just text pages and can be queried as such initially it would be easier to do this with pig. For hive to be able to query them they would need to have some sort of a table structure applied. This structure can be applied via the use of a SerDe. this SerDe you may have to write yourself or find one out there on the internet as I am sure your not the first person to want to process html in hive.