Pig Forum

Process and analyze text documents

  • #49076
    Martin Meier


    I am new in the HADOOP world, the choice of different tools and frameworks is a bit confusing for me. Would be great if someone could help me out where I should start, which architecture and tool would be recommended for the following scenario:

    I want to analyze some hundreds/thousands of documents (pdf, text, etc.) and cluster them according to their content using machine learning algorithms. At the end the result should be a tree, that represents categories of the content. But my question is not how to cluster the documents, but more:

    .) is HADOOP the right tool to process the text files?
    .) my idea was to use HDF to store the text file and create a table, that contains the documentID and the document content.
    .) To import the files to HDFS, what is the best approach? At first I have to convert it to text files (e.g. with some external tool), and then how to import the thousands of text files to HDFS (pig?).
    .) I am not sure at which layer I should normalize the texts to generate text vectors. Should I use java to access the text files and then store the result in a relational database?

    It would be great if someone could give me some hints or a starting point for me requirement.


to create new topics or reply. | New User Registration

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.