Home Forums Pig Process and analyze text documents

This topic contains 0 replies, has 1 voice, and was last updated by  Martin Meier 5 months, 1 week ago.

  • Creator
  • #49076

    Martin Meier


    I am new in the HADOOP world, the choice of different tools and frameworks is a bit confusing for me. Would be great if someone could help me out where I should start, which architecture and tool would be recommended for the following scenario:

    I want to analyze some hundreds/thousands of documents (pdf, text, etc.) and cluster them according to their content using machine learning algorithms. At the end the result should be a tree, that represents categories of the content. But my question is not how to cluster the documents, but more:

    .) is HADOOP the right tool to process the text files?
    .) my idea was to use HDF to store the text file and create a table, that contains the documentID and the document content.
    .) To import the files to HDFS, what is the best approach? At first I have to convert it to text files (e.g. with some external tool), and then how to import the thousands of text files to HDFS (pig?).
    .) I am not sure at which layer I should normalize the texts to generate text vectors. Should I use java to access the text files and then store the result in a relational database?

    It would be great if someone could give me some hints or a starting point for me requirement.


You must be logged in to reply to this topic.