Process and analyze text documents
I am new in the HADOOP world, the choice of different tools and frameworks is a bit confusing for me. Would be great if someone could help me out where I should start, which architecture and tool would be recommended for the following scenario:
I want to analyze some hundreds/thousands of documents (pdf, text, etc.) and cluster them according to their content using machine learning algorithms. At the end the result should be a tree, that represents categories of the content. But my question is not how to cluster the documents, but more:
.) is HADOOP the right tool to process the text files?
.) my idea was to use HDF to store the text file and create a table, that contains the documentID and the document content.
.) To import the files to HDFS, what is the best approach? At first I have to convert it to text files (e.g. with some external tool), and then how to import the thousands of text files to HDFS (pig?).
.) I am not sure at which layer I should normalize the texts to generate text vectors. Should I use java to access the text files and then store the result in a relational database?
It would be great if someone could give me some hints or a starting point for me requirement.