Process and analyze text documents

to create new topics or reply. | New User Registration

This topic contains 0 replies, has 1 voice, and was last updated by  Martin Meier 1 year, 5 months ago.

  • Creator
  • #49076

    Martin Meier


    I am new in the HADOOP world, the choice of different tools and frameworks is a bit confusing for me. Would be great if someone could help me out where I should start, which architecture and tool would be recommended for the following scenario:

    I want to analyze some hundreds/thousands of documents (pdf, text, etc.) and cluster them according to their content using machine learning algorithms. At the end the result should be a tree, that represents categories of the content. But my question is not how to cluster the documents, but more:

    .) is HADOOP the right tool to process the text files?
    .) my idea was to use HDF to store the text file and create a table, that contains the documentID and the document content.
    .) To import the files to HDFS, what is the best approach? At first I have to convert it to text files (e.g. with some external tool), and then how to import the thousands of text files to HDFS (pig?).
    .) I am not sure at which layer I should normalize the texts to generate text vectors. Should I use java to access the text files and then store the result in a relational database?

    It would be great if someone could give me some hints or a starting point for me requirement.


You must be to reply to this topic. | Create Account

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.