HDFS Forum

Need Redundancy, Not Big Data

  • #46244
    Jeff Bowman
    Participant

    I have a customer on whose behalf I’m investigating Hadoop on Windows. He’s running Windows Server 2012 Essentials Edition in a Hyper-V VM. His is a small company, with fewer than 25 workstations and less than 4TB of storage requirements.

    However, redundant and reliable offsite backup is very important.

    I’m wondering whether HDFS can fill this need. Is it possible to set up a few remote machines and configure them as nodes, and then be able to remove any one of them at any time without impacting the data store as a whole?

    Thanks,
    Jeff Bowman
    Fairbanks, Alaska

to create new topics or reply. | New User Registration

  • Author
    Replies
  • #46356
    Robert Molina
    Moderator

    Hi Jeff,
    HDFS should be able to fulfill the need of being redundant and reliable storage to backup your data. Yes it is possible to setup a few machines and configure them as nodes and then remove them at a later time. There is a decomissioning feature which allows to remove a node from the cluster, which distributes its blocks to other nodes before it fully can be removed from the cluster to ensure data integrity and redundancy.

    Regards,
    Robert

    #46558
    Jeff Bowman
    Participant

    Hi Robert

    Thanks for this–it helps. However, we’re not quite all the way there yet.

    I could be mistaken, but I thought one of the features of HDFS (being based on Google’s FS) was fail-safe redundancy. That in the event of a machine failure, the failed node could simply be taken offline and replaced. The FS would then rebuild itself back to its previous hardened state.

    Sort of a RAID6 across the WAN, if you will.

    Am I misunderstanding this part of it?

    Thanks,
    Jeff Bowman
    Fairbanks, Alaska

    #47214
    Robert Molina
    Moderator

    Hi Jeff,
    If there are under repcliated blocks, hdfs should automatically add replicas for it. The same goes if there is an excess replicas, HDFS will try to remove the excess. Thus, yes if the node fails, you can take it off line. Decomissioning is the elegant way of removing the node from the cluster.

    Regards,
    Robert

    #47229
    Jeff Bowman
    Participant

    Hi Robert

    This sounds like good news.

    OK then, just to recap: we can have a node fail unexpectedly–such as with a hard drive crash–and then simply replace it, being confident that no files were/will be lost.

    Thanks,
    Jeff Bowman
    Fairbanks, Alaska

    #48088
    Jeff Bowman
    Participant

    Is my understanding correct?

    Thanks,
    Jeff Bowman
    Fairbanks, Alaska

    #49270
    Robert Molina
    Moderator

    Hi Jeff,
    Correct, as long as the cluster is showing healthy, and you have cluster that accommodate the default 3 replicas, the data should be fine. But keep in mind, you still have to be aware of what is being done on the cluster. For instance, one can manually over-ride replication and set replication factor to 1 for a file. If a node goes down, and that file’s block was only on that one node, the file would not be accessible.

    Regards,
    Robert

    #49278
    Jeff Bowman
    Participant

    Hi Robert

    This is very, very good news. Thank you so much.

    Thanks,
    Jeff Bowman
    Fairbanks, Alaska

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.