[colug-432] Hadoop follow-up questions

Christopher Stolfi stolfi at gmail.com
Thu Mar 24 11:15:37 EDT 2011


You can use HDFS independent of any Map/Reduce functionality if you
choose (ie a generic data store). You really don't even need the
jobtracker and tasktracker to be running at all to use HDFS.

When reading/writing it's not spinning up new JVMs, I believe it's
using the existing Namenode process and Datanode processes for any
activities.   I can't really comment to it's speed, but I've never had
a problem. My usage is really limited to streaming files in over the
course of a day, not doing a lot of bulk uploads, so not sure I'm the
best judge.  During one hadoop migration we were able to saturate a
few gig links, but migration traffic is very different and highly
parallel (it's migrating blocks not files).  Also, 3 replicates of
each block is just the default...you can lower it to 1 or increase it
to 5 if you so choose.

The most important thing to commit to multiple locations
(local/NAS/tape) is the namenode data. The namenode is really what
maps files to blocks and blocks to nodes.  The primary and secondary
namenode processes keep track of the filesystem mappings and make sure
the current in-memory image and edit logs are kept in check (and on
disk).  Without this *one* process and *one* machine, the datanodes
are worthless.

Even with all that, if the data is that important, it's really a
business decision on spending the money to back it up.  I don't know
that I would backup my datanodes to tape, since they just have
arbitrary blocks on them...so coming up with a backup plan would be
interesting.

-s

On Thu, Mar 24, 2011 at 9:58 AM, Scott Merrill <skippy at skippy.net> wrote:
> On Thu, Mar 24, 2011 at 9:40 AM, Scott Merrill <skippy at skippy.net> wrote:
>> I have a couple of questions this morning. Anyone should feel free to
>> answer, not just Tom, if you have any insight.
>
> One more question: since HDFS redundantly stores data blocks in
> triplicate, does it make sense to still use traditional backup methods
> on data stored in HDFS? If one puts data into HDFS, can one reasonably
> rely on the built-in fault-tolerance of the triplicate copies of that
> data, or should one still be putting data to tape?
> _______________________________________________
> colug-432 mailing list
> colug-432 at colug.net
> http://lists.colug.net/mailman/listinfo/colug-432
>


More information about the colug-432 mailing list