[colug-432] Hadoop follow-up questions

Tom Hanlon tom at functionalmedia.com
Fri Mar 25 19:31:44 EDT 2011


On Mar 24, 2011, at 11:15 AM, Christopher Stolfi wrote:

> You can use HDFS independent of any Map/Reduce functionality if you
> choose (ie a generic data store). You really don't even need the
> jobtracker and tasktracker to be running at all to use HDFS.
> 
> When reading/writing it's not spinning up new JVMs, I believe it's
> using the existing Namenode process and Datanode processes for any
> activities.   I can't really comment to it's speed, but I've never had
> a problem. My usage is really limited to streaming files in over the
> course of a day, not doing a lot of bulk uploads, so not sure I'm the
> best judge.  During one hadoop migration we were able to saturate a
> few gig links, but migration traffic is very different and highly
> parallel (it's migrating blocks not files).  Also, 3 replicates of
> each block is just the default...you can lower it to 1 or increase it
> to 5 if you so choose.
> 
> The most important thing to commit to multiple locations
> (local/NAS/tape) is the namenode data. The namenode is really what
> maps files to blocks and blocks to nodes.  The primary and secondary
> namenode processes keep track of the filesystem mappings and make sure
> the current in-memory image and edit logs are kept in check (and on
> disk).  Without this *one* process and *one* machine, the datanodes
> are worthless.
> 
> Even with all that, if the data is that important, it's really a
> business decision on spending the money to back it up.  I don't know
> that I would backup my datanodes to tape, since they just have
> arbitrary blocks on them...so coming up with a backup plan would be
> interesting.
> 


Conceptually at first the backup plan might seem challenging, but not really once you get familiar with the tools. 

distcp will allow you to move from hdfs to local filesystem
hadoop distcp -jt local hdfs://name-node/path/to/dir file:///path/to/local/dir


Or you could simply sript the hadoop fs commands, 

Or you could use the java FileSystem Api, 

http://search-hadoop.com/jd/hcommon/org/apache/hadoop/fs/FileSystem.html#FileSystem()

Or perhaps you can get the FUSE dfs stuff to work, I try for about a half hour and then the effort is not worth the time, but supposedly you can mount hdfs using fuse, 

Not that you get update.. but it otherwise looks like a unix filesystem

--
Tom 

> -s
> 
> On Thu, Mar 24, 2011 at 9:58 AM, Scott Merrill <skippy at skippy.net> wrote:
>> On Thu, Mar 24, 2011 at 9:40 AM, Scott Merrill <skippy at skippy.net> wrote:
>>> I have a couple of questions this morning. Anyone should feel free to
>>> answer, not just Tom, if you have any insight.
>> 
>> One more question: since HDFS redundantly stores data blocks in
>> triplicate, does it make sense to still use traditional backup methods
>> on data stored in HDFS? If one puts data into HDFS, can one reasonably
>> rely on the built-in fault-tolerance of the triplicate copies of that
>> data, or should one still be putting data to tape?
>> _______________________________________________
>> colug-432 mailing list
>> colug-432 at colug.net
>> http://lists.colug.net/mailman/listinfo/colug-432
>> 
> _______________________________________________
> colug-432 mailing list
> colug-432 at colug.net
> http://lists.colug.net/mailman/listinfo/colug-432

Tom Hanlon
tom at functionalmedia.com
Cloudera Certified Hadoop Developer
Certified MySQL DBA




More information about the colug-432 mailing list