[colug-432] Hadoop follow-up questions

Thu Mar 24 09:40:34 EDT 2011

Last night's COLUG was an interesting introduction to Hadoop. Tom
Hanlon did a great job cutting through the hype to present the
strengths and weaknesses of Hadoop. Thanks, Tom!

I have a couple of questions this morning. Anyone should feel free to
answer, not just Tom, if you have any insight.

Since Hadoop is built atop HDFS, is it possible to utilize the
underlying HDFS independently of Hadoop's job scheduling functions
while still using Hadoop's job stuff for other things? For example,
would it be possible / advisable to stick a whole bunch of binary data
into the HDFS and access that data with HDFS's GET and PUT primitives,
and simultaneously put other data into HDFS to be processed using
MapReduce functions?

If the above is possible, can anyone speak to the general performance
of the HDFS GET and PUT operations? I understand that MapReduce is a
batch process, and spinning up the JVM will slow things down. But for
just accessing a file stored in HDFS with a GET command, what kind of
performance can one expect for that?

Thanks again, Tom, for the presentation!

Cheers,
Scott