[colug-432] Splunk Storm

Wed Jan 2 19:32:36 EST 2013

On Jan 2, 2013, at 12:56, Tom Hanlon <tom at functionalmedia.com> wrote:

> Colug, 
> 
> I saw the title to this and I thought to myself .... 
> " is this a combination of splunk with Nathan Marz's from Twitter, Storm product" 
> https://github.com/nathanmarz/storm
> 
> Sort of in the same space. I wonder if Splunk added the word 'Storm' to a product to muddy the context for customers ? 
> 
> Anyhow..
> I do think that Nathan Marz's Storm is an interesting project. 
> 
> I teach hadoop, so along with ingestion of data, often with flume or some other log collection tool, folks often want something close to near real time event processing. 
> 
> There is esper, there is splunk , there is storm. 
> http://esper.codehaus.org/
> 
> As far as Open Source (apache licensed) Nathan Marz's Storm is getting a lot of attention, not sure how much adoption. 

My large undisclosed company is using storm as a conduit for moving medical data in real-time to/from our legacy systems to Hadoop for our mobile tablet app.  I can't even pretend to understand the difference between a mauled zookeeper and a dead region server :) The guy on my team who understands Hadoop the best is scary smart.

> Thanks for the tip on splunkstorm. Are the features identical to commercial splunk ? I might want to mess around with Splunk so I can get a feel for the features. I tend to talk about it in vague terms, I think I cover the basics, but a little hands-on experience would be nice. 

I don't have a whole lot of experience with splunk from the user side, but so far it seems like all the major features are there. What seems to be missing are the typical stuff I'm used to seeing as a splunk server admin.  We just started rolling it out about 6 months ago so we're all learning to how manage it as my clients are finding ways to (ab|mis)use it. :)

For the splunkstorm, setting up the forwarder (the splunk client) on my Mac was easy easy.

There are a couple of things configuring it to watch out for that have bit us already:

- don't tell splunk to monitor a _directory_ of logs unless:
    - the directory only contains live (not archived, not managed by ie logrotate) logs
    - the directory contains archived logs, but you've blacklisted the archive mask/regex in the splunk config
- don't set the sourcetype yourself unless you really want to override splunk's detection or there is no chance splunk can tell the type because it is entirely custom (ie the modem signal strength logs) and you want to set a value. Splunk auto recognizes lots of logs like NCSA, CLF, etc.

hmm it occurs to me that splunkstorm only allows one index per project. This makes sense though. An index is =~ a database. One db per project.

There is a #splunk IRC channel on efnet. If IRC is blocked at your company, I think the web client is http://irc.efnet.org:9090

I think esp for starting out, splunkstorm is a less painful way to go than trying to build your own server.