Russel's Havens
  • Home
  • Log Analysis
  • Blog of Time

Work Time - Scalable Monitoring

9/14/2015

0 Comments

 
My professional space, monitoring, is deceptive.  How hard could it be, after all -- you just make sure stuff is up.  Worst case, you figure out Nagios or Zabbix or something, and set that up.

Well, scalable enterprise-level monitoring is actually quite hard.  Every internal team has a different idea of what monitoring is -- if you ask 10 people what monitoring is, you'll get 8 to 12 answers, with plenty of overlap and plenty of differences in opinion.

There are at least a dozen other decent open source monitoring tools, plus several dozen commercial (small and large) tools out there.  I work in an open source shop, and a long career has taught me that getting funding for an expensive monitoring package is unlikely to happen.  And when it does, then 3 or 4 years later, somebody will refuse to pay the maintenance fee.  So open source it is for me.  I'll just mention a couple in this post, a couple that we use heavily.

Nagios is not a bad first step -- at its core, it's just a time-based script executor with the ability to call alert scripts. -- conditional cron=simple  It can tell you when stuff is down, when it's low on resources, when it's not responding.  It has plugins for all sorts of things.  But those plugins are clunky.  The configuration is super clunky.  Any time you need to change something, you have to restart Nagios.  That's just plain icky. By default you get no historical data, no intelligent escalation, and by the time you've added plugins to help those out, your little Nagios instances has morphed into something quite different.  And scaling that different thing is not pretty.  Trust me -- I have 120 Nagios servers in my environment.

Zabbix gives you more functionality out of the box.  I personally find the model awkward, but it gives you nice graphing, a decent UI (as opposed to Nagios config files and VI), a decent API, etc.  But scale it out beyond a few thousand hosts and MySQL melts down.  I have 50,000 hosts to monitor, with 350,000 services (by Nagios reckoning).  And as soon as anybody sees graphs, they want 10 times as many metrics collected.

It's funny, but speaking of graphs, most people nowadays see graphs and dashboarding as central to monitoring.  Not that these buy you alerts, but they are considered critical for troubleshooting and trend analysis.  And, to be honest, they lend themselves nicely to proactive analysis, which can save you from downtime generating crashes.

So, my team's solution?  Something like this:

Picture
0 Comments



Leave a Reply.

    Author

    Russel is a senior career IT guy and relatively new manager with an academic interest in log management and log data analysis, a professional interest in monitoring and management systems. database management, and programming languages, and personal interests in family, photography, reading, and the outdoors.

    Archives

    January 2023
    August 2022
    February 2022
    January 2022
    November 2021
    October 2021
    February 2021
    January 2021
    December 2020
    August 2020
    July 2020
    July 2018
    October 2017
    September 2017
    July 2017
    January 2017
    December 2016
    July 2016
    January 2016
    September 2015
    July 2014
    April 2014
    July 2013
    June 2013
    December 2012
    August 2012
    July 2012
    May 2012
    January 2012
    December 2011
    November 2011
    October 2011
    September 2011

    Categories

    All
    Academic
    Aws
    Family
    Gratitude
    Home
    Language
    Reading
    Travel
    Work

    View my profile on LinkedIn

    RSS Feed

Powered by Create your own unique website with customizable templates.