Russel's Havens
  • Home
  • Log Analysis
  • Blog of Time

Velocity 2016

7/13/2016

0 Comments

 
Picture
A few weeks ago, I was able to attend the O'Reilly Velocity Conference.  It was my first time at this particular conference, and I was very impressed with the breadth of coverage.  As with all conferences, some sessions were much better than others.  I rarely enjoy the infomercial-style sessions, of the "we need to change our culture" whining sessions.  But here were several excellent technical sessions, and a few great sleepers (e.g. the one on anxiety in IT ops).

I put together a tl;dr list for my team:
  • It’s still a “choose your own adventure” experience in monitoring
  • Monitoring means different things to different people
  • There are lots of tools, and lots of new tools being created
  • The big guys still mostly roll their own or stitch together open source tools with custom code (lots of “dev” in their “ops”)
  • Anybody can build something, but to build something maintainable and effective requires specialization and a deep understanding of the tools available
  • Microservices on containers: everybody was talking about them, but most were saying “no silver bullet” and “it’s a mixed bag”
  • Advanced mathematical analytics is hard, and someday we’ll be able to use it without a PhD…but probably not today
  • To provide real value make your services as easy as possible for the customer (engineering, ops on call, et al) – create tools, ChatOps, APIs, UIs, data access, etc.
  • SRE/DevOps = 60% development/automation + 40% operations (the work is more dev than ops)

Slides and videos can be found (at the moment) at http://conferences.oreilly.com/velocity/devops-web-performance-ca/public/schedule/proceedings

0 Comments

Some Fun Snowshoeing Adventures

1/9/2016

0 Comments

 
Picture
Snowshoeing on Rocky Ridge.
Picture
The well-traveled canal trail in Payson Canyon
I was lucky enough to go snowshoeing several times during the Christmas holiday. Here three examples. Not as much snow as I would like, but plenty of cold, white landscape, exercise and solitude.
Picture
0 Comments

Same amount of time but too many things to fill it

1/9/2016

0 Comments

 
Ah, the holidays.  The wonderful time of quiet and rest.  Just enough to remind you that you have a lot of work ahead.


In addition to the day job, I'm teaching a SysAdmin class at BYU this semester.  And taking a writing class for my PhD program at Capella University. And going to a PhD colloquium in February to find a good dissertation topic and mentor, which is, essentially, another class with its own, shorter, track.  This will be a fun few months.
0 Comments

Work Time - Scalable Monitoring

9/14/2015

0 Comments

 
My professional space, monitoring, is deceptive.  How hard could it be, after all -- you just make sure stuff is up.  Worst case, you figure out Nagios or Zabbix or something, and set that up.

Well, scalable enterprise-level monitoring is actually quite hard.  Every internal team has a different idea of what monitoring is -- if you ask 10 people what monitoring is, you'll get 8 to 12 answers, with plenty of overlap and plenty of differences in opinion.

There are at least a dozen other decent open source monitoring tools, plus several dozen commercial (small and large) tools out there.  I work in an open source shop, and a long career has taught me that getting funding for an expensive monitoring package is unlikely to happen.  And when it does, then 3 or 4 years later, somebody will refuse to pay the maintenance fee.  So open source it is for me.  I'll just mention a couple in this post, a couple that we use heavily.

Nagios is not a bad first step -- at its core, it's just a time-based script executor with the ability to call alert scripts. -- conditional cron=simple  It can tell you when stuff is down, when it's low on resources, when it's not responding.  It has plugins for all sorts of things.  But those plugins are clunky.  The configuration is super clunky.  Any time you need to change something, you have to restart Nagios.  That's just plain icky. By default you get no historical data, no intelligent escalation, and by the time you've added plugins to help those out, your little Nagios instances has morphed into something quite different.  And scaling that different thing is not pretty.  Trust me -- I have 120 Nagios servers in my environment.

Zabbix gives you more functionality out of the box.  I personally find the model awkward, but it gives you nice graphing, a decent UI (as opposed to Nagios config files and VI), a decent API, etc.  But scale it out beyond a few thousand hosts and MySQL melts down.  I have 50,000 hosts to monitor, with 350,000 services (by Nagios reckoning).  And as soon as anybody sees graphs, they want 10 times as many metrics collected.

It's funny, but speaking of graphs, most people nowadays see graphs and dashboarding as central to monitoring.  Not that these buy you alerts, but they are considered critical for troubleshooting and trend analysis.  And, to be honest, they lend themselves nicely to proactive analysis, which can save you from downtime generating crashes.

So, my team's solution?  Something like this:

Picture
0 Comments

Work time (-ish): Monitorama

7/23/2014

0 Comments

 
A couple of months ago, my monitoring team, including my manager, attended the Monitorama PDX 2014 conference.


The conference started out slowly, but turned out to be quite useful in shaping the group's mindset about monitoring.  Many important principles were discussed at length during the conference's 3 days.


Stepping away from the running system and thinking about the space with the broader perspective was hugely helpful.  In researching the space after the conference, I ran across this brilliant USENIX LISA conference presentation by Caskey Dickson.  This presentation helped us to take what we had learned at Monitorama and frame that into manageable portions of work.  

My manager and I spent the next several weeks putting together a conceptual framework that covers various aspects of monitoring, many of which were addressed in the conference.  The framework is based around Caskey's monitoring components diagram, but, of course, is customized for our environment.
Then, of course, the day job caught up with us, and we found ourselves spending June and part of July finishing a number of projects to consolidate and stabilize our current environment.  After all, we have an ongoing monitoring job to do, so we have to replace the jet's engines while still in flight -- no ignoring running systems while we go and build a new one.

One of the more interesting take-aways from the conference was this: one of the most valuable thing we got from the conference was moving the members of our team closer together in our understanding of the problem space and various aspects of possible solutions to current problems.  This is absolutely huge in almost ever way measurable for us.

Outside of the conference, we made a couple of evening trips to beautiful areas.  I finally got around to dealing with those photos.  Here are some of those.
0 Comments

Logstash, openssl Certs and Madness

4/2/2014

0 Comments

 
Picture
Late last fall, I set up a Logstash/ElasticSearch/Kibana instance to collect logs from some of my monitoring servers.  The logs were coming through Logstash-forwarder, with almost no overhead and it was a beautiful thing. 
Then, sometime in late December, the logs stopped going into ElasticSearch.  No errors were to be seen anywhere on the Logstash server, just no logs in ElasticSearch.

Life is busy, between the many things happening at Adobe and the class I'm teaching at BYU, so I poked at it on and off again for a few weeks.  I fired an email to a contact I have at ES, but got dead air.  I eventually joined the #logstash IRC channel on freenode, but never found the time to troubleshoot to a point where I could succinctly describe the situation and environment.

This situation has
driven me crazy ever since.  I've looked at it again several times, but never found any debug info or logs that helped me figure out what was going on.  netstat and tcpdump showed that the traffic was flowing, but nothing was going to ES.  Upgrading Logstash and ElasticSearch, while adding cool features, did not help in the slightest. 
Changing the output to a file didn't help either, so it likely wasn't ES or the ES plugin.  Data was coming in, so the logstash-forwarders were doing their jobs. 

And, on top of that, not having very much time to work on it meant that it was going to be a pebble in my shoe for some time.  Adobe uses Splunk, so this was something of a skunkworks project anyway, and it would be hard to push for a product that just stops working for no apparent reason, so to keep my sanity (or what's left of it), I eventually
turned off the logstash-forwarders and left the logstash VM lie fallow.. 

Fast forward a couple of months and now I'm looking for a new monitoring solution for my businss unit.  I've come to understand that at our scale (40,000+ devices), tracking monitoring data, especially with historical data, graphs, etc., really is a big data problem.  In fact, just tracking the configuration of monitoring is a small-ish big data problem.  So, I've been studying up on big data tools.  Among the tools that pop up was ElasticSearch.

Ah, ElasticSearch and that beautiful Logstash-forwarded data.  It was so beautiful while it lasted.  Maybe it would be worth another go-around to try to get it going.  Wouldn't it be beautiful to get that going again and actually be able to use my server and application logs.

There's no time at work, so, a couple of nights ago I decided to set it up as if from scratch.  I updated to the current version of Logstash and ElasticSearch.  No change yet.  I double-checked connections with netcat and netstat.  Still reachable and connections still there.  I recreated the Logstash and logstash-forwarder configs for a couple of servers.  Nothing. 

Then I replaced all the certificates used by logstash-forwarder and
voilà.  Logs started flowing again.

Upon some investigation, the original cert I'd created had expired at the end of December -- the default expiration for openssl certificates is 1 month. While I'm sure this is very secure, 1 month is not very long in the real world.

So, now, I have a few years before this new cert expires.  Maybe I'll be better at debugging Logstash by then.  Or maybe I'll find this blog post.  Either way., I'll probably facepalm, generate & distribute a new cert and go for a few more years.

In process of listening to LS and ES youtube videos, Jordan Sissel ( who is awesome, by the way) described himself as SysAdmin like this: "I'm a SysAdmin, which means that I like being angry with computers." 
How true that is at times!  I love his concept of "anger-driven development"--something makes him angry, so he writes some project that fixes the problem.  That's cathartic in the best sense of the word, I think.  Logstash, fpm and his other projects are awesome, and awesomely cathartic for me, too.

0 Comments

Work Time: Oregon

7/27/2013

0 Comments

 
PictureThe front of the building. Inside, there seem to be guards
and badge readers everywhere, but the outside is
colorful and friendly.
Mid-week, I spent a couple of days in Oregon, getting to know Adobe's new data center.  Though it is dwarfed by the Intel facility just up the street, it is a very respectable facility, with a planned capacity of 7.5 megawatts.

As with everything Adobe does, the design is brilliant -- both in looks (as would be expected of Adobe) and in functionality (as would be expected of a world-class SaaS-maintaining IT organization).

Here are a few shots of the building, inside and out.

0 Comments

No Time: And Plenty to Do

7/21/2013

0 Comments

 
The knee is moving towards recovery pretty much as my physical therapist expects (though to me, it seems to be taking a long time).  Thankfully, I no longer need the crutches or knee brace for normal movement (though I'll be using a brace for more adventurous activities, like hiking, for a while).

As I'm recovering, there seems to be no shortage of things that need to be done.  At church, I'm a counselor in the ward bishopric, which takes my Sundays and one or two other nights per week.  My daughter is getting married in August (as seen on the Wedding page on this site)m which somehow seems to take time once in a while.  And I'm getting an IT 515R SysAdmin class ready to teach at BYU this fall, for which I still need to rework a couple of labs.  (I've also been asked to teach the Operating Systems class next Winter, which is totally awesome.)

This brings me to thinking about time management.  At work, I'm finding that my days are often sliced up by meetings, each of which gives me more to do.  The larger projects I work on tend to have long timelines and demand focus for long periods of time.  I think I need to work on being better at using small time slices.  

Using a ticketing system like Jira helps a bit, at least for tracking the work.  That sort of makes the work feel like queue work, which are usually lower-level request-response tasks.  So in some ways, it makes the large projects feel like workable-sized bits.  That helps with the "top dead-center"-feeling that you get on months-long projects that don't move much on any given day.
0 Comments

Home Time: Knee Recovery and Image Recognition

6/7/2013

0 Comments

 
Picture
This is a photo of a very sick knee, my very sick knee, in fact.

Earlier this Spring, I dislocated my patella, twice, about 2 weeks apart.  This comes with some history, and was not the first time, but this time I decided it's time to do something about it.

And thus I have a whole video of these sort of interesting views, recorded last week during knee surgery.  Although it's intellectually interesting, I'm still swollen and recovering, so it's actually slightly painful to watch the surgeon cleaning up, cauterizing and tightening the various parts of the inside of my knee.  Maybe with some time, the pain will fade and it will just be interesting.

This image is from the before part, so, most of the loose blobs there were removed during the surgery.  

As a data analyst, I'm intrigued by these sorts of images -- it takes a highly trained and experienced surgeon to really know what they are looking at.  I could only guess, and it's my knee.  If an untrained human can't really make heads or tails out of it, what sorts of knowledge would a computer need to "understand" it?  Would there be a way to arthroscope a joint, with a single pass-through with a camera, then have a computer do an analysis of what's normal, what's not, and highlight some of those parts for a doctor in real-time, would that be of benefit?  I suppose that as an academic, it's just and intriguing problem.  But as an operational realist (and one with a bad knee), I wonder what those benefits would be and if such a technology might help me (or might have helped me). 

0 Comments

Academic and Career time: USENIX Lisa '12 Poster Session

12/2/2012

1 Comment

 
Picture
The USENIX LISA '12 paper that Bret Swan and I created (he wrote it; I provided the knowledge) was not accepted, but early in November, I received a surprise email stating that they would like us to present a poster in their poster session.  

So, I've gathered together many of the "Lessons Learned" points we discussed in the paper, with more that have come to light since then, and put them into a poster form.  The poster is entitled: Enterprise Monitoring Visualization for SaaS: Lessons Learned in Developing Adobe’s Digital Marketing NOC
The screen grab at the left is my first stab, but it will get a facelift from the brilliant designer who did the design work for our new NOC wall (some of whose work you see here, in thumbnail form), so it will soon look much better.
The conference is in San Diego in a couple of weeks.  I couldn't be more excited.  I'm looking forward to soaking up as much knowledge as I possibly can.  Being able to actually present there is just a very tasty icing on an already near-perfect cake.

1 Comment
<<Previous
Forward>>

    Author

    Russel is a mid-career IT guy and new manager with an academic interest in log management and log data analysis, a professional interest in monitoring and management systems. database management, and programming languages, and personal interests in family, photography, reading, and the outdoors.

    Archives

    January 2022
    November 2021
    October 2021
    February 2021
    January 2021
    December 2020
    August 2020
    July 2020
    July 2018
    October 2017
    September 2017
    July 2017
    January 2017
    December 2016
    July 2016
    January 2016
    September 2015
    July 2014
    April 2014
    July 2013
    June 2013
    December 2012
    August 2012
    July 2012
    May 2012
    January 2012
    December 2011
    November 2011
    October 2011
    September 2011

    Categories

    All
    Academic
    Aws
    Family
    Gratitude
    Home
    Language
    Reading
    Travel
    Work

    View my profile on LinkedIn

    RSS Feed

Powered by Create your own unique website with customizable templates.