A couple of months ago, my monitoring team, including my manager, attended the Monitorama PDX 2014 conference. The conference started out slowly, but turned out to be quite useful in shaping the group's mindset about monitoring. Many important principles were discussed at length during the conference's 3 days. Stepping away from the running system and thinking about the space with the broader perspective was hugely helpful. In researching the space after the conference, I ran across this brilliant USENIX LISA conference presentation by Caskey Dickson. This presentation helped us to take what we had learned at Monitorama and frame that into manageable portions of work. My manager and I spent the next several weeks putting together a conceptual framework that covers various aspects of monitoring, many of which were addressed in the conference. The framework is based around Caskey's monitoring components diagram, but, of course, is customized for our environment. Then, of course, the day job caught up with us, and we found ourselves spending June and part of July finishing a number of projects to consolidate and stabilize our current environment. After all, we have an ongoing monitoring job to do, so we have to replace the jet's engines while still in flight -- no ignoring running systems while we go and build a new one. One of the more interesting take-aways from the conference was this: one of the most valuable thing we got from the conference was moving the members of our team closer together in our understanding of the problem space and various aspects of possible solutions to current problems. This is absolutely huge in almost ever way measurable for us. Outside of the conference, we made a couple of evening trips to beautiful areas. I finally got around to dealing with those photos. Here are some of those. | |
0 Comments
Late last fall, I set up a Logstash/ElasticSearch/Kibana instance to collect logs from some of my monitoring servers. The logs were coming through Logstash-forwarder, with almost no overhead and it was a beautiful thing. Then, sometime in late December, the logs stopped going into ElasticSearch. No errors were to be seen anywhere on the Logstash server, just no logs in ElasticSearch. Life is busy, between the many things happening at Adobe and the class I'm teaching at BYU, so I poked at it on and off again for a few weeks. I fired an email to a contact I have at ES, but got dead air. I eventually joined the #logstash IRC channel on freenode, but never found the time to troubleshoot to a point where I could succinctly describe the situation and environment. This situation has driven me crazy ever since. I've looked at it again several times, but never found any debug info or logs that helped me figure out what was going on. netstat and tcpdump showed that the traffic was flowing, but nothing was going to ES. Upgrading Logstash and ElasticSearch, while adding cool features, did not help in the slightest. Changing the output to a file didn't help either, so it likely wasn't ES or the ES plugin. Data was coming in, so the logstash-forwarders were doing their jobs. And, on top of that, not having very much time to work on it meant that it was going to be a pebble in my shoe for some time. Adobe uses Splunk, so this was something of a skunkworks project anyway, and it would be hard to push for a product that just stops working for no apparent reason, so to keep my sanity (or what's left of it), I eventually turned off the logstash-forwarders and left the logstash VM lie fallow.. Fast forward a couple of months and now I'm looking for a new monitoring solution for my businss unit. I've come to understand that at our scale (40,000+ devices), tracking monitoring data, especially with historical data, graphs, etc., really is a big data problem. In fact, just tracking the configuration of monitoring is a small-ish big data problem. So, I've been studying up on big data tools. Among the tools that pop up was ElasticSearch. Ah, ElasticSearch and that beautiful Logstash-forwarded data. It was so beautiful while it lasted. Maybe it would be worth another go-around to try to get it going. Wouldn't it be beautiful to get that going again and actually be able to use my server and application logs. There's no time at work, so, a couple of nights ago I decided to set it up as if from scratch. I updated to the current version of Logstash and ElasticSearch. No change yet. I double-checked connections with netcat and netstat. Still reachable and connections still there. I recreated the Logstash and logstash-forwarder configs for a couple of servers. Nothing. Then I replaced all the certificates used by logstash-forwarder and voilà. Logs started flowing again. Upon some investigation, the original cert I'd created had expired at the end of December -- the default expiration for openssl certificates is 1 month. While I'm sure this is very secure, 1 month is not very long in the real world. So, now, I have a few years before this new cert expires. Maybe I'll be better at debugging Logstash by then. Or maybe I'll find this blog post. Either way., I'll probably facepalm, generate & distribute a new cert and go for a few more years. In process of listening to LS and ES youtube videos, Jordan Sissel ( who is awesome, by the way) described himself as SysAdmin like this: "I'm a SysAdmin, which means that I like being angry with computers." How true that is at times! I love his concept of "anger-driven development"--something makes him angry, so he writes some project that fixes the problem. That's cathartic in the best sense of the word, I think. Logstash, fpm and his other projects are awesome, and awesomely cathartic for me, too.
The front of the building. Inside, there seem to be guards
and badge readers everywhere, but the outside is colorful and friendly. Mid-week, I spent a couple of days in Oregon, getting to know Adobe's new data center. Though it is dwarfed by the Intel facility just up the street, it is a very respectable facility, with a planned capacity of 7.5 megawatts. As with everything Adobe does, the design is brilliant -- both in looks (as would be expected of Adobe) and in functionality (as would be expected of a world-class SaaS-maintaining IT organization). Here are a few shots of the building, inside and out. The USENIX LISA '12 paper that Bret Swan and I created (he wrote it; I provided the knowledge) was not accepted, but early in November, I received a surprise email stating that they would like us to present a poster in their poster session. So, I've gathered together many of the "Lessons Learned" points we discussed in the paper, with more that have come to light since then, and put them into a poster form. The poster is entitled: Enterprise Monitoring Visualization for SaaS: Lessons Learned in Developing Adobe’s Digital Marketing NOC The screen grab at the left is my first stab, but it will get a facelift from the brilliant designer who did the design work for our new NOC wall (some of whose work you see here, in thumbnail form), so it will soon look much better. The conference is in San Diego in a couple of weeks. I couldn't be more excited. I'm looking forward to soaking up as much knowledge as I possibly can. Being able to actually present there is just a very tasty icing on an already near-perfect cake. The answer to that is: to work. This is new Adobe Digital Marketing Business Unit NOC, which I've spearheaded for the last few months. (And worked It's a huge change from the previous NOC, with much more room, much more emphasis on showing off what we are doing. With around 150 square feet of video wall, there's a lot to be seen. There's also a lot to be worked on, and that's where I've been for 50-60-ish hours a week for the last 2+ months. It's funny how you can see something as cool as this and still think only about how much more there is that needs to be done, how much better it should be. Funny, except that it's easy to let that grind you down. As I've worked these many hours lately, I've wondered if I'm experiencing, in a small way, something like what anorexics experience: they may be terribly thin, but they still only see themselves as too fat, and becoming thinner doesn't help; similarly, as I've been burning my evenings and Saturdays, I find myself often thinking that I'm not getting enough done, not working hard enough. That can't be right. I've seen this lighting on our amazing new building maybe too often. So, as awesome as the new building is, I'm hoping to see a little less of it each week moving forward. Hiring on a couple of new people will help tremendously. Okay, I'll admit this one thing: it's very fun that lots of people want tours of the place. I can see the allure of just becoming a tour guide and showing people cool places and cool sights. This is no Machu Pichu, but it's still kind of cool. Adobe's new Lehi facility is built to be a show-off facility. It's shiny, glassy, and bold, built right over a road, with a basketball court that you look into as you come over the point of the mountain on I-15. It's full of amazing work and play areas, making it even cooler on the inside than the outside. And I get to help design the look of one of the main stops on the tour route: the NOC. We've gotten some brilliant design work from our own product designers, and a state-of-the-art 24-monitor video wall. We'll start out with quite a bit of meaningful data, but hope to ramp that up a great deal over the coming months and years. It's an exciting and time consuming project. When it's up and going, it'll be fun saying: "I helped with that" and "I wrote the back end for everything you see here" again, like I did back in the Novell NOC days. If that were the only project I was working on, I think it would be enough. However, I'm also working on a project to update all our Nagios instances to a new architecture and new Nagios version -- not a small feat while we are monitoring 25,000+ devices with the existing systems. While that is going on, I have two projects to automate Nagios configuration file generation, plus a project to get Zabbix up and running at 4 sites. There's plenty of work going on. Luckily, I'm not alone in this -- I have a great team of people around me. Holy holiday afterburners Batman! Like never before, the holiday season has put everything on fast-forward.
This is probably partly due to the exuberance that Adobe has for the holiday -- 2 or 3 org parties and a couple of "give during the holidays" charity event. It's also probably due to the standard required set of holiday concerts, recitals, church parties and other activities. While the latter can be quite fun (or terribly boring, depending on the activity), the former is just impressive. I was even able to convince my wife to go to the company '80's-themed party--mostly because we haven't had a real company party in many years. Unfortunately, even with all the holiday activities, I still tasks to do and deadlines to meet. While this can be stressful, it's also exciting. So, at least for now, fire up those afterburners. It's a good thing that the computer world isn't simple. After all, who'd pay me to work on fun things like AWS Elastic Load Balancer monitoring if it were totally straight-forward.
I'm currently wrestling with an nice Ruby Nagios plugin for doing CloudWatch monitoring, I've got it doing AWS/EC2 metrics. Now I just have to figure out how to get it to report AWS/ELB metrics. Full documentation would have been nice, but, like I said, if it were easy, who'd pay me to figure it out? Last week was my first as a NOC architect at Adobe's Omniture Business Unit. Each day, I was scheduled for three to five meetings with the managers or members of various teams and given high-level overviews of the various activities in the business unit. Overall, it has been the most thorough and timely organizational training I have ever received.
With tens of thousands of servers, switches, routers and applications to watch, this NOC architect position promises to be both challenging and exciting. The environment already has an amazing set of tools, some of which are quite mature, while others are in need of some work. My first order of business is going to be documenting and getting to know these systems. Wish me luck. There's a lot to get, so I'll need it! |
AuthorRussel is a senior career IT guy and relatively new manager with an academic interest in log management and log data analysis, a professional interest in monitoring and management systems. database management, and programming languages, and personal interests in family, photography, reading, and the outdoors. Archives
January 2023
Categories
All
|