Seeing as I haven’t written anything in 2 months (oops), I thought I’d take some time to write a quick post about the project that I worked on during my placement at CERN.

To give a brief summary, I worked in the IT department, in the databases group, in the Infrastructure and Middleware Services section (IT-DB-IMS). My workload involved a few mini-projects, but the overall goal was essentially to examine and improve some of the IT monitoring infrastructure at CERN.

What is monitoring?

“Monitoring” is a bit of a catch-all term for a number of processes, really. It covers various tasks, including logging, metrics, alerts, notifications, and all of the processes that come with collecting and analysing that data.

The first mini-project involved improving the log monitoring at CERN.

Logging (ELK stack)

My first task involved setting up an ELK (Elasticsearch, Logstash, and Kibana) stack with Puppet. I’d never used Puppet before (only Ansible), so there was quite a lot of reading I needed to do to learn best practices and how to use the language. Puppet has a fairly steep learning curve, but is very rewarding when used properly. On the other hand, when it’s used wrongly? Bad things happen. I found that out the hard way, unfortunately.

Despite the fact that the ELK stack is already used at CERN, this was a little more complicated than it sounds; each ELK component in the CERN RPM repositories was outdated, so I spent quite a lot of time trying to get updated packages into the RPM repos. As it turned out, quite a fiddly process involving building custom RPM packages with the Koji build system.

So, what does the ELK stack do? Well, it’s generally used for storing and analysing logging information for a large set of clients. In short, clients will forward logs to an Elasticsearch server (a NoSQL data store with some very nice text-searching capabilities), where log messages can be analysed and graphed in near-real-time. This is extremely useful for loads of things - for example, graphing usage patterns on servers, diagnosing system errors, and more. I’ve included a screenshot of Kibana below as an example of some of the nice visualisations that it can create. (Bear in mind, this isn’t actually my image - it was taken from elastic.co).

ELK/Kibana dashboard

Metrics (OpenTSDB)

Metric collection involves monitoring system statistics, basically - things like CPU load average, amount of memory free, number of read/write operations on each disk

OpenTSDB is a fast and efficient time-series database that runs on top of Hadoop’s HDFS. HDFS is a clustered, distributed filesystem that replicates data across a number of machines to ensure fast and reliable access to data. This means the consequences of machine failures are reduced - if a machine stops working, it can automatically be replaced by another that can provide access to the same data, making HDFS extremely useful as a high-availability filesystem.

Anyway, OpenTSDB is great for storing metrics on a long-term basis. It’s never been used at CERN before so my first task was to figure out how to install and configure it. It actually turned out to be rather simple, and after configuring a metric collector (I used scollector for this instead of the tcollector that is usually recommended by OpenTSDB - tcollector ended up being rather inefficient).

It’s possible to visualise time-series data from right within OpenTSDB, too - it’s rudimentary and not very pretty, but it’s functional and fast. I’ve included a screenshot below. (Again, the image isn’t mine, it’s from opentsdb.net).

The OpenTSDB web interface

Analytics on Oracle WebLogic Logs

One of the other tasks at CERN was to create a number of analytics dashboards for several Oracle WebLogic clusters. WebLogics are essentially enterprise application servers from Oracle that run Java applications. Extracting information from the log files generated by WebLogic machines and providing useful metrics on the data proved to be a surprisingly complex task - actually, if you watch the lightning talk below, I spend a few minutes ranting about the design of the logging architecture within the WebLogic software.

Lightning Talk

As part of the placement, it was a requirement that we put together and present a lightning talk covering some of the key elements of our projects.

Despite being nervous, I think it turned out reasonably well (my talk was awarded “Best OpenLab student talk of the week”, so I guess it went well) - so I’ll embed the stream here. It’ll probably give a better idea of what I did than this blog post will :). Apologies in advance for the bad jokes.

You can watch my lightning talk here.

Would I do it again?

Hell, yes. Spending 2 months abroad at CERN was one of the best experiences of my life, and I can genuinely say that it’s changed me. It’s a beautiful area, I got to work on some fascinating projects with exciting tools, I met some fantastic people and did some fantastic things - from a spontaneous sailing trip on Lake Geneva, to lots of barbecues with good friends, to meeting Tim Berners-Lee’s old supervisor over lunch, to exploring the maze of tunnels underneath CERN. The entire experience was a blast, and I’ll remember it for the rest of my life. I hope that I will be able to return one day!