While investigating our MongoDB lock ratio I was asking around to see what common lock percentages were among those who I know use MongoDB. I discovered that despite having similar setups to what we are using that they didn't know how to get the lock percentage from Graphite.
Using Diamond you can easily get all
of the MongoDB server
status metrics into
Graphite but the globalLock.ratio
is a bit misleading in that it is based on
the total uptime of Mongo, which could be a while, and not on recent usage
patterns. And in 2.2 it disappears anyway!
The metrics that are included though that help are globalLock.totalTime
and
globalLock.lockTime
which can be used to find the lock ratio/percentage over
whatever sampling period you use.
The percentage winds up being
scale(divideSeries(derivative(servers.MONGOHOSTNAME.MongoDBCollector.globalLock.lockTime),derivative(servers.MONGOHOSTNAME.MongoDBCollector.globalLock.totalTime)),100)
.
You can remove the scale
function to get the ratio. This doesn't work with
globbing in Graphite though. You can scale the lockTime
though to be able to
get a globbable lock ratio for all of your Mongo servers, the exact value will
depend on the sampling period.
I released a package SQL-to-Graphite that aims to easily save the results of SQL queries into Graphite.
We use this and similar scripts (I'm going to move over to using this) at work
in order to collect global metrics about our systems. I typically count any table
that has a status
column and the average/max age of any records that should
be updated periodically.
I made this package once I hit the second repository where I would have to write a script to do this. It should be compatible with any database supported by SQLAlchemy.
After installing (pip install sql-to-graphite
) you can run the sql-to-graphite
command.
With a file like:
And start getting metrics into Graphite!
I noticed recently that my site didn't have the icon anymore. I'm
not sure when I lost it but to add it back I just added the link
field that
lets browsers know where my Atom feed is. Simple enough to add <link type="application/atom+xml" rel="alternate" href="/atom.xml"/>
to the head
of the page.
A recent task of mine was to add some metric collection to a Rails application at SeatGeek. One of the main components (and critical if there was a problem) is the set of Resque background workers. There is actually a Resque Plugin (abandoned, maintained that will collect stats. The gem sadly is not maintained so I forked the maintained repo in order to provide a stable source. I use the commit hash to make sure I get the version but if the repository we used disappears that would cause problems, so a fork solves that.My fork doesn't change much except for some of the paths used for the metrics. At some point I may clean up the README and package my first gem.
Categories in Jekyll have annoyed me for a while because of the URLs generated.
The path would be something like /tag1/tag2/year/month/day/title
which works
so long as you don't change the categories used. Since tags are also an option
and don't have the same issue I've switched. I followed this post about
tagging archive pages in Jekyll that
made it rather painless.
I just pushed a Python package for Klout-to-Graphite that will easily allow you to graph your Klout within Graphite.
This started with a few minutes after lunch at SeatGeek where we were checking various Klout scores. Since I tend to graph... everything... I quickly setup a cron script to start collecting the metrics for Graphite.
To run it:
Ideally this is run in cron, we use 30 minutes. Over the course of 2 weeks there is already a few rank changes and large jumps due to adding new social networks to Klout.
A recent metric I've started paying attention to was the duration of the health check for services behind HAProxy. This is reported in the admin interface CSV and can easily be added to your metric systems. This is what a few nodes started doing yesterday:
This service can usually hits the 50ms range for health checks although it started getting much worse. The service is actually written in Tornado although has a few blocking calls that are used. Non-blocking IO should allow the health checks to be very quick to respond as in this case it returns a static response.
The root cause for the problem is that calls to MongoDB in a particular handler were taking longer than before and will hold back other handlers as it is currently a blocking operation. If the HAProxy health checks pass a threshold it will remove the nodes from the pool, a good precaution, although in our case can cause flickering if MongoDB takes longer than expected.
I did receive alerts thanks to alerting of per-service health checks with Graphite Pager.
We are using Diamond at
SeatGeek which easily collects metrics from HAProxy.
Check duration is (by default) stored at
servers.HAPROXY-SERVER.haproxy.BACKEND.HOST-SERVER.check_duration
. The metric
we alert on is the moving median for each server regardless of the HAProxy
server
aliasByNode(movingMedian(groupByNode(servers.*.haproxy.*.*.check_duration,3,"averageSeries"),10),0)
.
I've started working on a project making it easy to send alerts from Graphite. Previously at AWeber we had this problem as well but used Nagios (not that easy) which wasn't a great experience. Given that PagerDuty can handle the notification part (Yay APIs!) all that was left was reading Graphite's rawData output and triggering an alert.
Right now I'm testing it at SeatGeek running it on Heroku, the example of how to set up Graphite Pager on Heroku is small and straightforward. It has already helped detect a few problems before our other monitoring tools and (eventually) can alert on actual business metrics!
The alert format looks like:
Pretty Simple. It supports globbing with unique alerts for each metric. Graphite Pager can't determine a disappearing host from the glob, maybe in the future, but will set the alert for all metrics returned.
Friday August 17th was my last day at AWeber. I've accepted a position at SeatGeek as a Web Engineer.
AWeber was a great opportunity that turned me from a programmer into a developer. I wish everyone there the best of luck.
Now that I've left I'm upset that most of what I worked on there was not open source so I will no longer be able to use the things that I've built. Towards the end of my time there I was trying to put more projects on Github that could have been useful to someone. Hopefully at SeatGeek I will have the change to make some larger contributions!
During Philly Tech Week I gave a talk about Kanban. This was my first time speaking professionally and I have since volunteered to give a talk at Philly Coders
The official blurb for the Kanban talk
When faced with the challenges of managing a growing email marketing software and 40-person development team, AWeber turned to the project management system Kanban. In this presentation with Ethan McCreadie and Philip Cristiano, they shared AWeber's journey into Kanban, how it functions within AWeber's team structure, and the advantages and disadvantages other companies should take into consideration before implementing the system.
If you want someone to wear your shirt (or promote your company through any free merch) then you might as well spend a little more and give away something great! In the case of shirts, I am a huge fan of the companies that give away American Apparel shirts instead of something crappy.
Companies that I've seen give away American Apparel shirts include:
Even when the company is a competitor to my employer (MailChimp), I'm likely to wear the shirt.
Companies that at least make a comfortable shirt without crappy graphics may still be worn after I'm through the American Apparel shirts.
Like in software, even if you are giving away something for free, you should focus on quality.
I've switched over to using Twitter's Bootstrap library. It was about 30 minutes to setup but I'll have to look around for the best way to use set the active tab in the navbar.
I have spent a significant amount of time recently looking into monitoring and metrics collection. At work we have Nagios and Cacti currently and are looking at other options. After setting up Ganglia we decided to give Graphite a try. There is a script to send data from Ganglia to Graphite although the whole system gets to be more complex than I'd like. The chain winds up being: monitored server -> Ganglia -> Ganglia parser -> Graphite.
Looking at the Graphite Tools page I learned about Diamond which can collect metrics using a Python agent and send them directly to Graphite. It has some very useful collectors already and the big benefit, makes it very easy to add a new collector!
In about an hour I was able to make a collector for MongoDB and submit a pull request. The changeset was accepted making it my first contribution to an existing open-source project! I plan on creating more collectors for software that we are using as work. Next up is likely to be RabbitMQ.
And if you haven't, try Graphite!
I've switched back to Jekyll! This time it's hosted on my Linode account. After checking my analytics for my domain while it was hosted with Flavors.me I realized it wasn't worth $20 a year for as few hits it was getting when I was paying for this server. It did serve as a good place for recruiters to contact me. We'll see if that still happens here.
I've kept my VIM/dot files online for a while in my Github but I recently spent some time to update my .vimrc file.
One of the changes that bugged me on OSX is that it ships with VIM 7.2 which doesn't have ColorColumn support. I like highlighting the 80th column in Python. As I discovered, the code to do this is :
{% highlight vim %} if version >= 730 autocmd FileType python set cc=80 hi ColorColumn ctermbg=darkgrey guibg=darkgrey endif {% endhighlight %}
Another change was to add Gundo support. This adds a window to navigate your undo tree, an incredibly useful feature.
Since I heard about the addition of branch coverage tracking for Coverage I've wanted to give it a try. Originally it required a beta release which somehow I never got working.
Once it was in a normal release I somehow forgot about it. There is still no commandline argument to turn it on when using Nose. You can however use the .coveragerc file to enable it.
In .coveragerc simply put
{% highlight bash %} [run] branch = True {% endhighlight %}
And next time you run Nose with coverage you'll have branch coverage too! I was finally reminded about this when coming across the Test Coverage Analysis post by Kai Lautaportti.
I've added categories to posts although you can't really see them on each page due to an annoying quirk with Jekyll where you can't get the category for a post unless you are listing all of the posts.
Github also doesn't seem to load the pluging properly...
A few weeks ago at work we improved our code review process by using Git more effectively. Previously a code review happened after the topic branch was merged into master. This obviously was not very effective as changes could have broken master without a proper review and there was less incentive to perform as careful of a review since the code was "working" already. This was carried over from when we were using SVN until we realized we were no longer forced to work in the dark ages.
Since we were already using Git we could easily change our workflow for a better review process. Once a topic branch was ready for review we could push a remote branch. Our remote branches take the form /review/{initials}/{topic}
To push a new remote branch:
{% highlight bash %} git push origin {branch name}:/review/{initials}/{branch name} {% endhighlight %}
And then we would move the Kanban card to the review column and find someone to review our changes.
When the code has been reviewed and any necessary changes made the reviewer will merge them into master.
{% highlight bash %} git checkout master git merge --no-ff --no-commit /review/{initials}/{branch name} git commit -s {% endhighlight %}
The merge command turns off fast-forward merging and commiting so when we commit with a -s we can sign-off on the changes. This shows who reviewed the code. We don't permit anyone to push their own changes without review although Git doesn't prevent you from changing the committer or the sign-off.
Then finally remove the remote branch by pushing an empty branch over it and delete your local copy of the branch.
{% highlight bash %} git push origin :/review/{initials}/{branch name} git branch -d /review/{initials}/{branch name} {% endhighlight %}
This release fixes a bug cause by linking all plug service instances to the installed plug. Runit uses a ./supervise in the service's directory to maintain state which would be clobbered when multiple services link to the same plug.
Now the virtualenv is copied to /srv/plug/plug_instances/ and linked into runit.
There is also a fix for uninstalling plugs leaving orphaned processes. Now Plug will stop the processes before removing them to prevent this.
Release v0.1.1 adds an uninstall
command to Plug that takes a --plug=
option and removes the virtualenv and all runit links.
You can get Plug on PyPi and try it out. As always, report any issues.