Skip to content

PuppetCamp 2010

I’m off to PuppetCamp this week, October 7th and 8th at the City Club of San Francisco. Hopefully there will be some interesting conversations with others in charge of non-trivial, multi-site, multi-environment puppet systems.  At PowerReviews, we rely heavily upon puppet for automation and I’m hoping to pick up a few new tricks.

We just went through a big upgrade to version 2.5.5 and we are already looking at 2.6.1. The REST interface is pretty exciting and the stored configs are already simplifying the configuration of the distributed applications that need to know about the other hosts at their layer within the stack. Stored configs have almost automated our Nagios configuration.  We are getting close!  In earlier versions of puppet, global variables were used to hack a config together and I’m glad to see that they will be going away.

I have spent a fair amount of time working on cloud configurations lately and comparing the cloud offerings from AWS and Rackspace. Using puppet with a cloud platform seems to be bleeding edge at the moment. There are certainly blog posts that go back 18 months but they are a bit trivial and seem to be more proof of concept instead of examples of architecture at scale. I’m hoping to find others with real world puppet in the cloud experience. How do other shops set this up? There seem to be several ways of doing it – each with their associated pitfalls…

Tagged , , ,

There are no shortcuts

Once upon a time, I was playing golf at chimney rock in Napa.  The course has been replaced by a vineyard now.  But I remember playing a round with my brothers.  I’m not a very good golfer and I have a hook and a slice.  One of the holes had a sharp dogleg left around another hole.  I shanked my drive so badly I landed on the fairway of another hole.  But it took a shortcut through the dogleg and I had a good look at the pin.  (We didn’t recognize the out of bounds markers during our rounds).

Anyway I was squaring up for my shot when the course marshal rolled up in his cart.  He pulled within 20 feet or so and paused for a moment.  He didn’t get out of the cart.  It was just the two of us since my brothers had been lucky enough to keep their balls on the green.  I eyed him warily not sure what was going to happen.

“Son, there are no shortcuts in golf.”

After delivering this pearl, his golf cart roared to life and he was gone.  I pondered his advice as I took my next shot.  And the next.  And the next.  And so on.  It took me a while to get onto the green.

Anyway, I remembered this story today while I was trying to build a custom RPM for ruby.  You would think this would be pretty easy and maybe even irrelevant since ruby is pretty popular these days and Linux is a pretty mainstream platform too.  I thought for sure someone else would have already done the hard work.  Alas, no.

We are building some new OS images for use in templating virtual machine builds in the cloud.  All of our machines will be configured using puppet and capistrano because each of these tools make system administration easier.  And easy is a requirement if we are going to start proliferating cloud hosts.  Puppet and capistrano are built using ruby so we need ruby.  For some arcane reason involving ruby openssl libs, puppet would not work with our vintage ruby 1.8.7-p174 rpm so it was time to build some new hotness.

I can build and install from the tarball just fine.

tar xvfz ruby-1.8.7-p302.tar.gz
cd ruby-1.8.7-p302
./configure -prefix=/usr
make install

But I have this thing about installing all of my system software via the native package manager for the platform.  I really like to use RPM on Redhat.  Otherwise dependency resolution is impossible.  I found that the tarball install worked great but that my puppet recipes used yum to install several gems and other ruby-ness that require a base ruby install.  And since RPM didn’t know about the tarball install I needed to install these RPMs with –nodeps.  Unfortunately, puppet doesn’t use that flag.  Instead it would try to install dependencies including an old vintage version of ruby-1.8.5 from the vanilla EL5 distribution.  Installing this RPM breaks the host.  And our puppeteering and capistrano magic each require ruby features that are only available in ruby versions greater than 1.8.6.  The whole point of installing ruby is to make puppet and capistrano function.

Welcome to dependency hell.  This is what sysadmins do.  They solve annoying little situations like this.  I haven’t solved this particular one yet as the .spec file format for rolling your own RPM is not for the faint of heart.  But I will…

Tagged ,

Postgresql 8.4.3 upgrade troubles

I recently upgraded several of postgres DB server to the latest version 8.4.3.  The process was pretty simple for the 8.3.x clusters.

Make a copy of the postgres configuration files.

cp /var/lib/pgsql/data/*.conf /tmp/.

pg_dump each DB instance in plaintext format:

pg_dump -d dbname -Fp -o ./dbname.dmp.sql

Stop postgres:

pg_ctl stop

Configure the new pgdg yum repository & install updated rpms.  (I used puppet for this – so I’m not going to expand on that step here…)

Initialize the postgres cluster:

service postgres initdb

Make a backup of the new 8.4.3 configuration files.

cp /var/lib/pgsql/data/*.conf /var/lib/pgsql/data/*.843.orig.conf

Restore the 8.3 configuration files:

mv /tmp/*.conf /var/lib/pgsql/data/

Start postgres

pg_ctl start

At this point, I thought I had everything dialed in and I was good to go.  Alas, I was wrong and the server wouldn’t start.  The pgstartup.log had a message complaining about IPv6.  I spent a few minutes on that and discovered that IPv6 was already disabled in the kernel and that this was a red herring.  Nothing in /var/log/messages.  No trail of bread crumbs.  What was I missing?

I diffed the conf files and compared version 8.3.x with 8.4.3 and discovered that pg_hba.conf was the problem.  The first thing I noticed was that the default authentication method used by postgres has been switched from trust to ident.  Trust had always seemed a bit too insecure to me so I suppose this was a good change but I was surprised to see ident.  I haven’t done anything with ident since I had a crash course in sendmail administration back in the 90s.  I had to read the bat book to fix some open relay issues.  I learned more about sendmail and the M4 macro processor then I really ever wanted to know.  But I digress.  Needless to say I didn’t want to use ident for authentication on my postgres server either.  In fact, I use MD5 for authentication. And that is when I realized the source of my problem.

When I upgraded to 8.4.3 I forgot to backup and restore my users and roles.  My 8.3 pg_hba.conf specified MD5 authentication for all users… but I no longer had any users, roles, or MD5 passwords installed on the cluster.  Luckily I had a backup on another host so this was a quick fix.  What I should have done was run this before blowing away the old 8.3 cluster:

pg_dumpall -g

So what should we take away from all this?  Well a couple of things.

  1. Make sure you have backups.  They save you when you do something stupid.
  2. Postgres startup error messages are non-existent for this scenario.  This is a silent fail.  This is too bad because I am probably not the only one to have encountered this problem.  Perhaps we will get a log message or an entry in pg_startup.log for this condition in some future release.

Database Schema Migration Tool

I have been working on database schema migration tools lately. We have 25+ database instances spread across 7 servers and sometimes it can get difficult to track which SQL changes got run against each of the databases (especially in the development environments).

First I surveyed some of the existing tools.  I looked at Rails Migrations, c5-db-migration, and several other tools.

I already use the nifty nagios check that handles schema diffs.    Unfortunately, it only reports on differences and doesn’t apply changes.

Some other tools that seem to overlap the c5 functionality:

Post Facto (from pgcon2009 – in alpha so I didn’t want to use it yet)

There is a good presentation on here that details pros/cons w/ current db schema migration tools
Running through the presentation provides a quick overview of why multi-env schema management is hard.

After reading documentation for all these tools, downloading some of them and even configuring them and trying them out, I had a much better idea of what I wanted.  Now I could draft requirements:

  1. Postgres support (don’t need to support other DBs)
  2. Must support running .sql files (don’t want to model the DB structure in another ORM tool)
  3. Integrate with SVN and store DDL in the repository
  4. Perform all schema updates in a single command (a la rails migrations)
  5. Low overhead
  6. Simple

The (2) top contenders were Rails Migrations and the C5 tool.

We liked the C5 tool since we had a few of the authors in to write some Java for us and they did good work.  Unfortunately the tool requires Java, Maven, Maven plugins, etc and since I am not too familiar with that environment the learning curve was too steep.  The developers sure did like it though.

As far as I could tell, Rails Migrations came along with a certain mindset about the database.  Rails Migrations seemed to be tightly integrated with and encourage the use of ActiveRecord as an ORM tool.  A “pure” Rails migration does not use SQL but rather the ActiveRecord models instead.  There were numerous patches available for running SQL statements or .sql files but they didn’t seem to be mainstream.  I got the feeling that using Rails Migrations in this way would be some kind of perversion and that ultimately I would find the process painful.  This was a bummer since I was hoping to do some more ruby now that I have been hacking on puppet for a while.

So… I decided to write my own tool.  What did I pick.  bash.  Yes.  It’s the environment I know the best and while I written code in many different languages over the years it has inevitably been my bash scripting that has survived for years in production environments.  It’s portable, it’s the lowest common denominator, and I’m lazy.  I certainly know that bash is not ideal.  It’s slow.  OO is nearly impossible.  Arrays (and non-trivial data structures) are difficult.  The syntax is obscure.  Etc…

Anyway I banged it out.  250 lines of shell scripting and it works great so far.  There are a few pre-requisites to get it to work:

  • .pgpass setup for passwd free login for the user running these migrations (I used postgres)
  • setup for passwd free login to to the svn repository. (.shosts, authorized_keys, hosts.equiv, whatever…)


#$Id: 23349 2009-10-27 21:14:53Z kevin $

ProgName=$(basename $0)

setup () {
  if [ ! -d "${WIPdir}" ]
  	mkdir $WIPdir
  cd $WIPdir

usage () {
  echo "$Id: 23349 2009-10-27 21:14:53Z kevin $"
  echo "  Applies database schema changes."
  echo ""
  echo " $(basename $0) -d  (-t) -f
  echo " $(basename $0) -d  -g"
  echo " $(basename $0) -d  -c"
  echo " $(basename $0) -d  -a"
  echo " $(basename $0) -d  -r "
  echo ""
  echo "Options:"
  echo " -d db name"
  echo ""
  echo "Run a single update:"
  echo " -f flag specifies file with DDL (i.e. SQL)."
  echo "   filename format includes revision_id for the change."
  echo "   ** this format for filename: YYYYMMDDHHMMSS_some_description.sql (use Military Time for HH field ex: 13:00 vs 1:00)"
  echo "   ** revision_id is subset of filename field: YYYYMMDDHHMMSS (first 14 digits)"
  echo ""
  echo "Get the DB version:"
  echo " -g gets most recent schema revision id from db"
  echo ""
  echo "List new DB schema revisions available:"
  echo " -c check svn repo for new schema revisions"
  echo ""
  echo "Create an empty DB for testing:"
  echo " -t flag is optional testing flag which creates an empty db w/ the latest production schema"
  echo ""
  echo "Apply all migrations:"
  echo " -a apply all available migrations in svn repo to target db (like Rails Migrations)"
  echo ""
  echo "Examples:"
  echo "    $(basename $0) -d kevindb -f /tmp/20091021030723_my_new_table.sql"
  echo "    $(basename $0) -d kevindb -r 20091017143100"
  echo "    $(basename $0) -d kevindb -c"
  echo "    $(basename $0) -d kevindb -t"
  echo "    $(basename $0) -d kevindb -g"

cleanup () {
  rm -rf $WIPdir

get_db_revision_id () {
  if [[ -z "$db" ]]
	exit 71
  db_revision_id=$(psql -A -t -d "${db}" -c 'SELECT max(revision_id) FROM schema_version;')
  echo "${db}" current revision_id = "${db_revision_id}"

check_revision () {
  if [[ "${1}" -gt "${2}" ]]
#	echo "${1} is new."
#	echo "${1} already applied."

list_ddl () {
  svn ls "${svnrepo}" || {
	echo "error checking svn repo: ${svnrepo}"
	exit 88

export_ddl () {
  svn export "${svnrepo}/${1}"

extract_revision_id () {
	echo "${1}" | egrep -e '(^[0-9]{14}.*.sql)' | cut -c1-14

get_svn_ci_mesg () {
	svn log --limit 1 "${svnrepo}/${1}" | egrep -v '(--------)' | tail -1

find_new_revisions () {
  echo "Unapplied schema changes from ${svnrepo} listed below:"
  list_ddl |
  while read line
	local rev_id=$(echo "${line}"  | egrep -e '(^[0-9]{14}.*.sql)' | cut -c1-14  )
	check_revision "${rev_id}" "${db_revision_id}"
	if [ "${apply}" = "yes" ]
		echo "${line}"

# match filenames w/ 14 digit prefix & .sql suffix
validate_naming_convention () {
  basename "${ddl_file}" | egrep -e '(^[0-9]{14}.*.sql)' > /dev/null
  if [[ $? = 0 ]]
	revision_id=$(basename "${ddl_file}" | egrep -e '(^[0-9]{14}.*.sql)' | cut -c1-14)
	echo "invalid ddl_file name: ${ddl_file}"
	exit 55

build_ddl () {
	cat $ddl_file > $WIPFile || {
       		echo "error: building ddlfile"
       		exit 45;
	echo "$insertsql" >> $WIPFile  || {
       		echo "error: building ddlfile"
       		exit 46;

run_sql () {

  # verify all required variables are set or abort
  if [[ -z "$who" ]]
	echo "abort. who not set."
	exit 2
  if [[ -z "$msg" ]]
	echo "abort. msg not set."
	exit 3
  if [[ -z "$db" ]]
	echo "abort. db not set."
	exit 4
  if [ ! -r "$sqlfile" -a ! -w "$sqlfile" ]
	echo "abort. sqlfile $sqlfile issues."
	exit 5

  insertsql="INSERT INTO schema_version (revision_id, date, who, msg) VALUES ('$revision_id', now(), '$who', '$msg');"

  build_ddl "${insertsql}" "${ddl_file}" "${WIPFile}"

  psql -A -e -t -1 -f ${WIPFile} -d ${db} || {
      	echo "error: executing sql in $sqlfile"
   	exit 7;
  rm $WIPFile

create_schema_only_db () {
  # support for test mode
    if [[ -z "$db" ]]
	exit 20
    cd $WIPdir
    svn export svn+ssh://
    psql -A -e -t -c "CREATE DATABASE ${db};" || {
       	echo "error: creating db"
       	exit 1;
    psql -A -e -t -f $WIPdir/production.schema.sql -d ${db} || {
       	echo "error: creating schema"
       	exit 1;
    rm $WIPdir/production.schema.sql


while getopts 'd:f:r:agcth' OPT
   case $OPT in
	d)	db="$OPTARG";;
	g)	get_db_revision_id
		exit 0;;
	c)	find_new_revisions
		exit 0;;
	f)	full_path_ddl_file="$OPTARG"
		ddl_file=$(basename $full_path_ddl_file)
		cd $WIPdir
		export_ddl "${ddl_file}"
		msg=$(get_svn_ci_mesg "${ddl_file}")
		run_sql "${who}" "${msg}" "${db}" "${ddl_file}"  || {
			echo "something bad has happened"
			exit 30
		rm "${ddl_file}";;
	a)	find_new_revisions | egrep -v '(current|Unapplied)' |
		while read unapplied_ddl_file
			echo "recursive call: $0 -d ${db} -f ${unapplied_ddl_file}"
			$0 -d "${db}" -f "${unapplied_ddl_file}"
		exit 0;;
	r)	urev_id="$OPTARG"
		find_new_revisions | egrep -v '(current|Unapplied)' |
		while read unapplied_ddl_file
		  file_rev_id=$(extract_revision_id "${unapplied_ddl_file}")
 	      	  if [[ "${file_rev_id}" -le "${urev_id}" ]]
		    echo "recursive call: $0 -d ${db} -f ${unapplied_ddl_file}"
		    $0 -d "${db}" -f "${unapplied_ddl_file}"
		exit 0;;
	t)	create_schema_only_db;;
	h)	usage
		exit 0;;
shift $(($OPTIND - 1))

exit 0

YOUR FIRST SCHEMA CHANGE – Define the schema_version table

Now you need a SQL file to initialize your DB with a table to hold your audit log of schema changes. Create a file named: YYYYMMDDHHMMSS_schema_version.sql with the content below and check it into your SVN repo.

CREATE TABLE schema_version (
revision_id bigint not null PRIMARY KEY,
date timestamp not null,
who text not null,
msg text not null

Now you are ready to track all of your future updates across any of the DB instances.

Tagged ,

SQL Hacks

Most of my SQL is usually something along the lines of:


But recently I have been working on a report that has been causing some trouble. The report issues a large SQL query for each of our partners and then a fancy perl script filters, sorts, aggregates, and formats the data on the client. This report now runs for over 20 hours per day. This is a problem. But just by reviewing the algorithm used for report generation it seems like there is some opportunity for improvement.

I have been working to convert this report into SQL so Postgres can handle the sorting, filtering, formatting, et al. It seemed like this strategy might run much faster than having all the logic done in Perl on a remote client. It’s too early to tell since I haven’t finished my port from Perl to SQL yet but I did learn a few things a long the way.

Temporary Tables
Temporary tables are awesome. Using a statement of the form:


You can save the results of a complicated SELECT statement and then use it later on for future JOIN fun. This way you can accomplish a couple of nice things.

  1. Logically separate the SQL so the next poor sap who has to debug this monster has any chance at all to understand the basics of what is going on.
  2. Temporary tables can be created early during a session for use as a lookup by subsequent queries. This can save the DB some work since it only has to generate the lookup table once instead of n times.

Crosstab & Pivot
The format of the report I need to generate got a bit hairy once I compared it to our DB schema. I found that I needed to translate rows into columns. This is a PITA but I RTFM and discovered the tablefunc module from postgres-contrib. The tablefunc module included a handy crosstab function that translates rows into columns. This sounded perfect for my needs but alas it didn’t work for me. You see the crosstab function was very particular about the number of columns it returned. (You have to specify the exact number of rows you are translating into columns). Unfortunately, for my dataset the number of rows and columns was not constant. It varied. This was demoralizing.

But I discovered a workaround. Before the tablefunc module there were lots of clever folks who figured out how to hack up a crosstab using ANSI-SQL. The best example I found was here:

So I stopped using the fancy new crosstab built-in and implemented my solution in ANSI-SQL.

I hope this frankenstein performs better than our current Perl-based solution.

Tagged , , , ,

Books I Have Been Reading

I just finished Wisdom of Crowds by James Surowiecki & How to Lie with Statistics by Darrell Huff. The first book was a very accessible read on the subject of Behavioral Finance which I have always found fascinating. (Basically prices in the stock market don’t reflect necessarily reflect the true value of any particular security. Rather they reflect the value that I think other people will put on the security. And their values are in turn influenced by the value that they think others will have for the security. Very fractal.)

How to Lie with Statistics is a classic that kept coming up on bibliographies and was cited by various authors enough times that I finally picked up a copy from the local library. The copy I found was dated and there were several passages that were misogynistic and racist. Despite those problems it was entertaining and reminded me of reading Cialdini during Palmer’s classes at UC Davis. I have always been wary of statistics that I didn’t calculate myself. (Some of this is my DIY bent.) But in most cases statistics are not presented properly and it is impossible to determine how the metrics were gathered. Huff does a great job of showing you how to twist numbers to your own advantage. It is all written tongue in cheek but not quite. I can swear that I have been in meetings where people have purposefully used methods like these to paint a flattering picture of failure.

As they say in the recording industry, “You can’t polish a turd.”

This morning I started Smart Mobs by Howard Rheingold. The prose seems to a bit breathless in the early bits but I have high hopes since I enjoyed some of Howard’s other books.

Export to Stone Tablet

We have been trying to come up with a reliable backup solution for our laptops. Our users connect via a mix of wi-fi and cabled connections and they move around a lot often attending multiple meetings per day. A CDP (Continuous Data Protection) product would be perfect for our needs but the products I have found are either much too expensive for a startup stage business or they do not support Macs. We have a few folks on our sysadmin team that have years of enterprise backup experience and that could put together a killer system using either Legato or Netbackup but both of those are cost-prohibitive. The small business solutions seem to be targeted at the windows platform and we just don’t have an interest in trying to get up to speed on Windows at this point. Our core competency is UNIX in general and Linux in specific so our backup solution needs to be on that platform too.

So at this point we use Zmanda for Linux and the windows laptops. The feature set is way behind the enterprise vendors and it doesn’t support CDP. But it is UNIX based and you can script it. So we will see what happens… If it becomes too painful we will bite the bullet and pay the 50% premium to license Netbackup or Legato.

For the Macs we give everyone an external hard drive, make sure they have upgraded to Leopard and tell them to use Time Machine regularly. Since the video cards on the macbook pros have started to die in large numbers we have had ample opportunity to restore laptops from Time Machine. So far it has been easy and trouble free. Apple got this one right. Too bad we can’t find anything like it for the other platforms.

Sometimes I just wish we had an option to export data to something reliable like stone tablets…


Londiste process dead this morning

I came in this morning and found that the londiste replay job that replicates into our disaster recovery database had silently failed. The process was still running. Replication lag was over 11 hours at that point so that implies that something strange occurred at 10 pm last night. The postgres logs for the DR database were completely empty after the midnight rotation which is very unusual since we log events at the “mod” level so we can see all the row level replication activity in real-time if we need to.

I tried to gracefully stop the replay job but it had no effect so I had to kill it. Restarting the job did the trick and replication lag immediately started to come down.

My nagios check didn’t fire off an alert which was a bt puzzling until I realized that the monitoring system is running an older version of the replication lag check script that only measures a single londiste job. Now that we have multiple replay jobs it needed to be updated.

I already have a new version of the nagios plugin that handles lag measurements for multiple replicas tested and ready to go. Now I guess I need to actually put it into production.

I wish I could figure out what caused londiste to fail last night. There was nothing in any of the system logs and no trail of breadcrumbs to follow for further investigation. Londiste has been pretty trouble free since we started using it in January. I’m not sure why we are starting to see issues now…

Tagged , ,

Londiste Memory Usage

During our last major software release we made a pretty massive schema update to the primary postgres database. We pulled several columns out of one of our core tables and moved them into a separate table. We added triggers, dropped tables, added columns, etc. After all the dust had settled one of the lead developers noticed that our londiste replay processes were taking up 7.6GB of resident memory (per top output). Needless to say this was troubling since we had two of theses processes running and only 32GB of RAM in the server.

I did find some info here that suggested that massive number of events submitted to the event queue in a relatively short number of ticks would increase CPU and memory usage. So I updated my .ini file with pgq_lazy_fetch and reloaded the londiste config.

No change in memory usage. Bummer.

Luckily we still had our regular batch processing disabled while we sorted out the changes for the software release so replication lag was <5 seconds on my replicas. So I stopped the replay processes and then restarted them again.

This freed up 15GB of memory. Sweet. Since the was the first time we left replication running during a software release this was the first time we encountered this issue. Now that we have the lazy_fetch enabled I don’t expect to see it again.

Tagged , ,

Reinventing the Wheel

Last month we reinvented postgres table partitioning. Last year we reinvented cacti. We probably have our own string class.

There is so much to knowledge to assimilate that it is practically impossible for a software engineering team to NOT reinvent the wheel. They cannot possibly know their entire subject domain inside and out. Also engineers cannot resist building stuff. They would rather write code than search the Internet for some open source library or standard template. Besides they were probably implemented by some junior woodchuck somewhere and buggy to boot. Let’s write our own.

Reinventing the wheel is an unfortunate inevitability. But try hard not too. Use someone else’s wheel. It’s already round!