About

James Golick

James Golick's software experience ranges from artificial intelligence to web front-end and JavaScript development. Most recently, James has fallen back in love with web development thanks to Ruby on Rails.

Since discovering Rails, James has become a prolific contributor to its open source ecosystem. He is the author of several popular plug-ins and gems, and a contributor to countless others, including the framework itself.

James is an advocate for well-written, well-tested code and he blogs regularly about the practice of developing software. He speaks regularly at software development conferences and user groups. James is a partner in Nine Lives, Inc.

Latest Tweets

follow me on Twitter

James on the Web

Introducing rollout: Condionally roll out features with redis

Aug 01 2010

When we work on new features, we like to push them to production regularly. We've found that long-lived branches tend to introduce more defects than short-lived ones. And as useful as staging can be, it's hard to beat seeing and tweaking new features on the real, production site and infrastructure.

When we're ready to alpha the feature, we'll roll it out to staff. For beta, we might roll it out to some specific friends or people who request access. Then, when it's time to go live, we'll roll it out to a percentage of people at a time to make sure that any remaining performance issues are caught without bringing down the entire application.

If we do find a problem, we need to be able to disable the feature in real-time.

We do all of this using a tool we put together called rollout. It allows us to roll out features to specific users, to pre-defined groups, to a percentage of users, or to any number of combinations of those options. It uses redis to store all of the configuration, so we can easily manipulate rollouts in real-time.

How it works

gem install rollout

I like to assign an instance of Rollout to a global variable.

$redis = Redis.new
$rollout = Rollout.new($redis)

I can check whether a user has access to a feature like this:

$rollout.active?(:chat, User.first) # => true/false

Let's say I want to roll out a chat feature. I'd wrap any chat-related code in:

if $rollout.active?(:chat, @current_user)
  # chat-related code
end

The simplest way to start rolling out our chat feature is by giving access to a single user:

$rollout.activate_user(:chat, User.find_by_nickname("jamesgolick"))
$rollout.active?(:chat, User.find_by_nickname("jamesgolick")) # => true

When alpha testing, it's convenient to be able to provide access to whole groups of users (staff, for example) at once. We define several groups when we initialize Rollout.

$rollout.define_group(:caretakers) do |user|
  user.caretaker?
end

To provide access to a group:

$rollout.activate_group(:chat, :caretakers)

When it's time to go live, we can slowly ramp up access:

$rollout.activate_perecentage(:chat, 10)

Performance issue? Bug? Remove everybody's access while you retool:

$rollout.deactivate_all(:chat)

More fine-grained deactivation controls exist. See the README for more details.

Get it!

gem install rollout

The code is on github.


Two Weeks With Cassandra

May 03 2010

Note: I wrote this article a month ago, but decided it was too boring to post. Since then, various people have told me they were interested in reading it, so here it is. When I have a chance, I'll post an update.

In an attempt to side step the inevitable flames, I'm going to omit from this article the reasons we selected cassandra, and just dive right in to our experiences.

After evaluating locally, we needed to do some testing under real load. It's usually most effective to do this with actual production traffic — simulations are never accurate enough.

So, after implementing a little scala client (our app is a mix of ruby and scala) (actually, we're now using Coda Hale's cassie - don't use what I wrote: it sucks), we ordered some hardware. We decided to run with 2 nodes and a replication factor of 2. We will likely add a third if and when we start using cassandra for non-transient data. For the time being, all reads and writes are being performed at a consistency level of ONE.

Here are the specs of the machines:

  • 2 x E5520 (Nehalem, 4 Cores, 8M Cache, 2.26 GHz)
  • 24GB Memory
  • 2 x 500GB SATA Drives
  • GB Ethernet

About two weeks ago, we started writing to the cluster alongside the datastore that was serving production requests at the time. This was shortly before we were trending cassandra metrics, but I believe they were peaking out at about 120 w/sec per node. Since then, we've peaked out around 300 w/sec per node.

There is absolutely no noticeable load on the machines.

We did have a few problems with nodes hanging, but that seems to have been a configuration issue on my part. The problem disappeared after we moved cassandra out of runit.

After about 5 days of double-writing, we started experimenting with cutting over some reads to cassandra. Our queries consist of getting a slice (count = 20) of a CF which contains UUIDs which we multiget from another CF. What we found when we started reading was that the first query was plenty fast, but the multiget was quite slow (35+ms).

After some discussion on the mailing list, we cranked up the size of the row cache. That brought the multiget time down to around 20ms, which was a big improvement, but still far beyond acceptable performance for our application.

The folks on the mailing list were saying the numbers sounded high though, so I poked around a little more. I ran a profile of the multiget and found that virtually all of the time was being spent in thrift. I quickly implemented multiget in my scala client and found that the same query was taking around 4ms. Much better.

We wound up deciding to write through to memcached so that this particular query could be mostly (99.9%) satisfied by cache. That brought the query time down to about 6ms which is plenty fast. The moral of the story is that it can be slow to access cassandra from ruby.

Fortunately, in a future release of cassandra (it's already in trunk), apache avro will be supported as an alternative to thrift. I know the twitter guys are planning to do some work on making that really fast from ruby. So, hopefully things will be better soon.

Aside from those two issues, running cassandra has been an absolute pleasure. Our cassandra cluster is serving 100% of production requests, and the CPU usage is still hovering around 0. Let's hope the follow-up post is as boring as this one.

Update: A lot of people have been asking about the munin plugins we use to trend cassandra metrics. So, I pulled them all in to a git repo.


What does "scalable database" mean?

Mar 30 2010

All over the interwebs, people are equating NoSQL with scalability. Yesterday, I wrote a blog post explaining that most NoSQL dbs aren't, in fact, scalable. The general consensus in the comments seemed to be that my article was missing substantiation — that without explaining why the dbs weren't scalable, I wasn't adding anything to the discussion. So, what does ”scalable database“ mean?

There are two kinds of scalability: vertical and horizontal. Vertical scaling is just adding more capacity to a single machine. Virtually every database product is vertically scalable to the extent that they can make good use of more CPU cores[1], RAM, and disk space. With a horizontally scalable system, it's possible to add capacity by adding more machines. By far, most database products are not horizontally scalable.

But, people have been scaling products like MySQL for years, so how'd they do it?

Vertically is one way. 37signals has written about the mammoth (128GB of RAM and 8 x 15,000 RPM SAS drives) machines they run their dbs on — and that was over a year ago. I bet they've at least doubled the capacity in those machines by now.

It's also common to see RDBMS-backed applications run one or more read-slaves. With master-slave replication, it's possible to scale reads horizontally. But, there's a trade-off there. Since most (all?) replication systems are asynchronous, reads from slaves may be somewhat stale. It's not uncommon for replication lag times to be measured in minutes or hours in the event of network partitioning (for example). So, read-sharding trades some consistency (the ”C“ in ACID) for aditional read capacity[2].

When an application needs more write capacity than they can get out of a single machine, they're forced to partition (shard) their data across multiple database servers. This is how companies like facebook and twitter have scaled their MySQL installations to massive proportions. This is the closest you can get to horizontal scalability with most database products.

Sharding is a client-side affair — that is, the database server doesn't do it for you. In this kind of environment, when you access data, the data access layer uses consistent hashing to determine which machine in the cluster a specific piece of data should be written to (or read from). Adding capacity to (or alleviating “hot spots” from) a sharded system is a process of manually rebalancing the data across the cluster. So, while it's possible to add capacity to a mysql-backed application by adding machines, mysql itself is not horizontally scalable.

On the other hand, products like cassandra, riak, and voldemort automatically partition your data. It's possible to add capacity to these systems by simply turning on a new machine and starting the service (actually, I only know this for certain about cassandra, but I'm reasonably certain it's true for the others). The database system itself takes care of rebalancing the data and ensuring that it is sufficiently replicated across the cluster. This is what it means for a database to be horizontally scalable[3].

  • [1] Many databases are not capable of making use of additional CPU cores. Many NoSQL dbs, such as redis, are event-driven and run in a single thread.
  • [2] For the sake of completeness, I am aware of some tools (such as mysql-mmm) that monitor replication lag, and remove overly lagged shards from read rotation to help maintain consistency.
  • [3] Truly distributed dbs like cassandra also provide other goodies like per-query consistency and availability settings. But, that's beyond the scope of this article. For more on that (and many other things on this topic), the dynamo paper is a good place to start.

Most NoSQL DBs Are Not Scalable

Mar 29 2010

Every blog post in this (stupid) debate seems to equate NoSQL with scalability. You either do or don't have to be google to use NoSQL depending on who is talking. It's time to stop doing that.

Here are the well-known NoSQL dbs that are not scalable (at least, not differently than MySQL):

  • MongoDB (Some form of horizontal scalability is planned but not yet production ready. By far, most users of mongo are not using it for its scalability promises.)
  • CouchDB
  • Redis
  • Tokyo cabinet
  • MemcacheDB
  • Berkeley DB

Here are the well-known NoSQL dbs that are scalable:

  • Cassandra
  • Riak
  • Voldemort

At this point, I'd say that couch, mongo, redis, and cassandra are by far the most well-known products in the space. Of those solutions, only one is scalable. So, people must be using NoSQL for other reasons.

In fact, nearly each of the databases I listed above has its own specific reason somebody might want to use it. Redis has a rich set of data structures; it can be used as a queue server, for example. Mongo is document oriented, so schema evolutions are very natural, you don't really need an ORM, and very rich queries are supported against collections (among other things). Couch shares some of the advantages of mongo, plus it has very nice master-master replication support (not doing it justice, but that's not the point). Cassandra is horizontally scalable. I could go on, but I think I've made my point.

People use NoSQL for all kinds of different reasons. Of all of the people using NoSQL dbs, most of them are probably not motivated by scalability. So, let's stop saying NoSQL when we mean scalable database. It only obscures the discussion.

Update: The consensus in the comments here seems to be that this article is missing substantiation. So, I follwed up with another article titled: What does “scalable database” mean?.


Opinionated Modular Code

Mar 28 2010

When you start writing modular code, and using techniques like dependency injection, you end up with a lot of pieces, but not necessarily an obvious whole. Working with java libraries can be an exercise in instantiating huge towers of dependencies to finally get to the object you actually need.

  val socket    = new TSocket(host, port)
  val protocol  = new TBinaryProtocol(socket)
  val client    = new TCassandra.Client(protocol)
  val cassandra = new Cassandra(keyspace, client)

My scala wrapper around cassandra's thrift bindings depends on an instance of TCassandra.Client, which depends on an instance of TProtocol (of which TBinaryProtocol is a subclass), which depends on an instance of TTransport (TSocket is subclass).

Thrift is necessarily modular. The user of a thrift service might wish to use a different TProtocol implementation or a non-blocking socket (TNonblockingSocket). Still, though, 4 lines of setup code just to get a simple cassandra client is cumbersome.

Contrasted with tightly coupled code, this appears to be a major drawback of composability. With coupled code, if you need an object, you just instantiate it and it creates — or refers directly to — all of the collaborators it needs. It's easy to imagine a tightly coupled version of my wrapper that instantiates hard dependencies.

class Cassandra(host: String, port: String) {
  val socket   = new TSocket(host, port)
  val protocol = new TBinaryProtocol(socket)
  val client   = new TCassandra.Client(protocol)
}

In a relatively high percentage of use cases, this set of defaults is perfectly acceptable. This kind of opinionated code is really easy to get started with, but can become problematic later on when its user decides she really does need that non-blocking socket. So, what to do?

It turns out it's possible to have it both ways. This is going to sound so simple that it's silly… but just create a second constructor that invokes the first one.

class Cassandra(client: TCassandra.Client) {
  def this(host: String, port: Int) = this(
    new TCassandra.Client(
      new TBinaryProtocol(
        new TSocket(host, port)
      )
    )
  )
}

This lowers the bar considerably to getting started with my Cassandra client. If somebody just wants to give it a try or whip up a quick and dirty program, they can do it easily — likely without even reading any documentation. But as the user's needs become more complex, I haven't shut them out of customizing the client to their heart's content.

Of course, as your object models become increasingly complex, there will be some additional effort required to maintain these auxiliary constructors. And arguably, you will create some degree of coupling between the classes. But it's mostly harmless. Provided it's possible to supply alternate dependencies, you have accomplished modularity. We're just adding sane defaults.

In ruby, I use default arguments to accomplish the same goal.

def initialize(klass, storage_factory = StorageFactory.new, table_creator=TableCreator.new)
  @klass           = klass
  @storage_factory = storage_factory
  @table_creator   = table_creator
end

This is an example from friendly's code base. The StorageProxy needs a StorageFactory and a TableCreator, so I create them if they aren't supplied.

This is all possible because the StorageFactory and TableCreator's default dependencies are set in exactly the same way. I think the only place I ever supply alternates is when I inject test doubles in StorageProxy's specs.

This makes for an object model that is really easy to work with, yet still highly modular. It's quick to get started with, but doesn't get in your way when you need to reach in and change some shit around. Making your classes both modular and convenient gets you the best of both worlds.