About

James Golick

James Golick is an engineer, entrepreneur, speaker, and above all else, a grinder.

As CTO (or something?) of BitLove, he scaled FetLife.com's traffic by more than an order of magnitude (and counting).

James spends most of his time writing ruby and scala, building infrastructure, and extinguishing fires.

He speaks regularly at conferences and blogs periodically, but James values shipping code over just about anything else.

Latest Tweets

follow me on Twitter

James on the Web

Why RubyGems Needs Loren Segal

Jun 01 2011

Full disclosure before I get started here. Loren and I are friends. I'd like to think that this blog post is mostly unbiased, but I'll let you come to your own conclusions.

Maintaining a piece of core infrastructure for a growing community is hard. Even if the code isn't especially complex, getting the release management issues right and keeping everybody happy is incredibly challenging.

Sometimes, that means making concessions, like continuing to maintain an API you don't like — or code that gets in the way of a refactoring you want to do to make your own life easier. But that's the challenge of maintaining software that tens of thousands of people depend on every day.

On Legacy Code

Did you know that in the linux kernel project, every commit absolutely must build cleanly and run the tests successfully to be accepted? Kernel hackers go to great lengths to make every patch fit these requirements. It's an enormous pain in the ass for the developers, but it means that when something breaks, git-bisect will find it for them.

Going to that much trouble just to be able to use git-bisect is a huge headache for developers. But it's worth it because the linux kernel is an important project that gets used by millions of people.

The problem with RubyGems isn't that the tests don't pass every commit, but that APIs have been disappearing too quickly. It's an important enough project used by enough people that deprecations should be measured in years not months.

The current maintainers of RubyGems don't want to live in that kind of world. They want to move quickly to refactor the codebase, deprecating and removing APIs where it suits them. And who could blame them? I wouldn't want to maintain a huge pile of legacy code either. Most programmers wouldn't.

You'd have to be crazy...

Actually, Loren is just about the only guy I know who isn't bothered by this sort of thing in the slightest. In fact, he seems to enjoy maintaining legacy code. It's a sick pleasure, and I know a little something about sick pleasures...

In all seriousness, a guy who cares about release management as deeply as Loren is one in a million. The fact that he's also an extremely talented engineer makes him one in a hundred million. When he told me a couple of months ago that he'd be willing to maintain RubyGems, I gave him an "are you serious?" look. Turns out, he was.

So, here we've got a ridiculously talented programmer who wants to make all of our lives easier by living with a whole bunch of legacy code for us. It really is a no-brainer.

The RubyGems team should bring Loren on board to run the project. He's more than willing to put in the time and effort, and he's the best person I can imagine for the job. The proof is in the pudding: Loren (and team) have put together a fork of RubyGems (1.3.7) that maintains backwards compatibility, and backports all of the performance improvements made since them.

SlimGems

SlimGems is a really great project with a really horrible name. It's an effort to make a RubyGems with a stable API (the one from 1.3.7), a better code base, and faster gem installs. And there are a lot more exciting plans for the future. Check out Loren's blog post for more info.

More importantly than any of that, though, is how helpful and friendly Loren is when it comes to bug reports and pull requests. He did it with YARD, and now he's doing it with SlimGems. I've never heard anybody give him anything but rave reviews.

Until Loren is an official RubyGems maintainer, I'll be running SlimGems. I moved our 30 our so servers at work over too. If that sounds interesting to you, 'gem install slimgems' is all it takes. Oh, and if you want to revert back to your original RG install, just 'gem uninstall slimgems'. That's just how we roll.

If you have any trouble or feedback, jump in to #slimgems on freenode, or open a GitHub issue, and we'll be happy to help you out.

The Future

Forks are good for communities. They're a great place for new ideas to be proven (think rails/merb). SlimGems has already demonstrated that its goals are possible. They've already achieved many of them. I'm ready to see the code and teams merged. Until then, do yourself a favour and 'gem install slimgems'.


VERIFY_NONE

Feb 15 2011

A while back, it came to my attention that ruby's net/https implementation had an insecure default: not verifying TLS certificates (OpenSSL::SSL::VERIFY_NONE). I wrote an article about it for RubyInside, and helped @geemus fix the issue in his excon gem. Despite this being an incredibly serious security issue, nobody really seemed to care. Oh well.

Then today, one of the biggest names in the ruby community, Aaron Patterson (aka @tenderlove), posted a gist of a little campfire bot that he wrote that forced net/https in to this insecure mode.

Yes, a campfire bot is relatively unimportant security-wise (except that if there's a man-in-the-middle, he now has credentials to access your campfire room, which may or may not contain company secrets — but I digress). Eventually I remarked that despite the relative unimportance of a campfire bot, tenderlove is a leader in the ruby community, and leaders should set good examples.

A few other people also responded. @joedamato posted an admittedly less constructive response. And Ben Black, a somewhat snarky, but not particularly harsh suggestion. That's when the hate started pouring in.

Here's the thing: this is a very serious security issue, and nearly every rubygem that uses net/http is guilty of it (yes, even active_merchant, the thing that everybody uses to interact with payment gateways). Why? Because of the prevelance of copy and paste coding. Yes, I do it, and so do you.

And nearly every net/https example uses VERIFY_NONE. It's so common in example code that in the related links on the RubyInside article about the perils of VERIFY_NONE, there's a link to example code that uses it (lol?).

Aaron is one of a small group of people in the ruby community who actually has the power to do something about this problem. By setting the right example, people will copy and paste good code instead of bad code. That's more useful than a million tweets or blog posts.

Yes, this may all seem trivial to you. It's just a hack, after all. Obviously, Aaron wasn't being deliberately insecure. He was just hacking, which is perfectly fine. But, we all know that hacks have a way of ending up in production.

It probably won't be tenderlove's app; it might be some noob who found and modified his code. But sometimes one man's hack winds up (however indirectly) being another man's business.


We are experiencing too much load. Let's add a new server.

Oct 27 2010

Taking a look through antirez's redis sharding slides tonight, one of the bullet points really jumped out at me. From slide #11:

We are experiencing too much load. Let's add a new server.

There's this idea floating around that we can scale out our data services "just in time". Proponents of cloud computing frequently tout this as an advantage of such a platform. Got a load spike? No problem, just spin up a few new instances to handle the demand. It's a great sounding story, but sadly, things don't quite work that way.

First of all, we don't have any information about the "load". From the statement, we can assume that we're running low on some kind of resource. But, what kind of resource? Network resources, disk resources, CPU resources, memory resources? All of the above?

Moving data on to a new machine is a resource-intensive process. At the very least, if the data is stored entirely in memory, replicating data to a new machine and splitting a partition is going to require CPU and network resources (and probably some memory). If the data is stored on hard drives, bootstrapping a new node will consume network, disk, CPU and memory resources. So, the process of adding capacity is going to add load before it can relieve any.

If we're maxed out of one of the resources we need to add capacity, attempting to spin up a new node is only going to make the situation worse. If you're maxing out your disk array's IOPS capacity, for example, and bootstrapping a new node requires reading data from that disk array, you're in trouble. If you need memory to add a node, and you're out of memory, you're in trouble. Let me say it again: adding a node to a storage cluster is not free.

If you have enough data and traffic to make a clustered database relevant, capacity has to be planned carefully. You have to use your system and application metrics to develop an understanding of your usage patterns. You may not always be able to predict when traffic spikes will occur, but in general, it's possible to know in advance roughly how large they might be.

You can focus on writing your app and let Mongo focus on scaling it. - Kristina Chodorow

If you take at face value the marketing materials of many NoSQL database vendors, you'd think that with a horizontally scalable data store, operations engineering simply isn't necessary. Recent high profile outages suggest otherwise.

How is this issue triggered? ... Systems are at or over capacity.

MongoDB, Redis-cluster (if and when it ships), Cassandra, Riak, Voldemort, and friends are tools that may be able to help you scale your data storage to varying degrees. Compared to sharding a relational database by hand, using a partitioned data store may even reduce operations costs at scale. But fundamentally, no software can use system resources that aren't there.


Want to work on a tiny team that makes an enormous impact?

Sep 06 2010

A year and a half ago, I met a guy at a conference who had a website that was getting pretty popular. At the time, he was doing it all: design, code, and management. After hanging out quite a bit, we both knew that working together was in our future. A few weeks later, I came on board as CTO.

That website was fetlife.com, the most popular social network in the kinky and BDSM community. We have hundreds of millions of pageviews, half a million members, millions of pictures, and tens of thousands of groups. And the numbers are only getting bigger.

Incidentally, I have given a couple of talks about some of the challenges we've faced scaling FetLife. One such talk, at GoRuCo 2010, was recorded; the video is below.

Unlike most other sites of our size, we're not VC-funded and our engineering team is tiny. Like, really tiny. John designs. Neko handles our email operations. I write code and do the rest of the ops stuff. That's it.

With FetLife's accelerating growth, scaling has become a full-time job. So, lately, the product hasn't been getting its much needed love and attention. That's why we're looking for an amazing person to join our team.

We titled the job "Rails Guy (Or Gal) with a Design Eye". That's a joking way of saying that we're looking for somebody who can kick ass with rails, but also has an eye for design. They don't have to be a designer (although that'd be awesome!). But since the product will be their focus, it's extremely important that this person can iterate over designs with John, push him to make them better, and then be able to wire things up correctly.

I can't think of many other companies where you could be part of such a small group with such a big impact. If this all sounds like something that might interest you, have a look at the job posting, then if it still sounds cool, send us an email at jointhefamily+jgc@fetlife.com. If you're on the fence or have any questions at all, please don't hesitate to drop me a line at james@fetlife.com.

Now back to our (ir)regularly scheduled programming!


Introducing rollout: Condionally roll out features with redis

Aug 01 2010

When we work on new features, we like to push them to production regularly. We've found that long-lived branches tend to introduce more defects than short-lived ones. And as useful as staging can be, it's hard to beat seeing and tweaking new features on the real, production site and infrastructure.

When we're ready to alpha the feature, we'll roll it out to staff. For beta, we might roll it out to some specific friends or people who request access. Then, when it's time to go live, we'll roll it out to a percentage of people at a time to make sure that any remaining performance issues are caught without bringing down the entire application.

If we do find a problem, we need to be able to disable the feature in real-time.

We do all of this using a tool we put together called rollout. It allows us to roll out features to specific users, to pre-defined groups, to a percentage of users, or to any number of combinations of those options. It uses redis to store all of the configuration, so we can easily manipulate rollouts in real-time.

How it works

gem install rollout

I like to assign an instance of Rollout to a global variable.

$redis = Redis.new
$rollout = Rollout.new($redis)

I can check whether a user has access to a feature like this:

$rollout.active?(:chat, User.first) # => true/false

Let's say I want to roll out a chat feature. I'd wrap any chat-related code in:

if $rollout.active?(:chat, @current_user)
  # chat-related code
end

The simplest way to start rolling out our chat feature is by giving access to a single user:

$rollout.activate_user(:chat, User.find_by_nickname("jamesgolick"))
$rollout.active?(:chat, User.find_by_nickname("jamesgolick")) # => true

When alpha testing, it's convenient to be able to provide access to whole groups of users (staff, for example) at once. We define several groups when we initialize Rollout.

$rollout.define_group(:caretakers) do |user|
  user.caretaker?
end

To provide access to a group:

$rollout.activate_group(:chat, :caretakers)

When it's time to go live, we can slowly ramp up access:

$rollout.activate_perecentage(:chat, 10)

Performance issue? Bug? Remove everybody's access while you retool:

$rollout.deactivate_all(:chat)

More fine-grained deactivation controls exist. See the README for more details.

Get it!

gem install rollout

The code is on github.


Next →