This weekend at Hackernews I stumbled across a great video showing how Facebook pushes new code live.
It’s the most advanced push deployment system I have seen.
The video is excellent. If you are into software, looking for advanced build and deploy techniques, or just looking for ammunition to improve your own build and deploy process, watch this video.
Here are some of the highlights.
Culture
We operate at ludicrous speed and massive scale.
Tools are not going to solve your problem.
It’s about culture.
No big fat layer of QA, managers and adult supervision.
500 core engineers.
As a developer you will shepherd your changes out from the time you check it into trunk, to the time you release it out to your mom.
There is no army of people who are going to vet it and check it.
You are accountable.
Subversion and git are used for version control.
UI tests are Watir and Selenium.
Oncall duties are serious. When you are on call, you are the guy.
Branching
Generally don’t branch. All work done off trunk. You work. You checking. Bam. You’re done.
Cut one week release branches.
Your change can go out with the weekly deploy, or you can bump it up into the daily deploy.
Testing
Everyone tests at Facebook all the time.
Anyone can open a bug – there is a Facebook group internally anyone can go to to see the latest bugs, and open any new ones they find.
Everything is automated. Here are few of the tools Facebook uses internally to do pushes.
IRC bots
These bots are there to tell you the state of your push.
Don’t bug a deployment engineer asking where you code is.
Ask the bot.
When you push is going live, the bot will ping you and ask you if you are here.
You are to respond, and let support know that you are here to help if needed.
You are on standby.
For a daily push, if you don’t do this, your rev doesn’t go out.
Test Console
Built there own test console to show the state of there tests.
Use Watir, Selenium, + bug suite of unit tests.
Console will not only show which tests are broken, they will show when the test broke, and who’s change broke it.
Shadow branch
Production + changes + tests
Shadowing prod.
This is the working prod changes changes are merged to.
Error tracking (18min)
php errors. Exceptions. Fatals. All the things going wrong.
Can see calling stack for all errors on the site.
Will show subversion blame for that line of code.
Gatekeeper (24min)
This is the tool/process that impressed me most out the the entire press.
Gatekeeper always Facebook to incrementally push changes to the live website, and then turn them on or off in very complicated selected ways (basically a big conditionally).
For example, you could push some new changes live (that you aren’t about) and then only expose them to:
– Employees only
– By country
– Age
– IP
– East coast/West coast
– Anyone but TechCrunch 🙂
You can bump public up ot 1%. In minutes you will get a million hits.
You can grab the data, turn it back down. Make changes. And then turn it on again.
Super cool feature that let’s you ease it out.
Push Karma (25min)
Basically a Karma system where by you are assigned a Karma score (4 stars) and every time you screw up a push, you lose Karma.
Great way for putting accountability on the engineers to make sure their changes make it live OK, and don’t cause the build engineers any pain.
HipHop for PHP (29min)
PHP compiler.
PHP is crappy and slow. So they make a compiled version of PHP.
Generates highly optimized C++ and converts into giant 1 GB binary – which is Facebook in it’s entirety.
Takes a couple minutes.
Savings are 50% performance boost.
Less hardware required.
Open source.
BitTorrent (31min)
Facebook pushes it’s 1GB binary of compiled PHP to it’s 10,000s of servers using BitTorrent.
Very cool. Rack affinity – looks locally first before going out to neighbours.
Ridiculous data speeds.
Can roll Facebook.com in about 15min.
Whole site.
Incredible.
Minimal user impact.
Summary
Tools alone won’t save you.
But you need the right people.
But you need the right culture.
But you need the right company.
May 28, 2011 @ 17:01:24
Interesting video. I usually do integration development using webMethods middleware as I’ve mentioned- we have to code directly on the server using a java client, so we have to keep notes on what code we change. Automated testing is difficult and rarely done as it interacts with multiple applications and remote partners, and as usual there is little emphasis on this type of activity.
My last client had a team of QA people who ran rigorous tests on the entire system before each deployment. They would check emails, check Trading Networks to ensure attributes were extracted from the XML, check files, soap messages, etc. This provided a level of comfort although it is a rarity. Also many changes require modifying core framework code which affects the entire system, not just a small area that is new and can be gradually rolled out to more users. To summarize this type of development is a bitch! 🙂
May 29, 2011 @ 12:31:01
I hear you Will – that sounds tough.
Not being able to make changes without confidence can be expensive and error prone.
The one thing that gives me hope is we seem to get better at this (as an industry) every year.
10 years ago I remember getting challenged on the merits of writing unit tests (which facebook does is spades).
No one challenges us that any more.
Gotta believe WebMethods will eventually make it easier for it’s developers to test too.
Thx for the comment.
May 28, 2011 @ 18:06:19
They removed the video 😦 anyone has another link ?
May 29, 2011 @ 12:31:53
Just tried it again this morning – worked for me (although I might somehow have it cached).
Does it work for anyone else?
May 31, 2011 @ 11:49:02
Works fine for me.
Thanks for the link and the short summary, very interesting.
Jun 01, 2011 @ 00:14:51
Works for me and I appreciate the write-up. I now know I have to watch the entire thing even though I ran out of time at work.
Linkfest 2011-06-05 « Charles Blogging
Jun 06, 2011 @ 03:02:15
What project management tools does Facebook use for managing agile development? - Quora
Oct 17, 2011 @ 23:26:20
Software Craftsmanship – Code Reviews « Notes from a coder
Sep 03, 2012 @ 23:45:41
Nov 26, 2012 @ 11:31:49
Scalability: Links, News And Resources (2) | Angel "Java" Lopez on Blog
Apr 04, 2013 @ 17:01:04
Dec 11, 2013 @ 20:28:41
DBAs: Relax! (Part 2) Automated Deployment does not mean what you think it means. | working with devs...
Aug 17, 2014 @ 18:10:00
YOW! 2013 Experience | Wittawat Saikliang
Aug 29, 2014 @ 02:52:08
Continuous delivery isn’t scary, it’s necessary – Ed Coffey's Blog
Apr 17, 2017 @ 23:09:03
Jan 22, 2018 @ 16:02:48