Sunday, June 24, 2012

Data-center Rolling Upgrades coordinated by ZooKeeper

Still playing around trying to improve the daily deploy work in the data-centers.

The idea is to replace a sequential/semi-manual process with something more automatic that don't need human intervention unless some failure happens.

Services and Deploy rules:
  • Services has dependencies (Service B depends on Service A), Deploy order matter!
  • You can't bring down all the machines at the same time! 
  • One or more machine can be unreachable during the deploy (network problems, hw failures, ...).
  • Each machine need to be self-sufficient!
Must to Have (Monitoring)
  • Current service state of each machines (online/offline, service v1, v2)
  • Current "deploy" state (Ready to roll?)

The idea is quite simple, using ZooKeeper to keep track of each Service (A, B, ..., K) with the list of machines available (ephemeral znodes) and to keep track of te deploy state ("staging").
  • /dc/current: Contains a list of services with the list of online machines (and relative service version).
  • /dc/staging: Contains a list of services with the list of machines ready to roll.
  • /dc/deploy: Deploy order queue each node represent the service to upgrade.
When you're ready to deploy something new you can create the new znodes:
  • Add services to "staging" with the useful metadata (version, download path, ...)
  • Define a deploy "order" queue
Each service is notified about the new staging version and starts downloading (see "data-center deploy using torrent and mlock()" post). Once the download is completed, the service register it self to the "staging" queue.

Now the tricky part is when can I start switching to the new version? The idea is to specify a quorum foreach service. The First machine in the "Staging" queue for the first service in the "Deploy" queue, looks for the quorum, and when is time shutdown it self and restart the new service. Once is done  adds it self to the "Current" list and remove it self from the staging queue.

And one by one each machine start upgrading it self, until the deploy queue is empty. If a machine is down during the deploy, the "Current" node is checked to find which version is the most popular, and the service will be started.

Saturday, June 2, 2012

Data-center deploy using torrent and mlock()

Every morning you come in the office hoping that the nightly job that produce your blobs has finished... and if everything is fine, you spend the rest of the day hoping that none of the machines fails during transfer...
If you've a service that consume static data and you've more than one datacenter, probably everyday you face the problem of distributing data on all your service machines.

Remember: 60MiB/sec * 1hour = ~210GiB

So what are the possible solution to transfer this blobs?
The first solution is copying all the data to one machine in each datacenter, and then each machine with the data is responsible to copy everything to all the other machines inside the datacenter.
Note: prefer rsync over scp, since if you lost connection with scp you need to retransfer everything from byte zero.

But what happens if a machine is down?
One of the solution is making all the machines part of this distribution, removing identities. Every machine is equal, every machine need to fetch these blobs. So, instead of using rsync from the "build" machine to the dist-host and from dist-host to service machines the "build" machine send an information "I've new data" and each machine starts fetching this data in a collaborative way (using bittorrent).

Each machine (build-machine/dist-hosts/services) need to run a torrent client, you can implement your torrent client in few lines of python using libtorrent. The idea is to fetch from a feed hosted on a build machine the latest blobs.torrent and start downloading. The build machine will be the initial seeder, but then every machine will be part of the data distribution. By writing your own tracker you can also tune your peer selection, preferring machines inside your datacenter or inside your rack to avoid cross-site latency.

Another important thing to remember, if your service rely on the buffer-cache to keep data in memory, is to tell to the system, to avoid evict your pages otherwise you'll probably see your service starting to slow down once you start to copy data to that machine... So make sure to mlock() your heavily used pages or if your blobs can be kept in memory use vmtouch to do the trick (vmtouch -l -d -m 64G blob) remember to add memlock entry for your user in /etc/security/limits.d/ otherwise you'll see mlock() fail.

You can find the source code of a simple command line bit-torrent client and a tracker at https://github.com/matteobertozzi/misc-common/tree/master/torrent.

Saturday, April 21, 2012

Improve and Tune your service/app with some statistics

One of the good thing to be in a data-driven company is that every decision must be based on the data that you've collected. For someone this means just Marketing decision, but you can do the same thing to improve your services, applications and code.

Think at these questions:
  • Is my code faster/slower between version A and B?
  • Is my code using much/more memory between version A and B?
  • Is someone still using feature A?
  • Which are the most used features?
If you look at the question, you can easy realize that these are not problems related just to big companies with lots of data, so even your small application can benefit from some stats.

One of the main stopper to do that is that is really difficult modify your application to add some stats support, because you really don't know what are your questions and you don't know what kind of output do you want.

What do you want is just a tiny call like: collect("func(x) time", 10sec)
And sometimes later you can decide.. ok I want to see what is the average of func(x) time between jan-feb (version A) and mar-apr (version B).
Or if you want keep track of features used you can call something like: collect("feature-used", "My Feature A"). And later you can decide to query for specified feature X to see when is last time that was used, or you can query for the most used features or something else.. but is really difficult to know in advance what you want to keep track.

Ok, now that you've understood a bit the problem that we want to solve, the fun part begins.
The main idea is to have a super light-weight "Stats Uploader" to collect your data with one single line of code and send to a collector, later on you can ask questions to your data (completely detached from your application).

As you can see from the schema above, your application send data to a "Collector" service, that can store your information in different ways (You can write your custom Sink to take care of specific keys, and store in a format that fit better your needs).
The Aggregator fetch the data required to answer your question and applies your custom logic to extract your answer.
The Viewer is just a front-end to display nicely your data, like a web service that plot some charts and table. It ask questions to the aggregator and displays to you.

The code is available on github at https://github.com/matteobertozzi/skvoz.
Probably I'll give a talk about this at EuroPython (EP2012) .

Friday, April 20, 2012

Embedded Store, under the hood...

This week I've found an interesting bug, that can be summarized in this way. The user does not have any idea of what happens under the hood, and his main usage is always against your design.

To give you more context, the bug was related to embedded storage systems, something like bsddb, sqlite or just your simple B+Tree or your on-disk HashTable.

So, How an embedded storage system is designed?
As you can see from the simplified schema on the right
  • The lowest level is the raw access to the disk data structure (e.g. B+Tree, HashTable, ...) so each request goest directly to disk.
  • On top of that, to speedup things, you add a cache to avoid fetching data from disk all the time.
  • And everything is packed in a nice api that provides you some sort of get and put functionality, at maximum speed using the cache.

Everything seems fine, You can create an application that access your library capable of handling tons of request without accessing the disk due to the cache layer and so on, and you can think even to build a service to be able to query your storage from a remote host, and here the problems begin.


Your first super happy user arrive and decide to build its super scalable infrastructure with your embedded storage.
..and what is the easy way to get a super scalable service? obviously adding some threads.. but threads are not a problem, because the user has learned the lesson and now knows that he should not use shared variables. So the brilliant solution is each thread has its own instance of the "storage object", to be more clear each thread do something like db.open("super-storage.db")

Everything seems fine... but after a couples of days the user started crying... sometimes data is missing, logs contains strange page not found messages, some part of the store is corrupted, and so on...

Can you spot the problem? Yes, is the cache...
No one is aware of the changes in the file, every thread use its own cache, and the first request to a block not in cache ends up to create some bad state.

So the solution for the user is to use the library as you've designed, with just one thread/process/what ever that access the store file.

But if you want slow down your super cool library and make the user happy you can always add an ID to the super-block and every time the user request something... you fetch the super-block from disk, compare with the one in cache, and if they are different you can just invalidate all the caches...

Sunday, February 26, 2012

Thoughts on bucket flushing, block size and compression...

This post is just a collection of thoughts, on how to store data. I hope to get some feedback and new ideas from you guys, thanks!

Since I've started working on B*Trees, I've always assumed to have a fixed block size. With this "restriction" & "simplification", you can easily come up with a block in this format:
Keys are packed together in the beginning of the block and values are packed together in the end of the block, growing up toward the center. In this way you don't have to move data when a new key is added or keys & values when one value grows.

Assuming that your inserts in the block are already sorted (e.g. flush "memstore" to disk), in this way you can even compress keys & values with different algorithms.

...but, with the fixed size block and values starting from the end you need to allocate a full block.

In contrast, you can "stream" your data and flush it, just few kb of data at the time. In this case you don't have to allocate a full block, but you've lost the ability to use different compressions for keys & values and the ability to keep in memory only the keys without doing memcpys.

Another possibility is traverse the data twice, the first time writing the keys and the second time writing the values. In this case you gained the different compression features but if you're using compression you're not able to stop after a specified block size threshold because you don't know how much space each key & value takes.

Any comments, thoughts, ideas?

Moved to Stockholm & Back to Music and Data!

A couple of weeks ago I've started my new job at Spotify AB (Stockholm, Sweden).

The last two years at Develer (Florence, Italy) were fantastic, great environment, great people, great conferences PyCon4, Euro Python, QtDay, and I've to thank especially Giovanni Bajo, Lorenzo Mancini, Simone Roselli and Francesco Pallanti (AriaZero), and many more, for all the support in these two years. Thanks again guys!

...but now I'm here, new company, new country, new language (funny language) and new challenges.

Stockholm is beautiful and is not that cold as I had imagined, (even with -18 degrees celsius), but I'm still don't able to find good biscuits, how can you live without biscuits?

Since my new job is more about networking & data, new blog post will be slightly different, from the previous ones... less ui oriented and more data & statistic oriented.

Keep a look at interesting meetup in stockholm, I will be there. (Next one is Python Stockholm).

...And don't forget to use Spotify (Love, Discover, Share Music)!

Sunday, February 5, 2012

AesFS - Encrypted File-System layer

Last week I've spent a couple of hours playing with OpenSSL and CommonCrypto, and the result is a tiny layer on top the file-system to encrypt/decrypt files using AES.

The source code is available on my github at misc-common/aesfs. To build just run 'python build.py' and the result is placed in ./build/aesfs/ and ./build/aespack/. AesPack is a command line utility to pack and unpack single files, while aesfs is a fuse file-system.

You can run aesfs with:
aesfs -o root=/my/plaindir /mnt/encrypted
Since AesFS is on top your file-system you've to specify (with -o root) the directory where you want to store files, while the /mnt/ is the mount point where you can read/write your files in clear.

Files are written in blocks of 4k (8 bytes of header, 16 extra bytes for aes, and 4072 bytes of data). Check block.h and block.c for more information.