Saturday, April 21, 2012

Improve and Tune your service/app with some statistics

One of the good thing to be in a data-driven company is that every decision must be based on the data that you've collected. For someone this means just Marketing decision, but you can do the same thing to improve your services, applications and code.

Think at these questions:
  • Is my code faster/slower between version A and B?
  • Is my code using much/more memory between version A and B?
  • Is someone still using feature A?
  • Which are the most used features?
If you look at the question, you can easy realize that these are not problems related just to big companies with lots of data, so even your small application can benefit from some stats.

One of the main stopper to do that is that is really difficult modify your application to add some stats support, because you really don't know what are your questions and you don't know what kind of output do you want.

What do you want is just a tiny call like: collect("func(x) time", 10sec)
And sometimes later you can decide.. ok I want to see what is the average of func(x) time between jan-feb (version A) and mar-apr (version B).
Or if you want keep track of features used you can call something like: collect("feature-used", "My Feature A"). And later you can decide to query for specified feature X to see when is last time that was used, or you can query for the most used features or something else.. but is really difficult to know in advance what you want to keep track.

Ok, now that you've understood a bit the problem that we want to solve, the fun part begins.
The main idea is to have a super light-weight "Stats Uploader" to collect your data with one single line of code and send to a collector, later on you can ask questions to your data (completely detached from your application).

As you can see from the schema above, your application send data to a "Collector" service, that can store your information in different ways (You can write your custom Sink to take care of specific keys, and store in a format that fit better your needs).
The Aggregator fetch the data required to answer your question and applies your custom logic to extract your answer.
The Viewer is just a front-end to display nicely your data, like a web service that plot some charts and table. It ask questions to the aggregator and displays to you.

The code is available on github at https://github.com/matteobertozzi/skvoz.
Probably I'll give a talk about this at EuroPython (EP2012) .

Friday, April 20, 2012

Embedded Store, under the hood...

This week I've found an interesting bug, that can be summarized in this way. The user does not have any idea of what happens under the hood, and his main usage is always against your design.

To give you more context, the bug was related to embedded storage systems, something like bsddb, sqlite or just your simple B+Tree or your on-disk HashTable.

So, How an embedded storage system is designed?
As you can see from the simplified schema on the right
  • The lowest level is the raw access to the disk data structure (e.g. B+Tree, HashTable, ...) so each request goest directly to disk.
  • On top of that, to speedup things, you add a cache to avoid fetching data from disk all the time.
  • And everything is packed in a nice api that provides you some sort of get and put functionality, at maximum speed using the cache.

Everything seems fine, You can create an application that access your library capable of handling tons of request without accessing the disk due to the cache layer and so on, and you can think even to build a service to be able to query your storage from a remote host, and here the problems begin.


Your first super happy user arrive and decide to build its super scalable infrastructure with your embedded storage.
..and what is the easy way to get a super scalable service? obviously adding some threads.. but threads are not a problem, because the user has learned the lesson and now knows that he should not use shared variables. So the brilliant solution is each thread has its own instance of the "storage object", to be more clear each thread do something like db.open("super-storage.db")

Everything seems fine... but after a couples of days the user started crying... sometimes data is missing, logs contains strange page not found messages, some part of the store is corrupted, and so on...

Can you spot the problem? Yes, is the cache...
No one is aware of the changes in the file, every thread use its own cache, and the first request to a block not in cache ends up to create some bad state.

So the solution for the user is to use the library as you've designed, with just one thread/process/what ever that access the store file.

But if you want slow down your super cool library and make the user happy you can always add an ID to the super-block and every time the user request something... you fetch the super-block from disk, compare with the one in cache, and if they are different you can just invalidate all the caches...