Sunday, February 26, 2012

Thoughts on bucket flushing, block size and compression...

This post is just a collection of thoughts, on how to store data. I hope to get some feedback and new ideas from you guys, thanks!

Since I've started working on B*Trees, I've always assumed to have a fixed block size. With this "restriction" & "simplification", you can easily come up with a block in this format:
Keys are packed together in the beginning of the block and values are packed together in the end of the block, growing up toward the center. In this way you don't have to move data when a new key is added or keys & values when one value grows.

Assuming that your inserts in the block are already sorted (e.g. flush "memstore" to disk), in this way you can even compress keys & values with different algorithms.

...but, with the fixed size block and values starting from the end you need to allocate a full block.

In contrast, you can "stream" your data and flush it, just few kb of data at the time. In this case you don't have to allocate a full block, but you've lost the ability to use different compressions for keys & values and the ability to keep in memory only the keys without doing memcpys.

Another possibility is traverse the data twice, the first time writing the keys and the second time writing the values. In this case you gained the different compression features but if you're using compression you're not able to stop after a specified block size threshold because you don't know how much space each key & value takes.

Any comments, thoughts, ideas?

Moved to Stockholm & Back to Music and Data!

A couple of weeks ago I've started my new job at Spotify AB (Stockholm, Sweden).

The last two years at Develer (Florence, Italy) were fantastic, great environment, great people, great conferences PyCon4, Euro Python, QtDay, and I've to thank especially Giovanni Bajo, Lorenzo Mancini, Simone Roselli and Francesco Pallanti (AriaZero), and many more, for all the support in these two years. Thanks again guys!

...but now I'm here, new company, new country, new language (funny language) and new challenges.

Stockholm is beautiful and is not that cold as I had imagined, (even with -18 degrees celsius), but I'm still don't able to find good biscuits, how can you live without biscuits?

Since my new job is more about networking & data, new blog post will be slightly different, from the previous ones... less ui oriented and more data & statistic oriented.

Keep a look at interesting meetup in stockholm, I will be there. (Next one is Python Stockholm).

...And don't forget to use Spotify (Love, Discover, Share Music)!

Sunday, February 5, 2012

AesFS - Encrypted File-System layer

Last week I've spent a couple of hours playing with OpenSSL and CommonCrypto, and the result is a tiny layer on top the file-system to encrypt/decrypt files using AES.

The source code is available on my github at misc-common/aesfs. To build just run 'python build.py' and the result is placed in ./build/aesfs/ and ./build/aespack/. AesPack is a command line utility to pack and unpack single files, while aesfs is a fuse file-system.

You can run aesfs with:
aesfs -o root=/my/plaindir /mnt/encrypted
Since AesFS is on top your file-system you've to specify (with -o root) the directory where you want to store files, while the /mnt/ is the mount point where you can read/write your files in clear.

Files are written in blocks of 4k (8 bytes of header, 16 extra bytes for aes, and 4072 bytes of data). Check block.h and block.c for more information.