Sunday, January 30, 2011

Zero-Copy Memory

Looking at my code paths, I've seen many methods that read from a data-structure to a buffer and then write back it to another data structure, and so on...

For example read from disk, write to a block cache, read a piece of data and put it in another other data-structure for processing. 

One way to avoid all this memcpy and data duplication is to use a copy-on-write data-structure. For strings or "single-block-data" is pretty easy. As you can see on picture above, you've a chunk of memory that is referenced from different objects. Each object keeps a pointer to the data and if needed an offset and a length to reference just a part of the data.

If someone call write() on one of those objects, internally data-blob is duplicated and modified. In this way the modified objects points to a new block and the others points to the old block.

This is a really simple copy-on-write technique and the string class is really suitable for that. But looking at my code, this case doesn't happen often... In the common case, I've object that references two or more blocks and sometimes happens that i need to remove or inject data in the middle of the buffer.

Extent-tree that i use for the Flow Objects in RaleighFS plus copy-on-write on blocks has the behavior that I need to avoid all this memcpy and data duplication.

Extent-Tree allows you to read from a list of blocks as a single block. Each block as an offset and a length, and blocks are sorted by offset, in the tree. This allows you to reach your specified offset fast and insert or remove at any specified offset.

Using this extent-tree has the advantage that you can avoid to memcpy large blocks from a data-structure to another, you can merge two extent tree or do other fancy thing without allocate new memory and copy data to it. But insert and split require a malloc, so you've to tune your node and extent allocation with an object pool, and this is pretty simple due to the fixed size of those object.

I've made a draft implementation that you can find on my github repo.

Sunday, January 16, 2011

Hadoop I/O: Sequence, Map, Set, Array, BloomMap Files

Hadoop' SequenceFile provide a persistent data structure for binary key-value pairs. In contrast with other persistent key-value data structures like B-Trees, you can't seek to a specified key editing, adding or removing it. This file is append-only.

SequenceFile has 3 available formats: An "Uncompressed" format, A "Record Compressed" format and a "Block-Compressed". All of them share a header that contains a couple of information that allows the reader to recognize is format. There're Key and Value Class Name that allows the Reader to instantiate those classes, via reflection, for reading. The version number and format (Is Compressed, Is Block Compressed), if compression is enabled the Compression Codec class name field is added to the header.


The sequence file also can contains a "secondary" key-value list that can be used as file Metadata. This key-value list can be just a Text/Text pair, and is written to the file during the initialization that happens in the SequenceFile.Writer constructor, so you can't edit your metadata.

As seen Sequence File has 3 available formats, the "Uncompressed" and the "Record Compressed" are really similar. Each call to the append() method adds a record to the sequence file the record contains the length of the whole record (key length + value length), the length of the key and the raw data of key and value. The difference between the compressed and the uncompressed version is that the value raw data is compressed, with the specified codec, or not.
In contrast the "Block-Compressed" format is more compression-aggressive. Data is not written until it reach a threshold, and when the threshold is reached all keys are compressed together, the same happens for the values and the auxiliary lists of key and value lengths.
As you can see in the figure on the left, a block record contains a VInt with the number of the buffered records and 4 compressed blocks that contains a list with the length of the keys, the list of keys, another list with the length of the values and finally the list of values. Before each block a sync marker is written.

Hadoop SequenceFile is the base data structure for the other types of files, like MapFile, SetFile, ArrayFile and BloomMapFile.

The MapFile is a directory that contains two SequenceFile: the data file ("/data") and the index file ("/index"). The data contains all the key, value records but key N + 1 must be greater then or equal to the key N. This condition is checked during the append() operation, if checkKey fail it throws an IOException "Key out of order".
The Index file is populated with the key and a LongWritable that contains the starting byte position of the record. Index does't contains all the keys but just a fraction of the keys, you can specify the indexInterval calling setIndexInterval() method. The Index is read enteirely into memory, so if you've large map you can set a index skip value that allows you to keep in memory just a fraction of the index keys.

SetFile and ArrayFile are based on MapFile, and their implementation are
just few lines of code.  The SetFile instead of append(key, value) as just the key field append(key) and the value is always the NullWritable instance. The ArrayFile as just the value field append(value) and the key is a LongWritable that contains the record number, count + 1. The BloomMapFile extends the MapFile adding another file, the bloom file "/bloom", and this file contains a serialization of the DynamicBloomFilter filled with the added keys. The bloom file is written entirely during the close operation.

If you want to play with SequenceFile, MapFile, SetFile, ArrayFile without using Java, I've written a naive implementation in python. You can find it, in my github repository python-hadoop.

Saturday, January 8, 2011

[TIP] Daily Repository Diff via Mail

If your 200 mail every morning are still not enough, you can add this script to your daily cron.

The following scripts that you can find here, allows you to send diffs of repositories that you follow. The  bash scripts below allows you to update your git/svn/hg repository and keep a diff in a plain and html format (using ansi2html).

git_diff() {
    cd $repo_url/$1

    git_repo_url=`git remote show origin | grep "Fetch URL" | cut -d ' ' -f 5-`
    echo "GIT Diff $1 ($2) - $git_repo_url"

    git fetch
    git diff --color HEAD origin/HEAD | $ansi2html > $diff_dir/$2.html
    git diff HEAD origin/HEAD > $diff_dir/$2.diff
    git merge origin/HEAD
}

hg_diff() {
    cd $repo_url/$1

    hg_repo_url=`hg showconfig | grep paths\.default | cut -d '=' -f 2-`
    echo "HG Diff $1 ($2) - $hg_repo_url"

    hg incoming --patch --git | $ansi2html > $diff_dir/$2.html
    hg incoming --patch --git  > $diff_dir/$2.diff
    hg pull -u
}

svn_diff() {
    cd $repo_url/$1

    svn_repo_url=`svn info | grep URL | cut -d ' ' -f 2-`
    svn_repo_rev=`svn info | grep "Last Changed Rev" | cut -d ' ' -f 4-`
    echo "SVN Diff $1 ($2) - $svn_repo_url"

    svn di $svn_repo_url -r$svn_repo_rev | $ansi2html > $diff_dir/$2.html
    svn di $svn_repo_url -r$svn_repo_rev > $diff_dir/$2.diff
    svn up
}

# Fetch my repos (xxx_diff repo_path diff_name)
git_diff "linux/linux-2.6" "linux-2.6"
svn_diff "apache/lucene" "lucene"
hg_diff "java/jdk" "hotspot-jdk7"

After running repo-diff script that allows you to update your favorites repositories and saving diff files, you can send them using he send-mail script.

diff_dir="~/.repo-diffs"
mail_address="th30z@localhost"

for html_file in `ls -1 $diff_dir/*.html` ; do
    repo_name=`basename $html_file | sed 's/\.html$//g'`
    diff_file=`echo $html_file | sed 's/\.html$/\.diff/g'`

    boundary="==`echo $repo_name | md5sum | cut -d ' ' -f -1`"
    alt_boundary="==`echo $boundary | md5sum | cut -d ' ' -f -1`"

    echo "Send Repo Diff $repo_name - $html_file"
    (
        echo "MIME-Version: 1.0"
        echo "Subject: Repo-Diff: $repo_name"
        echo "To: $mail_address"
        echo "Content-Type: multipart/mixed; boundary=$boundary"

        echo "--$boundary"
        echo "Content-Type: multipart/alternative; boundary=$alt_boundary"
        echo

        echo "--$alt_boundary"
        echo "Content-Type: text/plain"
        echo
        cat $diff_file

        echo "--$alt_boundary"
        echo "Content-Type: text/html"
        echo
        cat $html_file

        echo
        echo "--$alt_boundary--"
        echo "--$boundary"
        echo "Content-Type: Application/Binary_Attachment; 
                            name=\"`basename $diff_file`\""
        echo "Content-Disposition: attachment; 
                                   filename=\"`basename $diff_file`\""
        echo "Content-Transfer-Encoding: uuencode"
        echo
        uuencode $diff_file $diff_file
    ) | sendmail $mail_address
done

This script, for each file generated from the repo-diff script sends you a mail with the diff as body and attachment.

Scripts are available on my github repository under blog-code/repo-mail-diff.