Friday, October 15, 2010

Python: Inverted Index for dummies

An Inverted Index is an index data structure storing a mapping from content, such as words or numbers, to its document locations and is generally used to allow fast full text searches.



The first step of Inverted Index creation is Document Processing In our case is word_index() that consist of word_split(), normalization and the deletion of stop words ("the", "then", "that"...).
def word_split(text):
    word_list = []
    wcurrent = []
    windex = None

    for i, c in enumerate(text):
        if c.isalnum():
            wcurrent.append(c)
            windex = i
        elif wcurrent:
            word = u''.join(wcurrent)
            word_list.append((windex - len(word) + 1, word))
            wcurrent = []

    if wcurrent:
        word = u''.join(wcurrent)
        word_list.append((windex - len(word) + 1, word))

    return word_list
word_split() is quite a long function that does a really simple job split words. You can rewrite it with just one line using something like re.split('\W+', text).
def words_cleanup(words):
    cleaned_words = []
    for index, word in words:
        if len(word) < _WORD_MIN_LENGTH or word in _STOP_WORDS:
            continue
        cleaned_words.append((index, word))
    return cleaned_words

def words_normalize(words):
    normalized_words = []
    for index, word in words:
        wnormalized = word.lower()
        normalized_words.append((index, wnormalized))
    return normalized_words
Cleanup and Normalize are just to function filters to apply after word_split(). In this case cleanup remove the stopwords (frozenset of strings) and normalize convert word in lower case. But you can do something more like removing accents, transform to singular or something else.
def word_index(text):
    words = word_split(text)
    words = words_normalize(words)
    words = words_cleanup(words)
    return words
word_index() is just an helper, take an input text and does all the word splitting/normalization job and the result is a list of tuple that contains position and word. [(1, u'niners'), (13, u'coach')].
def inverted_index(text):
    inverted = {}

    for index, word in word_index(text):
        locations = inverted.setdefault(word, [])
        locations.append(index)

    return inverted
Finally we've our invertex_index() method that take a text as input and returns a dictionary with words as keys and locations (position of the words in the document) as values.
def inverted_index_add(inverted, doc_id, doc_index):
    for word, locations in doc_index.iteritems():
        indices = inverted.setdefault(word, {})
        indices[doc_id] = locations
    return inverted
The Previous method, inverted_index(), returns a dictionary with just the information for the specified document. So inverted_index_add() add Document's inverted index to a Multi-Document Inverted Index. Here we've words that are keys of the dictionary and values are dictionary with doc_id as key and document location as values. {'week':{'doc2': [149], 'doc1': [179, 187]}}.
def search(inverted, query):
    words = [word for _, word in word_index(query) if word in inverted]
    results = [set(inverted[word].keys()) for word in words]
    return reduce(lambda x, y: x & y, results) if results else []
Now that we've the inverted index, we're able to do queries on it. The function above takes a Multi-Document Inverted index and a query, and returns the set of documents that contains all the words that you've searched.

Obviously to make a serious search function you need to add ranking, phrase matches, stemming and other features of full-text search. And you need to write your dictionary using an on disk btree. But this is a basis example, of how to build an inverted index.

Source code can be found on my GitHub repository under py-inverted-index:
git clone http://github.com/matteobertozzi/blog-code.git

Saturday, October 9, 2010

Unsigned Int of specified bit length

Finally I've added 128+ bit support to RaleighFS Table Object Row Index. I don't need much operation on indices only inc, dec and compare, but I've implemented other couple of methods (add, and, or, xor) and now there's a mini uintX library available at GitHub.

...
uint8_t index[16];    // 128bit unsigned integer

uintx_init(index, 128U);   // Initialize 128bit index to zero.

uintx_inc(index, 128U);    // Increment Index (Now is One)
uintx_dec(index, 128U);    // Decrement Index (Now is Zero)

uintx_from_u64(index, 128U, 8192U);    // Now index is 8192

uintx_add64(index, 128U, 5U);          // Add 5 to index

uintx_compare(index, 128U, 8197U);     // Return 0
uintx_compare(index, 128U, 9197U);     // Return -1
uintx_compare(index, 128U, 0U);        // Return 1
...

The API is quite simple pass your object (uint8_t vector) and its bit size (uint8_t[16] is 128bit) and other args needed to the method. Of course you can replace 128 with 256, 512 or everything else.

The source code can be found at my GitHub in the uintx folder:
http://github.com/matteobertozzi/carthage.git