#!/usr/bin/python

# tz_bloom_filter.py
#       --copyright--                   Copyright 2013 (C) Tranzoa, Co. All rights reserved.    Warranty: You're free and on your own here. This code is not necessarily up-to-date or of public quality.
#       --url--                         http://www.tranzoa.net/tzpython/
#       --email--                       pycode is the name to send to. tranzoa.com is the place to send to.
#       --bodstamps--
#       January 25, 2013        bar
#       September 23, 2013      bar     correct namespace
#       --eodstamps--
##      \file
#       \namespace              tzpython.tzlib
#
#
"""
Notes about bloom filters from: http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html

The idea is to quickly and cheaply know whether you've seen a "thang" before (perhaps thangs are big and stored somewhere).
There are around N thangs.

Instead of checking the N things to see if a query thang is among the stored thangs,
have a bunch of hash functions that each return an index in a bit-table that's "M" bits long.

You have K hash functions and bit-tables.

Hash the query thang using each of the hash functions. Is the bit set in each and every hash function table?
The query thang is probably stored.
If any one of the bits is not set, then the query thang is not in the memory. This thang is new.


Pick K to be                        log(2.0) * M / N
That's the best K, giving good odds of this bloom filter not returning a false positive.
The odds of a false positive are    0.6185 ** (M / N)
The page contains a table to show the effects of M/N and K values.

Instead of storing K bit tables,
the page describes storing a 4-bit count for each bit.
The count is incremented for each bit set when an item is added to the N items.
And the counts are decremented when items go out of the N items.
The count maxes out at 15, which can cause a problem.
But that never happens in their application because the tables are rebuilt as a normal consequence of operations before it happens.

"""


#
#
# eof
