Bloom Filter

Sometimes, we need to find out if an element is a member of a given set or not. There are many algorithms which solve this problem. However, many are not efficient for large datasets and restricted memory. A good example is a dictionary on an iPhone. While typing, the application needs to verify if the word is in the dictionary or not. The dictionary is huge, and there is just not enough RAM to hold all words. How to find out all the typos? Luckily, there’s a way.

A Bloom filter is a data structure which verifies the presence of an element in a set, rapidly and memory-efficiently. This efficiency comes at a cost. It tells us with 100% accuracy, that the element is absent in the set, but sometimes it will report that the element is in the set when it isn’t. So it achieves the memory and speed efficiency at the cost of false positives. In simpler terms, when the Bloom filter tells us that element x is in set S, it may or may not be in set S. But if it tells us that it is not in S, then we can be sure that it isn’t.

Bloom filters are ideal for situations where a false negative is a very bad thing and a false positive is acceptable. A good example would be caching. It would be good to know if the resource requested by the client is in the cache, before we make the long trip to the server. For example, when we try to load a page on Chrome, the browser would love to know if the page is stored in the cache. If the filter returns no, the browser knows for sure that the page is not in the cache. Then it can make the expensive trip to the server to retrieve the page. If the filter returns yes, then the page might be in the cache, or it might be not. Accessing cache to verify its presence is less expensive than the trip to the server.

The implementation of Bloom filters is simple. We take a big array of bits, e.g. BitArray in C#, where all the bits are set to zero. Then we map all the elements in S to this array, by using any decent hash algorithms, such as MD5 or RSA256. For each element, we first find the hashes and set the corresponding bits in the array to 1. There can be collisions where two or more elements are mapped to the same location in the array. It doesn’t matter.

To verify if the element is in the set, we hash the element and verify if the corresponding bits are set. If even one of the bits is not set, then we can be sure that the element is not in S. If they are all set, then it means that the element might be in the set. The reason being, the same bits could have been set for different elements. Then we can always search S to confirm its presence.

I implemented the Bloom filter in C#. The interface for the filter has only two methods, add(word) and test(word).

  • void add(word): adds a word in the filter
  • bool test(word): returns true if word exists in the set.
public interface IFilter
{
    void Add(string word);
    bool Test(string word);
}

public class BloomFilter : IFilter
{
    private readonly BitArray _bitArray;
    
    public BloomFilter(int size)
    {
        _bitArray = new BitArray(size);
    }
    
    public void Add(string word)
    {
        var md5Hash = GetMd5Hash(word);
        _bitArray.Set(md5Hash, true);
        var sha256Hash = GetSha256Hash(word);
        _bitArray.Set(sha256Hash, true);
    }
    
    public bool Test(string word)
    {
        var md5Hash = GetMd5Hash(word);
        var sha256Hash = GetSha256Hash(word);
        
        return _bitArray.Get(md5Hash) && _bitArray.Get(sha256Hash);
    }
    
    private static int GetMd5Hash(string word)
    {
        var md5Hasher = MD5.Create();
        var hashed = md5Hasher.ComputeHash(Encoding.UTF8.GetBytes(word));
        var hashedValue = BitConverter.ToInt32(hashed, 0);
        return Math.Abs(hashedValue);
    }
    
    private static int GetSha256Hash(string word)
    {
        var sha256Hasher = SHA256.Create();
        var hashed = sha256Hasher.ComputeHash(Encoding.UTF8.GetBytes(word));
        var hashedValue = BitConverter.ToInt32(hashed, 0);
        return Math.Abs(hashedValue);
    }
}

To reduce the rate of false positives, we can use more hash functions. This means more bits will be set for each element, one bit per hashing algorithm. It results in stricter verification. However, this might affect the speed of the algorithm. It’s a tricky optimization problem, but can be solved if given the number of elements, the number of bits, and the number of hash functions.

Written on May 27, 2018