Bloom Filters

How does Google’s Gmail so quickly tell you that the chosen username is available or the opposite? How does Medium recommend you the articles to read? How does Netflix knows whether you have already watched a movie?

Scanning millions of rows for websites like Google, Medium & Netflix is hard, especially considering the number of active users they have who keep on sending data requests to them. Even there’s a limit to the database index that can reside in memory or the caching of REDIS that we employ to make API responses faster.

In this article, we will get to know about Bloom Filter, a data structure that extends the concept of the hash table to restrict memory usage to filter a large dataset. You may have heard of it as a probabilistic data structure designed to answer only Yes or No!

Introduction

Bloom filter was invented by Burton Bloom, a computer scientist in 1970 in an attempt to track which keys have been inserted while brutally optimizing the memory usage and execution time. In the quest for ultimate efficiency, the bloom filter is designed in such a way that it can lead to false positives, a situation when the key is not present but it returns true. But will never result in false negatives, a situation where the key is present but it returns false. So if a key is not present, it will always return false.

A bloom filter is highly space efficient and does not store the actual keys. Even for the insertion and lookup, it’s extremely fast. Both insertion and lookup operations are O(1) along with the space efficiency which is O(1)

Internals

Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries. Most of these require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings (tries are an exception since they can share storage between elements with equal prefixes). Linked structures incur an additional linear space overhead for pointers. A Bloom filter with a 1% error and an optimal value of k, on the other hand, requires only about 9.6 bits per element — regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. If a 1% false positive rate seems too high, each time we add about 4.8 bits per element we decrease it by ten times. 1

The bloom filter interface only supports 2 operations: Insertion & Lookup. Sadly you cannot delete any key previously inserted and you cannot iterate over the set of keys.

At the core, a bloom filter is an array of binary values i.e. 0 or 1. Every bit corresponds to the hash value of the given key! If the bit is 1 the key is present if it’s 0 the key is not present.

Hashing

A hash is like the fingerprint of the given data

For example, the MD5 hash value of 'Apple' is '9f6290f4436e5a2351f12e03b6433c3c'

Hashing is the key to the bloom filter’s success. First, we’ll use a single hash function to see how it goes with the filter and once we catch the limitation of a single hash function, we will introduce multiple hash functions for efficient results.

Many hash functions use numeric or alphanumeric keys. The commonly used hash functions are:

  • Division Method
  • Mid Square Method
  • Folding Method
  • Multiplication Method

Assume that we have an array of 10 elements to store the bitmap of the keys we will be provided. And let’s assume a simple hash function which will return us the array index for a given key like this:

function HashA(key: number) {
  return key % 10;
}

So assume we have to insert the following numbers in the bool filter: 2, 18, 45, the resultant index value from the hash will be:

function HashA(key: number) {
  return key % 10;
}

HashA(2); // return index value 2;
HashA(18); // return index value 8;
HashA(45); // return index value 5;

hash-A

A simple hash function to evaluate the index of the key


Collisions

You might have already guessed what can wrong with using only a single hash function, no matter how strong you implement it, will always result in a collision as your data set grows. Here we see an example:

function HashA(key: number) {
  return key % 10;
}

HashA(2); // return index value 2;
HashA(18); // return index value 8;
HashA(45); // return index value 5;

/* warning: collision ahead!*/
HashA(22); // return index value 2;

hash-A-collision

A simple hash function resulting in collision


To avoid the collision, the bloom filter uses a set of k independent hash functions with variations in the hashing algorithms. While insertion all the independent hash functions map to the same key to yield different index values which looks like this in code:

function HashA(key: number) {
  return key % 10;
}

function HashB(key: number) {
  const A = 0.12345;
  return Math.floor((key * A) % 1);
}

function HashC(key: number) {
  const stringValue = String(key);
  const PRIME_MULTIPLIER = 1801; // Random prime number
  let hashVal = 0;

  for (let i = 0; i < stringValue.length; i += 1) {
    hashVal += stringValue.charCodeAt(i) - 96;
    hashVal *= PRIME_MULTIPLIER;
  }

  return Math.abs(hashVal % 10);
}

/* trying out triple hashing of 2 and 22 */
HashA(2) // return index value 2;
HashB(2) // return index value 0;
HashC(2) // return index value 6;

HashA(22) // return index value 2;
HashB(22) // return index value 0;
HashC(22) // return index value 2;

triple-hashing

A combination of 3 hash functions yielding unique bitmaps avoiding collision


As you can see the bitmap above for 2 is now 2, 0, 6 while that for 22 is 2, 0, 2 which justifies the use of 3 different hash functions for unique identifications of keys instead of one or even the use of 2 different hashing functions which can result in collision quickly.

Tuning Parameters

Bloom filter’s false positives can be tuned using parameters like the size of the underlying array, and the number of hash functions. One common and simple mathematical approximation of the false positive rate is given by:

FalsePositiveRate = (1 - (1 - 1/m)nk)k

  • Increase the size of array (m) always decreases the false positive rate.
  • Increasing the number of items inserted (n) increases the false positive rate.
  • Increasing the number of hash functions (k) can increase or decrease the false positive rate depending on the above parameter.

Implementation

It’s time to code up the full implementation of the bloom filter (in TypeScript)

type HashFunction = (arg: any) => number; 

class BloomFilter<T> {
    private size: number;
    private hash functions: Array<HashFunction>;
    private bitmap: Array<number>;
    
    constructor(size: number, hashFunctions:Array<HashFunction>){
      if (hashFunctions.length < 3) {
        throw new Error('Need at least 3 hash functions');
      }
      this.size = size;
      this.hashFunctions = hashFunctions;
      this.bitmap = new Array<number>(this.size).fill(0);
    }

    public add(key: T) {
      this.hashFunctions.forEach(e => {
        this.bitmap[e(key) % this.size] = 1;
      });
    }

    public lookup(key: T): boolean {
      for (let i =0; i < this.hashFunctions.length; i += 1) {
        const index = this.hashFunctions[i](key) % this.size;
        if (!this.bitmap[index]) return false;
      }
      return true;
    }
}

const hashF: Array<HashFunction> = [
    function hashA(key: number): number {
      return key % 10;
    },

    function hashB(key: number): number {
      const A = 0.12345;
      return Math.floor((key * A) % 1);
    },

    function hashC(key: number): number {
      const stringValue = String(key);
      const PRIME_MULTIPLIER = 1801; // Random prime number
      let hashVal = 0;

      for (let i = 0; i < stringValue.length; i += 1) {
          hashVal += stringValue.charCodeAt(i) - 96;
          hashVal *= PRIME_MULTIPLIER;
      }

      return Math.abs(hashVal % 10);
    },
];

/* creating an instance of the bloom filter*/
const myFilter = new BloomFilter<number>(10, hashF);

myFilter.add(2);
myFilter.add(2);

console.log(myFilter.lookup(2));
console.log(myFilter.lookup(33));
console.log(myFilter.lookup(22));

Usages

Bloom filters have wide applications in various fields of computer science. The ease and efficiency of the bloom filter made it a popular choice as a filter for large datasets. The use of the bloom filter is not only limited to software applications, but the bloom filter also finds its place in the database systems and kernels.

  • The bloom filter is used to store and track the malicious URLs for safe browsing in Chrome2
  • Bloom filters are used in Google’s BigTable3, Apache Cassandra4, Apache HBase5, and Postgresql use Bloom filters to reduce the disk lookups for non-existent rows or columns.
  • Quora implemented a shared bloom filter in the feed backend to filter out stories that people have seen before.
  • Akamai & Facebook use bloom filters to avoid caching the items that are very rarely searched or searched only once.

Real life example

To help you with the extensible use of bloom filters and make it more clear for you, I found an excellent real-life use case written by Tarun Sharma, Ph.D. @ Caltech:

Let us say I work for Google, in the Chrome team, and I want to add a feature to the browser which notifies the user if the URL he has entered is malicious. So I have a dataset of about 1 million malicious URLs, the size of this file is around 25MB. Since the size is quite big, (big in comparison to the size of the browser itself), I store this data on a remote server.

Case 1: I use a hash function with a hash table. I decide on an efficient hashing function and run all the 1 million URLs through the hashing function to get hash keys. I then make a hash table (an array), where the hash key would give me the index to place that URL. So now once I have hashed and filled the hashing table, I check its size. I have stored all 1 million URLs in the hash table along with their keys. So the size is at least 25 MB. This hash table, due to its size will be stored on a remote server. When a user comes along and enters a URL in the address bar, I need to check if it was malicious. Thus I run the URL through the hash function (the browser itself can do this) and I get a hash key for that URL. I now have to request my remote server with that hash key, to check if the particular URL in my hash table with that particular key, is the same as what the user has entered. If yes then it is malicious and if no then it is not malicious. Thus every time the user enters a URL, a request to the remote server has to be made to check if it is a malicious URL. This would take a lot of time and thus make my browser slow.

Case 2: I use a bloom filter. The entire list of 1 million URLs is run through the bloom filter using multiple hash functions and the respective positions are marked as 1, in a huge array of 0s. Let’s say we want a false positive rate of 1%, using a bloom filter calculator http://hur.st/bloomfilter?n=1000000&p=0.01), we get the size of the bloom filter required as only 1.13 MB. This small size is expected as, even though the size of the array is huge, we are only storing 1s or 0s and not the URLs as in the case of the hash table. This array can be treated as a bit array. That is, since we have only two values 1 and 0, we can set individual bits instead of bytes. This would reduce the space taken by 8 times. This 1.13 MB bloom filter, due to its small size, can be stored in the web browser itself !! Thus when a user comes along and enters a URL, we simply apply the required hash functions (in the browser itself), and check all the positions in the bloom filter (which is stored in the browser). A value of 0 in any of the positions tells us that this URL is DEFINITELY NOT in the list of malicious URLs and the user can proceed freely. Thus we did not make a call to the server and hence save time. A value of 1 tells us that the URL MIGHT be in the list of malicious URLs. In these cases, we make a call to the remote server, and over there we can use some other hash function with some hash table as in the first case to retrieve and check if the URL is present. Since most of the time, a URL is not likely to be a malicious one, the small bloom filter in the browser figures that out and hence saves time by avoiding calls to the remote server. Only in some cases, if the bloom filter tells us that the URL MIGHT be malicious, only those cases we make a call to the server. That ‘MIGHT’ is 99% right.

So by using a small bloom filter in the browser, we have saved a lot of time as we do not need to make server calls for every URL entered.

Variations

There are a few common variations of the Bloom Filter, a list of which you can see in details in this Github repository by Tyler Treat. I have extracted out the most common ones from his repository.

Scalable Bloom Filter

A Scalable Bloom Filter (SBF)6 dynamically adapts to the size of the data set while enforcing a tight upper bound on the rate of false positives and a false-negative probability of zero. This works by adding Bloom filters with geometrically decreasing false-positive rates as filters become full. A tightening ratio, r, controls the filter growth. The compounded probability over the whole series converges to a target value, even accounting for an infinite series.

Scalable Bloom Filters are useful for cases where the size of the data set isn’t known a priori and memory constraints aren’t of particular concern. For situations where memory is bounded, consider using Inverse or Stable Bloom Filters.

Inverse Bloom Filter

An Inverse Bloom Filter7, or “the opposite of a Bloom filter”, is a concurrent, probabilistic data structure used to test whether an item has been observed or not. It was originally described and written by Jeff Hodges7 The Inverse Bloom Filter may report a false negative but can never report a false positive. That is, it may report that an item has not been seen when it actually has, but it will never report an item as seen which it hasn’t come across. This behaves in a similar manner to a fixed-size hashmap which does not handle conflicts.

This structure is particularly well-suited to streams in which duplicates are relatively close together. It uses a CAS-style approach, which makes it thread-safe.

Counting Bloom Filter

A Counting Bloom Filter (CBF)8 provides a way to remove elements by using an array of n-bit buckets. When an element is added, the respective buckets are incremented. To remove an element, the respective buckets are decremented. A query checks that each of the respective buckets are non-zero. Because CBFs allow elements to be removed, they introduce a non-zero probability of false negatives in addition to the possibility of false positives.

Counting Bloom Filters are useful for cases where elements are both added and removed from the data set. Since they use n-bit buckets, CBFs use roughly n-times more memory than traditional Bloom filters. This is an implementation of a Counting Bloom Filter as described by Fan, Cao, Almeida, and Broder in Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol.

Cuckoo Filter

The Cuckoo Filter9 as described by Andersen, Kaminsky, and Mitzenmacher is similar to the Counting Bloom Filter in that it supports adding and removing elements, but it does so in a way that doesn’t significantly degrade space and performance.

It works by using a cuckoo hashing scheme for inserting items. Instead of storing the elements themselves, it stores their fingerprints which also allows for item removal without false negatives (if you don’t attempt to remove an item not contained in the filter).

For applications that store many items and target moderately low false-positive rates, cuckoo filters have lower space overhead than space-optimized Bloom filters.

Limitations

There are some of the limitations of Bloom Filters are:

  • Keys cannot be removed once added.
  • Since the bloom filter does not store the keys, we cannot search or iterate over the collection of the keys.
  • False positives increase with increasing data in the bloom filter.
  • Cost of addition and lookup, it’s directly proportional to the hash functions used in the filter. The insertion and lookup order becomes O(k) for k hash functions.

Here’s an interesting read when When Bloom filters don’t bloom.

Wrap Up

This brings us to the end of our discussion about the Bloom Filter. This is indeed an amazing data structure that has such wide adoption and applications in the computer science domain ranging from software applications to distributed storage to operating system kernels. There are more such interesting data structures that I am planning to cover, I hope you enjoyed the read.

Stay healthy, stay blessed!