Follow us on social networks, contact us via email, skype or simply subscribe to our RSS Feed! The best kind of optimizations are ones that eliminate the need to do expensive but wasteful work. You would be surprised at how much processing some databases do just to tell you that the thing you queried for doesn’t exist in the database. A Bloom Filter (invented in 1970 by Burton Bloom) is a probabilistic data structure that can give you an answer whether something exists in a set with a level of accuracy.
When a new element comes in, the value is hashed which will decide which bit(s) in the bit set need to be set to represent this value has been seen before. Now that the bits in the bit set have been set for foo and bar we can query the bloom filter to tell us if something has been seen before. The accuracy of a bloom filter can be adjusted by the bit set size and the amount of hash functions used. In the case your application (like a database) is storing files on disk, you can drastically reduce file IO by having a bloom filter represent membership of each file and testing the bloom filters before opening and reading files. If the bloom filter says the element doesn’t exist in the file then there is no purpose in reading all the files.
In a distributed data set if you have thousands of nodes it’s not very efficient to query every node in some cases when you can use similar optimizations to test membership of data on remote nodes and avoid network hops.
Using a bloom filter in the right situations to do a quick check to avoid unnecessary work is definitely a nice option to have and can yield good results. Removing an element from this simple Bloom filter is impossible because false negatives are not permitted. One-time removal of an element from a Bloom filter can be simulated by having a second Bloom filter that contains items that have been removed.
The problem behind this method is the cost of hashing during insert and update and updating the bloom bit set also has its own difficulties. Sometimes I call myself a “Computer Scientist”, and then I get taken back to school by a 1970s algorithm!

This powerful combination of imported herbs has a stimulating effect on the body, glands and reproductive system. Brand SAS has introduced a wide range of innovative products for healthcare & wellness. While analyzing internals of some open source databases I’ve found that some of them spend a lot of time trying to do the wrong optimizations at the wrong times resulting in wasteful work. A bloom filter is backed by a bit set which can be set to whatever length you want (which affects accuracy, but more on that later). The element is hashed but instead of setting the bits, this time a check is done and if the bits that would have been set are already set the bloom filter will return true that the element has been seen before.
An element maps to k bits, and although setting any one of those k bits to zero suffices to remove the element, it also results in removing any other elements that happen to map onto that bit.
