In General, A Bloom filter is a
space-efficient probabilistic data structure, conceived by Burton Howard Bloom
in 1970, that is used
to test whether an element is a member of a set. False positive
matches are possible, but false negatives are not – in other words, a query
returns either "possibly in set" or "definitely not in
set".
In Cassandra, Bloom filters are
used to boost the performance of reads. It is non-deterministic algorithms for testing
whether an element is a member of a set. They are
non-deterministic because it is possible to get a false-positive read from a
Bloom filter, but not a false-negative.
Bloom filters work by mapping
the values in a data set into a bit array and condensing a larger data set into a
digest string using a hash function. The digest, by definition, uses a much smaller
amount of memory than the original data would. The filters are stored in memory and are
used to improve performance by reducing the need for disk access on key look-ups.
Disk access is typically much slower than memory access. So, in a way, a Bloom filter is
a special kind of cache.
When a query is performed, the Bloom
filter is checked first before accessing disk. Because false-negatives are not possible,
if the filter indicates that the element does not exist in the set, it
certainly doesn’t; but if the filter thinks that the element is in the set, the
disk is accessed to make sure.
Bloom filters are implemented
by the org.apache.cassandra.utils.BloomFilter
class. Cassandra provides the
ability to increase Bloom filter accuracy (reducing the number of false positives) by
increasing the filter size, at the cost of more memory. This false positive chance is
tuneable per table.
An example of a Bloom filter,
representing the set {x, y, z}. The colored arrows show the positions in the
bit array that each set element is mapped to. The element w is not in the set
{x, y, z}, because it hashes to one bit-array position containing 0. For this
figure, m = 18 and k = 3.
References: wikipedia, Cassandra The Definitive Guide
No comments:
Post a Comment