Approximate membership query filter

Approximate membership query filters (hereafter, AMQ filters) comprise a group of space-efficient probabilistic data structures that support approximate membership queries. An approximate membership query answers whether an element is in a set or not with a false positive rate of .

Bloom filters are the most known AMQ filter, but there are other AMQ filters that support additional operations or have different space requirements.

AMQ filters have numerous applications, mainly in distributed systems and databases. There, they are often used to avoid network request or I/O operations that result from requesting elements that do not exist.

Approximate membership query problem

edit

The approximate membership query problem is to store information about a set of elements S in a space-efficient way. The goal is to answer queries about whether an element x is in the set S or not, while constraining false positives to a maximal probability  . All AMQ filters support this lookup operation. Dynamic AMQ filters allow insertions at any time whereas static AMQ filters must be rebuilt after inserting additional elements. Some AMQ filters support additional operations such as deleting elements, or merging two filters.

Lookup

edit

An AMQ filter lookup will determine whether an element is definitely not in the set or probably in the set.

In other words, if the filter represents a set S and we are interested in a value s, then the lookup function applied to s behaves as follows:

  • if  : always returns true.
  • if  : returns false with probability  .

A false positive is a lookup of an element that is not part of the set, but where the lookup returns true. The probability of this happening is the false positive rate  . False negatives (the lookup returns false although the element is part of the set) are not allowed for AMQ filters.

Insertion

edit

After an element is inserted the lookup for this element must return true. Dynamic AMQ filters support inserting elements one at a time without rebuilding the data structure. Other AMQ filters have to be rebuilt after each insertion. Those are called static AMQ filters.

False positive rate vs. space

edit

There is a tradeoff between storage size and the false positive rate  . Increasing the storage space reduces the false positive rate. The theoretical lower bound is   bits for each element.[1] Dynamic AMQ filters cannot reach this lower bound. They need at least   bits for   insertions.[2] Different AMQ filters have different ranges of false positive rates and space requirements. Choosing the best AMQ filter depends on the application.

Data structures

edit

There are different ways to solve the approximate membership query problem. The most known data structure are Bloom filters, but there are other data structures that perform better for some false positive rates and space requirements, support additional operations, or have other insertion and lookup times. Below we describe some well known AMQ filters.

Bloom filter

edit

A Bloom filter is a bit array of   bits with   hash functions. Each hash function maps an element to one of the   positions in the array. In the beginning, all bits of the array are set to zero. To insert an element, all hash functions are calculated and all corresponding bits in the array are set to one. To lookup an element, all   hash functions are calculated. If all corresponding bits are set, true is returned. To reduce the false positive rate, the number of hash functions and   can be increased.

Quotient filter

edit

The idea of quotient filters is to hash an element and to split its fingerprint into the   least significant bits called the remainder   and the most significant bits called the quotient  . The quotient determines where in the hash table the remainder is stored. Additional three bits for every slot in the hash table are used to resolve soft collisions (same quotient but different remainders).

The space used by quotient filters is comparable to Bloom filters, but quotient filters can be merged without affecting their false positive rate.

Cuckoo filter

edit

Cuckoo filters are based on cuckoo hashing, but only fingerprints of the elements are stored in the hash table. Each element has two possible locations. The second location is calculated based on the first location and the fingerprint of the element. This is necessary to enable moving already inserted elements if both possible slots for an element are full.

After reaching a load threshold the insertion speed of cuckoo filter degrades. It is possible that an insertion fails, and the table must be rehashed. Whereas Bloom filters have always constant insertion time, but as the load factor increases the false positive rate increases as well.

A cuckoo filter supports deleting elements in the case where we know for certain that the element was in fact previously inserted. This is an advantage over Bloom filters and quotient filters which do not support this operation.

XOR filter

edit

XOR filters[3] are static AMQ filters that are based on a Bloomier filter and use the idea of perfect hash tables. Similar to cuckoo filters, they save fingerprints of the elements in a hash table. The idea is that a query for an element   is true if the XOR of three given hash functions   is the fingerprint of  . While building the hash table, each element is assigned one of its three slots in a way that no other elements are assigned to this slot. After all elements are assigned, we set for each element the value of its slot to the XOR of the two other (not assigned) slots of the element and the fingerprint of the element. This construction algorithm can fail in a way such that no dynamic insertions are possible without rebuilding the hash table. This hash table can be constructed using only   bits per element.

The disadvantage of this filter is that the data structure has to be rebuilt if additional elements are added. They are used in applications where no elements have to added afterwards and space is of importance.

Application

edit

Typical applications of AMQ filters are distributed systems and database systems. The AMQ filter functions as a proxy to the set of keys of a database or remote memory. Before a presumably slow query to the database or to remote memory is performed, the AMQ filter is used to give an approximate answer as to whether the key is in the database or in remote memory. The slow query is only performed when the AMQ filter returns true. Only in the case of a (hopefully rare) false positive is an unnecessary I/O or remote access performed. The applications are numerous and include package and resource routing, P2P networks, and distributed caching.[4]

AMQ filters are often used as an in-memory data structure to avoid expensive disk accesses. One application is Log-structured merge-trees or LSM trees. They have a fast in-memory component and one or multiple components on a disk which are trees themselves. Elements are inserted into the in-memory component until it reaches its maximal size, than the in-memory component is merged with the disk components. To speed up the lookup many LSM trees implement AMQ filters like Bloom filters or quotient filters. Those filters approximate for each component which elements are stored in it. LSM trees are used in databases like Apache AsterixDB, Bigtable, HBase, LevelDB, SQLite4.

Networks offer a lot of applications for AMQ filters. They are used to approximate a set of data that is located on a different servers. In many cases those AMQ filters can be seen as immutable. Even if the set on the remote server changes the AMQ filter is often not updated right away, but some false positives are tolerated. One example of this application is web cache sharing. If a proxy has a cache miss it wants to determine if another proxy has the requested data. Therefore, the proxy must know or at least approximate if another proxy holds the requested web page. This can be archived by periodically broadcasting a static AMQ filter of the URLs of the web pages a proxy has cached instead of broadcasting URL lists. In this setting, false negatives can occur if the cache changed in between periodic updates.

The same concept can be applied to P2P networks. AMQ filters can be used to approximate what is stored at each node of the network. The filter can be filled with ids or keywords of the actual documents of the nodes. False positives only lead to some unnecessary requests. AMQ filters have further applications in P2P networks for example finding the difference or intersection between sets stored on different nodes.

See also

edit

References

edit
  1. ^ Carter; Larry (1978). "Exact and approximate membership testers". Proceedings of the tenth annual ACM symposium on Theory of computing - STOC '78. pp. 59–65. doi:10.1145/800133.804332. S2CID 6465743.
  2. ^ Lovett; Shachar (2010). "A Lower Bound for Dynamic Approximate Membership Data Structures". 2010 IEEE 51st Annual Symposium on Foundations of Computer Science. pp. 797–804. doi:10.1109/FOCS.2010.81. ISBN 978-1-4244-8525-3. S2CID 7904735.
  3. ^ Graf; Lemire (2020). "Xor Filters". ACM Journal of Experimental Algorithmics. 25: 1–16. arXiv:1912.08258. doi:10.1145/3376122. S2CID 209405019.
  4. ^ Broder, Andrei; Mitzenmacher, Michael (2002). "Network Applications of Bloom Filters: A Survey". Internet Mathematics: 636–646. CiteSeerX 10.1.1.20.98.