Filter rows using a probabilistic filter (bloom filter or cuckoo filter) in Postgres
Demir last edited by
I have a large table (~1B rows) with a
btreeindex on a single
byteacolumn (~20GB, but fits in memory).
I'd like to send a query from the client with a probabilistic filter (e.g. a bloom filter or a cuckoo filter) and have the database scan the index, collecting all rows which might match the filter. This should return all possible matches and some number of false positives. The goal is to avoid providing a full list of items requested (for privacy – the client has a list, but can't share it with the server).
I see that Postgres has a https://www.postgresql.org/docs/10/bloom.html for creating bloom filter-based indexes, but I don't think that can be used to test a bloom filter provided in a query against a
I imagine the final result looks something like:
SELECT key, value FROM table WHERE matches_filter(key, b'01010101010101');
Is there some way to re-use internal methods from the
bloommodule here? Or are there any extensions which expose probabilistic filter functions?
You can't have a 20GB bytea value, no matter how much RAM you have. The max is 1GB. And you can't btree index values anywhere near that large--but it is not clear why you specify btree as you also talk about making new index types.
What you could do is use a well-known hashing function (like md5 or sha256) to digest the full value on the client, and then truncate that to a length where you would have plausibility for collisions, then send the truncated value to the server to search with. The server could then send back all the matching full-length digests. (Or it could send back the full length documents, but not if they are >1GB). That way you would learn if your document (or its hash digest anyway) was already in the database, but without the database knowing with certainty which of those (if any) was the one you wanted to search for. A regular btree index would be sufficient for the indexing part of this.