``Random access'' may be the cause of abnormally slow program operation.



When programming, it is necessary to consider factors such as choosing a language according to the purpose, the beauty of coding, and the cost of calculation. However,

Marek Majkowski , an engineer at Cloudflare, says based on his own experience that in addition to the type of language and computational cost, it is also necessary to pay attention to the memory access characteristics of the CPU.

When Bloom filters don't bloom
https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

In order to investigate IP spoofing attacks that spoof IP addresses, Marek tried to determine whether a packet's geographical route is legitimate from a list of IP addresses like the one in the image. For example, it would not be legitimate for a packet sent by an Italian provider to go through a data center in Brazil, which is far away from Italy.



Marek collected a list of IP addresses and ended up with a list of 1 billion lines. Usually, duplicates are removed from a list using a Bash

one-liner that combines the sort and uniq commands, but it takes too much time to calculate a list this large. In Marek's environment, even a file with a list of 40 million lines and a capacity of 600 MiB would take more than two minutes to process with the script.



That's when I came up with the idea of using

a Bloom filter . The operation of the Bloom filter is basically the same as the ' set type', which is one of the data types used in programming, and it is a computationally efficient data structure that determines whether the same element exists in one data structure. is. Depending on the probability , false positives (determining that data is present even though it is not) may occur, but false negatives (determining that data is not present even though data is present) will not occur.

Like the set type, Bloom filters are limited by the collision of hash functions used and the memory capacity for storing bit arrays. A Bloom filter has four variables: 'n' representing the number of input elements, 'm' representing the memory capacity used by the bit array , 'k' representing the number of hash functions, and the probability of false positives occurring. It is made up of 'p' which stands for 'p'. Using the site below, you can find the appropriate memory capacity and number of hash functions from the specified number of elements and the probability of false positives.

Bloom filter calculator
https://hur.st/bloomfilter/

Marek implemented a program called ``mmuniq-bloom'' in C language using a bloom filter. You can specify four variables as arguments.

cloudflare-blog/mmuniq-bloom.c at master · cloudflare/cloudflare-blog · GitHub
https://github.com/cloudflare/cloudflare-blog/blob/master/2020-02-mmuniq/mmuniq-bloom.c

I immediately calculated the time it would take to process the list using mmuniq-bloom, and found that it was approximately 12 seconds. This is a big improvement compared to the 2 minutes it takes when using the sort and uniq commands...



The 'wc -l' command that outputs the number of line breaks was calculated to complete in about 0.4 seconds. Even though mmuniq-bloom uses a bloom filter with high computational efficiency, Marek was not satisfied with the calculation time of 12 seconds.



When I ran the program without Bloom filter processing, the execution speed was reduced to about 2 seconds. In other words, Bloom filter processing alone takes about 10 seconds.



In order to check the operation of the program, I used

the strace command to monitor system calls , but I did not see any noticeable abnormalities.



Next, when we analyzed the performance using

the perf command, we found that mmuniq-bloom's process_line function consumed 87.2% of CPU resources.



Further analysis revealed that the ``MOV instruction'', an assembler instruction that copies data, was consuming resources.



After compilation, the part of the source code that uses the MOV instruction 'uint64_t v = *p' occupies most of the program's processing cycles.



I also tested it with '

google-perftools ', which is the same performance analysis tool as perf, but the result was that the bitmap_getset function, which includes 'uint64_t v = *p', consumed a lot of resources.



Looking back on the fact that ``wc -l'', which counts line breaks, is fast, mmuniq-bloom is slow, Marek says, ``Bloom filter uses

random access , and its speed is very slow.'' Reach a conclusion.

The speed of one fetch from random access memory (RAM) is said to be 100 nanoseconds, so it would take 32 seconds to read a list of 40 million rows hashed with 8 hash values. It will be.



Also, since the memory capacity used as the bit array of the created Bloom filter was 128MiB, the bit array could not fit into the CPU's L3 cache, which is a high-speed memory. If you check the performance measurement using the perf command, you will see that 'LLC-load-misses', which indicates a load miss of the final level cache, is occurring.



Marek's colleagues often say, ``You can think of the computational speed of modern CPUs as infinite, until you hit the wall of memory read speed .'' Modern CPUs are good at sequential access , which reads a storage medium sequentially from the end, but are weak at random access . Marek says that when using a data structure like a Bloom filter, you need to check whether the data structure will fit in the CPU's cache.

in Software,   , Posted by darkhorse_log