Is there anything we can do about this when designing algorithms and data-structures? Yes there is a lot we can do. If we perform chunks of work on data that is co-located, and we stride around memory in a predictable fashion, then our algorithms can be many times faster. For example rather than using bucket and chain hash tables, like in the JDK, we can employ hash tables using open-addressing with linear-probing. Rather than using linked-lists or trees with single items in each node, we can store an array of many items in each node. - http://mechanical-sympathy.blogspot.com.au[..]emory-access-patterns-are-important.html
RangeBitmap produces a RoaringBitmap of the indexes which satisfy a predicate, and can take RoaringBitmap parameters as inputs to skip over rows already filtered out. The Streams API code used before is translated into RangeBitmap API calls: - https://richardstartin.github.io/posts/range-predicates