S3Fifo article feedback #20

ben-manes · 2023-08-21T02:44:48Z

ben-manes
Aug 21, 2023

I attempted to reproduce the hit rate results in your blog article, where you observed that TinyLFU had lower performance than S3Fifo. I ran your workloads against the reference implementation, Caffeine.

Background

Caffeine uses the adaptive windowing technique, so this does not fully discount your observations. The paper recommended a 1% window as a starting point, since that works well in many critical workloads like database, search, and analytics. It concludes by showing workloads where a larger window is required and that dynamically adjusting this is left for a future work. That subsequent work to evaluate approaches does appear to correct this deficiency and is most often used in practice (available in Java, Rust, Go, Python, TypeScript).

Caffeine starts with a 1% window, monitors the hit rate over the TinyLFU sample period, and uses hill climbing to adjust the window size. The number of frequency counters is not known upfront, so like a hash table the capacity grows based on the entry count. This resizing loses the popularity scores and the policy may have started at a poor window size. Thus, like a JIT compiler, during this warmup period the hit rate may be lower but it quickly reaches peak performance so that this becomes noise except in very short runs. The admission decisions vary slightly between runs due to jitter introduced to mitigate hash flooding attacks.

Analysis

I ran the two simulators side-by-side by comparing the LRU hit rates to ensure that the traces are being evaluated correctly. Caffeine's simulator does not have rich variable size support (entry's weight) as this is less common in Java due to the object's heap usage being mostly hidden from the developer. There is a fork that extends support in order to evaluate size-aware TinyLFU strategies (research paper) where they proposed an adaption to further improve the hit rate. I hope to someday evaluate that proposal and incorporate the improvements.

I used libcachesim at 0f4d135 (current master) and Caffeine at cb5f75d (current master) with this patch to include the new trace formats.

In some traces the simulators report different results for FIFO and LRU by up to 0.25%. I determined that this was because libcachesim does not update the entry's size on a cache hit but does record the byte hit with the new weight.

trace.csv

./cachesim ../../data/trace.csv csv fifo,lru,qdlp,s3fifo,sieve 1gb -t "time-col=2, obj-id-col=5, obj-size-col=4"

	Hit Rate	Byte Hit Rate
FIFO	36.64% (63.36% MR)	26.77% (73.23% MR)
LRU	36.96% (63.04% MR)	27.14% (72.86% MR)
QDLP	43.43% (56.57% MR)	35.99% (64.01% MR)
S3Fifo	38.22% (61.78% MR)	29.10% (70.90% MR)
Sieve	43.47% (56.53% MR)	36.03% (63.97% MR)
Caffeine	47.07% (52.93% MR)	38.22% (61.78% MR)

Caffeine leads Sieve (the best algorithm) by 3.60% HR / 2.19% BHR, and S3Fifo by 8.85% HR / 9.12% BHR.

twitter_cluster52.csv

./cachesim ../../data/twitter_cluster52.csv csv fifo,lru,qdlp,s3fifo,sieve 1mb -t "time-col=1, obj-id-col=2, obj-size-col=3"

	Hit Rate	Byte Hit Rate
FIFO	70.65 (29.35% MR)	68.89% (31.11% MR)
LRU	73.13% (26.87% MR)	71.53% (28.47% MR)
QDLP	75.63% (24.37% MR)	74.82% (25.18% MR)
S3Fifo	76.66% (23.34% MR)	75.99% (24.01% MR)
Sieve	74.86% (25.14% MR)	73.88 (26.12% MR)
Caffeine	75.91% (24.09% MR)	75.09% (24.91% MR)

Caffeine is beaten by S3Fifo (the best algorithm) by -0.75% HR / -0.90% BHR. That is very different from the -10% difference that was observed in the blog article.

hm_0.csv.gz

./cachesim ../../data/hm_0.csv csv fifo,lru,qdlp,s3fifo,sieve 1gb -t "time-col=2, obj-id-col=5, obj-size-col=6"

	Hit Rate	Byte Hit Rate
FIFO	72.86 (27.14% MR)	59.12% (40.88% MR)
LRU	73.76% (26.24% MR)	60.67% (39.33% MR)
QDLP	75.27% (24.73% MR)	63.82% (36.18% MR)
S3Fifo	75.22 (24.78% MR)	63.52% (36.48% MR)
Sieve	73.93% (26.07% MR)	62.34% (37.66% MR)
Caffeine	75.65% (24.35% MR)	64.27% (35.73% MR)

Caffeine leads QDLP (the best algorithm) by 0.38% HR / 0.45% BHR, and S3Fifo by 0.43% HR / 0.75% BHR.

Multithreading

The article states that LRU is a scalability bottleneck due to reordering under a lock. This assertion might be misleading because many caches decouple their eviction policy from the way that they manage concurrency, rendering LRU's characteristics here irrelevant. This allows the cache designer to consider a broader set of algorithms and data structures that may allow for better time and space efficiency, or that enable additional features. For example, Caffeine scaled linearly to 380M reads/s on 16 cores (2015 hardware). In comparison, Segcache's FIFO policy was reported at 70M reads/s on 24 cores (2020 hardware). The benchmarks may differ, but it shows that the policy's concurrency does not need to be a bottleneck and designers can focus on other areas to optimize.

Conclusion

In the workloads that were considered exemplary by this project and used for its policy designs, we can observe that adaptive W-TinyLFU (Caffeine) demonstrates competitive performance. It is important to compare reference implementations to avoid accidental differences and I have not attempted to debug your implementation to explain the degradation. I believe any static configuration will underperform in some scenario and that designers should continue to explore ways to make more effective and adaptive choices.

P.S. This is an issue since you disabled the discussion tab. You are welcome to close this after reading (self destructs in 10, 9, ...)

1a1a11a · 2023-08-21T12:29:36Z

1a1a11a
Aug 21, 2023
Maintainer

Hi @ben-manes, I am very happy to read your feedback (I should have asked earlier)!
I will address one comment in each reply.

Some context on the simulator

Yes, there are discrepancies between simulators, e.g., whether (1) considering object size, (2) considering changes to object size, (3) metadata size, and (4) whether the insertion happens before or after eviction (not a complete list). That's why we choose to implement all algorithms within libCacheSim.

I agree that implementing an algorithm correctly is non-trivial. Our co-author @yazhuo found two bugs in some open-source LIRS implementations. Our early version of TinyLFU implementation also had bugs (maybe we still have...).
Having a correct and easy-to-use simulator is important for the community. libCacheSim has not fulfilled the goal, but we are working hard to bring it closer to the goal.

Our W-TinyLFU is implemented by @ziyueqiu, so I loop her in.
We made some approximations when implementing W-TinyLFU (we called TinyLFU in the paper). First, we did not implement the door-keeper, the goal of which is mostly for memory saving and should not affect the miss ratio. Second, we did not implement the hill-climbing technique, and we will acknowledge the difference in our paper. I will add this to an issue so that if someone has time can add it.

Some context on the simulations

In my earlier works, I noticed that object size is very important for many workloads, and size-aware algorithms significantly outperform the ones that are not size-aware when looking at request miss ratio. In this work, I ignored the object size to focus on the access patterns. You can add --ignore-obj-size 1 to the command.

0 replies

1a1a11a · 2023-08-21T12:53:26Z

1a1a11a
Aug 21, 2023
Maintainer

The results you shared

First, S3-FIFO is not the best algorithm for each trace. And it is not always better than TinyLFU as we showed.

trace.csv

This is the same trace as trace.oracleGeneral.bin (using a different format)

On this trace, W-TinyLFU is indeed better than S3-FIFO when considering object size, and you can verify by adding TinyLFU to the algorithm list. (Note that the TinyLFU results are from libCacheSim, which is different from Caffeine)

trace  FIFO	 	 cache size  1024MiB, miss ratio 0.6335, byte miss ratio 0.7046
trace  LRU	         cache size  1024MiB, miss ratio 0.6297, byte miss ratio 0.7009
trace  S3FIFO	         cache size  1024MiB, miss ratio 0.6178, byte miss ratio 0.6821
trace  Sieve	         cache size  1024MiB, miss ratio 0.5653, byte miss ratio 0.6173
trace  WTinyLFU	 cache size  1024MiB, miss ratio 0.4704, byte miss ratio 0.5211

However, when not considering object size, W-TinyLFU is worse than S3FIFO

trace  FIFO 		cache size     4897, miss ratio 0.8054, byte miss ratio 0.8054
trace  LRU 		cache size     4897, miss ratio 0.8049, byte miss ratio 0.8049
trace  S3FIFO 		cache size     4897, miss ratio 0.7525, byte miss ratio 0.7525
trace  Sieve 		cache size     4897, miss ratio 0.7907, byte miss ratio 0.7907
trace  WTinyLFU 	cache size     4897, miss ratio 0.7761, byte miss ratio 0.7761

Similar results can be found on other traces. The results in the blog are not a complete picture of what we have performed, and I am happy to share the paper with you.

Where do you see the 10% difference between S3-FIFO and W-TinyLFU? I don't think that's true in general, the confusion could come from my mistake, and I am happy to correct it.

BTW, the first two traces in your benchmark are not used in our benchmark. Both traces are just a tiny subset of longer (one-week long) traces.
I included them in the repo for people to try out. If you are interested, this is the full twitter trace (cluster52) that should be used. If you would like to try it out, I can convert it to a csv format, or you can use the following command to convert it.

./bin/tracePrint cluster52.oracleGeneral.bin.zst oracleGeneral

0 replies

1a1a11a · 2023-08-21T12:55:52Z

1a1a11a
Aug 21, 2023
Maintainer

Discussions

Adaptive window size

I believe adaptive window size will improve the efficiency of TinyLFU on the tail traces. But I am not sure about the impact on the mean though. We will probably look into this in the future. My guess is that similar to the adaptive algorithm (ARC), having adaptivity hurts the mean efficiency a bit.

Multi-threading

Directly comparing the throughput of Caffeine and Segcache is not fair because the throughput we reported in Segcache includes reading from the traces. I agree that decoupling cache management and data access is an increasingly popular way to improve scalability, but it is unclear whether the management can catch up with the data accesses when the throughput is very high. It is an effective approach, but I would view this as a compromised optimization to improve scalability (a similar solution in some systems is to use try-lock). Moreover, decoupling is non-trivial in many systems and can introduce bugs (compared to a simple eviction algorithm).

BTW, we are working on the camera-ready of the paper, I will add the discussion on the dynamic window size in TinyLFU (since we did not compare with it). If you would like to take a look and give more feedback, I am more than happy to send you the draft.
I am always a big fan of TinyLFU, it is elegant and and very efficient!

0 replies

ben-manes · 2023-08-28T05:56:53Z

ben-manes
Aug 28, 2023
Author

Sorry for the late reply. I had hoped to be able to put the time in this weekend towards a thoughtful response, but like the work week it was too busy to sneak away to do this fun stuff. Hopefully next weekend I will be able to do additional analysis and provide some commentary.

5 replies

jbduncan Jan 7, 2024

@ben-manes I'd love to hear your reply if time permits. 😁

1a1a11a Jan 7, 2024
Maintainer

Hi @jbduncan

I am not Ben, just want to share some thoughts.

There are some discussions in other repos, e.g.,

This is my personal view, I think the adaptive W-TinyLFU is on par with S3-FIFO, or maybe even better on block traces (I am not sure since our evaluation used the non-adaptive version). On web workloads and large multi-tenanted block workloads, S3-FIFO is better than the adaptive W-TinyLFU in most (if not all) cases, but the margin is not large.

jbduncan Jan 7, 2024

@1a1a11a Thank you for letting me know about these discussions, much appreciated! It's exciting to see new alternatives to W-TinyLFU emerging and to see everyone learning from each other. I hope this will lead to improvements in Caffeine and other caches over time.

(After discovering Guava many years ago, I fell into the caching rabbit hole and eventually found Caffeine. Caffeine has a similar API to Guava's cache, which Ben also made, so I understand. I admit I have a bit of an attachment to these libraries, but I'm keeping an open mind).

1a1a11a Jan 7, 2024
Maintainer

Yup, I tried both Guava and Caffeine, and found Caffeine is much better in my work at Twitter (I am not a Java programmer). I am a big fan of Caffeine and Ben's work. And I like Ben's replies in many posts because I can always learn from them. I believe Ben is a nice and awesome person (although I have not met him).

ben-manes Jan 7, 2024
Author

My understanding is that Sieve and S3-FIFO are very strong at cloud storage workloads like a cdn. Those tend to bias towards a recency component and these algorithms do a good job of quickly evacuating new arrivals that are pollutants. That's quite important for cloud providers or those managing their own where a hit rate increase can save millions of dollars. I think it is very valuable by showing what is achievable in those target workloads and what characteristics they favor.

It is scattershot of great, good, bad, awful at a broader set of workloads like consumer and business application caches, database, search, analytics, etc. In some of those cases you have scans, loops, zig-zag, zipf popularity, extreme recency, etc. A shared cache may experience multiple workload patterns as intermixed (e.g. noisy neighbors) or sequential (e.g. batch jobs). Then there is the various different scales of cache size and workload lengths. This variety is what the Linux page cache has to cope with where a machine might be a media server, a message broker, a smart phone, etc. That's a similar challenge that a process scheduler and garbage collector face where you might specialize for throughput or response times, or try to balance and adapt to both.

In an important category of workloads Sieve and S3-FIFO are excellent. They are also quite poor in the broader set of important workloads. On the other hand, adaptive W-TinyLFU has so far been competitive at the full spectrum of workloads where it tries to lead or be a close follower. Unfortunately its challenge is how quickly it can able detect the ideal pattern, how well it actually optimizes to the best configuration, and if that best is enough to be competitive. That means knowing what the best algorithm was able to achieve is extremely valuable for dissecting and further tuning the algorithms narrow the gap. S3-FIFO and Sieve raise the bar and show that the gap is wider than previously thought, so an open question is if this can be narrowed.

Another factor is how much effort does it take to achieve "good enough". If one needs a simple concurrent cache that is embedded, e.g. a json serializer's type cache or path resolver cache, then its really a no brainer to quickly write a Clock-based one. Simple, easy to understand and debug, and fast. That's a case where Clock / Sieve / S3-FIFO / etc are trivially easy to bootstrap and move on to the next engineering task. A sanity test verifies if the policy matches the usage's workload and if not, then flip a few heuristics (like Sieve does to Clock) might be enough. Of course if you don't have that constraint and want something off-the-shelf that will "just work" without any analysis, then a caching library that dynamically adjust to optimize for all of those unknowns is desirable.

All of that is to say "it depends". 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3Fifo article feedback #20

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

S3Fifo article feedback #20

ben-manes Aug 21, 2023

Background

Analysis

trace.csv

twitter_cluster52.csv

hm_0.csv.gz

Multithreading

Conclusion

Replies: 4 comments · 5 replies

1a1a11a Aug 21, 2023 Maintainer

Some context on the simulator

Some context on the simulations

1a1a11a Aug 21, 2023 Maintainer

The results you shared

trace.csv

1a1a11a Aug 21, 2023 Maintainer

Discussions

Adaptive window size

Multi-threading

ben-manes Aug 28, 2023 Author

jbduncan Jan 7, 2024

1a1a11a Jan 7, 2024 Maintainer

jbduncan Jan 7, 2024

1a1a11a Jan 7, 2024 Maintainer

ben-manes Jan 7, 2024 Author

ben-manes
Aug 21, 2023

Replies: 4 comments 5 replies

1a1a11a
Aug 21, 2023
Maintainer

1a1a11a
Aug 21, 2023
Maintainer

1a1a11a
Aug 21, 2023
Maintainer

ben-manes
Aug 28, 2023
Author

1a1a11a Jan 7, 2024
Maintainer

1a1a11a Jan 7, 2024
Maintainer

ben-manes Jan 7, 2024
Author