Replies: 4 comments 5 replies
-
Hi @ben-manes, I am very happy to read your feedback (I should have asked earlier)! Some context on the simulatorYes, there are discrepancies between simulators, e.g., whether (1) considering object size, (2) considering changes to object size, (3) metadata size, and (4) whether the insertion happens before or after eviction (not a complete list). That's why we choose to implement all algorithms within libCacheSim. I agree that implementing an algorithm correctly is non-trivial. Our co-author @yazhuo found two bugs in some open-source LIRS implementations. Our early version of TinyLFU implementation also had bugs (maybe we still have...). Our W-TinyLFU is implemented by @ziyueqiu, so I loop her in. Some context on the simulationsIn my earlier works, I noticed that object size is very important for many workloads, and size-aware algorithms significantly outperform the ones that are not size-aware when looking at request miss ratio. In this work, I ignored the object size to focus on the access patterns. You can add |
Beta Was this translation helpful? Give feedback.
-
The results you sharedFirst, S3-FIFO is not the best algorithm for each trace. And it is not always better than TinyLFU as we showed. trace.csvThis is the same trace as On this trace, W-TinyLFU is indeed better than S3-FIFO when considering object size, and you can verify by adding TinyLFU to the algorithm list. (Note that the TinyLFU results are from libCacheSim, which is different from Caffeine)
However, when not considering object size, W-TinyLFU is worse than S3FIFO
Similar results can be found on other traces. The results in the blog are not a complete picture of what we have performed, and I am happy to share the paper with you. Where do you see the 10% difference between S3-FIFO and W-TinyLFU? I don't think that's true in general, the confusion could come from my mistake, and I am happy to correct it. BTW, the first two traces in your benchmark are not used in our benchmark. Both traces are just a tiny subset of longer (one-week long) traces.
|
Beta Was this translation helpful? Give feedback.
-
DiscussionsAdaptive window sizeI believe adaptive window size will improve the efficiency of TinyLFU on the tail traces. But I am not sure about the impact on the mean though. We will probably look into this in the future. My guess is that similar to the adaptive algorithm (ARC), having adaptivity hurts the mean efficiency a bit. Multi-threadingDirectly comparing the throughput of Caffeine and Segcache is not fair because the throughput we reported in Segcache includes reading from the traces. I agree that decoupling cache management and data access is an increasingly popular way to improve scalability, but it is unclear whether the management can catch up with the data accesses when the throughput is very high. It is an effective approach, but I would view this as a compromised optimization to improve scalability (a similar solution in some systems is to use try-lock). Moreover, decoupling is non-trivial in many systems and can introduce bugs (compared to a simple eviction algorithm). BTW, we are working on the camera-ready of the paper, I will add the discussion on the dynamic window size in TinyLFU (since we did not compare with it). If you would like to take a look and give more feedback, I am more than happy to send you the draft. |
Beta Was this translation helpful? Give feedback.
-
Sorry for the late reply. I had hoped to be able to put the time in this weekend towards a thoughtful response, but like the work week it was too busy to sneak away to do this fun stuff. Hopefully next weekend I will be able to do additional analysis and provide some commentary. |
Beta Was this translation helpful? Give feedback.
-
I attempted to reproduce the hit rate results in your blog article, where you observed that TinyLFU had lower performance than S3Fifo. I ran your workloads against the reference implementation, Caffeine.
Background
Caffeine uses the adaptive windowing technique, so this does not fully discount your observations. The paper recommended a 1% window as a starting point, since that works well in many critical workloads like database, search, and analytics. It concludes by showing workloads where a larger window is required and that dynamically adjusting this is left for a future work. That subsequent work to evaluate approaches does appear to correct this deficiency and is most often used in practice (available in Java, Rust, Go, Python, TypeScript).
Caffeine starts with a 1% window, monitors the hit rate over the TinyLFU sample period, and uses hill climbing to adjust the window size. The number of frequency counters is not known upfront, so like a hash table the capacity grows based on the entry count. This resizing loses the popularity scores and the policy may have started at a poor window size. Thus, like a JIT compiler, during this warmup period the hit rate may be lower but it quickly reaches peak performance so that this becomes noise except in very short runs. The admission decisions vary slightly between runs due to jitter introduced to mitigate hash flooding attacks.
Analysis
I ran the two simulators side-by-side by comparing the LRU hit rates to ensure that the traces are being evaluated correctly. Caffeine's simulator does not have rich variable size support (entry's weight) as this is less common in Java due to the object's heap usage being mostly hidden from the developer. There is a fork that extends support in order to evaluate size-aware TinyLFU strategies (research paper) where they proposed an adaption to further improve the hit rate. I hope to someday evaluate that proposal and incorporate the improvements.
I used libcachesim at 0f4d135 (current master) and Caffeine at cb5f75d (current master) with this patch to include the new trace formats.
In some traces the simulators report different results for FIFO and LRU by up to 0.25%. I determined that this was because libcachesim does not update the entry's size on a cache hit but does record the byte hit with the new weight.
trace.csv
./cachesim ../../data/trace.csv csv fifo,lru,qdlp,s3fifo,sieve 1gb -t "time-col=2, obj-id-col=5, obj-size-col=4"
Caffeine leads Sieve (the best algorithm) by 3.60% HR / 2.19% BHR, and S3Fifo by 8.85% HR / 9.12% BHR.
twitter_cluster52.csv
./cachesim ../../data/twitter_cluster52.csv csv fifo,lru,qdlp,s3fifo,sieve 1mb -t "time-col=1, obj-id-col=2, obj-size-col=3"
Caffeine is beaten by S3Fifo (the best algorithm) by -0.75% HR / -0.90% BHR. That is very different from the -10% difference that was observed in the blog article.
hm_0.csv.gz
./cachesim ../../data/hm_0.csv csv fifo,lru,qdlp,s3fifo,sieve 1gb -t "time-col=2, obj-id-col=5, obj-size-col=6"
Caffeine leads QDLP (the best algorithm) by 0.38% HR / 0.45% BHR, and S3Fifo by 0.43% HR / 0.75% BHR.
Multithreading
The article states that LRU is a scalability bottleneck due to reordering under a lock. This assertion might be misleading because many caches decouple their eviction policy from the way that they manage concurrency, rendering LRU's characteristics here irrelevant. This allows the cache designer to consider a broader set of algorithms and data structures that may allow for better time and space efficiency, or that enable additional features. For example, Caffeine scaled linearly to 380M reads/s on 16 cores (2015 hardware). In comparison, Segcache's FIFO policy was reported at 70M reads/s on 24 cores (2020 hardware). The benchmarks may differ, but it shows that the policy's concurrency does not need to be a bottleneck and designers can focus on other areas to optimize.
Conclusion
In the workloads that were considered exemplary by this project and used for its policy designs, we can observe that adaptive W-TinyLFU (Caffeine) demonstrates competitive performance. It is important to compare reference implementations to avoid accidental differences and I have not attempted to debug your implementation to explain the degradation. I believe any static configuration will underperform in some scenario and that designers should continue to explore ways to make more effective and adaptive choices.
P.S. This is an issue since you disabled the discussion tab. You are welcome to close this after reading (self destructs in 10, 9, ...)
Beta Was this translation helpful? Give feedback.
All reactions