RISE RP005 QEMU weekly report 2024-12-11

Milestone 2

Make requested revisions to Paolo's upstream patches.

IN PROGRESS.
This work is now being handled by Craig Blackmore.
We have a further reworked version of the first patch, which addresses the issue of whether an exception during the load/store could lead to an invalid value in the vstart CSR (see this mailing list post) and also the issue raised by Max Chou (see this mailing list post). This is now posted upstream (see this mailing list post). See the detailed discussion below.
The second patch, has been rewritten to address the issue raised in this mailing list post. We do not think our approach is optimal and have posted a RFQ upstream asking for advice (see this mailing list post). Detailed discussion below.

SiFive benchmarks.

COMPLETE.

Patch 1 details

Making the accesses atomic didn't help in the end as vstart does need to point to the exact element on which the trap was taken. We've updated the patch to only access up to element size. The new version bypasses a simple call out to memcpy, instead doing single byte accesses with a lot more indirection/instructions executed.

However, benchmarking the SiFive memcpy suggests the patch is improving performance for length < 12 bytes. This version of the patch, pending submission upstream is commit 1c035a7319 in Craig Blackmore's QEMU fork.

We also experimented with avoiding the memcpy directly in vext_continuos_ldst_host for < 12 bytes and instead doing ldst_host for each element. However that was less beneficial.

Patch 2 details

The current version of the patch is on GitHub commit 8229b8e308 in Craig Blackmore's QEMU fork.

The code is rewritten using atomic load/stores. Now using 2, 4, 8 or 16 byte accesses if there are enough elements (the previous version only used 16 byte accesses). The patch uses functions provided by accel/tcg/ldst_atomicity.c.inc. The way we get at them from the riscv target is a bit dirty. Craig Blackmore has posted to seek advice on how better to do this.

Note. The SiFive benchmarks only use 1 byte element size insns vle8/vse8 so they won't show the benefit from this patch. Jeremy Bennett will create alternative benchmarks to measure the impact.

Milestone 3

Generate TCG Ops for vector whole word load/store.

IN PROGRESS.
We have benchmarked TCG operations generation for the whole register loads/stores using a modified version of vector memcpy. For VLEN=128 this almost doubles performance for the very smallest data sizes (1, 2 bytes), while yielding nearly treble performance for larger data sizes. For VLEN=1024 the speedup ranges from 50% for the very smallest data sizes to 2.5-fold for larger data. The work is in commit e473afe818 in Palo Savini's QEMU fork.
We are currently cleaning up the patch, particularly to ensure that the stat of the CPU is always consistent with RISC-V semantics with respect to the vstart CSR (this may reduce the performance). We will submit this upstream shortly when this work is complete.

Improve first-fault handling for vector load/store helper functions.

IN PROGRESS.
No new work to report this week.

Improve strided load/store helper functions.

IN PROGRESS.
No new work to report this week.

Statistics

SiFive benchmarks: Patch 1 comparison

Baseline commit: 248f9209ed
Patch commit: 9e5145d045
See report-2024-12-05-16-41-57.pdf.

As noted above, there is a performance benefit of up to 30% for memcpy with small data. This is hard to see in the graphs, but is clear in the detailed results.

TCGOp generation for memcpy (modified)

Baseline commit: 25fdf6a6ec
Patch commit: e473afe818
See report-2024-12-09-09-33-28.pdf.

This is a version of the SiFive memcpy benchmark modified to use whole word load/store.

Actions

2024-11-13:

Craig Blackmore (transferred from Paolo Savini) to investigate whether patch #1 may cause exceptions to be taken at the wrong byte address, thereby causing a failure on resumption.
- COMPLETE.
- Updated patch benchmarked and submitted.
Craig Blackmore (transferred from Paolo Savini) to investigate whether patch #2 can use larger memcpy than Max and therefore may offer further improvements.
- ONGOING.
- RFQ posted to improve initial implementation.

Other

Jeremy Bennett, Craig Blackmore and Paolo Savini will be on vacation 23 December to 3 January.

Next meeting 18 December 2024.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20241211.md

20241211.md

RISE RP005 QEMU weekly report 2024-12-11

Milestone 2

Patch 1 details

Patch 2 details

Milestone 3

Statistics

SiFive benchmarks: Patch 1 comparison

TCGOp generation for memcpy (modified)

Actions

Other

Files

20241211.md

Latest commit

History

20241211.md

File metadata and controls

RISE RP005 QEMU weekly report 2024-12-11

Milestone 2

Patch 1 details

Patch 2 details

Milestone 3

Statistics

SiFive benchmarks: Patch 1 comparison

TCGOp generation for memcpy (modified)

Actions

Other