- Paolo Savini (Embecosm)
- Helene Chelin (Embecosm)
- Jeremy Bennett (Embecosm)
- Hugh O'Keeffe (Ashling)
- Nadim Shehayed (Ashling)
- Daniel Barboza (Ventana)
- WP2:
- Get SPECCPU results the optimization of the vle8.v instruction.
- IN PROGRESS: some tests show a little improvement but a few tests fail. We're investigating.
- We fixed a bug appearing when increasing the data size and VLEN above 512 bytes with LMUL 8.
- Optimize the tail bytes of vle8.v.
- Deferred.
- Add optimization for vse8.v.
- The optimization was extended to support stores too. See below.
- Explore optimization through the usage of builtins like
__builtin_memcpy
.- bswap.h contains already examples of this and a useful pool of functions for simd stores and loads.
- we are trying to see if we can optimize 64 bit vle/vse and by extension the others with 128 bit operations.
- We started setting up the infrastructure to work on ARM.
- Get SPECCPU results the optimization of the vle8.v instruction.
- WP2:
- Get SPECCPU results the optimization of the vle8.v/vse8.v instruction.
- Triage the remaining failures.
- Optimize the tail bytes of vle8.v.
- Experiment the 128 bit optimization.
- Get SPECCPU results the optimization of the vle8.v/vse8.v instruction.
Our current set of agreed priorities are taken from the Statement of Work. This has the following priorities, which trade off functionality targeted versus architectures supported.
- vector load/store ops for x86_64 AVX
- vector load/store ops for AArch64/Neon
- vector integer ALU ops for x86_64 AVX
- vector load/store ops for Intel AVX10
For each of these there will be an analysis phase and an optimization phase, leading to the following set of work packages.
- WP0: Infrastructure
- WP1: Analysis of vector load/store ops on x86_64 AVX
- WP2: Optimization of vector load/store ops on x86_64 AVX
- WP3: Analysis of vector load/store ops on AArch64/Neon
- WP4: Optimization of vector load/store ops on AArch64/Neon
- WP5: Analysis of integer ALU ops on x86_64 AVX
- WP6: Optimization of integer ALU ops on x86_64 AVX
- WP7: Analysis of vector load/store ops on Intel AVX10
- WP8: Optimization of vector load/store ops on Intel AVX10
These priorities can be revised by agreement with RISE during the project.
We pushed a pull request to Embecosm's public repo with the changes made so far to optimize the vle*.v
/vse*.v
instructions:
https://github.com/embecosm/rise-rvv-tcg-qemu/pull/1/files
The pull request contains the initial implementation plus the later addition of the extra function pointer used to select the right double word load or store function to speed up the current load/store instruction in the vext_ldst_us
helper function.
GEN_VEXT_LD_US(vle8_v, int8_t, lde_b, lde_d)
GEN_VEXT_LD_US(vle16_v, int16_t, lde_h, lde_d)
GEN_VEXT_LD_US(vle32_v, int32_t, lde_w, lde_d)
GEN_VEXT_LD_US(vle64_v, int64_t, lde_d, lde_d)
...
GEN_VEXT_ST_US(vse8_v, int8_t, ste_b, ste_d)
GEN_VEXT_ST_US(vse16_v, int16_t, ste_h, ste_d)
GEN_VEXT_ST_US(vse32_v, int32_t, ste_w, ste_d)
GEN_VEXT_ST_US(vse64_v, int64_t, ste_d, ste_d)
The pull request also contains a fix to the management of the loop indexes that solves the segmentation fault issue we were having with large data sizes.
This doesn't have an impact on the performance but only on the correctness. We added some correctness checks to the memcpy benchmarks (check that the contents of the destination register and source register coincide) and all the tests now pass. The next steps involve try to optimize beyond 64 bits and extend this optimization to other load/store helper functions.
We fixed the bug was causing the QEMU error for data sizes of 1024 and VLEN=1024, LMUL=8.
Size | VLEN=128,LMUL=1 | VLEN=1024,LMUL=8 |
---|---|---|
1 | 0.92 | 0.93 |
2 | 0.83 | 0.94 |
4 | 1.20 | 0.90 |
8 | 2.31 | 2.14 |
16 | 3.60 | 3.12 |
32 | 3.62 | 4.17 |
64 | 3.66 | 5.23 |
128 | 3.89 | 5.93 |
256 | 3.91 | 6.14 |
512 | 3.99 | 6.49 |
1024 | 4.00 | 6.72 |
2048 | 4.01 | 6.75 |
These results are in ns/instruction.
No changes to the results as the latest patch achieves correctness rather then performance.
length | master | HC | ratio | master | HC | ratio |
---|---|---|---|---|---|---|
1 | 21.13 | 71.49 | 0.30 | 21.12 | 21.41 | 0.99 |
2 | 33.44 | 43.11 | 0.78 | 33.50 | 33.02 | 1.01 |
4 | 62.16 | 72.75 | 0.85 | 62.06 | 58.31 | 1.06 |
8 | 108.88 | 26.50 | 4.11 | 109.00 | 22.00 | 4.95 |
16 | 209.88 | 38.88 | 5.40 | 209.88 | 34.00 | 6.17 |
32 | 210.50 | 55.00 | 3.83 | 396.25 | 72.00 | 5.50 |
64 | 209.00 | 52.00 | 4.02 | 771.50 | 131.50 | 5.87 |
128 | 207.00 | 48.00 | 4.31 | 1523.00 | 250.00 | 6.09 |
256 | 212.00 | 51.00 | 4.16 | 3012.00 | 481.00 | 6.26 |
512 | 210.00 | 50.00 | 4.20 | 5998.00 | 938.00 | 6.39 |
1,024 | 213.00 | 50.00 | 4.26 | 5995.00 | 939.00 | 6.38 |
2,048 | 207.00 | 36.00 | 5.75 | 6000.00 | 772.00 | 7.77 |
4,096 | 210.00 | 38.00 | 5.53 | 5996.00 | 772.00 | 7.77 |
8,192 | 208.00 | 34.00 | 6.12 | 5994.00 | 770.00 | 7.78 |
8,192 | 208.00 | 34.00 | 6.12 | 5994.00 | 770.00 | 7.78 |
We haven't been able to collect all the results we need with SPECCPU and what we have so far will need more analysis but you can see the progress here
2024-06-05
- Paolo Check behaviour of QEMU with tail bytes.
- IN PROGRESS
- Paolo Subscribe to QEMU mailing list.
- COMPLETE
- Paolo Look at the patches from Max Chou.
- Deferred to prioritize debugging.
2024-05-15
- Jeremy to look at impact of masked v unmasked and strided v unstrided on vector operations.
- lower proirity.
2024-05-08
- Jeremy to characterise QEMU floating point performance and file it as a performance regression issue in QEMU GitLab.
- low priority, deferred to prioritize the smoke tests work.
2024-05-01
- Paolo to review the generic issue from Palmer Dabbelt to identify ideas for optimization and benchmarks to reuse.
- Reproduction deferred to prioritize ARM analysis.
- So far we didn't see the execution time difference reported in the issue. Need to check the context.
- The bionic benchmarks may be a useful source of small benchmarks.
- Taken the ARM example: it might be tricky to map each load/store with the right host operations but that's the kind of optimization we are aiming at.
- PAUSED
- Daniel to advise Paolo on best practice for preparing QEMU upstream submissions.
The risk register is held in a shared spreadsheet
We will keep it updated continuously and report any changes each week.
There are no changes to the risk register this week.
Jeremy will be on vacation from the 7th to the 16th of June. Paolo will be on vacation from the 20th to the 24th of June.