1
0
mirror of https://github.com/xmrig/xmrig.git synced 2026-06-20 11:32:41 -04:00

Compare commits

...

3 Commits

Author SHA1 Message Date
xmrig
3fb851d91d Merge pull request #3820 from aa022/dev
ARM64 RandomX JIT: dataset prefetch + non-temporal loads (+~8% on M4 base)
2026-05-26 00:21:52 +07:00
aa022
9ac373fea5 ARM64 RandomX JIT: drop early dataset prefetch 2026-05-25 18:05:50 +02:00
aa022
978720462d ARM64 RandomX JIT: dataset prefetch + non-temporal loads
Two Apple-silicon-targeted tweaks to the aarch64 RandomX JIT:

- Early dataset prefetch: when readReg2/readReg3 are finalized well before
  the end of the program body, emit the next iteration's dataset-line prefetch
  early to hide more DRAM latency on the serial scalar chain.
- Non-temporal dataset loads: each 64-byte dataset line is read once and never
  reused, so ldp -> ldnp avoids evicting the hot scratchpad, and the prefetch
  hint moves pldl2strm -> pldl1strm to match the longer lead time.

Measured ~8% hashrate gain on Apple M4 base over dev (7eadfdc9).
2026-05-25 13:46:41 +02:00

View File

@@ -303,7 +303,7 @@ DECL(randomx_program_aarch64_cacheline_align_mask1):
add x20, x20, x1
# Prefetch dataset data
prfm pldl2strm, [x20]
prfm pldl1strm, [x20]
DECL(randomx_program_aarch64_cacheline_align_mask2):
# Actual mask will be inserted by JIT compiler
@@ -312,16 +312,16 @@ DECL(randomx_program_aarch64_cacheline_align_mask2):
DECL(randomx_program_aarch64_xor_with_dataset_line):
# xor integer registers with dataset data
ldp x20, x19, [x10]
ldnp x20, x19, [x10]
eor x4, x4, x20
eor x5, x5, x19
ldp x20, x19, [x10, 16]
ldnp x20, x19, [x10, 16]
eor x6, x6, x20
eor x7, x7, x19
ldp x20, x19, [x10, 32]
ldnp x20, x19, [x10, 32]
eor x12, x12, x20
eor x13, x13, x19
ldp x20, x19, [x10, 48]
ldnp x20, x19, [x10, 48]
eor x14, x14, x20
eor x15, x15, x19