xmrig/doc/RISCV_PERF_TUNING.md

# RISC-V Performance Optimization Guide

This guide provides comprehensive instructions for optimizing XMRig on RISC-V architectures.

## Build Optimizations

### Compiler Flags Applied Automatically

The CMake build now applies aggressive RISC-V-specific optimizations:

```cmake
# RISC-V ISA with extensions
-march=rv64gcv_zba_zbb_zbc_zbs

# Aggressive compiler optimizations
-funroll-loops              # Unroll loops for ILP (instruction-level parallelism)
-fomit-frame-pointer        # Free up frame pointer register (RISC-V has limited registers)
-fno-common                 # Better code generation for global variables
-finline-functions          # Inline more functions for better cache locality
-ffast-math                 # Relaxed FP semantics (safe for mining)
-flto                       # Link-time optimization for cross-module inlining

# Release build additions
-minline-atomics            # Inline atomic operations for faster synchronization
```

### Optimal Build Command

```bash
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc)
```

**Expected build time**: 5-15 minutes depending on CPU

## Runtime Optimizations

### 1. Memory Configuration (Most Important)

Enable huge pages to reduce TLB misses and fragmentation:

#### Enable 2MB Huge Pages
```bash
# Calculate required huge pages (1 page = 2MB)
# For 2 GB dataset: 1024 pages
# For cache + dataset: 1536 pages minimum
sudo sysctl -w vm.nr_hugepages=2048
```

Verify:
```bash
grep HugePages /proc/meminfo
# Expected: HugePages_Free should be close to nr_hugepages
```

#### Enable 1GB Huge Pages (Optional but Recommended)

```bash
# Run provided helper script
sudo ./scripts/enable_1gb_pages.sh

# Verify 1GB pages are available
cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
# Should be: >= 1 (one 1GB page)
```

Update config.json:
```json
{
    "cpu": {
        "huge-pages": true
    },
    "randomx": {
        "1gb-pages": true
    }
}
```

### 2. RandomX Mode Selection

| Mode | Memory | Init Time | Throughput | Recommendation |
|------|--------|-----------|-----------|-----------------|
| **light** | 256 MB | 10 sec | Low | Testing, resource-constrained |
| **fast** | 2 GB | 2-5 min* | High | Production (with huge pages) |
| **auto** | 2 GB | Varies | High | Default (uses fast if possible) |

*With optimizations; can be 30+ minutes without huge pages

**For RISC-V, use fast mode with huge pages enabled.**

### 3. Dataset Initialization Threads

Optimal thread count = 60-75% of CPU cores (leaves headroom for OS/other tasks)

```json
{
    "randomx": {
        "init": 4
    }
}
```

Or auto-detect (rewritten for RISC-V):
```json
{
    "randomx": {
        "init": -1
    }
}
```

### 4. CPU Affinity (Optional)

Pin threads to specific cores for better cache locality:

```json
{
    "cpu": {
        "rx/0": [
            { "threads": 1, "affinity": 0 },
            { "threads": 1, "affinity": 1 },
            { "threads": 1, "affinity": 2 },
            { "threads": 1, "affinity": 3 }
        ]
    }
}
```

### 5. CPU Governor (Linux)

Set to performance mode for maximum throughput:

```bash
# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Set to performance (requires root)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Verify
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Should output: performance
```

## Configuration Examples

### Minimum (Testing)
```json
{
    "randomx": {
        "mode": "light"
    },
    "cpu": {
        "huge-pages": false
    }
}
```

### Recommended (Balanced)
```json
{
    "randomx": {
        "mode": "auto",
        "init": 4,
        "1gb-pages": true
    },
    "cpu": {
        "huge-pages": true,
        "priority": 2
    }
}
```

### Maximum Performance (Production)
```json
{
    "randomx": {
        "mode": "fast",
        "init": -1,
        "1gb-pages": true,
        "scratchpad_prefetch_mode": 1
    },
    "cpu": {
        "huge-pages": true,
        "priority": 3,
        "yield": false
    }
}
```

## CLI Equivalents

```bash
# Light mode
./xmrig --randomx-mode=light

# Fast mode with 4 init threads
./xmrig --randomx-mode=fast --randomx-init=4

# Benchmark
./xmrig --bench=1M --algo=rx/0

# Benchmark Wownero variant (1 MB scratchpad)
./xmrig --bench=1M --algo=rx/wow

# Mine to pool
./xmrig -o pool.example.com:3333 -u YOUR_WALLET -p x
```

## Performance Diagnostics

### Check if Vector Extensions are Detected

Look for `FEATURES:` line in output:
```
 * CPU:       ky,x60 (uarch ky,x1)
 * FEATURES:  rv64imafdcv zba zbb zbc zbs
```

- `v`: Vector extension (RVV) ✓
- `zba`, `zbb`, `zbc`, `zbs`: Bit manipulation ✓
- If missing, make sure build used `-march=rv64gcv_zba_zbb_zbc_zbs`

### Verify Huge Pages at Runtime

```bash
# Run xmrig with --bench=1M and check output
./xmrig --bench=1M

# Look for line like:
# HUGE PAGES   100%  1 / 1 (1024 MB)
```

- Should show 100% for dataset AND threads
- If less, increase `vm.nr_hugepages` and reboot

### Monitor Performance

```bash
# Run benchmark multiple times to find stable hashrate
./xmrig --bench=1M --algo=rx/0
./xmrig --bench=10M --algo=rx/0
./xmrig --bench=100M --algo=rx/0

# Check system load and memory during mining
while true; do free -h; grep HugePages /proc/meminfo; sleep 2; done
```

## Expected Performance

### Hardware: Orange Pi RV2 (Ky X1, 8 cores @ ~1.5 GHz)

| Config | Mode | Hashrate | Init Time |
|--------|------|----------|-----------|
| Scalar (baseline) | fast | 30 H/s | 10 min |
| Scalar + huge pages | fast | 33 H/s | 2 min |
| RVV (if enabled) | fast | 70-100 H/s | 3 min |

*Actual results depend on CPU frequency, memory speed, and load*

## Troubleshooting

### Long Initialization Times (30+ minutes)

**Cause**: Huge pages not enabled, system using swap
**Solution**:
1. Enable huge pages: `sudo sysctl -w vm.nr_hugepages=2048`
2. Reboot: `sudo reboot`
3. Reduce mining threads to free memory
4. Check available memory: `free -h`

### Low Hashrate (50% of expected)

**Cause**: CPU governor set to power-save, no huge pages, high contention
**Solution**:
1. Set governor to performance: `echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor`
2. Enable huge pages
3. Reduce number of mining threads
4. Check system load: `top` or `htop`

### Dataset Init Crashes or Hangs

**Cause**: Insufficient memory, corrupted huge pages
**Solution**:
1. Disable huge pages temporarily: set `huge-pages: false` in config
2. Reduce mining threads
3. Reboot and re-enable huge pages
4. Try light mode: `--randomx-mode=light`

### Out of Memory During Benchmark

**Cause**: Not enough RAM for dataset + cache + threads
**Solution**:
1. Use light mode: `--randomx-mode=light`
2. Reduce mining threads: `--threads=1`
3. Increase available memory (kill other processes)
4. Check: `free -h` before mining

## Advanced Tuning

### Vector Length (VLEN) Detection

RISC-V vector extension variable length (VLEN) affects performance:

```bash
# Check VLEN on your CPU
cat /proc/cpuinfo | grep vlen

# Expected values:
# - 128 bits (16 bytes) = minimum
# - 256 bits (32 bytes) = common
# - 512 bits (64 bytes) = high performance
```

Larger VLEN generally means better performance for vectorized operations.

### Prefetch Optimization

The code automatically optimizes memory prefetching for RISC-V:

```
scratchpad_prefetch_mode: 0 = disabled (slowest)
scratchpad_prefetch_mode: 1 = prefetch.r (default, recommended)
scratchpad_prefetch_mode: 2 = prefetch.w (experimental)
```

### Memory Bandwidth Saturation

If experiencing memory bandwidth saturation (high latency):

1. Reduce mining threads
2. Increase L2/L3 cache by mining fewer threads per core
3. Enable cache QoS (AMD Ryzen): `cache_qos: true`

## Building with Custom Flags

To build with custom RISC-V flags:

```bash
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_C_FLAGS="-march=rv64gcv_zba_zbb_zbc_zbs -O3 -funroll-loops -fomit-frame-pointer" \
      ..
make -j$(nproc)
```

## Future Optimizations

- [ ] Zbk* (crypto) support detection and usage
- [ ] Optimal VLEN-aware algorithm selection
- [ ] Per-core memory affinity (NUMA support)
- [ ] Dynamic thread count adjustment based on thermals
- [ ] Cross-compile optimizations for various RISC-V cores

## References

- [RISC-V Vector Extension Spec](https://github.com/riscv/riscv-v-spec)
- [RISC-V Bit Manipulation Spec](https://github.com/riscv/riscv-bitmanip)
- [RISC-V Crypto Spec](https://github.com/riscv/riscv-crypto)
- [XMRig Documentation](https://xmrig.com/docs)

---

For further optimization, enable RVV intrinsics by replacing `sse2rvv.h` with `sse2rvv_optimized.h` in the build.