1
0
mirror of https://github.com/xmrig/xmrig.git synced 2025-12-22 21:12:43 -05:00

Added RandomX code here

This commit is contained in:
MoneroOcean
2019-06-18 14:56:52 -07:00
parent fb832d2650
commit d12cf28ebf
123 changed files with 17981 additions and 25 deletions

View File

@@ -0,0 +1,269 @@
# RandomX configuration
RandomX has 45 customizable parameters (see table below). We recommend each project using RandomX to select a unique configuration to prevent network attacks from hashpower rental services.
These parameters can be modified in source file [configuration.h](../src/configuration.h).
|parameter|description|default value|
|---------|-----|-------|
|`RANDOMX_ARGON_MEMORY`|The number of 1 KiB Argon2 blocks in the Cache| `262144`|
|`RANDOMX_ARGON_ITERATIONS`|The number of Argon2d iterations for Cache initialization|`3`|
|`RANDOMX_ARGON_LANES`|The number of parallel lanes for Cache initialization|`1`|
|`RANDOMX_ARGON_SALT`|Argon2 salt|`"RandomX\x03"`|
|`RANDOMX_CACHE_ACCESSES`|The number of random Cache accesses per Dataset item|`8`|
|`RANDOMX_SUPERSCALAR_LATENCY`|Target latency for SuperscalarHash (in cycles of the reference CPU)|`170`|
|`RANDOMX_DATASET_BASE_SIZE`|Dataset base size in bytes|`2147483648`|
|`RANDOMX_DATASET_EXTRA_SIZE`|Dataset extra size in bytes|`33554368`|
|`RANDOMX_PROGRAM_SIZE`|The number of instructions in a RandomX program|`256`|
|`RANDOMX_PROGRAM_ITERATIONS`|The number of iterations per program|`2048`|
|`RANDOMX_PROGRAM_COUNT`|The number of programs per hash|`8`|
|`RANDOMX_JUMP_BITS`|Jump condition mask size in bits|`8`|
|`RANDOMX_JUMP_OFFSET`|Jump condition mask offset in bits|`8`|
|`RANDOMX_SCRATCHPAD_L3`|Scratchpad size in bytes|`2097152`|
|`RANDOMX_SCRATCHPAD_L2`|Scratchpad L2 size in bytes|`262144`|
|`RANDOMX_SCRATCHPAD_L1`|Scratchpad L1 size in bytes|`16384`|
|`RANDOMX_FREQ_*` (29x)|Instruction frequencies|multiple values|
Not all of the parameters can be changed safely and most parameters have some contraints on what values can be selected. Follow the guidelines below.
### RANDOMX_ARGON_MEMORY
This parameter determines the amount of memory needed in the light mode. Memory is specified in KiB (1 KiB = 1024 bytes).
#### Permitted values
Any integer power of 2.
#### Notes
Lower sizes will reduce the memory-hardness of the algorithm.
### RANDOMX_ARGON_ITERATIONS
Determines the number of passes of Argon2 that are used to generate the Cache.
#### Permitted values
Any positive integer.
#### Notes
The time needed to initialize the Cache is proportional to the value of this constant.
### RANDOMX_ARGON_LANES
The number of parallel lanes for Cache initialization.
#### Permitted values
Any positive integer.
#### Notes
This parameter determines how many threads can be used for Cache initialization.
### RANDOMX_ARGON_SALT
Salt value for Cache initialization.
#### Permitted values
Any string of byte values.
#### Note
Every implementation should choose a unique salt value.
### RANDOMX_CACHE_ACCESSES
The number of random Cache access per Dataset item.
#### Permitted values
Any integer greater than 1.
#### Notes
This value directly determines the performance ratio between the 'fast' and 'light' modes.
### RANDOMX_SUPERSCALAR_LATENCY
Target latency for SuperscalarHash, in cycles of the reference CPU.
#### Permitted values
Any positive integer.
#### Notes
The default value was tuned so that a high-performance superscalar CPU running at 2-4 GHz will execute SuperscalarHash in similar time it takes to load data from RAM (40-80 ns). Using a lower value will make Dataset generation (and light mode) more memory bound, while increasing this value will make Dataset generation (and light mode) more compute bound.
### RANDOMX_DATASET_BASE_SIZE
Dataset base size in bytes.
#### Permitted values
Integer powers of 2 in the range 64 - 4294967296 (inclusive).
#### Note
This constant affects the memory requirements in fast mode. Some values are unsafe depending on other parameters. See [Unsafe configurations](#unsafe-configurations).
### RANDOMX_DATASET_EXTRA_SIZE
Dataset extra size in bytes.
#### Permitted values
Non-negative integer divisible by 64.
#### Note
This constant affects the memory requirements in fast mode. Some values are unsafe depending on other parameters. See [Unsafe configurations](#unsafe-configurations).
### RANDOMX_PROGRAM_SIZE
The number of instructions in a RandomX program.
#### Permitted values
Any positive integer divisible by 8.
#### Notes
Smaller values will make RandomX more DRAM-latency bound, while higher values will make RandomX more compute-bound. Some values are unsafe. See [Unsafe configurations](#unsafe-configurations).
### RANDOMX_PROGRAM_ITERATIONS
The number of iterations per program.
#### Permitted values
Any positive integer.
#### Notes
Time per hash increases linearly with this constant. Smaller values will increase the overhead of program compilation, while larger values may allow more time for optimizations. Some values are unsafe. See [Unsafe configurations](#unsafe-configurations).
### RANDOMX_PROGRAM_COUNT
The number of programs per hash.
#### Permitted values
Any positive integer.
#### Notes
Time per hash increases linearly with this constant. Some values are unsafe. See [Unsafe configurations](#unsafe-configurations).
### RANDOMX_JUMP_BITS
Jump condition mask size in bits.
#### Permitted values
Positive integers. The sum of `RANDOMX_JUMP_BITS` and `RANDOMX_JUMP_OFFSET` must not exceed 16.
#### Notes
This determines the jump probability of the CBRANCH instruction. The default value of 8 results in jump probability of <code>1/2<sup>8</sup> = 1/256</code>. Increasing this constant will decrease the rate of jumps (and vice versa).
### RANDOMX_JUMP_OFFSET
Jump condition mask offset in bits.
#### Permitted values
Non-negative integers. The sum of `RANDOMX_JUMP_BITS` and `RANDOMX_JUMP_OFFSET` must not exceed 16.
#### Notes
Since the low-order bits of RandomX registers are slightly biased, this offset moves the condition mask to higher bits, which are less biased. Using values smaller than the default may result in a slightly lower jump probability than the theoretical value calculated from `RANDOMX_JUMP_BITS`.
### RANDOMX_SCRATCHPAD_L3
RandomX Scratchpad size in bytes.
#### Permitted values
Any integer power of 2. Must be larger than or equal to `RANDOMX_SCRATCHPAD_L2`.
#### Notes
The default value of 2 MiB was selected to match the typical cache/core ratio of desktop processors. Using a lower value will make RandomX more core-bound, while using larger values will make the algorithm more latency-bound. Some values are unsafe depending on other parameters. See [Unsafe configurations](#unsafe-configurations).
### RANDOMX_SCRATCHPAD_L2
Scratchpad L2 size in bytes.
#### Permitted values
Any integer power of 2. Must be larger than or equal to `RANDOMX_SCRATCHPAD_L1`.
#### Notes
The default value of 256 KiB was selected to match the typical per-core L2 cache size of desktop processors. Using a lower value will make RandomX more core-bound, while using larger values will make the algorithm more latency-bound.
### RANDOMX_SCRATCHPAD_L1
Scratchpad L1 size in bytes.
#### Permitted values
Any integer power of 2. The minimum is 64 bytes.
#### Notes
The default value of 16 KiB was selected to be about half of the per-core L1 cache size of desktop processors. Using a lower value will make RandomX more core-bound, while using larger values will make the algorithm more latency-bound.
### RANDOMX_FREQ_*
Instruction frequencies (per 256 instructions).
#### Permitted values
There is a total of 29 different instructions. The sum of frequencies must be equal to 256.
#### Notes
Making large changes to the default values is not recommended. The only exceptions are the instruction pairs IROR_R/IROL_R, FADD_R/FSUB_R and FADD_M/FSUB_M, which are functionally equivalent.
## Unsafe configurations
There are some configurations that are considered 'unsafe' because they affect the security of the algorithm against attacks. If the conditions listed below are not satisfied, the configuration is unsafe and a compilation error is emitted when building the RandomX library.
These checks can be disabled by definining `RANDOMX_UNSAFE` when building RandomX, e.g. by using `-DRANDOMX_UNSAFE` command line switch in GCC or MSVC. It is not recommended to disable these checks except for testing purposes.
### 1. Memory-time tradeoffs
#### Condition
````
RANDOMX_CACHE_ACCESSES * RANDOMX_ARGON_MEMORY * 1024 + 33554432 >= RANDOMX_DATASET_BASE_SIZE + RANDOMX_DATASET_EXTRA_SIZE
````
Configurations not satisfying this condition are vulnerable to memory-time tradeoffs, which enables efficient mining in light mode.
#### Solutions
* Increase `RANDOMX_CACHE_ACCESSES` or `RANDOMX_ARGON_MEMORY`.
* Decrease `RANDOMX_DATASET_BASE_SIZE` or `RANDOMX_DATASET_EXTRA_SIZE`.
### 2. Insufficient Scratchpad writes
#### Condition
````
(128 + RANDOMX_PROGRAM_SIZE * RANDOMX_FREQ_ISTORE / 256) * (RANDOMX_PROGRAM_COUNT * RANDOMX_PROGRAM_ITERATIONS) >= RANDOMX_SCRATCHPAD_L3
````
Configurations not satisfying this condition are vulnerable to Scratchpad size optimizations due to low amount of writes.
#### Solutions
* Increase `RANDOMX_PROGRAM_SIZE`, `RANDOMX_FREQ_ISTORE`, `RANDOMX_PROGRAM_COUNT` or `RANDOMX_PROGRAM_ITERATIONS`.
* Decrease `RANDOMX_SCRATCHPAD_L3`.
### 3. Program filtering strategies
#### Condition
```
RANDOMX_PROGRAM_COUNT > 1
```
Configurations not satisfying this condition are vulnerable to program filtering strategies.
#### Solution
* Increase `RANDOMX_PROGRAM_COUNT` to at least 2.
### 4. Low program entropy
#### Condition
```
RANDOMX_PROGRAM_SIZE >= 64
```
Configurations not satisfying this condition do not have a sufficient number of instruction combinations.
#### Solution
* Increase `RANDOMX_PROGRAM_SIZE` to at least 64.
### 5. High compilation overhead
#### Condition
```
RANDOMX_PROGRAM_ITERATIONS >= 400
```
Configurations not satisfying this condition have a program compilation overhead exceeding 10%.
#### Solution
* Increase `RANDOMX_PROGRAM_ITERATIONS` to at least 400.

530
RandomX/doc/design.md Normal file
View File

@@ -0,0 +1,530 @@
# RandomX design
To minimize the performance advantage of specialized hardware, a proof of work (PoW) algorithm must achieve *device binding* by targeting specific features of existing general-purpose hardware. This is a complex task because we have to target a large class of devices with different architectures from different manufacturers.
There are two distinct classes of general processing devices: central processing units (CPUs) and graphics processing units (GPUs). RandomX targets CPUs for the following reasons:
* CPUs, being less specialized devices, are more prevalent and widely accessible. A CPU-bound algorithm is more egalitarian and allows more participants to join the network. This is one of the goals stated in the original CryptoNote whitepaper [[1](https://cryptonote.org/whitepaper.pdf)].
* A large common subset of native hardware instructions exists among different CPU architectures. The same cannot be said about GPUs. For example, there is no common integer multiplication instruction for NVIDIA and AMD GPUs [[2](https://github.com/ifdefelse/ProgPOW/issues/16)].
* All major CPU instruction sets are well documented with multiple open source compilers available. In comparison, GPU instruction sets are usually proprietary and may require vendor specific closed-source drivers for maximum performance.
## 1. Design considerations
The most basic idea of a CPU-bound proof of work is that the "work" must be dynamic. This takes advantage of the fact that CPUs accept two kinds of inputs: *data* (the main input) and *code* (which specifies what to perform with the data).
Conversely, typical cryptographic hashing functions [[3](https://en.wikipedia.org/wiki/Cryptographic_hash_function)] do not represent suitable work for the CPU because their only input is *data*, while the sequence of operations is fixed and can be performed more efficiently by a specialized integrated circuit.
### 1.1 Dynamic proof of work
A dynamic proof of work algorithm can generally consist of the following 4 steps:
1) Generate a random program.
2) Translate it into the native machine code of the CPU.
3) Execute the program.
4) Transform the output of the program into a cryptographically secure value.
The actual 'useful' CPU-bound work is performed in step 3, so the algorithm must be tuned to minimize the overhead of the remaining steps.
#### 1.1.1 Generating a random program
Early attempts at a dynamic proof of work design were based on generating a program in a high-level language, such as C or Javascript [[4](https://github.com/hyc/randprog), [5](https://github.com/tevador/RandomJS)]. However, this is very inefficient for two main reasons:
* High level languages have a complex syntax, so generating a valid program is relatively slow since it requires the creation of an abstract syntax tree (ASL).
* Once the source code of the program is generated, the compiler will generally parse the textual representation back into the ASL, which makes the whole process of generating source code redundant.
The fastest way to generate a random program is to use a *logic-less* generator - simply filling a buffer with random data. This of course requires designing a syntaxless programming language (or instruction set) in which all random bit strings represent valid programs.
#### 1.1.2 Translating the program into machine code
This step is inevitable because we don't want to limit the algorithm to a specific CPU architecture. In order to generate machine code as fast as possible, we need our instruction set to be as close to native hardware as possible, while still generic enough to support different architectures. There is not enough time for expensive optimizations during code compilation.
#### 1.1.3 Executing the program
The actual program execution should utilize as many CPU components as possible. Some of the features that should be utilized in the program are:
* multi-level caches (L1, L2, L3)
* μop cache [[6](https://en.wikipedia.org/wiki/CPU_cache#Micro-operation_(%CE%BCop_or_uop)_cache)]
* arithmetic logic unit (ALU)
* floating point unit (FPU)
* memory controller
* instruction level parallelism [[7](https://en.wikipedia.org/wiki/Instruction-level_parallelism)]
* superscalar execution [[8](https://en.wikipedia.org/wiki/Superscalar_processor)]
* out-of-order execution [[9](https://en.wikipedia.org/wiki/Out-of-order_execution)]
* speculative execution [[10](https://en.wikipedia.org/wiki/Speculative_execution)]
* register renaming [[11](https://en.wikipedia.org/wiki/Register_renaming)]
Chapter 2 describes how the RandomX VM takes advantages of these features.
#### 1.1.4 Calculating the final result
Blake2b [[12](https://blake2.net/)] is a cryptographically secure hashing function that was specifically designed to be fast in software, especially on modern 64-bit processors, where it's around three times faster than SHA-3 and can run at a speed of around 3 clock cycles per byte of input. This function is an ideal candidate to be used in a CPU-friendly proof of work.
For processing larger amounts of data in a cryptographically secure way, the Advanced Encryption Standard (AES) [[13](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard)] can provide the fastest processing speed because many modern CPUs support hardware acceleration of these operations. See chapter 3 for more details about the use of AES in RandomX.
### 1.2 The "Easy program problem"
When a random program is generated, one may choose to execute it only when it's favorable. This strategy is viable for two main reasons:
1. The runtime of randomly generated programs typically follows a log-normal distribution [[14](https://en.wikipedia.org/wiki/Log-normal_distribution)] (also see Appendix C). A generated program may be quickly analyzed and if it's likely to have above-average runtime, program execution may be skipped and a new program may be generated instead. This can significantly boost performance especially in case the runtime distribution has a heavy tail (many long-running outliers) and if program generation is cheap.
2. An implementation may choose to optimize for a subset of the features required for program execution. For example, the support for some operations (such as division) may be dropped or some instruction sequences may be implemented more efficiently. Generated programs would then be analyzed and be executed only if they match the specific requirements of the optimized implementation.
These strategies of searching for programs of particular properties deviate from the objectives of this proof of work, so they must be eliminated. This can be achieved by requiring a sequence of *N* random programs to be executed such that each program is generated from the output of the previous one. The output of the final program is then used as the result.
```
+---------------+ +---------------+ +---------------+ +---------------+
| | | | | | | |
input --> | program 1 | --> | program 2 | --> ... --> | program (N-1) | --> | program N | --> result
| | | | | | | |
+---------------+ +---------------+ +---------------+ +---------------+
```
The principle is that after the first program is executed, a miner has to either commit to finishing the whole chain (which may include unfavorable programs) or start over and waste the effort expended on the unfinished chain. Examples of how this affects the hashrate of different mining strategies are given in Appendix A.
Additionally, this chained program execution has the benefit of equalizing the runtime for the whole chain since the relative deviation of a sum of identically distributed runtimes is decreased.
### 1.3 Verification time
Since the purpose of the proof of work is to be used in a trustless peer-to-peer network, network participants must be able to quickly verify if a proof is valid or not. This puts an upper bound on the complexity of the proof of work algorithm. In particular, we set a goal for RandomX to be at least as fast to verify as the CryptoNight hash function [[15](https://cryptonote.org/cns/cns008.txt)], which it aims to replace.
### 1.4 Memory-hardness
Besides pure computational resources, such as ALUs and FPUs, CPUs usually have access to a large amount of memory in the form of DRAM [[16](https://en.wikipedia.org/wiki/Dynamic_random-access_memory)]. The performance of the memory subsystem is typically tuned to match the compute capabilities, for example [[17](https://en.wikipedia.org/wiki/Multi-channel_memory_architecture)]:
* single channel memory for embedded and low power CPUs
* dual channel memory for desktop CPUs
* triple or quad channel memory for workstation CPUs
* six or eight channel memory for high-end server CPUs
In order to utilize the external memory as well as the on-chip memory controllers, the proof of work algorithm should access a large memory buffer (called the "Dataset"). The Dataset must be:
1. larger than what can be stored on-chip (to require external memory)
2. dynamic (to require writable memory)
The maximum amount of SRAM that can be put on a single chip is more than 512 MiB for a 16 nm process and more than 2 GiB for a 7 nm process [[18](https://www.grin-forum.org/t/obelisk-grn1-chip-details/4571)]. Ideally, the size of the Dataset should be at least 4 GiB. However, due to constraints on the verification time (see below), the size used by RandomX was selected to be 2080 MiB. While a single chip can theoretically be made with this amount of SRAM using current technology (7 nm in 2019), the feasibility of such solution is questionable, at least in the near future.
#### 1.4.1 Light-client verification
While it's reasonable to require >2 GiB for dedicated mining systems that solve the proof of work, an option must be provided for light clients to verify the proof using a much lower amount of memory.
The ratio of memory required for the 'fast' and 'light' modes must be chosen carefully not to make the light mode viable for mining. In particular, the area-time (AT) product of the light mode should not be smaller than the AT product of the fast mode. Reduction of the AT product is a common way of measuring tradeoff attacks [[19](https://eprint.iacr.org/2015/227.pdf)].
Given the constraints described in the previous chapters, the maximum possible performance ratio between the fast and the light verification modes was empirically determined to be 8. This is because:
1. Further increase of the light verification time would violate the constraints set out in chapter 1.3.
2. Further decrease of the fast mode runtime would violate the constraints set out in chapter 1.1, in particular the overhead time of program generation and result calculation would become too high.
Additionally, 256 MiB was selected as the maximum amount of memory that can be required in the light-client mode. This amount is acceptable even for small single-board computers such as the Raspberry Pi.
To keep a constant memory-time product, the maximum fast-mode memory requirement is:
```
8 * 256 MiB = 2048 MiB
```
This can be further increased since the light mode requires additional chip area for the SuperscalarHash function (see chapter 3.4 and chapter 6 of the Specification). Assuming a conservative estimate of 0.2 mm<sup>2</sup> per SuperscalarHash core and DRAM density of 0.149 Gb/mm<sup>2</sup> [[20](http://en.thelec.kr/news/articleView.html?idxno=20)], the additional memory is:
```
8 * 0.2 * 0.149 * 1024 / 8 = 30.5 MiB
```
or 32 MiB when rounded to the nearest power of 2. The total memory requirement of the fast mode can be 2080 MiB with a roughly constant AT product.
## 2. Virtual machine architecture
This section describes the design of the RandomX virtual machine (VM).
### 2.1 Instruction set
RandomX uses a fixed-length instruction encoding with 8 bytes per instruction. This allows a 32-bit immediate value to be included in the instruction word. The interpretation of the instruction word bits was chosen so that any 8-byte word is a valid instruction. This allows for very efficient random program generation (see chapter 1.1.1).
### 2.2 Program
The program executed by the VM has the form of a loop consisting of 256 random instructions.
* 256 instructions is long enough to provide a large number of possible programs and enough space for branches. The number of different programs that can be generated is limited to 2<sup>512</sup> = 1.3e+154, which is the number of possible seed values of the random generator.
* 256 instructions is short enough so that high-performance CPUs can execute one iteration in similar time it takes to fetch data from DRAM. This is advantageous because it allows Dataset accesses to be synchronized and fully prefetchable (see chapter 2.9).
* Since the program is a loop, it can take advantage of the μop cache [[6](https://en.wikipedia.org/wiki/CPU_cache#Micro-operation_(%CE%BCop_or_uop)_cache)] that is present in some x86 CPUs. Running a loop from the μop cache allows the CPU to power down the x86 instruction decoders, which should help to equalize the power efficiency between x86 and architectures with simple instruction decoding.
### 2.3 Registers
The VM uses 8 integer registers and 12 floating point registers. This is the maximum that can be allocated as physical registers in x86-64, which has the fewest architectural registers among common 64-bit CPU architectures. Using more registers would put x86 CPUs at a disadvantage since they would have to use memory to store VM register contents.
### 2.4 Integer operations
RandomX uses all primitive integer operations that have high output entropy: addition (IADD_RS, IADD_M), subtraction (ISUB_R, ISUB_M, INEG_R), multiplication (IMUL_R, IMUL_M, IMULH_R, IMULH_M, ISMULH_R, ISMULH_M, IMUL_RCP), exclusive or (IXOR_R, IXOR_M) and rotation (IROR_R, IROL_R).
#### 2.4.1 IADD_RS
The IADD_RS instruction utilizes the address calculation logic of CPUs and can be performed in a single hardware instruction by most CPUs (x86 `lea`, arm `add`).
#### 2.4.2 IMUL_RCP
Because integer division is not fully pipelined in CPUs and can be made faster in ASICs, the IMUL_RCP instruction requires only one division per program to calculate the reciprocal. This forces an ASIC to include a hardware divider without giving them a performance advantage during program execution.
#### 2.4.3 ISWAP_R
This instruction can be executed efficiently by CPUs that support register renaming/move elimination.
### 2.5 Floating point operations
RandomX uses double precision floating point operations, which are supported by the majority of CPUs and require more complex hardware than single precision. All operations are performed as 128-bit vector operations, which is also supported by all major CPU architectures.
RandomX uses five operations that are guaranteed by the IEEE 754 standard to give correctly rounded results: addition, subtraction, multiplication, division and square root. All 4 rounding modes defined by the standard are used.
#### 2.5.1 Floating point register groups
The domains of floating point operations are separated into "additive" operations, which use register group F and "multiplicative" operations, which use register group E. This is done to prevent addition/subtraction from becoming no-op when a small number is added to a large number. Since the range of the F group registers is limited to around `±3.0e+14`, adding or subtracting a floating point number with absolute value larger than 1 always changes at least 5 fraction bits.
Because the limited range of group F registers would allow the use of a more efficient fixed-point representation (with 80-bit numbers), the FSCAL instruction manipulates the binary representation of the floating point format to make this optimization more difficult.
Group E registers are restricted to positive values, which avoids `NaN` results (such as square root of a negative number or `0 * ∞`). Division uses only memory source operand to avoid being optimized into multiplication by constant reciprocal. The exponent of group E memory operands is set to a value between -255 and 0 to avoid division and multiplication by 0 and to increase the range of numbers that can be obtained. The approximate range of possible group E register values is `1.7E-77` to `infinity`.
Approximate distribution of floating point register values at the end of each program loop is shown in these figures (left - group F, right - group E):
![Imgur](https://i.imgur.com/64G4qE8.png)
*(Note: bins are marked by the left-side value of the interval, e.g. bin marked `1e-40` contains values from `1e-40` to `1e-20`.)*
The small number of F register values at `1e+14` is caused by the FSCAL instruction, which significantly increases the range of the register values.
Group E registers cover a very large range of values. About 2% of programs produce at least one `infinity` value.
To maximize entropy and also to fit into one 64-byte cache line, floating point registers are combined using the XOR operation at the end of each iteration before being stored into the Scratchpad.
### 2.6 Branches
Modern CPUs invest a lot of die area and energy to handle branches. This includes:
* Branch predictor unit [[21](https://en.wikipedia.org/wiki/Branch_predictor)]
* Checkpoint/rollback states that allow the CPU to recover in case of a branch misprediction.
To take advantage of speculative designs, the random programs should contain branches. However, if branch prediction fails, the speculatively executed instructions are thrown away, which results in a certain amount of wasted energy with each misprediction. Therefore we should aim to minimize the number of mispredictions.
Additionally, branches in the code are essential because they significantly reduce the amount of static optimizations that can be made. For example, consider the following x86 instruction sequence:
```asm
...
branch_target_00:
...
xor r8, r9
test r10, 2088960
je branch_target_00
xor r8, r9
...
```
The XOR operations would normally cancel out, but cannot be optimized away due to the branch because the result will be different if the branch is taken. Similarly, the ISWAP_R instruction could be always statically optimized out if it wasn't for branches.
In general, random branches must be designed in such way that:
1. Infinite loops are not possible.
1. The number of mispredicted branches is small.
1. Branch condition depends on a runtime value to disable static branch optimizations.
#### 2.6.1 Branch prediction
Unfortunately, we haven't found a way how to utilize branch prediction in RandomX. Because RandomX is a consensus protocol, all the rules must be set out in advance, which includes the rules for branches. Fully predictable branches cannot depend on the runtime value of any VM register (since register values are pseudorandom and unpredictable), so they would have to be static and therefore easily optimizable by specialized hardware.
#### 2.6.2 CBRANCH instruction
RandomX therefore uses random branches with a jump probability of 1/256 and branch condition that depends on an integer register value. These branches will be predicted as "not taken" by the CPU. Such branches are "free" in most CPU designs unless they are taken. While this doesn't take advantage of the branch predictors, speculative designs will see a significant performance boost compared to non-speculative branch handling - see Appendix B for more information.
The branching conditions and jump targets are chosen in such way that infinite loops in RandomX code are impossible because the register controlling the branch will never be modified in the repeated code block. Each CBRANCH instruction can jump up to twice in a row. Handling CBRANCH using predicated execution [[22](https://en.wikipedia.org/wiki/Predication_(computer_architecture))] is impractical because the branch is not taken most of the time.
### 2.7 Instruction-level parallelism
CPUs improve their performance using several techniques that utilize instruction-level parallelism of the executed code. These techniques include:
* Having multiple execution units that can execute operations in parallel (*superscalar execution*).
* Executing instruction not in program order, but in the order of operand availability (*out-of-order execution*).
* Predicting which way branches will go to enhance the benefits of both superscalar and out-of-order execution.
RandomX benefits from all these optimizations. See Appendix B for a detailed analysis.
### 2.8 Scratchpad
The Scratchpad is used as read-write memory. Its size was selected to fit entirely into CPU cache.
#### 2.8.1 Scratchpad levels
The Scratchpad is split into 3 levels to mimic the typical CPU cache hierarchy [[23](https://en.wikipedia.org/wiki/CPU_cache)]. Most VM instructions access "L1" and "L2" Scratchpad because L1 and L2 CPU caches are located close to the CPU execution units and provide the best random access latency. The ratio of reads from L1 and L2 is 3:1, which matches the inverse ratio of typical latencies (see table below).
|CPU μ-architecture|L1 latency|L2 latency|L3 latency|source|
|----------------|----------|----------|----------|------|
ARM Cortex A55|2|6|-|[[24](https://www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55/4)]
|AMD Zen+|4|12|40|[[25](https://en.wikichip.org/wiki/amd/microarchitectures/zen%2B#Memory_Hierarchy)]|
|Intel Skylake|4|12|42|[[26](https://en.wikichip.org/wiki/amd/microarchitectures/zen%2B#Memory_Hierarchy)]
The L3 cache is much larger and located further from the CPU core. As a result, its access latencies are much higher and can cause stalls in program execution.
RandomX therefore performs only 2 random accesses into "L3" Scratchpad per program iteration (steps 2 and 3 in chapter 4.6.2 of the Specification). Register values from a given iteration are written into the same locations they were loaded from, which guarantees that the required cache lines have been moved into the faster L1 or L2 caches.
Additionally, integer instructions that read from a fixed address also use the whole "L3" Scratchpad (Table 5.1.4 of the Specification) because repetitive accesses will ensure that the cache line will be placed in the L1 cache of the CPU. This shows that the Scratchpad level doesn't always directly correspond to the same CPU cache level.
#### 2.8.2 Scratchpad writes
There are two ways the Scratchpad is modified during VM execution:
1. At the end of each program iteration, all register values are written into "L3" Scratchpad (see Specification chapter 4.6.2, steps 9 and 11). This writes a total of 128 bytes per iteration in two 64-byte blocks.
2. The ISTORE instruction does explicit stores. On average, there are 16 stores per program, out of which 2 stores are into the "L3" level. Each ISTORE instruction writes 8 bytes.
The image below shows an example of the distribution of writes to the Scratchpad. Each pixel in the image represents 8 bytes of the Scratchpad. Red pixels represent portions of the Scratchpad that have been overwritten at least once during hash calculation. The "L1" and "L2" levels are on the left side (almost completely overwritten). The right side of the scratchpad represents the bottom 1792 KiB. Only about 66% of it are overwritten, but the writes are spread uniformly and randomly.
![Imgur](https://i.imgur.com/pRz6aBG.png)
See Appendix D for the analysis of Scratchpad entropy.
#### 2.8.3 Read-write ratio
Programs make, on average, 39 reads (instructions IADD_M, ISUB_M, IMUL_M, IMULH_M, ISMULH_M, IXOR_M, FADD_M, FSUB_M, FDIV_M) and 16 writes (instruction ISTORE) to the Scratchpad per program iteration. Additional 128 bytes are read and written implicitly to initialize and store register values. 64 bytes of data is read from the Dataset per iteration. In total:
* The average amount of data read from memory per program iteration is: 39 * 8 + 128 + 64 = **504 bytes**.
* The average mount of data written to memory per program iteration is: 16 * 8 + 128 = **256 bytes**.
This is close to a 2:1 read/write ratio, which CPUs are optimized for.
### 2.9 Dataset
Since the Scratchpad is usually stored in the CPU cache, only Dataset accesses utilize the memory controllers.
RandomX randomly reads from the Dataset once per program iteration (16384 times per hash result). Since the Dataset must be stored in DRAM, it provides a natural parallelization limit, because DRAM cannot do more than about 25 million random accesses per second per bank group. Each separately addressable bank group allows a throughput of around 1500 H/s.
All Dataset accesses read one CPU cache line (64 bytes) and are fully prefetched. The time to execute one program iteration described in chapter 4.6.2 of the Specification is about the same as typical DRAM access latency (50-100 ns).
#### 2.9.1 Cache
The Cache, which is used for light verification and Dataset construction, is about 8 times smaller than the Dataset. To keep a constant area-time product, each Dataset item is constructed from 8 random Cache accesses.
Because 256 MiB is small enough to be included on-chip, RandomX uses a custom high-latency, high-power mixing function ("SuperscalarHash") which defeats the benefits of using low-latency memory and the energy required to calculate SuperscalarHash makes light mode very inefficient for mining (see chapter 3.4).
Using less than 256 MiB of memory is not possible due to the use of tradeoff-resistant Argon2d with 3 iterations. When using 3 iterations (passes), halving the memory usage increases computational cost 3423 times for the best tradeoff attack [[27](https://eprint.iacr.org/2015/430.pdf)].
## 3. Custom functions
### 3.1 AesGenerator1R
AesGenerator1R was designed for the fastest possible generation of pseudorandom data to fill the Scratchpad. It takes advantage of hardware accelerated AES in modern CPUs. Only one AES round is performed per 16 bytes of output, which results in throughput exceeding 20 GB/s in most modern CPUs. While 1 AES round is not sufficient for a good distribution of random values, this is not an issue because the purpose is just to initialize the Scratchpad with random non-zero data.
### 3.2 AesGenerator4R
AesGenerator4R uses 4 AES rounds to generate pseudorandom data for Program Buffer initialization. Since 2 AES rounds are sufficient for full avalanche of all input bits [[28](https://csrc.nist.gov/csrc/media/projects/cryptographic-standards-and-guidelines/documents/aes-development/rijndael-ammended.pdf)], AesGenerator4R provides an excellent output distribution while maintaining very good performance.
The reversible nature of this generator is not an issue since the generator state is always initialized using the output of a non-reversible hashing function (Blake2b).
### 3.3 AesHash1R
AesHash was designed for the fastest possible calculation of the Scratchpad fingerprint. It interprets the Scratchpad as a set of AES round keys, so it's equivalent to AES encryption with 32768 rounds. Two extra rounds are performed at the end to ensure avalanche of all Scratchpad bits in each lane. The output of the AesHash is fed into the Blake2b hashing function to calculate the final PoW hash.
### 3.4 SuperscalarHash
SuperscalarHash was designed to burn as much power as possible while the CPU is waiting for data to be loaded from DRAM. The target latency of 170 cycles corresponds to the usual DRAM latency of 40-80 ns and clock frequency of 2-4 GHz. ASIC devices designed for light-mode mining with low-latency memory will be bottlenecked by SuperscalarHash when calculating Dataset items and their efficiency will be destroyed by the high power usage of SuperscalarHash.
The average SuperscalarHash function contains a total of 450 instructions, out of which 155 are 64-bit multiplications. On average, the longest dependency chain is 95 instructions long. An ASIC design for light-mode mining, with 256 MiB of on-die memory and 1-cycle latency for all operations, will need on average 95 * 8 = 760 cycles to construct a Dataset item, assuming unlimited parallelization. It will have to execute 155 * 8 = 1240 64-bit multiplications per item, which will consume energy comparable to loading 64 bytes from DRAM.
## Appendix
### A. The effect of chaining VM executions
Chapter 1.2 describes why `N` random programs are chained to prevent mining strategies that search for 'easy' programs. RandomX uses a value of `N = 8`.
Let's define `Q` as the ratio of acceptable programs in a strategy that uses filtering. For example `Q = 0.75` means that 25% of programs are rejected.
For `N = 1`, there are no wasted program executions and the only cost is program generation and the filtering itself. The calculations below assume that these costs are zero and the only real cost is program execution. However, this is a simplification because program generation in RandomX is not free (the first program generation requires full Scratchpad initialization), but it describes a best-case scenario for an attacker.
For `N > 1`, the first program can be filtered as usual, but after the program is executed, there is a chance of `1-Q` that the next program should be rejected and we have wasted one program execution.
For `N` chained executions, the chance is only <code>Q<sup>N</sup></code> that all programs in the chain are acceptable. However, during each attempt to find such chain, we will waste the execution of some programs. For `N = 8`, the number of wasted programs per attempt is equal to <code>(1-Q)*(1+2\*Q+3\*Q<sup>2</sup>+4\*Q<sup>3</sup>+5\*Q<sup>4</sup>+6\*Q<sup>5</sup>+7\*Q<sup>6</sup>)</code> (approximately 2.5 for `Q = 0.75`).
Let's consider 3 mining strategies:
#### Strategy I
Honest miner that doesn't reject any programs (`Q = 1`).
#### Strategy II
Miner that uses optimized custom hardware that cannot execute 25% of programs (`Q = 0.75`), but supported programs can be executed 50% faster.
#### Strategy III
Miner that can execute all programs, but rejects 25% of the slowest programs for the first program in the chain. This gives a 5% performance boost for the first program in the chain (this matches the runtime distribution from Appendix C).
#### Results
The table below lists the results for the above 3 strategies and different values of `N`. The columns **N(I)**, **N(II)** and **N(III)** list the number of programs that each strategy has to execute on average to get one valid hash result (this includes programs wasted in rejected chains). Columns **Speed(I)**, **Speed(II)** and **Speed(III)** list the average mining performance relative to strategy I.
|N|N(I)|N(II)|N(III)|Speed(I)|Speed(II)|Speed(III)|
|---|----|----|----|---------|---------|---------|
|1|1|1|1|1.00|1.50|1.05|
|2|2|2.3|2|1.00|1.28|1.02|
|4|4|6.5|4|1.00|0.92|1.01|
|8|8|27.0|8|1.00|0.44|1.00|
For `N = 8`, strategy II will perform at less than half the speed of the honest miner despite having a 50% performance advantage for selected programs. The small statistical advantage of strategy III is negligible with `N = 8`.
### B. Performance simulation
As discussed in chapter 2.7, RandomX aims to take advantage of the complex design of modern high-performance CPUs. To evaluate the impact of superscalar, out-of-order and speculative execution, we performed a simplified CPU simulation. Source code is available in [perf-simulation.cpp](../src/tests/perf-simulation.cpp).
#### CPU model
The model CPU uses a 3-stage pipeline to achieve an ideal throughput of 1 instruction per cycle:
```
(1) (2) (3)
+------------------+ +----------------+ +----------------+
| Instruction | | | | |
| fetch | ---> | Memory access | ---> | Execute |
| + decode | | | | |
+------------------+ +----------------+ +----------------+
```
The 3 stages are:
1. Instruction fetch and decode. This stage loads the instruction from the Program Buffer and decodes the instruction operation and operands.
2. Memory access. If this instruction uses a memory operand, it is loaded from the Scratchpad in this stage. This includes the calculation of the memory address. Stores are also performed in this stage. The value of the address register must be available in this stage.
3. Execute. This stage executes the instruction using the operands retrieved in the previous stages and writes the results into the register file.
Note that this is an optimistically short pipeline that would not allow very high clock speeds. Designs using a longer pipeline would significantly increase the benefits of speculative execution.
#### Superscalar execution
Our model CPU contains two kinds of components:
* Execution unit (EXU) - it is used to perform the actual integer or floating point operation. All RandomX instructions except ISTORE must use an execution unit in the 3rd pipeline stage. All operations are considered to take only 1 clock cycle.
* Memory unit (MEM) - it is used for loads and stores into Scratchpad. All memory instructions (including ISTORE) use a memory unit in the 2nd pipeline stage.
A superscalar design will contain multiple execution or memory units to improve performance.
#### Out-of-order execution
The simulation model supports two designs:
1. **In-order** - all instructions are executed in the order they appear in the Program Buffer. This design will stall if a dependency is encountered or the required EXU/MEM unit is not available.
2. **Out-of-order** - doesn't execute instructions in program order, but an instruction can be executed when its operands are ready and the required EXU/MEM units are available.
#### Branch handling
The simulation model supports two types of branch handling:
1. **Non-speculative** - when a branch is encountered, the pipeline is stalled. This typically adds a 3-cycle penalty for each branch.
2. **Speculative** - all branches are predicted not taken and the pipeline is flushed if a misprediction occurs (probability of 1/256).
#### Results
The following 10 designs were simulated and the average number of clock cycles to execute a RandomX program (256 instructions) was measured.
|design|superscalar config.|reordering|branch handling|execution time [cycles]|IPC|
|-------|-----------|----------|---------------|-----------------------|---|
|#1|1 EXU + 1 MEM|in-order|non-speculative|293|0.87|
|#2|1 EXU + 1 MEM|in-order|speculative|262|0.98|
|#3|1 EXU + 1 MEM|in-order|non-speculative|197|1.3|
|#4|2 EXU + 1 MEM|in-order|speculative|161|1.6|
|#5|2 EXU + 1 MEM|out-of-order|non-speculative|144|1.8|
|#6|2 EXU + 1 MEM|out-of-order|speculative|122|2.1|
|#7|4 EXU + 2 MEM|in-order|non-speculative|135|1.9|
|#8|4 EXU + 2 MEM|in-order|speculative|99|2.6|
|#9|4 EXU + 2 MEM|out-of-order|non-speculative|89|2.9|
|#10|4 EXU + 2 MEM|out-of-order|speculative|64|4.0|
The benefits of superscalar, out-of-order and speculative designs are clearly demonstrated.
### C. RandomX runtime distribution
Runtime numbers were measured on AMD Ryzen 7 1700 running at 3.0 GHz using 1 core. Source code to measure program execution and verification times is available in [runtime-distr.cpp](../src/tests/runtime-distr.cpp). Source code to measure the performance of the x86 JIT compiler is available in [jit-performance.cpp](../src/tests/jit-performance.cpp).
#### Fast mode - program execution
The following figure shows the distribution of the runtimes of a single VM program (in fast mode). This includes: program generation, JIT compilation, VM execution and Blake2b hash of the register file. Program generation and JIT compilation was measured to take 3.6 μs per program.
![Imgur](https://i.imgur.com/ikv2z2i.png)
AMD Ryzen 7 1700 can calculate 625 hashes per second in fast mode (using 1 thread), which means a single hash result takes 1600 μs (1.6 ms). This consists of (approximately):
* 1480 μs for VM execution (8 programs)
* 45 μs for initial Scratchpad fill (AesGenerator1R).
* 45 μs for final Scratchpad hash (AesHash1R).
* 30 μs for program generation and JIT compilation (8 programs)
This gives a total overhead of 7.5% (time per hash spent not executing VM).
#### Light mode - verification time
The following figure shows the distribution of times to calculate 1 hash result using the light mode. Most of the time is spent executing SuperscalarHash to calculate Dataset items (13.2 ms out of 14.8 ms). The average verification time exactly matches the performance of the CryptoNight algorithm.
![Imgur](https://i.imgur.com/VtwwJT8.png)
### D. Scratchpad entropy analysis
The average entropy of the Scratchpad after 8 program executions was approximated using the LZMA compression algorithm:
1. Hash resuls were calculated and the final scratchpads were written to disk as files with '.spad' extension (source code: [scratchpad-entropy.cpp](../src/tests/scratchpad-entropy.cpp))
2. The files were compressed using 7-Zip [[29](https://www.7-zip.org/)] in Ultra compression mode: `7z.exe a -t7z -m0=lzma2 -mx=9 scratchpads.7z *.spad`
The size of the resulting archive is approximately 99.98% of the uncompressed size of the scratchpad files. This shows that the Scratchpad retains high entropy during VM execution.
### E. SuperscalarHash analysis
SuperscalarHash is a custom function used by RandomX to generate Dataset items. It operates on 8 integer registers and uses a random sequence of instructions. About 1/3 of the instructions are multiplications.
The following figure shows the sensitivity of SuperscalarHash to changing a single bit of an input register:
![Imgur](https://i.imgur.com/ztZ0V0G.png)
This shows that SuperscalaHash has quite low sensitivity to high-order bits and somewhat decreased sensitivity to the lowest-order bits. Sensitivity is highest for bits 3-53 (inclusive).
When calculating a Dataset item, the input of the first SuperscalarHash depends only on the item number. To ensure a good distribution of results, the constants described in section 7.3 of the Specification were chosen to provide unique values of bits 3-53 for *all* item numbers in the range 0-34078718 (the Dataset contains 34078719 items). All initial register values for all Dataset item numbers were checked to make sure bits 3-53 of each register are unique and there are no collisions (source code: [superscalar-init.cpp](../src/tests/superscalar-init.cpp)). While this is not strictly necessary to get unique output from SuperscalarHash, it's a security precaution that mitigates the non-perfect avalanche properties of the randomly generated SuperscalarHash instances.
## References
[1] CryptoNote whitepaper - https://cryptonote.org/whitepaper.pdf
[2] ProgPoW: Inefficient integer multiplications - https://github.com/ifdefelse/ProgPOW/issues/16
[3] Cryptographic Hashing function - https://en.wikipedia.org/wiki/Cryptographic_hash_function
[4] randprog - https://github.com/hyc/randprog
[5] RandomJS - https://github.com/tevador/RandomJS
[6] μop cache - https://en.wikipedia.org/wiki/CPU_cache#Micro-operation_(%CE%BCop_or_uop)_cache
[7] Instruction-level parallelism - https://en.wikipedia.org/wiki/Instruction-level_parallelism
[8] Superscalar processor - https://en.wikipedia.org/wiki/Superscalar_processor
[9] Out-of-order execution - https://en.wikipedia.org/wiki/Out-of-order_execution
[10] Speculative execution - https://en.wikipedia.org/wiki/Speculative_execution
[11] Register renaming - https://en.wikipedia.org/wiki/Register_renaming
[12] Blake2 hashing function - https://blake2.net/
[13] Advanced Encryption Standard - https://en.wikipedia.org/wiki/Advanced_Encryption_Standard
[14] Log-normal distribution - https://en.wikipedia.org/wiki/Log-normal_distribution
[15] CryptoNight hash function - https://cryptonote.org/cns/cns008.txt
[16] Dynamic random-access memory - https://en.wikipedia.org/wiki/Dynamic_random-access_memory
[17] Multi-channel memory architecture - https://en.wikipedia.org/wiki/Multi-channel_memory_architecture
[18] Obelisk GRN1 chip details - https://www.grin-forum.org/t/obelisk-grn1-chip-details/4571
[19] Biryukov et al.: Tradeoff Cryptanalysis of Memory-Hard Functions - https://eprint.iacr.org/2015/227.pdf
[20] SK Hynix 20nm DRAM density - http://en.thelec.kr/news/articleView.html?idxno=20
[21] Branch predictor - https://en.wikipedia.org/wiki/Branch_predictor
[22] Predication - https://en.wikipedia.org/wiki/Predication_(computer_architecture)
[23] CPU cache - https://en.wikipedia.org/wiki/CPU_cache
[24] Cortex-A55 Microarchitecture - https://www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55/4
[25] AMD Zen+ Microarchitecture - https://en.wikichip.org/wiki/amd/microarchitectures/zen%2B#Memory_Hierarchy
[26] Intel Skylake Microarchitecture - https://en.wikichip.org/wiki/amd/microarchitectures/zen%2B#Memory_Hierarchy
[27] Biryukov et al.: Fast and Tradeoff-Resilient Memory-Hard Functions for
Cryptocurrencies and Password Hashing - https://eprint.iacr.org/2015/430.pdf Table 2, page 8
[28] J. Daemen, V. Rijmen: AES Proposal: Rijndael - https://csrc.nist.gov/csrc/media/projects/cryptographic-standards-and-guidelines/documents/aes-development/rijndael-ammended.pdf page 28
[29] 7-Zip File archiver - https://www.7-zip.org/

983
RandomX/doc/program.asm Normal file
View File

@@ -0,0 +1,983 @@
randomx_isn_0:
; ISMULH_R r0, r7
mov rax, r8
imul r15
mov r8, rdx
randomx_isn_1:
; IADD_RS r1, r2, SHFT 2
lea r9, [r9+r10*4]
randomx_isn_2:
; ISTORE L1[r6+1506176493], r4
lea eax, [r14d+1506176493]
and eax, 16376
mov qword ptr [rsi+rax], r12
randomx_isn_3:
; IMUL_R r5, r3
imul r13, r11
randomx_isn_4:
; IROR_R r3, r5
mov ecx, r13d
ror r11, cl
randomx_isn_5:
; CBRANCH r7, -1891017657, COND 15
add r15, -1886823353
test r15, 2139095040
jz randomx_isn_0
randomx_isn_6:
; ISUB_M r3, L1[r7-1023302103]
lea eax, [r15d-1023302103]
and eax, 16376
sub r11, qword ptr [rsi+rax]
randomx_isn_7:
; IMUL_R r6, 220479013
imul r14, 220479013
randomx_isn_8:
; IADD_RS r5, r3, -669841607, SHFT 2
lea r13, [r13+r11*4-669841607]
randomx_isn_9:
; IADD_M r3, L3[532344]
add r11, qword ptr [rsi+532344]
randomx_isn_10:
; FADD_R f0, a3
addpd xmm0, xmm11
randomx_isn_11:
; CBRANCH r3, -1981570318, COND 4
add r11, -1981566222
test r11, 1044480
jz randomx_isn_10
randomx_isn_12:
; FSUB_R f0, a1
subpd xmm0, xmm9
randomx_isn_13:
; IADD_RS r1, r6, SHFT 2
lea r9, [r9+r14*4]
randomx_isn_14:
; FSQRT_R e2
sqrtpd xmm6, xmm6
randomx_isn_15:
; CBRANCH r5, -1278791788, COND 14
add r13, -1278791788
test r13, 1069547520
jz randomx_isn_12
randomx_isn_16:
; ISUB_R r3, -1310797453
sub r11, -1310797453
randomx_isn_17:
; IMUL_RCP r3, 2339914445
mov rax, 16929713537937567113
imul r11, rax
randomx_isn_18:
; FADD_R f1, a2
addpd xmm1, xmm10
randomx_isn_19:
; FSUB_R f2, a2
subpd xmm2, xmm10
randomx_isn_20:
; IMUL_R r7, r0
imul r15, r8
randomx_isn_21:
; FADD_M f2, L2[r7-828505656]
lea eax, [r15d-828505656]
and eax, 262136
cvtdq2pd xmm12, qword ptr [rsi+rax]
addpd xmm2, xmm12
randomx_isn_22:
; FDIV_M e1, L1[r1-1542605227]
lea eax, [r9d-1542605227]
and eax, 16376
cvtdq2pd xmm12, qword ptr [rsi+rax]
andps xmm12, xmm13
orps xmm12, xmm14
divpd xmm5, xmm12
randomx_isn_23:
; IMUL_RCP r0, 1878277380
mov rax, 10545322453154434729
imul r8, rax
randomx_isn_24:
; ISUB_R r6, r3
sub r14, r11
randomx_isn_25:
; IMUL_M r1, L1[r3-616171540]
lea eax, [r11d-616171540]
and eax, 16376
imul r9, qword ptr [rsi+rax]
randomx_isn_26:
; FSWAP_R f2
shufpd xmm2, xmm2, 1
randomx_isn_27:
; FSQRT_R e0
sqrtpd xmm4, xmm4
randomx_isn_28:
; IXOR_R r7, r5
xor r15, r13
randomx_isn_29:
; FADD_R f3, a3
addpd xmm3, xmm11
randomx_isn_30:
; FSUB_M f0, L2[r0+1880524670]
lea eax, [r8d+1880524670]
and eax, 262136
cvtdq2pd xmm12, qword ptr [rsi+rax]
subpd xmm0, xmm12
randomx_isn_31:
; IADD_RS r0, r3, SHFT 3
lea r8, [r8+r11*8]
randomx_isn_32:
; FMUL_R e0, a2
mulpd xmm4, xmm10
randomx_isn_33:
; IMUL_M r1, L1[r4-588273594]
lea eax, [r12d-588273594]
and eax, 16376
imul r9, qword ptr [rsi+rax]
randomx_isn_34:
; IADD_M r4, L1[r6+999905907]
lea eax, [r14d+999905907]
and eax, 16376
add r12, qword ptr [rsi+rax]
randomx_isn_35:
; ISUB_R r4, r0
sub r12, r8
randomx_isn_36:
; FMUL_R e0, a3
mulpd xmm4, xmm11
randomx_isn_37:
; ISTORE L1[r4+2027210220], r3
lea eax, [r12d+2027210220]
and eax, 16376
mov qword ptr [rsi+rax], r11
randomx_isn_38:
; FADD_M f1, L2[r3+1451369534]
lea eax, [r11d+1451369534]
and eax, 262136
cvtdq2pd xmm12, qword ptr [rsi+rax]
addpd xmm1, xmm12
randomx_isn_39:
; FMUL_R e1, a1
mulpd xmm5, xmm9
randomx_isn_40:
; FSUB_R f3, a2
subpd xmm3, xmm10
randomx_isn_41:
; IMULH_R r3, r3
mov rax, r11
mul r11
mov r11, rdx
randomx_isn_42:
; ISUB_R r4, r3
sub r12, r11
randomx_isn_43:
; CBRANCH r6, 335851892, COND 5
add r14, 335847796
test r14, 2088960
jz randomx_isn_25
randomx_isn_44:
; IADD_RS r7, r5, SHFT 3
lea r15, [r15+r13*8]
randomx_isn_45:
; CFROUND r6, 48
mov rax, r14
rol rax, 29
and eax, 24576
or eax, 40896
push rax
ldmxcsr dword ptr [rsp]
pop rax
randomx_isn_46:
; IMUL_RCP r6, 2070736307
mov rax, 9565216276746377827
imul r14, rax
randomx_isn_47:
; IXOR_R r2, r4
xor r10, r12
randomx_isn_48:
; IMUL_R r0, r5
imul r8, r13
randomx_isn_49:
; CBRANCH r2, -272659465, COND 15
add r10, -272659465
test r10, 2139095040
jz randomx_isn_48
randomx_isn_50:
; ISTORE L1[r6+1414933948], r5
lea eax, [r14d+1414933948]
and eax, 16376
mov qword ptr [rsi+rax], r13
randomx_isn_51:
; ISTORE L1[r3-1336791747], r6
lea eax, [r11d-1336791747]
and eax, 16376
mov qword ptr [rsi+rax], r14
randomx_isn_52:
; FSCAL_R f1
xorps xmm1, xmm15
randomx_isn_53:
; CBRANCH r6, -2143810604, COND 1
add r14, -2143810860
test r14, 130560
jz randomx_isn_50
randomx_isn_54:
; ISUB_M r3, L1[r1-649360673]
lea eax, [r9d-649360673]
and eax, 16376
sub r11, qword ptr [rsi+rax]
randomx_isn_55:
; FADD_R f2, a3
addpd xmm2, xmm11
randomx_isn_56:
; CFROUND r3, 8
mov rax, r11
rol rax, 5
and eax, 24576
or eax, 40896
push rax
ldmxcsr dword ptr [rsp]
pop rax
randomx_isn_57:
; IROR_R r2, r0
mov ecx, r8d
ror r10, cl
randomx_isn_58:
; IADD_RS r4, r2, SHFT 1
lea r12, [r12+r10*2]
randomx_isn_59:
; CBRANCH r6, -704407571, COND 10
add r14, -704276499
test r14, 66846720
jz randomx_isn_54
randomx_isn_60:
; FSUB_R f1, a3
subpd xmm1, xmm11
randomx_isn_61:
; ISUB_R r3, r7
sub r11, r15
randomx_isn_62:
; FMUL_R e2, a2
mulpd xmm6, xmm10
randomx_isn_63:
; FMUL_R e3, a1
mulpd xmm7, xmm9
randomx_isn_64:
; ISTORE L3[r2+845419810], r0
lea eax, [r10d+845419810]
and eax, 2097144
mov qword ptr [rsi+rax], r8
randomx_isn_65:
; CBRANCH r1, -67701844, COND 5
add r9, -67705940
test r9, 2088960
jz randomx_isn_60
randomx_isn_66:
; IROR_R r3, r1
mov ecx, r9d
ror r11, cl
randomx_isn_67:
; IMUL_R r3, r1
imul r11, r9
randomx_isn_68:
; IROR_R r1, 40
ror r9, 40
randomx_isn_69:
; IMUL_R r3, r0
imul r11, r8
randomx_isn_70:
; IXOR_M r6, L3[1276704]
xor r14, qword ptr [rsi+1276704]
randomx_isn_71:
; FADD_M f0, L1[r1-1097746982]
lea eax, [r9d-1097746982]
and eax, 16376
cvtdq2pd xmm12, qword ptr [rsi+rax]
addpd xmm0, xmm12
randomx_isn_72:
; IMUL_M r7, L1[r2+588700215]
lea eax, [r10d+588700215]
and eax, 16376
imul r15, qword ptr [rsi+rax]
randomx_isn_73:
; IXOR_M r2, L2[r3-1120252909]
lea eax, [r11d-1120252909]
and eax, 262136
xor r10, qword ptr [rsi+rax]
randomx_isn_74:
; FMUL_R e2, a0
mulpd xmm6, xmm8
randomx_isn_75:
; IMULH_R r2, r1
mov rax, r10
mul r9
mov r10, rdx
randomx_isn_76:
; FMUL_R e1, a2
mulpd xmm5, xmm10
randomx_isn_77:
; FSQRT_R e1
sqrtpd xmm5, xmm5
randomx_isn_78:
; FSCAL_R f1
xorps xmm1, xmm15
randomx_isn_79:
; FSWAP_R e1
shufpd xmm5, xmm5, 1
randomx_isn_80:
; IXOR_R r3, 721175561
xor r11, 721175561
randomx_isn_81:
; FSCAL_R f0
xorps xmm0, xmm15
randomx_isn_82:
; IADD_RS r3, r0, SHFT 1
lea r11, [r11+r8*2]
randomx_isn_83:
; ISUB_R r2, -691647438
sub r10, -691647438
randomx_isn_84:
; IXOR_R r1, r3
xor r9, r11
randomx_isn_85:
; IMULH_R r1, r7
mov rax, r9
mul r15
mov r9, rdx
randomx_isn_86:
; IMULH_R r3, r4
mov rax, r11
mul r12
mov r11, rdx
randomx_isn_87:
; CBRANCH r3, -1821955951, COND 5
add r11, -1821955951
test r11, 2088960
jz randomx_isn_87
randomx_isn_88:
; FADD_R f2, a3
addpd xmm2, xmm11
randomx_isn_89:
; IXOR_R r6, r3
xor r14, r11
randomx_isn_90:
; CBRANCH r4, -1780348372, COND 15
add r12, -1784542676
test r12, 2139095040
jz randomx_isn_88
randomx_isn_91:
; IROR_R r4, 55
ror r12, 55
randomx_isn_92:
; FSUB_R f3, a2
subpd xmm3, xmm10
randomx_isn_93:
; FSCAL_R f1
xorps xmm1, xmm15
randomx_isn_94:
; FADD_R f1, a0
addpd xmm1, xmm8
randomx_isn_95:
; ISUB_R r0, r3
sub r8, r11
randomx_isn_96:
; ISMULH_R r5, r7
mov rax, r13
imul r15
mov r13, rdx
randomx_isn_97:
; IADD_RS r0, r5, SHFT 1
lea r8, [r8+r13*2]
randomx_isn_98:
; IMUL_R r7, r3
imul r15, r11
randomx_isn_99:
; IADD_RS r2, r4, SHFT 2
lea r10, [r10+r12*4]
randomx_isn_100:
; ISTORE L3[r2+1641523310], r4
lea eax, [r10d+1641523310]
and eax, 2097144
mov qword ptr [rsi+rax], r12
randomx_isn_101:
; ISTORE L2[r5+1966751371], r5
lea eax, [r13d+1966751371]
and eax, 262136
mov qword ptr [rsi+rax], r13
randomx_isn_102:
; IXOR_R r4, r7
xor r12, r15
randomx_isn_103:
; CBRANCH r7, -607792642, COND 4
add r15, -607792642
test r15, 1044480
jz randomx_isn_99
randomx_isn_104:
; FMUL_R e1, a1
mulpd xmm5, xmm9
randomx_isn_105:
; IMUL_R r2, r3
imul r10, r11
randomx_isn_106:
; IADD_RS r5, r1, -1609896472, SHFT 3
lea r13, [r13+r9*8-1609896472]
randomx_isn_107:
; FMUL_R e2, a2
mulpd xmm6, xmm10
randomx_isn_108:
; ISUB_R r3, r6
sub r11, r14
randomx_isn_109:
; ISUB_R r0, r5
sub r8, r13
randomx_isn_110:
; IMUL_M r2, L3[1548384]
imul r10, qword ptr [rsi+1548384]
randomx_isn_111:
; FADD_R f2, a1
addpd xmm2, xmm9
randomx_isn_112:
; ISUB_M r6, L1[r7+1465746]
lea eax, [r15d+1465746]
and eax, 16376
sub r14, qword ptr [rsi+rax]
randomx_isn_113:
; IMULH_M r3, L1[r6-668730597]
lea ecx, [r14d-668730597]
and ecx, 16376
mov rax, r11
mul qword ptr [rsi+rcx]
mov r11, rdx
randomx_isn_114:
; IMUL_M r3, L2[r6-1549338697]
lea eax, [r14d-1549338697]
and eax, 262136
imul r11, qword ptr [rsi+rax]
randomx_isn_115:
; IMULH_M r4, L1[r6-82240335]
lea ecx, [r14d-82240335]
and ecx, 16376
mov rax, r12
mul qword ptr [rsi+rcx]
mov r12, rdx
randomx_isn_116:
; ISWAP_R r2, r4
xchg r10, r12
randomx_isn_117:
; IADD_RS r1, r0, SHFT 1
lea r9, [r9+r8*2]
randomx_isn_118:
; FSUB_R f0, a1
subpd xmm0, xmm9
randomx_isn_119:
; IADD_M r3, L1[r1-233433054]
lea eax, [r9d-233433054]
and eax, 16376
add r11, qword ptr [rsi+rax]
randomx_isn_120:
; FSUB_R f1, a0
subpd xmm1, xmm8
randomx_isn_121:
; ISUB_R r4, r3
sub r12, r11
randomx_isn_122:
; IXOR_M r6, L2[r1-425418413]
lea eax, [r9d-425418413]
and eax, 262136
xor r14, qword ptr [rsi+rax]
randomx_isn_123:
; FSQRT_R e2
sqrtpd xmm6, xmm6
randomx_isn_124:
; CBRANCH r1, -1807592127, COND 12
add r9, -1806543551
test r9, 267386880
jz randomx_isn_118
randomx_isn_125:
; IADD_RS r4, r4, SHFT 0
lea r12, [r12+r12*1]
randomx_isn_126:
; ISTORE L2[r5-104490218], r0
lea eax, [r13d-104490218]
and eax, 262136
mov qword ptr [rsi+rax], r8
randomx_isn_127:
; IXOR_R r5, r0
xor r13, r8
randomx_isn_128:
; IMUL_M r6, L1[r2-603755642]
lea eax, [r10d-603755642]
and eax, 16376
imul r14, qword ptr [rsi+rax]
randomx_isn_129:
; INEG_R r5
neg r13
randomx_isn_130:
; FMUL_R e0, a0
mulpd xmm4, xmm8
randomx_isn_131:
; ISUB_R r0, -525100988
sub r8, -525100988
randomx_isn_132:
; IMUL_RCP r0, 3636489804
mov rax, 10893494383940851768
imul r8, rax
randomx_isn_133:
; FADD_M f2, L1[r3-768193829]
lea eax, [r11d-768193829]
and eax, 16376
cvtdq2pd xmm12, qword ptr [rsi+rax]
addpd xmm2, xmm12
randomx_isn_134:
; IADD_RS r7, r7, SHFT 3
lea r15, [r15+r15*8]
randomx_isn_135:
; IROR_R r3, r2
mov ecx, r10d
ror r11, cl
randomx_isn_136:
; ISUB_R r1, r4
sub r9, r12
randomx_isn_137:
; FADD_M f2, L1[r3+1221716517]
lea eax, [r11d+1221716517]
and eax, 16376
cvtdq2pd xmm12, qword ptr [rsi+rax]
addpd xmm2, xmm12
randomx_isn_138:
; FDIV_M e2, L1[r3-1258284098]
lea eax, [r11d-1258284098]
and eax, 16376
cvtdq2pd xmm12, qword ptr [rsi+rax]
andps xmm12, xmm13
orps xmm12, xmm14
divpd xmm6, xmm12
randomx_isn_139:
; FSUB_R f1, a0
subpd xmm1, xmm8
randomx_isn_140:
; IADD_RS r5, r6, -1773817530, SHFT 3
lea r13, [r13+r14*8-1773817530]
randomx_isn_141:
; IADD_M r0, L3[540376]
add r8, qword ptr [rsi+540376]
randomx_isn_142:
; FMUL_R e1, a3
mulpd xmm5, xmm11
randomx_isn_143:
; IADD_RS r6, r3, SHFT 2
lea r14, [r14+r11*4]
randomx_isn_144:
; ISTORE L1[r6+1837899146], r5
lea eax, [r14d+1837899146]
and eax, 16376
mov qword ptr [rsi+rax], r13
randomx_isn_145:
; FSWAP_R f2
shufpd xmm2, xmm2, 1
randomx_isn_146:
; FMUL_R e0, a0
mulpd xmm4, xmm8
randomx_isn_147:
; IADD_RS r1, r4, SHFT 3
lea r9, [r9+r12*8]
randomx_isn_148:
; ISUB_M r1, L2[r6-326072101]
lea eax, [r14d-326072101]
and eax, 262136
sub r9, qword ptr [rsi+rax]
randomx_isn_149:
; FSUB_R f1, a1
subpd xmm1, xmm9
randomx_isn_150:
; FADD_M f0, L2[r5+1123208251]
lea eax, [r13d+1123208251]
and eax, 262136
cvtdq2pd xmm12, qword ptr [rsi+rax]
addpd xmm0, xmm12
randomx_isn_151:
; FSWAP_R f1
shufpd xmm1, xmm1, 1
randomx_isn_152:
; IMUL_M r3, L1[r4+522054565]
lea eax, [r12d+522054565]
and eax, 16376
imul r11, qword ptr [rsi+rax]
randomx_isn_153:
; IADD_RS r0, r0, SHFT 1
lea r8, [r8+r8*2]
randomx_isn_154:
; FMUL_R e2, a3
mulpd xmm6, xmm11
randomx_isn_155:
; FSUB_R f1, a2
subpd xmm1, xmm10
randomx_isn_156:
; ISTORE L1[r6+1559762664], r7
lea eax, [r14d+1559762664]
and eax, 16376
mov qword ptr [rsi+rax], r15
randomx_isn_157:
; FSUB_R f0, a1
subpd xmm0, xmm9
randomx_isn_158:
; ISUB_R r5, r6
sub r13, r14
randomx_isn_159:
; FADD_R f0, a0
addpd xmm0, xmm8
randomx_isn_160:
; FMUL_R e1, a0
mulpd xmm5, xmm8
randomx_isn_161:
; FSUB_R f2, a1
subpd xmm2, xmm9
randomx_isn_162:
; ISUB_R r5, r7
sub r13, r15
randomx_isn_163:
; FDIV_M e3, L2[r4-1912085642]
lea eax, [r12d-1912085642]
and eax, 262136
cvtdq2pd xmm12, qword ptr [rsi+rax]
andps xmm12, xmm13
orps xmm12, xmm14
divpd xmm7, xmm12
randomx_isn_164:
; IXOR_M r3, L1[r0-858372123]
lea eax, [r8d-858372123]
and eax, 16376
xor r11, qword ptr [rsi+rax]
randomx_isn_165:
; IXOR_R r4, r6
xor r12, r14
randomx_isn_166:
; IADD_RS r3, r6, SHFT 0
lea r11, [r11+r14*1]
randomx_isn_167:
; FMUL_R e1, a1
mulpd xmm5, xmm9
randomx_isn_168:
; IADD_RS r5, r2, -371238437, SHFT 1
lea r13, [r13+r10*2-371238437]
randomx_isn_169:
; ISTORE L2[r5-633500019], r5
lea eax, [r13d-633500019]
and eax, 262136
mov qword ptr [rsi+rax], r13
randomx_isn_170:
; IXOR_R r4, -246154334
xor r12, -246154334
randomx_isn_171:
; IROR_R r7, r5
mov ecx, r13d
ror r15, cl
randomx_isn_172:
; ISTORE L1[r5+4726218], r2
lea eax, [r13d+4726218]
and eax, 16376
mov qword ptr [rsi+rax], r10
randomx_isn_173:
; IADD_RS r2, r0, SHFT 3
lea r10, [r10+r8*8]
randomx_isn_174:
; IXOR_R r2, r6
xor r10, r14
randomx_isn_175:
; IADD_RS r0, r7, SHFT 0
lea r8, [r8+r15*1]
randomx_isn_176:
; FMUL_R e1, a1
mulpd xmm5, xmm9
randomx_isn_177:
; ISTORE L1[r1+962725405], r0
lea eax, [r9d+962725405]
and eax, 16376
mov qword ptr [rsi+rax], r8
randomx_isn_178:
; ISTORE L1[r5-1472969684], r4
lea eax, [r13d-1472969684]
and eax, 16376
mov qword ptr [rsi+rax], r12
randomx_isn_179:
; FSCAL_R f3
xorps xmm3, xmm15
randomx_isn_180:
; IXOR_M r7, L1[r5+1728657403]
lea eax, [r13d+1728657403]
and eax, 16376
xor r15, qword ptr [rsi+rax]
randomx_isn_181:
; CBRANCH r2, -759703940, COND 2
add r10, -759704452
test r10, 261120
jz randomx_isn_175
randomx_isn_182:
; FADD_R f1, a2
addpd xmm1, xmm10
randomx_isn_183:
; IMULH_R r5, r1
mov rax, r13
mul r9
mov r13, rdx
randomx_isn_184:
; FSUB_R f3, a2
subpd xmm3, xmm10
randomx_isn_185:
; IMUL_R r6, r2
imul r14, r10
randomx_isn_186:
; IROR_R r2, r6
mov ecx, r14d
ror r10, cl
randomx_isn_187:
; FADD_R f2, a3
addpd xmm2, xmm11
randomx_isn_188:
; FSUB_R f3, a2
subpd xmm3, xmm10
randomx_isn_189:
; FSUB_R f0, a1
subpd xmm0, xmm9
randomx_isn_190:
; FSUB_R f1, a2
subpd xmm1, xmm10
randomx_isn_191:
; ISTORE L2[r0+519974891], r5
lea eax, [r8d+519974891]
and eax, 262136
mov qword ptr [rsi+rax], r13
randomx_isn_192:
; IXOR_R r3, r0
xor r11, r8
randomx_isn_193:
; IMUL_RCP r3, 2631645861
mov rax, 15052968123180221777
imul r11, rax
randomx_isn_194:
; FSCAL_R f2
xorps xmm2, xmm15
randomx_isn_195:
; IMUL_RCP r6, 3565118466
mov rax, 11111575010739676440
imul r14, rax
randomx_isn_196:
; IMUL_RCP r7, 2240276148
mov rax, 17682677777245240213
imul r15, rax
randomx_isn_197:
; FADD_R f3, a0
addpd xmm3, xmm8
randomx_isn_198:
; ISTORE L3[r7-908286266], r0
lea eax, [r15d-908286266]
and eax, 2097144
mov qword ptr [rsi+rax], r8
randomx_isn_199:
; FMUL_R e0, a1
mulpd xmm4, xmm9
randomx_isn_200:
; FADD_R f1, a2
addpd xmm1, xmm10
randomx_isn_201:
; IADD_RS r3, r2, SHFT 3
lea r11, [r11+r10*8]
randomx_isn_202:
; FSUB_R f0, a0
subpd xmm0, xmm8
randomx_isn_203:
; CBRANCH r1, -1282235504, COND 2
add r9, -1282234992
test r9, 261120
jz randomx_isn_182
randomx_isn_204:
; IMUL_M r1, L3[176744]
imul r9, qword ptr [rsi+176744]
randomx_isn_205:
; FSWAP_R e1
shufpd xmm5, xmm5, 1
randomx_isn_206:
; CBRANCH r0, -1557284726, COND 14
add r8, -1555187574
test r8, 1069547520
jz randomx_isn_204
randomx_isn_207:
; IADD_M r3, L1[r0+72267507]
lea eax, [r8d+72267507]
and eax, 16376
add r11, qword ptr [rsi+rax]
randomx_isn_208:
; ISUB_R r7, r0
sub r15, r8
randomx_isn_209:
; IROR_R r3, r2
mov ecx, r10d
ror r11, cl
randomx_isn_210:
; ISUB_R r0, r3
sub r8, r11
randomx_isn_211:
; IMUL_RCP r7, 3271526781
mov rax, 12108744298594255889
imul r15, rax
randomx_isn_212:
; FSQRT_R e2
sqrtpd xmm6, xmm6
randomx_isn_213:
; IMUL_R r0, r4
imul r8, r12
randomx_isn_214:
; FSWAP_R f3
shufpd xmm3, xmm3, 1
randomx_isn_215:
; FADD_R f2, a1
addpd xmm2, xmm9
randomx_isn_216:
; ISMULH_M r5, L1[r4-1702277076]
lea ecx, [r12d-1702277076]
and ecx, 16376
mov rax, r13
imul qword ptr [rsi+rcx]
mov r13, rdx
randomx_isn_217:
; ISUB_R r4, r2
sub r12, r10
randomx_isn_218:
; FMUL_R e1, a2
mulpd xmm5, xmm10
randomx_isn_219:
; FSUB_R f3, a1
subpd xmm3, xmm9
randomx_isn_220:
; ISTORE L2[r1+1067932664], r3
lea eax, [r9d+1067932664]
and eax, 262136
mov qword ptr [rsi+rax], r11
randomx_isn_221:
; IROR_R r6, r4
mov ecx, r12d
ror r14, cl
randomx_isn_222:
; FSUB_R f1, a1
subpd xmm1, xmm9
randomx_isn_223:
; ISUB_R r2, r5
sub r10, r13
randomx_isn_224:
; IXOR_R r2, r7
xor r10, r15
randomx_isn_225:
; IXOR_R r7, r5
xor r15, r13
randomx_isn_226:
; IMUL_RCP r4, 1021824288
mov rax, 9691999329617659469
imul r12, rax
randomx_isn_227:
; IROR_R r1, 48
ror r9, 48
randomx_isn_228:
; IMUL_RCP r4, 4042529026
mov rax, 9799331310263836012
imul r12, rax
randomx_isn_229:
; FSQRT_R e1
sqrtpd xmm5, xmm5
randomx_isn_230:
; IROR_R r3, r6
mov ecx, r14d
ror r11, cl
randomx_isn_231:
; FMUL_R e2, a1
mulpd xmm6, xmm9
randomx_isn_232:
; IMULH_M r4, L1[r6+396272725]
lea ecx, [r14d+396272725]
and ecx, 16376
mov rax, r12
mul qword ptr [rsi+rcx]
mov r12, rdx
randomx_isn_233:
; FSUB_R f0, a0
subpd xmm0, xmm8
randomx_isn_234:
; FADD_R f3, a2
addpd xmm3, xmm10
randomx_isn_235:
; IADD_RS r7, r3, SHFT 1
lea r15, [r15+r11*2]
randomx_isn_236:
; ISUB_R r6, r3
sub r14, r11
randomx_isn_237:
; IADD_RS r4, r4, SHFT 2
lea r12, [r12+r12*4]
randomx_isn_238:
; ISUB_R r7, r1
sub r15, r9
randomx_isn_239:
; ISMULH_R r2, r5
mov rax, r10
imul r13
mov r10, rdx
randomx_isn_240:
; FMUL_R e1, a2
mulpd xmm5, xmm10
randomx_isn_241:
; IADD_RS r1, r4, SHFT 2
lea r9, [r9+r12*4]
randomx_isn_242:
; FDIV_M e2, L2[r6+259737107]
lea eax, [r14d+259737107]
and eax, 262136
cvtdq2pd xmm12, qword ptr [rsi+rax]
andps xmm12, xmm13
orps xmm12, xmm14
divpd xmm6, xmm12
randomx_isn_243:
; IADD_M r0, L1[r1+789576070]
lea eax, [r9d+789576070]
and eax, 16376
add r8, qword ptr [rsi+rax]
randomx_isn_244:
; IMUL_R r3, r4
imul r11, r12
randomx_isn_245:
; IMUL_R r3, r1
imul r11, r9
randomx_isn_246:
; IMUL_RCP r4, 1001661150
mov rax, 9887096364157721599
imul r12, rax
randomx_isn_247:
; CBRANCH r3, -722123512, COND 2
add r11, -722123512
test r11, 261120
jz randomx_isn_246
randomx_isn_248:
; ISMULH_R r7, r6
mov rax, r15
imul r14
mov r15, rdx
randomx_isn_249:
; IADD_M r5, L3[1870552]
add r13, qword ptr [rsi+1870552]
randomx_isn_250:
; ISUB_R r0, r1
sub r8, r9
randomx_isn_251:
; IMULH_R r0, r5
mov rax, r8
mul r13
mov r8, rdx
randomx_isn_252:
; FSUB_R f1, a1
subpd xmm1, xmm9
randomx_isn_253:
; ISTORE L2[r3-2010380786], r5
lea eax, [r11d-2010380786]
and eax, 262136
mov qword ptr [rsi+rax], r13
randomx_isn_254:
; FMUL_R e3, a2
mulpd xmm7, xmm10
randomx_isn_255:
; CBRANCH r7, -2007380935, COND 9
add r15, -2007315399
test r15, 33423360
jz randomx_isn_249

938
RandomX/doc/specs.md Normal file
View File

@@ -0,0 +1,938 @@
# RandomX
RandomX is a proof of work (PoW) algorithm which was designed to close the gap between general-purpose CPUs and specialized hardware. The core of the algorithm is a simulation of a virtual CPU.
#### Table of contents
1. [Definitions](#1-definitions)
1. [Algorithm description](#2-algorithm-description)
1. [Custom functions](#3-custom-functions)
1. [Virtual Machine](#4-virtual-machine)
1. [Instruction set](#5-instruction-set)
1. [SuperscalarHash](#6-superscalarhash)
1. [Dataset](#7-dataset)
## 1. Definitions
### 1.1 General definitions
**Hash256** and **Hash512** refer to the [Blake2b](https://blake2.net/blake2_20130129.pdf) hashing function with a 256-bit and 512-bit output size, respectively.
**Floating point format** refers to the [IEEE-754 double precision floating point format](https://en.wikipedia.org/wiki/Double-precision_floating-point_format) with a sign bit, 11-bit exponent and 52-bit fraction.
**Argon2d** is a tradeoff-resistant variant of [Argon2](https://github.com/P-H-C/phc-winner-argon2/blob/master/argon2-specs.pdf), a memory-hard password derivation function.
**AesGenerator1R** refers to an AES-based pseudo-random number generator described in chapter 3.2. It's initialized with a 512-bit seed value and is capable of producing more than 10 bytes per clock cycle.
**AesGenerator4R** is a slower but more secure AES-based pseudo-random number generator described in chapter 3.3. It's initialized with a 512-bit seed value.
**AesHash1R** refers to an AES-based fingerprinting function described in chapter 3.4. It's capable of processing more than 10 bytes per clock cycle and produces a 512-bit output.
**BlakeGenerator** refers to a custom pseudo-random number generator described in chapter 3.4. It's based on the Blake2b hashing function.
**SuperscalarHash** refers to a custom diffusion function designed to run efficiently on superscalar CPUs (see chapter 7). It transforms a 64-byte input value into a 64-byte output value.
**Virtual Machine** or **VM** refers to the RandomX virtual machine as described in chapter 4.
**Programming the VM** refers to the act of loading a program and configuration into the VM. This is described in chapter 4.5.
**Executing the VM** refers to the act of running the program loop as described in chapter 4.6.
**Scratchpad** refers to the workspace memory of the VM. The whole scratchpad is structured into 3 levels: L3 -> L2 -> L1 with each lower level being a subset of the higher levels.
**Register File** refers to a 256-byte sequence formed by concatenating VM registers in little-endian format in the following order: `r0`-`r7`, `f0`-`f3`, `e0`-`e3` and `a0`-`a3`.
**Program Buffer** refers to the buffer from which the VM reads instructions.
**Cache** refers to a read-only buffer initialized by Argon2d as described in chapter 7.1.
**Dataset** refers to a large read-only buffer described in chapter 7. It is constructed from the Cache using the SuperscalarHash function.
### 1.2 Configurable parameters
RandomX has several configurable parameters that are listed in Table 1.2.1 with their default values.
*Table 1.2.1 - Configurable parameters*
|parameter|description|default value|
|---------|-----|-------|
|`RANDOMX_ARGON_MEMORY`|The number of 1 KiB Argon2 blocks in the Cache| `262144`|
|`RANDOMX_ARGON_ITERATIONS`|The number of Argon2d iterations for Cache initialization|`3`|
|`RANDOMX_ARGON_LANES`|The number of parallel lanes for Cache initialization|`1`|
|`RANDOMX_ARGON_SALT`|Argon2 salt|`"RandomX\x03"`|
|`RANDOMX_CACHE_ACCESSES`|The number of random Cache accesses per Dataset item|`8`|
|`RANDOMX_SUPERSCALAR_LATENCY`|Target latency for SuperscalarHash (in cycles of the reference CPU)|`170`|
|`RANDOMX_DATASET_BASE_SIZE`|Dataset base size in bytes|`2147483648`|
|`RANDOMX_DATASET_EXTRA_SIZE`|Dataset extra size in bytes|`33554368`|
|`RANDOMX_PROGRAM_SIZE`|The number of instructions in a RandomX program|`256`|
|`RANDOMX_PROGRAM_ITERATIONS`|The number of iterations per program|`2048`|
|`RANDOMX_PROGRAM_COUNT`|The number of programs per hash|`8`|
|`RANDOMX_JUMP_BITS`|Jump condition mask size in bits|`8`|
|`RANDOMX_JUMP_OFFSET`|Jump condition mask offset in bits|`8`|
|`RANDOMX_SCRATCHPAD_L3`|Scratchpad L3 size in bytes|`2097152`|
|`RANDOMX_SCRATCHPAD_L2`|Scratchpad L2 size in bytes|`262144`|
|`RANDOMX_SCRATCHPAD_L1`|Scratchpad L1 size in bytes|`16384`|
Instruction frequencies listed in Tables 5.2.1, 5.3.1, 5.4.1 and 5.5.1 are also configurable.
## 2. Algorithm description
The RandomX algorithm accepts two input values:
* String `K` with a size of 0-60 bytes (key)
* String `H` of arbitrary length (the value to be hashed)
and outputs a 256-bit result `R`.
The algorithm consists of the following steps:
1. The Dataset is initialized using the key value `K` (described in chapter 7).
1. 64-byte seed `S` is calculated as `S = Hash512(H)`.
1. Let `gen1 = AesGenerator1R(S)`.
1. The Scratchpad is filled with `RANDOMX_SCRATCHPAD_L3` random bytes using generator `gen1`.
1. Let `gen4 = AesGenerator4R(gen1.state)` (use the final state of `gen1`).
1. The value of the VM register `fprc` is set to 0 (default rounding mode - chapter 4.3).
1. The VM is programmed using `128 + 8 * RANDOMX_PROGRAM_SIZE` random bytes using generator `gen4` (chapter 4.5).
1. The VM is executed (chapter 4.6).
1. A new 64-byte seed is calculated as `S = Hash512(RegisterFile)`.
1. Set `gen4.state = S` (modify the state of the generator).
1. Steps 7-10 are performed a total of `RANDOMX_PROGRAM_COUNT` times. The last iteration skips steps 9 and 10.
1. Scratchpad fingerprint is calculated as `A = AesHash1R(Scratchpad)`.
1. Bytes 192-255 of the Register File are set to the value of `A`.
1. Result is calculated as `R = Hash256(RegisterFile)`.
The input of the `Hash512` function in step 9 is the following 256 bytes:
```
+---------------------------------+
| registers r0-r7 | (64 bytes)
+---------------------------------+
| registers f0-f3 | (64 bytes)
+---------------------------------+
| registers e0-e3 | (64 bytes)
+---------------------------------+
| registers a0-a3 | (64 bytes)
+---------------------------------+
```
The input of the `Hash256` function in step 14 is the following 256 bytes:
```
+---------------------------------+
| registers r0-r7 | (64 bytes)
+---------------------------------+
| registers f0-f3 | (64 bytes)
+---------------------------------+
| registers e0-e3 | (64 bytes)
+---------------------------------+
| AesHash1R(Scratchpad) | (64 bytes)
+---------------------------------+
```
## 3 Custom functions
### 3.1 Definitions
Two of the custom functions are based on the [Advanced Encryption Standard](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) (AES).
**AES encryption round** refers to the application of the ShiftRows, SubBytes and MixColumns transformations followed by a XOR with the round key.
**AES decryption round** refers to the application of inverse ShiftRows, inverse SubBytes and inverse MixColumns transformations followed by a XOR with the round key.
### 3.2 AesGenerator1R
AesGenerator1R produces a sequence of pseudo-random bytes.
The internal state of the generator consists of 64 bytes arranged into four columns of 16 bytes each. During each output iteration, every column is decrypted (columns 0, 2) or encrypted (columns 1, 3) with one AES round using the following round keys (one key per column):
```
key0 = 53 a5 ac 6d 09 66 71 62 2b 55 b5 db 17 49 f4 b4
key1 = 07 af 7c 6d 0d 71 6a 84 78 d3 25 17 4e dc a1 0d
key2 = f1 62 12 3f c6 7e 94 9f 4f 79 c0 f4 45 e3 20 3e
key3 = 35 81 ef 6a 7c 31 ba b1 88 4c 31 16 54 91 16 49
```
These keys were generated as:
```
key0, key1, key2, key3 = Hash512("RandomX AesGenerator1R keys")
```
Single iteration produces 64 bytes of output which also become the new generator state.
```
state0 (16 B) state1 (16 B) state2 (16 B) state3 (16 B)
| | | |
AES decrypt AES encrypt AES decrypt AES encrypt
(key0) (key1) (key2) (key3)
| | | |
v v v v
state0' state1' state2' state3'
```
### 3.3 AesGenerator4R
AesGenerator4R works the same way as AesGenerator1R, except it uses 4 rounds per column:
```
state0 (16 B) state1 (16 B) state2 (16 B) state3 (16 B)
| | | |
AES decrypt AES encrypt AES decrypt AES encrypt
(key0) (key0) (key0) (key0)
| | | |
v v v v
AES decrypt AES encrypt AES decrypt AES encrypt
(key1) (key1) (key1) (key1)
| | | |
v v v v
AES decrypt AES encrypt AES decrypt AES encrypt
(key2) (key2) (key2) (key2)
| | | |
v v v v
AES decrypt AES encrypt AES decrypt AES encrypt
(key3) (key3) (key3) (key3)
| | | |
v v v v
state0' state1' state2' state3'
```
AesGenerator4R uses the following 4 round keys:
```
key0 = 5d 46 90 f8 a6 e4 fb 7f b7 82 1f 14 95 9e 35 cf
key1 = 50 c4 55 6a 8a 27 e8 fe c3 5a 5c bd dc ff 41 67
key2 = a4 47 4c 11 e4 fd 24 d5 d2 9a 27 a7 ac 4a 32 3d
key3 = 2a 3a 0c 81 ff ae a9 99 d9 db d3 42 08 db f6 76
```
These keys were generated as:
```
key0, key1, key2, key3 = Hash512("RandomX AesGenerator4R keys")
```
### 3.4 AesHash1R
AesHash1R calculates a 512-bit fingerprint of its input.
AesHash1R has a 64-byte internal state, which is arranged into four columns of 16 bytes each. The initial state is:
```
state0 = 0d 2c b5 92 de 56 a8 9f 47 db 82 cc ad 3a 98 d7
state1 = 6e 99 8d 33 98 b7 c7 15 5a 12 9e f5 57 80 e7 ac
state2 = 17 00 77 6a d0 c7 62 ae 6b 50 79 50 e4 7c a0 e8
state3 = 0c 24 0a 63 8d 82 ad 07 05 00 a1 79 48 49 99 7e
```
The initial state vectors were generated as:
```
state0, state1, state2, state3 = Hash512("RandomX AesHash1R state")
```
The input is processed in 64-byte blocks. Each input block is considered to be a set of four AES round keys `key0`, `key1`, `key2`, `key3`. Each state column is encrypted (columns 0, 2) or decrypted (columns 1, 3) with one AES round using the corresponding round key:
```
state0 (16 B) state1 (16 B) state2 (16 B) state3 (16 B)
| | | |
AES encrypt AES decrypt AES encrypt AES decrypt
(key0) (key1) (key2) (key3)
| | | |
v v v v
state0' state1' state2' state3'
```
When all input bytes have been processed, the state is processed with two additional AES rounds with the following extra keys (one key per round, same pair of keys for all columns):
```
xkey0 = 89 83 fa f6 9f 94 24 8b bf 56 dc 90 01 02 89 06
xkey1 = d1 63 b2 61 3c e0 f4 51 c6 43 10 ee 9b f9 18 ed
```
The extra keys were generated as:
```
xkey0, xkey1 = Hash256("RandomX AesHash1R xkeys")
```
```
state0 (16 B) state1 (16 B) state2 (16 B) state3 (16 B)
| | | |
AES encrypt AES decrypt AES encrypt AES decrypt
(xkey0) (xkey0) (xkey0) (xkey0)
| | | |
v v v v
AES encrypt AES decrypt AES encrypt AES decrypt
(xkey1) (xkey1) (xkey1) (xkey1)
| | | |
v v v v
finalState0 finalState1 finalState2 finalState3
```
The final state is the output of the function.
### 3.4 BlakeGenerator
BlakeGenerator is a simple pseudo-random number generator based on the Blake2b hashing function. It has a 64-byte internal state `S`.
#### 3.4.1 Initialization
The internal state is initialized from a seed value `K` (0-60 bytes long). The seed value is written into the internal state and padded with zeroes. Then the internal state is initialized as `S = Hash512(S)`.
#### 3.4.2 Random number generation
The generator can generate 1 byte or 4 bytes at a time by supplying data from its internal state `S`. If there are not enough unused bytes left, the internal state is reinitialized as `S = Hash512(S)`.
## 4. Virtual Machine
The components of the RandomX virtual machine are summarized in Fig. 4.1.
*Figure 4.1 - Virtual Machine*
![Imgur](https://i.imgur.com/Enk42b8.png)
The VM is a complex instruction set computer ([CISC](https://en.wikipedia.org/wiki/Complex_instruction_set_computer)). All data are loaded and stored in little-endian byte order. Signed integer numbers are represented using [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement).
### 4.1 Dataset
Dataset is described in detail in chapter 7. It's a large read-only buffer. Its size is equal to `RANDOMX_DATASET_BASE_SIZE + RANDOMX_DATASET_EXTRA_SIZE` bytes. Each program uses only a random subset of the Dataset of size `RANDOMX_DATASET_BASE_SIZE`. All Dataset accesses read an aligned 64-byte item.
### 4.2 Scratchpad
Scratchpad represents the workspace memory of the VM. Its size is `RANDOMX_SCRATCHPAD_L3` bytes and it's divided into 3 "levels":
* The whole scratchpad is the third level "L3".
* The first `RANDOMX_SCRATCHPAD_L2` bytes of the scratchpad is the second level "L2".
* The first `RANDOMX_SCRATCHPAD_L1` bytes of the scratchpad is the first level "L1".
The scratchpad levels are inclusive, i.e. L3 contains both L2 and L1 and L2 contains L1.
To access a particular scratchpad level, bitwise AND with a mask according to table 4.2.1 is applied to the memory address.
*Table 4.2.1: Scratchpad access masks*
|Level|8-byte aligned mask|64-byte aligned mask|
|---------|-|-|
|L1|`(RANDOMX_SCRATCHPAD_L1 - 1) & ~7`|-|
|L2|`(RANDOMX_SCRATCHPAD_L2 - 1) & ~7`|-|
|L3|`(RANDOMX_SCRATCHPAD_L3 - 1) & ~7`|`(RANDOMX_SCRATCHPAD_L3 - 1) & ~63`|
### 4.3 Registers
The VM has 8 integer registers `r0`-`r7` (group R) and a total of 12 floating point registers split into 3 groups: `f0`-`f3` (group F), `e0`-`e3` (group E) and `a0`-`a3` (group A). Integer registers are 64 bits wide, while floating point registers are 128 bits wide and contain a pair of numbers in floating point format. The lower and upper half of floating point registers are not separately addressable.
Additionally, there are 3 internal registers `ma`, `mx` and `fprc`.
Integer registers `r0`-`r7` can be the source or the destination operands of integer instructions or may be used as address registers for accessing the Scratchpad.
Floating point registers `a0`-`a3` are read-only and their value is fixed for a given VM program. They can be the source operand of any floating point instruction. The value of these registers is restricted to the interval `[1, 4294967296)`.
Floating point registers `f0`-`f3` are the "additive" registers, which can be the destination of floating point addition and subtraction instructions. The absolute value of these registers will not exceed about `3.0e+14`.
Floating point registers `e0`-`e3` are the "multiplicative" registers, which can be the destination of floating point multiplication, division and square root instructions. Their value is always positive.
`ma` and `mx` are the memory registers. Both are 32 bits wide. `ma` contains the memory address of the next Dataset read and `mx` contains the address of the next Dataset prefetch.
The 2-bit `fprc` register determines the rounding mode of all floating point operations according to Table 4.3.1. The four rounding modes are defined by the IEEE 754 standard.
*Table 4.3.1: Rounding modes*
|`fprc`|rounding mode|
|-------|------------|
|0|roundTiesToEven|
|1|roundTowardNegative|
|2|roundTowardPositive|
|3|roundTowardZero|
#### 4.3.1 Group F register conversion
When an 8-byte value read from the memory is to be converted to an F group register value or operand, it is interpreted as a pair of 32-bit signed integers (in little endian, two's complement format) and converted to floating point format. This conversion is exact and doesn't need rounding because only 30 bits of the fraction significand are needed to represent the integer value.
#### 4.3.2 Group E register conversion
When an 8-byte value read from the memory is to be converted to an E group register value or operand, the same conversion procedure is applied as for F group registers (see 4.3.1) with additional post-processing steps for each of the two floating point values:
1. The sign bit is set to `0`.
2. Bits 0-2 of the exponent are set to the constant value of <code>011<sub>2</sub></code>.
3. Bits 3-6 of the exponent are set to the value of the exponent mask described in chapter 4.5.6. This value is fixed for a given VM program.
4. The bottom 22 bits of the fraction significand are set to the value of the fraction mask described in chapter 4.5.6. This value is fixed for a given VM program.
### 4.4 Program buffer
The Program buffer stores the program to be executed by the VM. The program consists of `RANDOMX_PROGRAM_SIZE` instructions. Each instruction is encoded by an 8-byte word. The instruction set is described in chapter 5.
### 4.5 VM programming
The VM requires `128 + 8 * RANDOMX_PROGRAM_SIZE` bytes to be programmed. This is split into two parts:
* `128` bytes of configuration data = 16 quadwords (16×8 bytes), used according to Table 4.5.1
* `8 * RANDOMX_PROGRAM_SIZE` bytes of program data, copied directly into the Program Buffer
*Table 4.5.1 - Configuration data*
|quadword|description|
|-----|-----------|
|0|initialize low half of register `a0`|
|1|initialize high half of register `a0`|
|2|initialize low half of register `a1`|
|3|initialize high half of register `a1`|
|4|initialize low half of register `a2`|
|5|initialize high half of register `a2`|
|6|initialize low half of register `a3`|
|7|initialize high half of register `a3`|
|8|initialize register `ma`|
|9|(reserved)|
|10|initialize register `mx`|
|11|(reserved)|
|12|select address registers|
|13|select Dataset offset|
|14|initialize register masks for low half of group E registers|
|15|initialize register masks for high half of group E registers|
#### 4.5.2 Group A register initialization
The values of the floating point registers `a0`-`a3` are initialized using configuration quadwords 0-7 to have the following value:
<code>+1.fraction x 2<sup>exponent</sup></code>
The fraction has full 52 bits of precision and the exponent value ranges from 0 to 31. These values are obtained from the initialization quadword (in little endian format) according to Table 4.5.2.
*Table 4.5.2 - Group A register initialization*
|bits|description|
|----|-----------|
|0-51|fraction|
|52-58|(reserved)|
|59-63|exponent|
#### 4.5.3 Memory registers
Registers `ma` and `mx` are initialized using the low 32 bits of quadwords 8 and 10 in little endian format.
#### 4.5.4 Address registers
Bits 0-3 of quadword 12 are used to select 4 address registers for program execution. Each bit chooses one register from a pair of integer registers according to Table 4.5.3.
*Table 4.5.3 - Address registers*
|address register (bit)|value = 0|value = 1|
|----------------------|-|-|
|`readReg0` (0)|`r0`|`r1`|
|`readReg1` (1)|`r2`|`r3`|
|`readReg2` (2)|`r4`|`r5`|
|`readReg3` (3)|`r6`|`r7`|
#### 4.5.5 Dataset offset
The `datasetOffset` is calculated by bitwise AND of quadword 13 and the value `RANDOMX_DATASET_EXTRA_SIZE / 64`. The result is multiplied by `64`. This offset is used when reading values from the Dataset.
#### 4.5.6 Group E register masks
These masks are used for the conversion of group E registers (see 4.3.2). The low and high halves each have their own masks initialized from quadwords 14 and 15. The fraction mask is given by bits 0-21 and the exponent mask by bits 60-63 of the initialization quadword.
### 4.6 VM execution
During VM execution, 3 additional temporary registers are used: `ic`, `spAddr0` and `spAddr1`. Program execution consists of initialization and loop execution.
#### 4.6.1 Initialization
1. `ic` register is set to `RANDOMX_PROGRAM_ITERATIONS`.
2. `spAddr0` is set to the value of `mx`.
3. `spAddr1` is set to the value of `ma`.
4. The values of all integer registers `r0`-`r7` are set to zero.
#### 4.6.2 Loop execution
The loop described below is repeated until the value of the `ic` register reaches zero.
1. XOR of registers `readReg0` and `readReg1` (see Table 4.5.3) is calculated and `spAddr0` is XORed with the low 32 bits of the result and `spAddr1` with the high 32 bits.
2. `spAddr0` is used to perform a 64-byte aligned read from Scratchpad level 3 (using mask from Table 4.2.1). The 64 bytes are XORed with all integer registers in order `r0`-`r7`.
3. `spAddr1` is used to perform a 64-byte aligned read from Scratchpad level 3 (using mask from Table 4.2.1). Each floating point register `f0`-`f3` and `e0`-`e3` is initialized using an 8-byte value according to the conversion rules from chapters 4.3.1 and 4.3.2.
4. The 256 instructions stored in the Program Buffer are executed.
5. The `mx` register is XORed with the low 32 bits of registers `readReg2` and `readReg3` (see Table 4.5.3).
6. A 64-byte Dataset item at address `datasetOffset + mx % RANDOMX_DATASET_BASE_SIZE` is prefetched from the Dataset (it will be used during the next iteration).
7. A 64-byte Dataset item at address `datasetOffset + ma % RANDOMX_DATASET_BASE_SIZE` is loaded from the Dataset. The 64 bytes are XORed with all integer registers in order `r0`-`r7`.
8. The values of registers `mx` and `ma` are swapped.
9. The values of all integer registers `r0`-`r7` are written to the Scratchpad (L3) at address `spAddr1` (64-byte aligned).
10. Register `f0` is XORed with register `e0` and the result is stored in register `f0`. Register `f1` is XORed with register `e1` and the result is stored in register `f1`. Register `f2` is XORed with register `e2` and the result is stored in register `f2`. Register `f3` is XORed with register `e3` and the result is stored in register `f3`.
11. The values of registers `f0`-`f3` are written to the Scratchpad (L3) at address `spAddr0` (64-byte aligned).
12. `spAddr0` and `spAddr1` are both set to zero.
13. `ic` is decreased by 1.
## 5. Instruction set
The VM executes programs in a special instruction set, which was designed in such way that any random 8-byte word is a valid instruction and any sequence of valid instructions is a valid program. Because there are no "syntax" rules, generating a random program is as easy as filling the program buffer with random data.
### 5.1 Instruction encoding
Each instruction word is 64 bits long. Instruction fields are encoded as shown in Fig. 5.1.
*Figure 5.1 - Instruction encoding*
![Imgur](https://i.imgur.com/FtkWRwe.png)
#### 5.1.1 opcode
There are 256 opcodes, which are distributed between 29 distinct instructions. Each instruction can be encoded using multiple opcodes (the number of opcodes specifies the frequency of the instruction in a random program).
*Table 5.1.1: Instruction groups*
|group|# instructions|# opcodes||
|---------|-----------------|----|-|
|integer |17|129|50.4%|
|floating point |9|94|36.7%|
|control |2|17|6.6%|
|store |1|16|6.3%|
||**29**|**256**|**100%**
All instructions are described below in chapters 5.2 - 5.5.
#### 5.1.2 dst
Destination register. Only bits 0-1 (register groups A, F, E) or 0-2 (groups R, F+E) are used to encode a register according to Table 5.1.2.
*Table 5.1.2: Addressable register groups*
|index|R|A|F|E|F+E|
|--|--|--|--|--|--|
|0|`r0`|`a0`|`f0`|`e0`|`f0`|
|1|`r1`|`a1`|`f1`|`e1`|`f1`|
|2|`r2`|`a2`|`f2`|`e2`|`f2`|
|3|`r3`|`a3`|`f3`|`e3`|`f3`|
|4|`r4`||||`e0`|
|5|`r5`||||`e1`|
|6|`r6`||||`e2`|
|7|`r7`||||`e3`|
#### 5.1.3 src
The `src` flag encodes a source operand register according to Table 5.1.2 (only bits 0-1 or 0-2 are used).
Some integer instructions use a constant value as the source operand in cases when `dst` and `src` encode the same register (see Table 5.2.1).
For register-memory instructions, the source operand is used to calculate the memory address.
#### 5.1.4 mod
The `mod` flag is encoded as:
*Table 5.1.3: mod flag encoding*
|`mod` bits|description|range of values|
|----|--------|----|
|0-1|`mod.mem` flag|0-3|
|2-3|`mod.shift` flag|0-3|
|4-7|`mod.cond` flag|0-15|
The `mod.mem` flag selects between Scratchpad levels L1 and L2 when reading from or writing to memory except for two cases:
* it's a memory read and `dst` and `src` encode the same register
* it's a memory write `mod.cond` is 14 or 15
In these two cases, the Scratchpad level is L3 (see Table 5.1.4).
*Table 5.1.4: memory access Scratchpad level*
|condition|Scratchpad level|
|---------|-|
|`src == dst` (read)|L3|
|`mod.cond >= 14` (write)|L3|
|`mod.mem == 0`|L2|
|`mod.mem != 0`|L1|
The address for reading/writing is calculated by applying bitwise AND operation to the address and the 8-byte aligned address mask listed in Table 4.2.1.
The `mod.cond` and `mod.shift` flags are used by some instructions (see 5.2, 5.4).
#### 5.1.5 imm32
A 32-bit immediate value that can be used as the source operand and is used to calculate addresses for memory operations. The immediate value is sign-extended to 64 bits unless specified otherwise.
### 5.2 Integer instructions
For integer instructions, the destination is always an integer register (register group R). Source operand (if applicable) can be either an integer register or memory value. If `dst` and `src` refer to the same register, most instructions use `0` or `imm32` instead of the register. This is indicated in the 'src == dst' column in Table 5.2.1.
`[mem]` indicates a memory operand loaded as an 8-byte value from the address `src + imm32`.
*Table 5.2.1 Integer instructions*
|frequency|instruction|dst|src|`src == dst ?`|operation|
|-|-|-|-|-|-|
|25/256|IADD_RS|R|R|`src = dst`|`dst = dst + (src << mod.shift) (+ imm32)`|
|7/256|IADD_M|R|R|`src = 0`|`dst = dst + [mem]`|
|16/256|ISUB_R|R|R|`src = imm32`|`dst = dst - src`|
|7/256|ISUB_M|R|R|`src = 0`|`dst = dst - [mem]`|
|16/256|IMUL_R|R|R|`src = imm32`|`dst = dst * src`|
|4/256|IMUL_M|R|R|`src = 0`|`dst = dst * [mem]`|
|4/256|IMULH_R|R|R|`src = dst`|`dst = (dst * src) >> 64`|
|1/256|IMULH_M|R|R|`src = 0`|`dst = (dst * [mem]) >> 64`|
|4/256|ISMULH_R|R|R|`src = dst`|`dst = (dst * src) >> 64` (signed)|
|1/256|ISMULH_M|R|R|`src = 0`|`dst = (dst * [mem]) >> 64` (signed)|
|8/256|IMUL_RCP|R|-|-|<code>dst = 2<sup>x</sup> / imm32 * dst</code>|
|2/256|INEG_R|R|-|-|`dst = -dst`|
|15/256|IXOR_R|R|R|`src = imm32`|`dst = dst ^ src`|
|5/256|IXOR_M|R|R|`src = 0`|`dst = dst ^ [mem]`|
|10/256|IROR_R|R|R|`src = imm32`|`dst = dst >>> src`|
|0/256|IROL_R|R|R|`src = imm32`|`dst = dst <<< src`|
|4/256|ISWAP_R|R|R|`src = dst`|`temp = src; src = dst; dst = temp`|
#### 5.2.1 IADD_RS
This instructions adds the values of two registers (modulo 2<sup>64</sup>). The value of the second operand is shifted left by 0-3 bits (determined by the `mod.shift` flag). Additionally, if `dst` is register `r5`, the immediate value `imm32` is added to the result.
#### 5.2.2 IADD_M
64-bit integer addition operation (performed modulo 2<sup>64</sup>) with a memory source operand.
#### 5.2.3 ISUB_R, ISUB_M
64-bit integer subtraction (performed modulo 2<sup>64</sup>). ISUB_R uses register source operand, ISUB_M uses a memory source operand.
#### 5.2.4 IMUL_R, IMUL_M
64-bit integer multiplication (performed modulo 2<sup>64</sup>). IMUL_R uses a register source operand, IMUL_M uses a memory source operand.
#### 5.2.5 IMULH_R, IMULH_M, ISMULH_R, ISMULH_M
These instructions output the high 64 bits of the whole 128-bit multiplication result. The result differs for signed and unsigned multiplication (IMULH is unsigned, ISMULH is signed). The variants with a register source operand perform a squaring operation if `dst` equals `src`.
#### 5.2.6 IMUL_RCP
If `imm32` equals 0 or is a power of 2, IMUL_RCP is a no-op. In other cases, the instruction multiplies the destination register by a reciprocal of `imm32` (the immediate value is zero-extended and treated as unsigned). The reciprocal is calculated as <code>rcp = 2<sup>x</sup> / imm32</code> by choosing the largest integer `x` such that <code>rcp < 2<sup>64</sup></code>.
#### 5.2.7 INEG_R
Performs two's complement negation of the destination register.
#### 5.2.8 IXOR_R, IXOR_M
64-bit exclusive OR operation. IXOR_R uses a register source operand, IXOR_M uses a memory source operand.
#### 5.2.9 IROR_R, IROL_R
Performs a cyclic shift (rotation) of the destination register. Source operand (shift count) is implicitly masked to 6 bits. IROR rotates bits right, IROL left.
#### 5.2.9 ISWAP_R
This instruction swaps the values of two registers. If source and destination refer to the same register, the result is a no-op.
### 5.3 Floating point instructions
For floating point instructions, the destination can be a group F or group E register. Source operand is either a group A register or a memory value.
`[mem]` indicates a memory operand loaded as an 8-byte value from the address `src + imm32` and converted according to the rules in chapters 4.3.1 (group F) or 4.3.2 (group E). The lower and upper memory operands are denoted as `[mem][0]` and `[mem][1]`.
All floating point operations are rounded according to the current value of the `fprc` register (see Table 4.3.1). Due to restrictions on the values of the floating point registers, no operation results in `NaN` or a denormal number.
*Table 5.3.1 Floating point instructions*
|frequency|instruction|dst|src|operation|
|-|-|-|-|-|
|8/256|FSWAP_R|F+E|-|`(dst0, dst1) = (dst1, dst0)`|
|20/256|FADD_R|F|A|`(dst0, dst1) = (dst0 + src0, dst1 + src1)`|
|5/256|FADD_M|F|R|`(dst0, dst1) = (dst0 + [mem][0], dst1 + [mem][1])`|
|20/256|FSUB_R|F|A|`(dst0, dst1) = (dst0 - src0, dst1 - src1)`|
|5/256|FSUB_M|F|R|`(dst0, dst1) = (dst0 - [mem][0], dst1 - [mem][1])`|
|6/256|FSCAL_R|F|-|<code>(dst0, dst1) = (-2<sup>x0</sup> * dst0, -2<sup>x1</sup> * dst1)</code>|
|20/256|FMUL_R|E|A|`(dst0, dst1) = (dst0 * src0, dst1 * src1)`|
|4/256|FDIV_M|E|R|`(dst0, dst1) = (dst0 / [mem][0], dst1 / [mem][1])`|
|6/256|FSQRT_R|E|-|`(dst0, dst1) = (√dst0, √dst1)`|
#### 5.3.1 FSWAP_R
Swaps the lower and upper halves of the destination register. This is the only instruction that is applicable to both F an E register groups.
#### 5.3.2 FADD_R, FADD_M
Double precision floating point addition. FADD_R uses a group A register source operand, FADD_M uses a memory operand.
#### 5.3.3 FSUB_R, FSUB_M
Double precision floating point subtraction. FSUB_R uses a group A register source operand, FSUB_M uses a memory operand.
#### 5.3.4 FSCAL_R
This instruction negates the number and multiplies it by <code>2<sup>x</sup></code>. `x` is calculated by taking the 4 least significant digits of the biased exponent and interpreting them as a binary number using the digit set `{+1, -1}` as opposed to the traditional `{0, 1}`. The possible values of `x` are all odd numbers from -15 to +15.
The mathematical operation described above is equivalent to a bitwise XOR of the binary representation with the value of `0x80F0000000000000`.
#### 5.3.5 FMUL_R
Double precision floating point multiplication. This instruction uses only a register source operand.
#### 5.3.6 FDIV_M
Double precision floating point division. This instruction uses only a memory source operand.
#### 5.3.7 FSQRT_R
Double precision floating point square root of the destination register.
### 5.4 Control instructions
There are 2 control instructions.
*Table 5.4.1 - Control instructions*
|frequency|instruction|dst|src|operation|
|-|-|-|-|-|
|1/256|CFROUND|-|R|`fprc = src >>> imm32`
|16/256|CBRANCH|R|-|`dst = dst + cimm`, conditional jump
#### 5.4.1 CFROUND
This instruction calculates a 2-bit value by rotating the source register right by `imm32` bits and taking the 2 least significant bits (the value of the source register is unaffected). The result is stored in the `fprc` register. This changes the rounding mode of all subsequent floating point instructions.
#### 5.4.2 CBRANCH
This instruction adds an immediate value `cimm` (constructed from `imm32`, see below) to the destination register and then performs a conditional jump in the Program Buffer based on the value of the destination register. The target of the jump is the instruction following the instruction when register `dst` was last modified.
At the beginning of each program iteration, all registers are considered to be unmodified. A register is considered as modified by an instruction in the following cases:
* It is the destination register of an integer instruction except IMUL_RCP and ISWAP_R.
* It is the destination register of IMUL_RCP and `imm32` is not zero or a power of 2.
* It is the source or the destination register of ISWAP_R and the destination and source registers are distinct.
* The CBRANCH instruction is considered to modify all integer registers.
If register `dst` has not been modified yet, the jump target is the first instruction in the Program Buffer.
The CBRANCH instruction performs the following steps:
1. A constant `b` is calculated as `mod.cond + RANDOMX_JUMP_OFFSET`.
1. A constant `cimm` is constructed as sign-extended `imm32` with bit `b` set to 1 and bit `b-1` set to 0 (if `b > 0`).
1. `cimm` is added to the destination register.
1. If bits `b` to `b + RANDOMX_JUMP_BITS - 1` of the destination register are zero, the jump is executed (target is the instruction following the instruction where `dst` was last modified).
Bits in immediate and register values are numbered from 0 to 63 with 0 being the least significant bit. For example, for `b = 10` and `RANDOMX_JUMP_BITS = 8`, the bits are arranged like this:
```
cimm = SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSMMMMMMMMMMMMMMMMMMMMM10MMMMMMMMM
dst = ..............................................XXXXXXXX..........
```
`S` is a copied sign bit from `imm32`. `M` denotes bits of `imm32`. The 9th bit is set to 0 and the 10th bit is set to 1. This value will be added to `dst`.
The second line uses `X` to mark bits of `dst` that will be checked by the condition. If all these bits are 0 after adding `cimm`, the jump is executed.
The construction of the CBRANCH instruction ensures that no inifinite loops are possible in the program.
### 5.5 Store instruction
There is one explicit store instruction for integer values.
`[mem]` indicates the destination is an 8-byte value at the address `dst + imm32`.
*Table 5.5.1 - Store instruction*
|frequency|instruction|dst|src|operation|
|-|-|-|-|-|
|16/256|ISTORE|R|R|`[mem] = src`
#### 5.5.1 ISTORE
This instruction stores the value of the source integer register to the memory at the address calculated from the value of the destination register. The `src` and `dst` can be the same register.
## 6. SuperscalarHash
SuperscalarHash is a custom diffusion function that was designed to burn as much power as possible using only the CPU's integer ALUs.
The input and output of SuperscalarHash are 8 integer registers `r0`-`r7`, each 64 bits wide. The output of SuperscalarHash is used to construct the Dataset (see chapter 7.3).
### 6.1 Instructions
The body of SuperscalarHash is a random sequence of instructions that can run on the Virtual Machine. SuperscalarHash uses a reduced set of only integer register-register instructions listed in Table 6.1.1. `dst` refers to the destination register, `src` to the source register.
*Table 6.1.1 - SuperscalarHash instructions*
|freq. †|instruction|Macro-ops|operation|rules|
|-|-|-|-|-|
|0.11|ISUB_R|`sub_rr`|`dst = dst - src`|`dst != src`|
|0.11|IXOR_R|`xor_rr`|`dst = dst ^ src`|`dst != src`|
|0.11|IADD_RS|`lea_sib`|`dst = dst + (src << mod.shift)`|`dst != src`, `dst != r5`
|0.22|IMUL_R|`imul_rr`|`dst = dst * src`|`dst != src`|
|0.11|IROR_C|`ror_ri`|`dst = dst >>> imm32`|`imm32 % 64 != 0`
|0.10|IADD_C|`add_ri`|`dst = dst + imm32`|
|0.10|IXOR_C|`xor_ri`|`dst = dst ^ imm32`|
|0.03|IMULH_R|`mov_rr`,`mul_r`,`mov_rr`|`dst = (dst * src) >> 64`|
|0.03|ISMULH_R|`mov_rr`,`imul_r`,`mov_rr`|`dst = (dst * src) >> 64` (signed)|
|0.06|IMUL_RCP|`mov_ri`,`imul_rr`|<code>dst = 2<sup>x</sup> / imm32 * dst</code>|`imm32 != 0`, <code>imm32 != 2<sup>N</sup></code>|
† Frequencies are approximate. Instructions are generated based on complex rules.
#### 6.1.1 ISUB_R
See chapter 5.2.3. Source and destination are always distinct registers.
#### 6.1.2 IXOR_R
See chapter 5.2.8. Source and destination are always distinct registers.
#### 6.1.3 IADD_RS
See chapter 5.2.1. Source and destination are always distinct registers and register `r5` cannot be the destination.
#### 6.1.4 IMUL_R
See chapter 5.2.4. Source and destination are always distinct registers.
#### 6.1.5 IROR_C
The destination register is rotated right. The rotation count is given by `imm32` masked to 6 bits and cannot be 0.
#### 6.1.6 IADD_C
A sign-extended `imm32` is added to the destination register.
#### 6.1.7 IXOR_C
The destination register is XORed with a sign-extended `imm32`.
#### 6.1.8 IMULH_R, ISMULH_R
See chapter 5.2.5.
#### 6.1.9 IMUL_RCP
See chapter 5.2.6. `imm32` is never 0 or a power of 2.
### 6.2 The reference CPU
Unlike a standard RandomX program, a SuperscalarHash program is generated using a strict set of rules to achieve the maximum performance on a superscalar CPU. For this purpose, the generator runs a simulation of a reference CPU.
The reference CPU is loosely based on the [Intel Ivy Bridge microarchitecture](https://en.wikipedia.org/wiki/Ivy_Bridge_(microarchitecture)). It has the following properties:
* The CPU has 3 integer execution ports P0, P1 and P5 that can execute instructions in parallel. Multiplication can run only on port P1.
* Each of the Superscalar instructions listed in Table 6.1.1 consist of one or more *Macro-ops*. Each Macro-op has certain execution latency (in cycles) and size (in bytes) as shown in Table 6.2.1.
* Each of the Macro-ops listed in Table 6.2.1 consists of 0-2 *Micro-ops* that can go to a subset of the 3 execution ports. If a Macro-op consists of 2 Micro-ops, both must be executed together.
* The CPU can decode at most 16 bytes of code per cycle and at most 4 Micro-ops per cycle.
*Table 6.2.1 - Macro-ops*
|Macro-op|latency|size|1st Micro-op|2nd Micro-op|
|-|-|-|-|-|
|`sub_rr`|1|3|P015|-|
|`xor_rr`|1|3|P015|-|
|`lea_sib`|1|4|P01|-|
|`imul_rr`|3|4|P1|-|
|`ror_ri`|1|4|P05|-|
|`add_ri`|1|7, 8, 9|P015|-|
|`xor_ri`|1|7, 8, 9|P015|-|
|`mov_rr`|0|3|-|-|
|`mul_r`|4|3|P1|P5|
|`imul_r`|4|3|P1|P5|
|`mov_ri`|1|10|P015|-|
* P015 - Micro-op can be executed on any port
* P01 - Micro-op can be executed on ports P0 or P1
* P05 - Micro-op can be executed on ports P0 or P5
* P1 - Micro-op can be executed only on port P1
* P5 - Micro-op can be executed only on port P5
Macro-ops `add_ri` and `xor_ri` can be optionally padded to a size of 8 or 9 bytes for code alignment purposes. `mov_rr` has 0 execution latency and doesn't use an execution port, but still occupies space during the decoding stage (see chapter 6.3.1).
### 6.3 CPU simulation
SuperscalarHash programs are generated to maximize the usage of all 3 execution ports of the reference CPU. The generation consists of 4 stages:
* Decoding stage
* Instruction selection
* Port assignment
* Operand assignment
Program generation is complete when one of two conditions is met:
1. An instruction is scheduled for execution on cycle that is equal to or greater than `RANDOMX_SUPERSCALAR_LATENCY`
1. The number of generated instructions reaches `3 * RANDOMX_SUPERSCALAR_LATENCY + 2`.
#### 6.3.1 Decoding stage
The generator produces instructions in groups of 3 or 4 Macro-op slots such that the size of each group is exactly 16 bytes.
*Table 6.3.1 - Decoder configurations*
|decoder group|configuration|
|-------------|-------------|
|0|4-8-4|
|1|7-3-3-3|
|2|3-7-3-3|
|3|4-9-3|
|4|4-4-4-4|
|5|3-3-10|
The rules for the selection of the decoder group are following:
* If the currently processed instruction is IMULH_R or ISMULH_R, the next decode group is group 5 (the only group that starts with a 3-byte slot and has only 3 slots).
* If the total number of multiplications that have been generated is less than or equal to the current decoding cycle, the next decode group is group 4.
* If the currently processed instruction is IMUL_RCP, the next decode group is group 0 or 3 (must begin with a 4-byte slot for multiplication).
* Otherwise a random decode group is selected from groups 0-3.
#### 6.3.2 Instruction selection
Instructions are selected based on the size of the current decode group slot - see Table 6.3.2.
*Table 6.3.2 - Decoder configurations*
|slot size|note|instructions|
|-------------|-------------|-----|
|3|-|ISUB_R, IXOR_R
|3|last slot in the group|ISUB_R, IXOR_R, IMULH_R, ISMULH_R|
|4|decode group 4, not the last slot|IMUL_R|
|4|-|IROR_C, IADD_RS|
|7,8,9|-|IADD_C, IXOR_C|
|10|-|IMUL_RCP|
#### 6.3.3 Port assignment
Micro-ops are issued to execution ports as soon as an available port is free. The scheduling is done optimistically by checking port availability in order P5 -> P0 -> P1 to not overload port P1 (multiplication) by instructions that can go to any port. The cycle when all Micro-ops of an instruction can be executed is called the 'scheduleCycle'.
#### 6.3.4 Operand assignment
The source operand (if needed) is selected first. is it selected from the group of registers that are available at the 'scheduleCycle' of the instruction. A register is available if the latency of its last operation has elapsed.
The destination operand is selected with more strict rules (see column 'rules' in Table 6.1.1):
* value must be ready at the required cycle
* cannot be the same as the source register unless the instruction allows it (see column 'rules' in Table 6.1.1)
* this avoids optimizable operations such as `reg ^ reg` or `reg - reg`
* it also increases intermixing of register values
* register cannot be multiplied twice in a row unless `allowChainedMul` is true
* this avoids accumulation of trailing zeroes in registers due to excessive multiplication
* `allowChainedMul` is set to true if an attempt to find source/destination registers failed (this is quite rare, but prevents a catastrophic failure of the generator)
* either the last instruction applied to the register or its source must be different than the current instruction
* this avoids optimizable instruction sequences such as `r1 = r1 ^ r2; r1 = r1 ^ r2` (can be eliminated) or `reg = reg >>> C1; reg = reg >>> C2` (can be reduced to one rotation) or `reg = reg + C1; reg = reg + C2` (can be reduced to one addition)
* register `r5` cannot be the destination of the IADD_RS instruction (limitation of the x86 lea instruction)
## 7. Dataset
The Dataset is a read-only memory structure that is used during program execution (chapter 4.6.2, steps 6 and 7). The size of the Dataset is `RANDOMX_DATASET_BASE_SIZE + RANDOMX_DATASET_EXTRA_SIZE` bytes and it's divided into 64-byte 'items'.
In order to allow PoW verification with a lower amount of memory, the Dataset is constructed in two steps using an intermediate structure called the "Cache", which can be used to calculate Dataset items on the fly.
The whole Dataset is constructed from the key value `K`, which is an input parameter of RandomX. The whole Dataset needs to be recalculated everytime the key value changes. Fig. 7.1 shows the process of Dataset construction.
*Figure 7.1 - Dataset construction*
![Imgur](https://i.imgur.com/86h5SbW.png)
### 7.1 Cache construction
The key `K` is expanded into the Cache using the "memory fill" function of Argon2d with parameters according to Table 7.1.1. The key is used as the "password" field.
*Table 7.1.1 - Argon2 parameters*
|parameter|value|
|------------|--|
|parallelism|`RANDOMX_ARGON_LANES`|
|output size|0|
|memory|`RANDOMX_ARGON_MEMORY`|
|iterations|`RANDOMX_ARGON_ITERATIONS`|
|version|`0x13`|
|hash type|0 (Argon2d)|
|password|key value `K`|
|salt|`RANDOMX_ARGON_SALT`
|secret size|0|
|assoc. data size|0|
The finalizer and output calculation steps of Argon2 are omitted. The output is the filled memory array.
### 7.2 SuperscalarHash initialization
The key value `K` is used to initialize a BlakeGenerator (see chapter 3.4), which is then used to generate 8 SuperscalarHash instances for Dataset initialization.
### 7.3 Dataset block generation
Dataset items are numbered sequentially with `itemNumber` starting from 0. Each 64-byte Dataset item is generated independently using 8 SuperscalarHash functions (generated according to chapter 7.2) and by XORing randomly selected data from the Cache (constructed according to chapter 7.1).
The item data is represented by 8 64-bit integer registers: `r0`-`r7`.
1. The register values are initialized as follows (`*` = multiplication, `^` = XOR):
* `r0 = (itemNumber + 1) * 6364136223846793005`
* `r1 = r0 ^ 9298411001130361340`
* `r2 = r0 ^ 12065312585734608966`
* `r3 = r0 ^ 9306329213124626780`
* `r4 = r0 ^ 5281919268842080866`
* `r5 = r0 ^ 10536153434571861004`
* `r6 = r0 ^ 3398623926847679864`
* `r7 = r0 ^ 9549104520008361294`
1. Let `cacheIndex = itemNumber`
1. Let `i = 0`
1. Load a 64-byte item from the Cache. The item index is given by `cacheIndex` modulo the total number of 64-byte items in Cache.
1. Execute `SuperscalarHash[i](r0, r1, r2, r3, r4, r5, r6, r7)`, where `SuperscalarHash[i]` refers to the i-th SuperscalarHash function. This modifies the values of the registers `r0`-`r7`.
1. XOR all registers with the 64 bytes loaded in step 4 (8 bytes per column in order `r0`-`r7`).
1. Set `cacheIndex` to the value of the register that has the longest dependency chain in the SuperscalarHash function executed in step 5.
1. Set `i = i + 1` and go back to step 4 if `i < RANDOMX_CACHE_ACCESSES`.
1. Concatenate registers `r0`-`r7` in little endian format to get the final Datset item data.
The constants used to initialize register values in step 1 were determined as follows:
* Multiplier `6364136223846793005` was selected because it gives an excellent distribution for linear generators (D. Knuth: The Art of Computer Programming Vol 2., also listed in [Commonly used LCG parameters](https://en.wikipedia.org/wiki/Linear_congruential_generator#Parameters_in_common_use))
* XOR constants used to initialize registers `r1`-`r7` were determined by calculating `Hash512` of the ASCII value `"RandomX SuperScalarHash initialize"` and taking bytes 8-63 as 7 little-endian unsigned 64-bit integers. Additionally, the constant for `r1` was increased by <code>2<sup>33</sup>+700</code> and the constant for `r3` was increased by <code>2<sup>14</sup></code> (these changes are necessary to ensure that all registers have unique initial values for all values of `itemNumber`).