tevador / randomx Goto Github PK
View Code? Open in Web Editor NEWProof of work algorithm based on random code execution
License: BSD 3-Clause "New" or "Revised" License
Proof of work algorithm based on random code execution
License: BSD 3-Clause "New" or "Revised" License
Fyi, SoWs from audit/review teams will be uploaded to https://github.com/hyc/RandomxAudits
For people looking up ERROR: Cache allocation failed
whilst using --largePages
:
dsc@dev6:~/zz$ ./benchmark --nonces 5000 --mine --largePages --init 12 --threads 12
RandomX benchmark
- full memory mode (2080 MiB)
- interpreted mode
- hardware AES mode
- large pages mode
Initializing (12 threads) ...
ERROR: Cache allocation failed
strace:
mmap(NULL, 268435456, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE|MAP_HUGETLB, -1, 0) = -1 ENOMEM (Cannot allocate memory)
futex(0x7f19d15d81a0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(1, "ERROR: Cache allocation failed\n", 31ERROR: Cache allocation failed
MAP_HUGETLB
-> allocation fails -> edit /proc/sys/vm/nr_hugepages
and set it to some high number (I put mine on 1500
). This fixed it.
sudo sysctl -w vm.nr_hugepages=1500
Hi @tevador, In the event RandomX Fails to deter ASICs,
I've got an idea to extend the RandomX lifespan even further.
Emergency Failsafe / Backup proposal:
How This Might Work:
There would be Huge Benefits for the Entire computing industry
I'm thinking this would help the world far more than SHA3.
Could this be done with RandomX's Randomization VM engine?
Thanks,
I've been doing some portability tests and found out that the ARM instruction for conversion between integer and floating point (VCVT
) uses a hardcoded rounding mode rather than the rounding mode that is set in the FPSCR register. Reference on page 4-22..
This causes a discrepancy between the results from x86 and ARM (confirmed with Raspberry Pi 3).
Possible solutions:
Not sure which solution is the best. Thoughts?
Hello,
Anytime I set the --largepages option, i get an error. am I making a mistake? thanks for your time and effort.
command to set hugepages: sysctl -w vm.nr_hugepages=128
root@DELL-R820-E5-03:/randomx/RandomX/bin# gcc-7 -v16.04.york0' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.1.0-10ubuntu1
Thread model: posix
gcc version 7.1.0 (Ubuntu 7.1.0-10ubuntu1~16.04.york0)
root@DELL-R820-E5-03:~/randomx/RandomX/bin# ./benchmark --mine --largePages
RandomX benchmark
root@DELL-R820-E5-03:~/randomx/RandomX/bin# ./benchmark --mine --jit
RandomX benchmark
root@DELL-R820-E5-03:~/randomx/RandomX/bin# ./benchmark --mine --jit --threads 40 --init 64
RandomX benchmark
root@DELL-R820-E5-03:~/randomx/RandomX/bin# ./benchmark --mine --jit --threads 40 --init 64 --nonces 1000000
RandomX benchmark
root@DELL-R820-E5-03:~/randomx/RandomX/bin# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 4
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-4640 0 @ 2.40GHz
Stepping: 7
CPU MHz: 1388.062
CPU max MHz: 2800.0000
CPU min MHz: 1200.0000
BogoMIPS: 4803.03
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60
NUMA node1 CPU(s): 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61
NUMA node2 CPU(s): 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62
NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
root@DELL-R820-E5-03:~/randomx/RandomX/bin# ./benchmark --mine --jit --largePages --threads 40 --init 64 --nonces 1000000
RandomX benchmark
root@DELL-R820-E5-03:~/randomx/RandomX/bin# uname -a
Linux DELL-R820-E5-03 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
root@DELL-R820-E5-03:~/randomx/RandomX/bin# free -m
total used free shared buff/cache available
Mem: 15971 444 14863 9 663 15006
Swap: 16318 0 16318
root@DELL-R820-E5-03:/randomx/RandomX/bin# cat /proc/meminfo/randomx/RandomX/bin#
MemTotal: 16354676 kB
MemFree: 15219988 kB
MemAvailable: 15367052 kB
Buffers: 36828 kB
Cached: 481796 kB
SwapCached: 0 kB
Active: 417548 kB
Inactive: 140912 kB
Active(anon): 42720 kB
Inactive(anon): 9168 kB
Active(file): 374828 kB
Inactive(file): 131744 kB
Unevictable: 3652 kB
Mlocked: 3652 kB
SwapTotal: 16710652 kB
SwapFree: 16710652 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 43520 kB
Mapped: 33532 kB
Shmem: 9632 kB
Slab: 160804 kB
SReclaimable: 62612 kB
SUnreclaim: 98192 kB
KernelStack: 10704 kB
PageTables: 2940 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 24756916 kB
Committed_AS: 481120 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 14336 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 128
HugePages_Free: 128
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 134252 kB
DirectMap2M: 6092800 kB
DirectMap1G: 12582912 kB
root@DELL-R820-E5-03:
alignas(64) randomx::Program program;
is protected
member of class randomx_vm
:
https://github.com/tevador/RandomX/blob/master/src/virtual_machine.hpp#L54
I don't see any way of accessing it from the derived classes. This makes it impossible to inspect the RandomX code used in generating POW in randomx_calculate_hash
.
https://github.com/tevador/RandomX/blob/master/src/randomx.cpp#L236
I think such ability is important and I was wondering if the program
variable could be exposed. For example, by adding getProgram
into class randomx_vm
, in the following, or similar form:
randomx::Program const* getProgram()
{
return &program;
};
The provisional instruction weights are following:
#define WT_ADD_64 10
#define WT_ADD_32 2
#define WT_SUB_64 10
#define WT_SUB_32 2
#define WT_MUL_64 21
#define WT_MULH_64 10
#define WT_MUL_32 15
#define WT_IMUL_32 15
#define WT_IMULH_64 10
#define WT_DIV_64 1
#define WT_IDIV_64 1
#define WT_AND_64 4
#define WT_AND_32 2
#define WT_OR_64 4
#define WT_OR_32 2
#define WT_XOR_64 4
#define WT_XOR_32 2
#define WT_SHL_64 3
#define WT_SHR_64 3
#define WT_SAR_64 3
#define WT_ROL_64 6
#define WT_ROR_64 6
#define WT_FPADD 20
#define WT_FPSUB 20
#define WT_FPMUL 22
#define WT_FPDIV 8
#define WT_FPSQRT 6
#define WT_FPROUND 2
#define WT_CALL 24
#define WT_RET 18
(Divide weight by 256 to get the probability that the instruction is generated.)
The rationale behind these values:
Potential issues:
Does anyone have ideas for a better instruction mix to maximize ASIC resistance?
I have pushed an interpreter for RandomX.
The aim of the interpreter is portability and correctness, not speed.
Tested on the following platforms: Windows (x86_64), Windows (x86), Linux (x86_64), ARM (armv7+neon).
Run the test as ./randomx [--lightClient] [--softAes] [numberOfPrograms=1000]
--lightClient
is for a "light client" mode, which requires only ~64 MiB of memory (full client requires >4 GiB). --softAes
uses software AES.
For 1000 programs (default), the output hash must be 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
(applies to commit c9102ee).
There is also a unit test just for ALU and FPU operations: bin/AluFpuTest
.
> randomx.exe
Initializing...
Dataset (4 GiB) initialized in 25.4644 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 45.9957 programs per second
> randomx.exe --softAes
Using software AES.
Initializing...
Dataset (4 GiB) initialized in 53.0843 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 45.9795 programs per second
> randomx.exe --lightClient
Initializing...
Cache (64 MiB) initialized in 1.0757 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 16.0407 programs per second
> randomx.exe --lightClient --softAes
Using software AES.
Initializing...
Cache (64 MiB) initialized in 1.01232 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 9.446 programs per second
> randomx.exe --lightClient
Initializing...
Cache (64 MiB) initialized in 2.77849 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 9.43859 programs per second
> randomx.exe --lightClient --softAes
Using software AES.
Initializing...
Cache (64 MiB) initialized in 2.78784 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 6.89343 programs per second
$ bin/randomx
Initializing...
Dataset (4 GiB) initialized in 19.8195 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 58.5224 programs per second
$ bin/randomx --softAes
Using software AES.
Initializing...
Dataset (4 GiB) initialized in 39.8167 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 58.5863 programs per second
$ bin/randomx --lightClient
Initializing...
Cache (64 MiB) initialized in 0.610684 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 21.1308 programs per second
$ bin/randomx --lightClient --softAes
Using software AES.
Initializing...
Cache (64 MiB) initialized in 0.607081 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 12.4759 programs per second
$ bin/randomx --lightClient --softAes
Using software AES.
Initializing...
Cache (64 MiB) initialized in 8.2832 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 1.32003 programs per second
Considering the ongoing discussions, I think a comparison between CNv-8, CN-R, RandomX and RandomX light block verification times could come in handy to clear up misconceptions and speed up the process. I don't know how to do CNv-8 and CN-R and compare them to the RandomX verification mode test.
Lock if deemed unnecessary.
So in the effort to reduce the gap between a consumer CPU and a custom CPU, one approach is exploiting some of the stuff baked in to modern CPUs. This probably just buys us time, similar to the AES-NI of cryptonight.
For instance, does randomX do things that take advantage of
CLMUL instruction set
or any of the things on the extension list?
Since there are only 8 integer registers, there is a good chance that two memory operations will end up with the same address. This happens in almost every program. For example in the sample program, instructions 108 and 111 read exactly the same scratchpad location. Same thing happens with store operations, in which case the first ISTORE instruction can be optimized away.
Proposed solution: use imm32
as an instruction-specific displacement. The adress calculation will change from:
mov eax, r8d
and eax, 16376
to (for example)
lea eax, [r8+1860630909]
and eax, 16376
This adds one ALU operation per memory instruction, which should have only a small impact on CPU performance.
Issue reported in the context of Kudelski Security's audit
The implementation does not leverage vectorized instructions. For example, on platforms supporting AVX2, a reference, portable implemnentations is about 40% slower than an AVX2 implementation, as reported on a Cannonlake microarchitecture benchmark from SUPERCOP.
An AVX2 implementation of BLAKE2b can be found in the SUPERCOP archive as well as in Libsodium.
An AVX512-optimized version of BLAKE2s (not BLAKE2b) is used in Wireguard.
Similar techniques may be used to optimize BLAKE2b for the AVX512 instruction set.
I made a quick and dirty test to see how this algo would work with an amd GPU.
This is how I did it:
Before giving the results, some considerations:
Now the results:
I have seen those things:
I checked the consistency of the number of executed GPU instructions:
Hope I didn't made a huge mistake, but memory bandwidth and number of executed instructions looks ok.
In a "real" miner, I only see L1 Instruction cache as a potential limiting factor. Initially I was thinking of L2/L3, but HBM is really fast. L1 instruction cache misses will be filled from the L2 cache which is apparently not overloaded. Vega10 has also improved instruction cache.
Needs more investigations so see how the L1 instruction cache impacts performance.
I can put the code (Vulkan) in my github if you want to have a look at this prototype.
I have published a RandomX x86 benchmark. It tests all ALU and FPU VM instructions.
Download the release here: https://github.com/tevador/RandomX/releases (precompiled binaries available for Windows x64 and Linux x64).
Paste the result directly into this thread together with your CPU specs and platform.
Intel Core i5-3230M (Ivy Bridge, 2013), Windows 7 x64
operation | time [s] | (result) |
---|---|---|
NOOP | 0.65806 | 13901818124319359487 |
ADD_64 | 0.7371 | 17981554887921175522 |
ADD_32 | 0.740774 | 4040882173786131426 |
SUB_64 | 0.782932 | 2297734475728124768 |
SUB_32 | 0.749162 | 4040882150066715488 |
MUL_64 | 0.82144 | 13108172643246038454 |
MULH_64 | 0.974916 | 4528287945977545198 |
MUL_32 | 0.832865 | 3505774372494007734 |
IMUL_32 | 0.832609 | 111758659619151286 |
IMULH_64 | 0.978652 | 14335428980581912669 |
DIV_64 | 9.17839 | 4040949655748476752 |
IDIV_64 | 9.79582 | 10706389806209703347 |
AND_64 | 0.737244 | 17685105886084186988 |
AND_32 | 0.752489 | 4040880554977510252 |
OR_64 | 0.740858 | 3708075074809940201 |
OR_32 | 0.754382 | 4040882048943442153 |
XOR_64 | 0.737324 | 11373760495026296796 |
XOR_32 | 0.751642 | 4040882115245103068 |
SHL_64 | 0.977999 | 13807543856509042205 |
SHR_64 | 1.00519 | 18303742120199201659 |
SAR_64 | 1.00618 | 13476444829495768773 |
ROR_64 | 0.966894 | 10563782527368126285 |
ROL_64 | 0.966994 | 15157809728722651568 |
FNOOP | 1.1203 | 11251318250348714849 |
FADD | 1.2373 | 14977738299044271065 |
FSUB | 1.23676 | 16485221503511343178 |
FMUL | 1.23883 | 5358969898691635935 |
FDIV | 4.37206 | 4466393864116852366 |
FSQRT | 4.33514 | 3437575554930583295 |
FROUND | 2.01637 | 11251318250348714849 |
Intel Core i3-3220 (Ivy Bridge, 2012), Ubuntu 16.04 x64
operation | time [s] | (result) |
---|---|---|
NOOP | 0.516929 | 13901818124319359487 |
ADD_64 | 0.638101 | 17981554887921175522 |
ADD_32 | 0.745153 | 4040882173786131426 |
SUB_64 | 0.592502 | 2297734475728124768 |
SUB_32 | 0.590001 | 4040882150066715488 |
MUL_64 | 0.638484 | 13108172643246038454 |
MULH_64 | 0.851473 | 4528287945977545198 |
MUL_32 | 0.607774 | 3505774372494007734 |
IMUL_32 | 0.607894 | 111758659619151286 |
IMULH_64 | 0.851494 | 14335428980581912669 |
DIV_64 | 6.97456 | 4040949655748476752 |
IDIV_64 | 7.7931 | 10706389806209703347 |
AND_64 | 0.638668 | 17685105886084186988 |
AND_32 | 0.744427 | 4040880554977510252 |
OR_64 | 0.638064 | 3708075074809940201 |
OR_32 | 0.744875 | 4040882048943442153 |
XOR_64 | 0.638116 | 11373760495026296796 |
XOR_32 | 0.744428 | 4040882115245103068 |
SHL_64 | 0.738194 | 13807543856509042205 |
SHR_64 | 0.729493 | 18303742120199201659 |
SAR_64 | 0.729931 | 13476444829495768773 |
ROR_64 | 0.729517 | 10563782527368126285 |
ROL_64 | 0.729536 | 15157809728722651568 |
FNOOP | 0.733196 | 11251318250348714849 |
FADD | 0.76531 | 14977738299044271065 |
FSUB | 0.765336 | 16485221503511343178 |
FMUL | 0.804609 | 5358969898691635935 |
FDIV | 3.43021 | 4466393864116852366 |
FSQRT | 3.41546 | 3437575554930583295 |
FROUND | 1.54395 | 11251318250348714849 |
AMD Sempron 2800+ (Palermo, 2004), Ubuntu 18.04 x64 (One of the oldest CPUs you can run this benchmark on)
operation | time [s] | (result) |
---|---|---|
NOOP | 1.37092 | 13901818124319359487 |
ADD_64 | 1.62194 | 17981554887921175522 |
ADD_32 | 1.79334 | 4040882173786131426 |
SUB_64 | 1.6724 | 2297734475728124768 |
SUB_32 | 1.4496 | 4040882150066715488 |
MUL_64 | 1.74611 | 13108172643246038454 |
MULH_64 | 2.78505 | 4528287945977545198 |
MUL_32 | 1.97347 | 3505774372494007734 |
IMUL_32 | 1.90647 | 111758659619151286 |
IMULH_64 | 2.48499 | 14335428980581912669 |
DIV_64 | 36.962 | 4040949655748476752 |
IDIV_64 | 37.9536 | 10706389806209703347 |
AND_64 | 1.84943 | 17685105886084186988 |
AND_32 | 1.74499 | 4040880554977510252 |
OR_64 | 1.62137 | 3708075074809940201 |
OR_32 | 1.74484 | 4040882048943442153 |
XOR_64 | 1.61982 | 11373760495026296796 |
XOR_32 | 2.20235 | 4040882115245103068 |
SHL_64 | 1.62928 | 13807543856509042205 |
SHR_64 | 1.63084 | 18303742120199201659 |
SAR_64 | 1.62931 | 13476444829495768773 |
ROR_64 | 1.63083 | 10563782527368126285 |
ROL_64 | 1.62921 | 15157809728722651568 |
FNOOP | 2.39553 | 11251318250348714849 |
FADD | 2.79056 | 14977738299044271065 |
FSUB | 2.64978 | 16485221503511343178 |
FMUL | 2.65083 | 5358969898691635935 |
FDIV | 8.97491 | 4466393864116852366 |
FSQRT | 15.2086 | 3437575554930583295 |
FROUND | 7.25179 | 11251318250348714849 |
https://arxiv.org/abs/1902.00112
"HashCore: Proof-of-Work Functions for General Purpose Processors"
posted by sarang in #monero-pow
Please post:
Precompiled binaries are available for Windows 64-bit and Ubuntu/Debian x86-64. Download the latest release here: https://github.com/tevador/RandomX/releases
Run as:
randomx --verify --softAes
--jit
option to get 2-3 times faster verification.--softAes
option to get a small increase in performance.--largePages
option to get a small increase in performance.Mining mode is currently only supported on 64-bit x86 CPUs. Requires at least 2.25 GiB of RAM.
Run as:
randomx --mine --init Q --threads T --nonces N
Q
(number of initialization threads) equal to the number of hardware threads of your CPU.T
(number of mining threads) which produces the highest hashrate. Starting point should be 1 thread per 2 MB of L3 cache, but some CPUs can benefit from running more threads, while some CPUs cannot run more than 1 thread per core efficiently depending on other factors such as L1/L2 cache sizes.N
(number of nonces) equal to 10000
, 100000
or 1000000
depending on the performance of your system. Aim for at least a 60-second mining period for accurate results.--softAes
option. Your mining performance will be about 40% lower.--largePages
option to get a significant increase in performance (up to 25%).At first I tried to manually compile it using the code below:
sudo apt-get update
sudo apt-get install build-essentials
git clone https://github.com/tevador/RandomX.git
cd RandomX
cmake .
make
cd bin
./benchmark
This has resulted in
ERROR: Dataset allocation failed
Then trying it a second time resulted in
ERROR: Cache allocation failed
So I tried to install the latest precompiled release which gave me this
./benchmark: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by ./benchmark)
And I fixed that using sudo apt-get install --only-upgrade libstdc++6
Then it started giving me
ERROR: Dataset allocation failed
again.
Not sure what's going on here.
Modern Linux distros now compile with -fPIE
enabled by default. Linking fails because of absolute references
/usr/bin/ld: obj/JitCompilerX86-static.o: relocation R_X86_64_32S against `.text' can not be used when making a PIE object; recompile with -fPIC
Adding -no-pie
to CFLAGS, CCFLAGS, and LDFLAGS gets around this for now. Is that what we actually want now, since PIC is slower?
Floating point instructions FPADD
and FPSUB
become a no-op if the second operand is larger than about 1e+25 (the mantissa is not big enough to handle adding a small and a very big number).
This happens in about 22% of cases. and in about 7% of cases, both vector values are unchanged (the whole register). An ASIC could mark registers with large values and optimize out FPADD
and FPSUB
that use such register.
Possible solution: separate source and destination registers for FPADD
/FPSUB
and FPMUL
/FPDIV
. Large values will be possible only in the second group of registers, so all floating point operations will have an effect.
Current CMakeLists.txt has some issues, which make randomx difficult to use as a module in other project.
n file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/x86intrin.h:43:0,
from /root/RandomX/src/intrin_portable.h:89,
from /root/RandomX/src/soft_aes.h:32,
from /root/RandomX/src/aes_hash.cpp:29:
/usr/lib/gcc/x86_64-linux-gnu/5/include/wmmintrin.h: In function 'rx_vec_i128 aesenc(rx_vec_i128, rx_vec_i128) [with bool soft = false; rx_vec_i128 = __vector(2) long long int]':
/usr/lib/gcc/x86_64-linux-gnu/5/include/wmmintrin.h:61:1: error: inlining failed in call to always_inline '__m128i _mm_aesenc_si128(__m128i, __m128i)': target specific option mismatch
_mm_aesenc_si128 (__m128i __X, __m128i __Y)
^
In file included from /root/RandomX/src/aes_hash.cpp:29:0:
/root/RandomX/src/soft_aes.h:40:65: error: called from here
return soft ? soft_aesenc(in, key) : rx_aesenc_vec_i128(in, key);
The error is due to the fact that ARCH_ID
is not set as this variable is specific to monero. Maybe as a fallback, if not explicitly set by the parent project it could be set to ${CMAKE_HOST_SYSTEM_PROCESSOR}
?
Also there is no cmake_minimum_required
provided which results in warnings when build randomx using cmake.
Possible updated version of the CMakeLists.txt is here:
https://github.com/moneroexamples/RandomX/blob/update_cmake/CMakeLists.txt
The cmake_minimum_required
provided is same as for the monero.
With the changes, randomx can also be build using cmake:
cd RandomX
mkdir build && cd build
cmake ..
make
Issue reported in the context of Kudelski Security's audit
The design document states:
RandomX uses all primitive integer operations that preserve entropy: addition (IADD_RS, IADD_M), subtraction (ISUB_R, ISUB_M, INEG_R), multiplication (IMUL_R, IMUL_M, IMULH_R, IMULH_M, ISMULH_R, ISMULH_M, IMUL_RCP), exclusive or (IXOR_R, IXOR_M) and rotation (IROR_R, IROL_R).
Here "preserve entropy" is probably meant as "is invertible if one of the operands is fixed". However this is not the case for multiplication. I suggest to clarify that part.
My primary concern with randomX and its predecessor CNVM is that the machine specification itself is too simple, and can be implemented in a very minimally equipped CPU. I also still believe that forcing execution of a language processor (i.e., a compiler or interpreter) is a more extensive "CPU-like" task and we should still be leveraging that.
Our main problem with randomJS is that if you can randomly generate a syntactically correct program source code, you can just take the generated AST and execute it directly, instead of bothering to execute the JS.
So, instead, what if we combine these two approaches - we specify a language and a reference compiler implementation that compiles the given language into bytecode. So a valid implementation must generate identical bytecode for a given source text. And then, we feed the bytecode into randomX - which means the generated AST has nothing to do with the resulting execution, so it can't be used as a shortcut. I.e., the bytecode and the randomX instruction set are completely independent entities. This is just a long-winded way of using a compiler to generate the random bytes that randomX will execute.
Currently, the floating point scheduler is empty about 50% of the time (measured with Ryzen 1700). This means the FPU is underutilized.
Possible solutions:
The compiled VM will crash on MacOS when using more than 1 thread,
This is because MacOS has a reduced stack size of 512 KB for pthreads. RandomX requires a minimum of 8 MB.
The condition register of the CBRANCH instruction is selected to be the least recently modified register. This maximizes the length of the jump, but has an unwanted side effect by allowing the condition to be evaluated in advance.
I made some simulations to estimate the benefits of speculative execution in RandomX and it turns out that in a large number of cases, speculation is not necessary because the condition result is already known by the time execution arrives at the branch.
Example program: https://pastebin.com/hdSJyuUF
The number in front of each instruction is the cycle when the result of the instruction is known. Execution was simulated as a non-speculative out-of-order ASIC with a 3-cycle pipeline and 4 execution ports (notice there is at least a 3-cycle delay between each CBRANCH instruction and the next instruction).
Here the non-speculative ASIC needed 71 cycles to execute this program. A speculative ASIC would reduce this to 64 cycles.
I propose to remove the complex register selection algorithm in the CBRANCH instruction and instead simply use the existing dst
bit field to select the condition register. The branch target will be still the instruction following the one when the condition register was last modified.
This change increases the performance gap between a speculative and non-speculative ASIC:
current | proposed | |
---|---|---|
non-speculative | 76.3 cycles | 84.4 cycles |
speculative | 64.7 cycles | 64.8 cycles |
spec-ex advantage | 18% | 30% |
(table lists the average execution time of 10000 random programs)
The rationale for this change is that non-speculative ASIC designs are more likely due to their simplicity and lower power consumption. Most high-performance CPUs utilize speculative execution, so this change would increase the ASIC resistance of RandomX.
Side effects of this change:
Memory operands are loaded as 8-byte values from the address indicated by src. The 8 byte value is interpreted as two 32-bit signed integers and implicitly converted to floating point format
64-bit FP format has 52 bit precision, so low 20 bits of all FP registers will be 0 right after loading. ASIC can have optimizations for this, especially for read-only FP registers. We need to load 64-bit signed integers and convert them to double precision FP with current rounding mode to fill all bits of FP registers.
I came across this discussion on the Grin forum: https://www.grin-forum.org/t/obelisk-grn1-chip-details/4571
The conversation contains a lot of interesting information. The main takeway is that ASIC manufacturers prefer single chip designs even if it may not seem as the best way.
For RandomX, I see two possible options how an efficient ASIC might look like:
I did a ballpark estimate of the performance and power consumption of the first option (single-die RandomX ASIC). These are my assumptions, mainly based on the information from the Grin discussion:
I think some of the above assuptions are rather optimistic, but I'm interested in the best possible scenario here.
First, let's estimate the time to calculate 1 hash. This requires:
One program iteration consist of:
This gives 311 cycles for the program execution 704 cycles to calculate the dataset item. Overall 1015 cycles per iteration, which is about 1 μs.
Total time per hash is about 16.7 ms, which means exactly 60 H/s per scratchpad. We can see that 2/3 of the time, the core is waiting for the dataset item to be calculated, so we can have 1 core per 3 scratchpads to keep all cores busy 100% of the time.
To max-out our 800 mm2 die, we will have:
Final performance of this chip is 249*60 = 15 KH/s.
Power consumption estimate is 754*0.1 + 50*0.4 = 95 W.
Power efficiency is about 150-160 H/J. While this is about 3x more than a CPU, I would consider this as the upper bound of ASIC efficiency given the above assumptions may not be feasible in practice.
Are you planning to write the ARM support or do we need someone to do this?
It's not clear to me how much needs to be ported at the moment.
It would seem this code is presently licensed under GPLv3, which would require releasing any project including RandomX, such as Monero, under a GPLv3 license.
EDIT: This design is outdated and no longer applies to the current RandomX version.
Similar idea was originally proposed by @cjdelisle for a GPU miner, but I think it's more applicable for an ASIC design.
RandomX ASIC miner:
During dataset expansion, the SRAM is used to store the 64 MiB cache, while dataset blocks are loaded into HBM memory.
When mining, the ASIC runs 256 programs in parallel.
Memory allocation:
Instructions are loaded into the decoder/scheduler core (each core has its own program buffer, program counter and register file). The scheduler cores handle only CALL and RET instructions and pass the rest to one of the 28 worker cores.
Each of the 28 worker cores implements exactly one RandomX instruction pipelined for throughput which matches the instruction weight (for example MUL_64 worker can handle 21 times more instructions per clock than DIV_64 worker). Each worker has an instruction queue fed by the scheduler cores.
The speed of the decoder/scheduler cores would be designed to keep every worker core 100% utilized.
Some complications of the design:
There could be also some dedicated load/store ports for loading instruction operands.
If limited only by HBM bandwidth, this ASIC could do around 120 000 programs per second (480 GiB/s memory read rate), or roughly the same as 30 Ryzen 1700 CPUs. This assumes that sufficient compute power can fit on a single chip. If we estimate power consumption at 300 W (= Vega 64), this ASIC would be around 9 times more power efficient than a CPU.
Improved estimate: #11 (comment)
Dislaimer: I'm not an ASIC designer.
What's the point in having them? ISDIV_C compiles to a lot of code:
; ISDIV_C r4, -692911499
mov rax, -893288710803585809
imul r12
xor eax, eax
sar rdx, 25
sets al
add rdx, rax
add r12, rdx
And can be clearly made faster on ASIC. And since it's essentially multiplication by constant, maybe replace these two instructions with explicit multiplication by 64-bit constant? Then we won't need to handle all edge cases and the code would become something like this:
; IMUL_C r4, -893288710803585809
mov rax, -893288710803585809
imul r12
add r12, rdx
P.S. Since constants are only 32-bit, we can define this instruction as multiplication by reciprocal without further edge case handling:
a *= (2^64-1)/C if C != 0
a *= some fixed 64-bit value if C = 0
Issue reported in the context of Kudelski Security's audit
Collisions and preimages for AESHash1R can be found instantaneously, since the message is input as 16-byte blocks xored once with part of the state. (The attack is straightforward given the descriptions of AESHash1R.)
This cryptographic insecurity is likely understood by the authors, we nonetheless believe that AESHash1R should be a secure hash function, for otherwise one can easily determine Scratchpad values that map to a given hash. May not be exploitable but why take the risk?
Furthermore, in our understanding this part of the PoW is not performance critical, so the BLAKE2b-based hash may be used directly instead.
Issue reported in the context of Kudelski Security's audit
AESGenerator4R processes each 16-byte chunk of the generator's state independently, in a way that if state0 == state2
then the transformed values will be identical as well (same for 1 and 3).
The designers may want to avoid this property by using a mode similar to CTR or CBC. If CBC is used, CBC'd decryption mode might be used instead of the encryption mode—because the former is parallelizable, whereas the former is not, which allows for example to exploit Zen's two parallel AES units.
I tried to compile RandomX with clang by changing the CC and CXX variables in the makefile.
With clang-5, I got a compile error:
src/asm/program_prologue_linux.inc:35:28: error: cannot use base register with variable reference movapd xmm13, xmmword ptr mantissaMask[rip]
With clang-7, some linker error:
obj/argon2_core.o: file not recognized: file format not recognized clang-7: error: linker command failed with exit code 1 (use -v to see invocation)
gcc 7.4.0 can compile it with no problem.
Update:
I tried clang-6, got the same linker error as clang-7
Issue reported in the context of Kudelski Security's audit
Division by zero here if divisor
is zero:
Lines 46 to 50 in 1276d67
I suggest to check for divisor == zero
and return an appropriate value.
Currently, an average program contains 4 integer division instructions (out of 512).
I'm planning to increase it to 32 division instructions. 28 would be divisions by a runtime constant and 4 divisions by a runtime variable.
Integer division by a runtime constant can be optimized into a multiplication by a fixed-point reciprocal.
I have implemented this optimization in the JIT compiler and there is no measurable performance drop with 32 divisions per program. Without this optimization, performance drops by around 10% (Intel Sandy Bridge).
This attempts to force an ASIC to do some sort of compilation step or face 8 times more division instructions.
The most common critique of any effort to create an ASIC-proof PoW is that any PoW can be made into an ASIC. The general thought has been to design the PoW such that the eventual ASIC is as costly or efficient as consumer grade hardware.
What if instead we consider design choices that would make a potential ASIC useful for other applications. This would affect the real-world side of the ASIC-problem - manufacturing centralization, distribution.
Basically, a CPU is going to run RandomX at 100%. A RandomX ASIC could possibly run RandomX at 200%. But what if a randomX co-processor could be integrated with commodity CPUs to make the final consumer product (some next gen CPU or motherboard) perform consumer-grade tasks at 170% of a standard CPU?
Its effectively the same model as integrated graphics.
So instead of an ASIC manufacturer only having the option of making RandomX ASICs to mine monero, the same circuitry (or IP) could be packaged (or sold) for other purposes. This would get us to the commoditization faster - because what is more profitable if you stumble across a genius design to hack RandomX, printing a bunch of hardware and setting up mines or just selling the IP to intel or AMD? From what I know, these companies are struggling to find ways to make new CPUs better and faster.
For instance, this type of design choice may re-open #2 , because any advances in branch prediction would provide a boon for any situation where a CPU is tasked with running an operating system, or anything (if I understand things correctly).
As a first step towards a fully compiled VM, I have added an assembly generator for RandomX programs. At the moment, it's available only for 64-bit Windows.
A sample program is located in https://github.com/tevador/RandomX/blob/master/src/program.inc
Assembly code can be generated with randomx --genAsm [nonce]
.
Initial tests show that the compiled VM is approximately 5-6 times faster than the interpreter (doesn't include compilation time). Compiled program size is ~26 KiB, so it fits into L1I cache.
TODO/FIX: Output from the compiled VM doesn't match the output of the reference interpreter.
starting with:
randomx/src/intrin_portable.h:25:1: error: unknown type name 'constexpr'
constexpr int32_t unsigned32ToSigned2sCompl(uint32_t x) {
Part of the build script for wownerujo is:
https://github.com/wownero/wownerujo/blob/master/script/build-external-libs/build-wownero.sh
Presumably monerujo will likely run into the same build error.
Currently, branches are taken/not taken randomly. This is not ideal for CPUs as they rely heavily on branch prediction.
Conditions for branching instructions should be designed such that they produce random repeating patterns.
Some parts of the documentation (README.md, design.md, dataset.md, vm.md) and the code (main.cpp, dataset.cpp) indicate that the dataset has a size of 4 GiB, but if I'm not mistaken it it now 2 GiB.
MacOS Mojave, Apple LLVM version 10.0.0
Compilation error on both master@6e8c83fd
and dev@6b344b81
:
$ make
cc -march=native -O3 -flto -c src/argon2_core.c -o obj/argon2_core.o
cc -march=native -O3 -flto -c src/argon2_ref.c -o obj/argon2_ref.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/AssemblyGeneratorX86.cpp -o obj/AssemblyGeneratorX86.o
cc -march=native -O3 -flto -c src/blake2/blake2b.c -o obj/blake2b.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/CompiledVirtualMachine.cpp -o obj/CompiledVirtualMachine.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/dataset.cpp -o obj/dataset.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/JitCompilerX86.cpp -o obj/JitCompilerX86.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/instructionsPortable.cpp -o obj/instructionsPortable.o
src/instructionsPortable.cpp:22:26: warning: expected 'ON' or 'OFF' or 'DEFAULT' in pragma [-Wunknown-pragmas]
#pragma STDC FENV_ACCESS on
^
1 warning generated.
c++ -std=c++11 -maes -march=native -O3 -flto -c src/Instruction.cpp -o obj/Instruction.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/InterpretedVirtualMachine.cpp -o obj/InterpretedVirtualMachine.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/main.cpp -o obj/main.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/Program.cpp -o obj/Program.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/softAes.cpp -o obj/softAes.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/VirtualMachine.cpp -o obj/VirtualMachine.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/Cache.cpp -o obj/Cache.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/virtualMemory.cpp -o obj/virtualMemory.o
cc -march=native -O3 -flto -c src/reciprocal.c -o obj/reciprocal.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/LightClientAsyncWorker.cpp -o obj/LightClientAsyncWorker.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/hashAes1Rx4.cpp -o obj/hashAes1Rx4.o
c++ -x assembler-with-cpp -c src/JitCompilerX86-static.S -o obj/JitCompilerX86-static.o
src/JitCompilerX86-static.S:41:8: error: invalid alignment value
.align 64
^
src/JitCompilerX86-static.S:45:8: error: invalid alignment value
.align 64
^
src/JitCompilerX86-static.S:48:8: error: invalid alignment value
.align 64
^
src/JitCompilerX86-static.S:67:8: error: invalid alignment value
.align 64
^
src/JitCompilerX86-static.S:71:8: error: invalid alignment value
.align 64
^
make: *** [obj/JitCompilerX86-static.o] Error 1
on a quad-socket server (4x Xeon E5-4640), a single process with 64 threads maxes out at about 3700 H/s.
running 4 processes at once (one per physical CPU) with 8 threads each, I get about 2525 H/s per process for a total of 10100 H/s.
since real mining software will almost certainly have NUMA support, it would probably be good to implement it here so people get a more accurate idea of actual mining hashrates.
https://github.com/tevador/RandomX
@timolson @hyc Can you review/comment the initial draft?
There is still a lot of work to do, but it should be a lot simpler than RandomJS to implement and audit while being at least equally ASIC resistant.
Steps to reproduce:
#include "src/randomx.h"
#include <iostream>
#include <iomanip>
int main() {
const char myKey[] = "RandomX example key";
const char myInput[] = "RandomX example input";
char hash1[RANDOMX_HASH_SIZE];
char hash2[RANDOMX_HASH_SIZE];
randomx_cache *myCache1 = randomx_alloc_cache((randomx_flags) (RANDOMX_FLAG_JIT));
if (myCache1 == nullptr) {
std::cout << "Cache allocation failed" << std::endl;
return 1;
}
randomx_init_cache(myCache1, myKey, sizeof myKey);
randomx_vm *myMachine1 = randomx_create_vm((randomx_flags)(RANDOMX_FLAG_JIT), myCache1, nullptr);
if (myMachine1 == nullptr) {
std::cout << "Failed to create a virtual machine" << std::endl;
return 1;
}
randomx_calculate_hash(myMachine1, &myInput, sizeof myInput, hash1);
randomx_release_cache(myCache1);
randomx_destroy_vm(myMachine1);
randomx_cache *myCache2 = randomx_alloc_cache((randomx_flags) (RANDOMX_FLAG_DEFAULT));
if (myCache2 == nullptr) {
std::cout << "Cache allocation failed" << std::endl;
return 1;
}
randomx_init_cache(myCache2, myKey, sizeof myKey);
randomx_vm *myMachine2 = randomx_create_vm((randomx_flags)(RANDOMX_FLAG_DEFAULT), myCache2, nullptr);
if (myMachine2 == nullptr) {
std::cout << "Failed to create a virtual machine" << std::endl;
return 1;
}
randomx_calculate_hash(myMachine2, &myInput, sizeof myInput, hash2);
randomx_release_cache(myCache2);
randomx_destroy_vm(myMachine2);
for (unsigned i = 0; i < RANDOMX_HASH_SIZE; ++i)
std::cout << std::setfill('0') << std::setw(2) << std::hex << ((int) hash1[i] & 0xff);
std::cout << std::endl;
for (unsigned i = 0; i < RANDOMX_HASH_SIZE; ++i)
std::cout << std::setfill('0') << std::setw(2) << std::hex << ((int) hash2[i] & 0xff);
std::cout << std::endl;
return 0;
}
I've noticed this with not just the JIT flag, and believe all flags create this problem.
Because of the 4 GB requirement both webmining and botnets are hindered. Botnets either don't have the necessary RAM or it will be too disruptive to the machine performance, thus harder to stay undetected.
How much do we know about the hardware capabilities of existing botnets? What systems tend to get infected? A lot of computers out there are 4 GB or less RAM, with no OS or apps of any kind, they will be hosed by the 4 GB. What kind of apps do 8 and 16 GB machines generally run and how disruptive is 4GB extra use, especially to 8 GB machines?
Will a purpose built CPU miner want dual channel RAM? With good timings?
You'll go 2x4 GB. Windows 10 with with nothing else running should not use more than 1.5 GB (with free ram, win 10 will allocate it even when not necessarily needed), maybe you can even strip it down or go with some kind of a linux solution (not pirating windows is an unnecessary cost and you shouldn't pirate). Would 6GB + 1.5 GB + 500 MB free be cutting it too close?
Is the 4 GB also taking into consideration people mining with their PC? 4GB machines are screwed anyway and 6GB is pretty much nonexistent. Looking at steam HW survey https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam
8GB 37.90%
16+GB ~35%
Is mining with an 8GB machine while it is not Idle even viable with 4GB miner RAM usage? If you have spare RAM, how many CPU cores can you reasonably spare? Pretty much all of the 4C processors have no HT. It used to be an i7 K exclusive feature (now only i9, with more cores).
The idea would be to not make it more expensive to build a bare-bones miner (8GB needed anyway, should we waste ram?) while possibly making it harder for infected machines to mine and stay undetected.
Do we know what kind of a performance impact dual vs single channel ram has and also different frequencies and timings? Can we make it favor faster ram?
Will you finish fixing the big endian problems?
I want to start porting the JIT to ppc64le, and ppc64be too if you fix it, so I'd like to know if RandomX will stay little endian only.
Machine code of a RandomX instruction:
This ADD_64 instruction needs a total of 60 bytes of code, of which only one 3-byte instruction is the actual addition operation.
The overhead code consists of:
bd0
, bd2
). This is required because of control flow instructions like JUMP/CALL make the sequence of instructions unpredictable.bd8
, bdf
, bec
)be2
, be5
, be7
)bf8
-c03
)This overhead creates a lot of unavoidable dependency chains which limit the throughput to ~2 instructions per clock.
I've been thinking about a different structure of the program, which somewhat resembles CryptonightR by @SChernykh.
The program would be a loop exactly long enough to fit into the uOP cache of new CPUs (about 1000 instructions in total). Reading from the scratchpad would be done at the beginning of the loop - all 16 registers would be loaded at once. Then only register-register instructions would be executed (long chain) and at the end, registers would be written to the scratchpad and the loop would continue from the beginning.
Since the instruction reorder buffer of x86 CPUs is >100 instructions, they could essentially eliminate all dependency chains in the loop body and execute close to 4 instructions per clock.
Initial tests show that we could generate and execute 16 chained programs with this structure and still get a performance boost over the old scheme.
Advantages:
Disadvantages
Currently, the divisor used in the FDIV_M instruction only has a 30-bit mantissa. This means an ASIC could use a simpler divider for this instructions than if the full 52-bit precision was used.
Solution: Fill the lowest 22 bits of the mantissa with a program-specific bit pattern.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.