Code Monkey home page Code Monkey logo

progpow's People

Contributors

andrealanfranchi avatar arkpar avatar azawlocki avatar caktux avatar chfast avatar chriseth avatar cjentzsch avatar cubedro avatar danielhams avatar davesmacer avatar debris avatar gavofyork avatar genoil avatar giact avatar gluk256 avatar ifdefelse avatar jean-m-cyr avatar lefterisjp avatar lianahus avatar mansaj avatar mariusvanderwijden avatar obscuren avatar onepremise avatar smurfy avatar subtly avatar tiimjiim avatar vbuterin avatar winsvega avatar xcthulhu avatar yann300 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

progpow's Issues

22 Keccak rounds instead of 24 ?

Unless I am missing something Keccak_f800 performs only 22 permutation rounds while instead they should be 24. Round constants indexed 22 and 23 get never applied.

Random math is not uniformly distributed

The math operation selection is based on r % 11 which is not uniformly distributed. Because (2^32-1) % 11 == 3 the first 4 math operations have higher probability of being selected.

Solution: extend the operations to 16 or trim to 8.

ProgPow ASIC possibilities evaluated by 2 experts

SChernykh/CryptonightR#1 (comment)

In fact, a carefully designed ASIC could still outperform GPU by spending more resource/area on the bottlenecks. The memory bandwidth can be greatly improved using more smaller DRAM partitions and parallel memory controllers with address interleaving. The random math cannot utilize GPU’s float point ALUs, tensor cores and certain on chip memory, which occupies much more area than the tiny integer ALUs. An ASIC implementation could just build more simplified integer ALUs, multi-bank RFs with a very simple decoder for better TLP. It is also possible to achieve chained operations with reconfigurable ALU-array.

What's IfDefElse's take on this?

ProgPoW's handling of Ethash's uneven DAG size is underspecified

Ethash DAG sizes are multiples of 128 bytes, but not of 256. ProgPoW's README.md says:

The DAG is generated exactly as in Ethash.  All the parameters (ephoch length, DAG size, etc) are unchanged.

ProgPoW accesses the DAG in 256-byte chunks, which is a documented change from Ethash. However, it doesn't specify how it's handling the last element of Ethash's original DAG, which is only 128 bytes long. Is that last element perhaps/hopefully never accessed? This needs to be specified fully, and implementations checked for potential unintended behavior.

Also, ProgPoW's bundled implementation of Ethash's DAG initialization has ETHASH_MIX_BYTES changed from 128 to 256. Since the DAG is supposed to be unchanged from Ethash's, this change is probably unneeded (at least for the host-side DAG initialization code that's also unused in the ProgPoW tree?), but it breaks the included (and also unused?) ethash_hash().

The ETHASH_MIX_BYTES macro is reused by the CUDA and OpenCL DAG initialization code, but I guess it's not supposed to make a difference there as well? If so, should it perhaps be reverted to 128 to emphasize that the DAG is indeed unchanged from Ethash's? Or is the change to 256 needed e.g. not to waste resources on computing the last (unused?) 128 bytes? I suggest adjusting the code or comments to address or avoid these questions right in there.

In the plain C implementation of ProgPoW I'm currently playing with, I left Ethash's DAG initialization as-is (with ETHASH_MIX_BYTES at 128) and it's producing the correct (matching your test-vectors.md) digest for block 30k.

Make cache content vary per-hash

Right now, ProgPoW's cache is the first 16 KiB of DAG. This has some drawbacks:

  1. It's the same for all hashes being computed, including those concurrently computed on a GPU, which wastes GPU on-die memory on many copies of this data instead of using that memory for different data.

  2. Even inside each SM's shared memory or each CU's LDS (64 KiB) we may have a few copies of this data. We're giving an ASIC flexibility to provide a 16 KiB SRAM with more read ports and/or banks (or accept more bank conflict stalls) instead of the 64+ KiB SRAM that we have in the GPU.

  3. The very beginning of the DAG might be especially susceptible to TMTO attacks, even though those are probably impractical because 16 KiB SRAM is cheap enough as it is.

We might partially mitigate 3 by using a later portion of the DAG or something else, but addressing 1 and 2 is not trivial. Ideally, we'd use different cache content (such as different portions of the DAG as quickly determined by a fast hash of the block header) for each hash computation. However, our current random reads from the DAG are only of 16 KiB total per hash computed, so reading another 16 KiB of the cache from a random offset as well would cost us half of the global memory bandwidth, halving the bandwidth remaining for the tiny random DAG reads.

Maybe we should consider a ProgPoW revision/mode with much higher PROGPOW_CNT_DAG (number of loop iterations), so that it'd read a lot more data from the DAG per hash computed (and would have a lot lower nominal hashrate as a side-effect), and could then easily afford to also read the 16 KiB cache from a random DAG offset without much effect on memory bandwidth usage (and without much additional effect on the hashrate). This would result in slower PoW verification, but maybe even 100x slower is acceptable (so that reading the 16 KiB caches from random DAG offsets would cost only 1% of total bandwidth)? Of course, it'd also go against #36, but then at least having different hashrates from Ethash's would be justified by actual advantage rather than being arbitrary.

Or maybe we should add cache writes as well, so the caches will become at least to some extent different as ProgPoW runs. (This would also mitigate 3.) Right now, we read approx. 3x cache size's worth of data from each cache, so perhaps we can afford to write 1x cache size's worth of data as well (e.g., maybe read 2x the size and write 1x the size, keeping the total cache access count the same as we currently have)? This is probably more practical than my suggestion above since it allows preserving fast verification and even implementation of #36, but it's a lower-level change.

I'd appreciate any comments.

Correctness of test vectors

I'm trying to implement ProgPow 0.9.4 in Scala from the spec described here, but I have a problem matching test vectors from the repo to my results.

The problem starts in obtaining hash_init so I will focus only on this code:

  case class Hash32 (uint32s: Array[Long])

  def keccak_f800_progpow(
    st:   Array[Long]
   ): Hash32 = {
    for(r <- 0 until 22){ ProgPow.keccak_f800_round(st, r) }
    Hash32(st.take(8))
  }

  def keccak_f800_initial_pass(
    header: Hash32,
    nonce:   Array[Long],
  ): Hash32 = {

    val st = Array.fill[Long](25)(0L)

    for(i <- 0 until 8) { st(i) = header.uint32s(i) }
    st(8) = nonce(0)
    st(9) = nonce(1)
    st(10) = keccakf_rndc(0)
    st(18) = keccakf_rndc(6)

    keccak_f800_progpow(st)
  }

  def hash(
    prog_seed:  Array[Long],
    nonce:      Array[Long],
    header:     Hash32
  ): PoW = {

    val hash_init = ProgPow.keccak_f800_initial_pass(header, nonce)

    // Further code is irrelevant as long as hash_init is wrong
  }

The code is pretty straightforward and I don't see where the mistake could occur.
The previous version returns values as expected (passes all vectors from repo).
ProgPow.keccak_f800_round passes test vector (I am not sure why this vector is removed from your repo) and nothing changed here from the previous version.

My results of calling hash function are:

block_number: 30000
header: ffeeddccbbaa9988776655443322110000112233445566778899aabbccddeeff
nonce: 0x123456789abcdef0
 
hash_init: 820cbdad61f9121f6d516f1758b5d73994ac00854ff1ddeef0b7701df5caa0fe
hash_mix:  ca62b7dfb3b5e713ad23f5c70fea05a888f76b13af7998af1db374d5f6f9f77f
hash_final: 21fbd7b6c3069edebb514eb0ab76178a3b9a1d8ed9066af4f70e063670dec958


block_number: 0
header: ffeeddccbbaa9988776655443322110000112233445566778899aabbccddeeff
nonce: 0x123456789abcdef0
 
hash_init: 820cbdad61f9121f6d516f1758b5d73994ac00854ff1ddeef0b7701df5caa0fe
hash_mix: 649bf3c6c69d7f1d86d114341f39cc27f69d4b8f39fa92e30fbafba48562cbab
hash_final: 2ae94b9ded0f6bac6cdb4348a6481c3ca07899aef390585a19a38a554ac25228

Could you please tell me if there is sth missing in my implementation (or maybe I misunderstood something)?

Build instructions

There are https://github.com/ethereum-mining/ethminer/blob/master/docs/BUILD.md and https://github.com/AndreaLanfranchi/ethminer/blob/master/docs/BUILD.md in those related repos, but apparently no equivalent in this one. Their instructions are almost usable for this repo, but not quite. Perhaps proper build instructions should be added right in here and maintained in here?

FWIW, here's what worked for me (on Scientific Linux 6.10 with CUDA 10.0 and AMDGPU-PRO 18.50, after scl enable devtoolset-6 bash to gain newer gcc and using cmake3 - I omitted these two tricks from the commands below to make them readily reusable on more common and recent distros):

git clone https://github.com/ifdefelse/ProgPOW
cd ProgPOW
git submodule update --init --recursive
mkdir build
cd build
cmake .. -DETHASHCUDA=ON
make -sj8

This produced an OpenCL+CUDA build. (Why isn't CUDA enabled by default? It is in default build of current ethminer on this same system.)

Then there was this runtime error:

Fatal GPU error: CUDA error in func ethash_generate_dag at line 173 calling cudaGetLastError() failed with error no kernel image is available for execution on the device

which I solved by adding builds for all of this machine's GPUs' compute capabilities to libethash-cuda/CMakeLists.txt:

diff --git a/libethash-cuda/CMakeLists.txt b/libethash-cuda/CMakeLists.txt
index 89fa461..ebd079b 100644
--- a/libethash-cuda/CMakeLists.txt
+++ b/libethash-cuda/CMakeLists.txt
@@ -31,6 +31,8 @@ else()
        set(
                CUDA_NVCC_FLAGS
                ${CUDA_NVCC_FLAGS}
+               "-gencode arch=compute_35,code=sm_35"
+               "-gencode arch=compute_52,code=sm_52"
                "-gencode arch=compute_61,code=sm_61"
                "-gencode arch=compute_75,code=sm_75"
        )

and re-running just the final make command. As I understand, someone on CUDA older than 10.0 would also need to remove or comment out the line with sm_75 on it or the build would fail. Ideally, all of this would be auto-detected and wouldn't require manual edits, but meanwhile it'd be helpful to include this in the proposed build instructions as well.

Until such instructions are added, perhaps keep this issue open so that at least what's included in this comment is easy to find by those who need the instructions.

ProgPow perioid_seed issue

this is a bug report.

since ProgPoW utilize block_number to get the prog_seed = block_number / PROGPOW_PERIOD(50),
we need to obtain block_number from the geth/pool node.

now, eth_getWork() support block_number info. from the latest geth 1.8.18, so I've made small fix to use block_number to cuda_init() correctly,

@@ -319,6 +323,7 @@ bool CUDAMiner::s_noeval = false;

 bool CUDAMiner::cuda_init(
        size_t numDevices,
+       uint64_t block_number,
        ethash_light_t _light,
        uint8_t const* _lightData,
        uint64_t _lightBytes,
@@ -346,7 +351,8 @@ bool CUDAMiner::cuda_init(
                m_search_buf = new volatile search_results *[s_numStreams];
                m_streams = new cudaStream_t[s_numStreams];

-               uint64_t dagBytes = ethash_get_datasize(_light->block_number);
+               cnote << "block_number " << _light->block_number; // wrong value
+               uint64_t dagBytes = ethash_get_datasize(block_number);
                uint32_t dagElms   = (unsigned)(dagBytes / ETHASH_MIX_BYTES);
                uint32_t lightWords = (unsigned)(_lightBytes / sizeof(node));

@@ -376,7 +382,8 @@ bool CUDAMiner::cuda_init(
                hash64_t * dag = m_dag;
                hash64_t * light = m_light[m_device_num];

-               compileKernel(_light->block_number, dagElms); // wrong block_number
+               uint64_t prog_seed = block_number / PROGPOW_PERIOD;
+               compileKernel(prog_seed, dagElms);

                if(!light){
                        cudalog << "Allocating light with size: " << _lightBytes;

but the result is following,
image


testnet genesis with progpowBlock: 0

test with ethereum/go-ethereum#17731 + eth_getWork() fix ethereum/go-ethereum#18038

{
  "config": {
    "chainId": 1357,
    "homesteadBlock":0,
    "eip155Block":0,
    "eip158Block":0,
    "progpowBlock":0,
    "ethash": { }
  },
  "difficulty": "0x2000",
  "timestamp": "0x5be6530c",
  "nonce": "0x0000000000000042",
  "mixHash": "0x0000000000000000000000000000000000000000000000000000000000000000",
  "parentHash": "0x0000000000000000000000000000000000000000000000000000000000000000",
  "extraData": "0x",
  "gasLimit": "3141592",
  "alloc": { }
}

image

image

image


mininig command : $ /share/ProgPOW/build/ethminer/ethminer -U --cuda-devices 0 -P http://127.0.0.1:8485

some hunks of quick and dirty patch.

diff --git a/libethcore/EthashAux.h b/libethcore/EthashAux.h
index e74c345..1ea9b9e 100644
--- a/libethcore/EthashAux.h
+++ b/libethcore/EthashAux.h
@@ -72,7 +72,8 @@ struct WorkPackage
     explicit WorkPackage(BlockHeader const& _bh)
       : boundary(_bh.boundary()),
         header(_bh.hashWithout()),
-        epoch(static_cast<int>(_bh.number()) / ETHASH_EPOCH_LENGTH)
+        epoch(static_cast<int>(_bh.number()) / ETHASH_EPOCH_LENGTH),
+        height(static_cast<uint64_t>(_bh.number()))
     {}
     explicit operator bool() const { return header != h256(); }

@@ -82,6 +83,7 @@ struct WorkPackage
     int epoch = -1;

     uint64_t startNonce = 0;
+    uint64_t height = 0;
     int exSizeBits = -1;
     int job_len = 8;
 };
diff --git a/libpoolprotocols/getwork/EthGetworkClient.cpp b/libpoolprotocols/getwork/EthGetworkClient.cpp
index 58968d4..a84c576 100644
--- a/libpoolprotocols/getwork/EthGetworkClient.cpp
+++ b/libpoolprotocols/getwork/EthGetworkClient.cpp
@@ -98,6 +98,7 @@ void EthGetworkClient::workLoop()
                                WorkPackage newWorkPackage;
                                newWorkPackage.header = h256(v[0].asString());
                                newWorkPackage.epoch = EthashAux::toEpoch(h256(v[1].asString()));
+                               newWorkPackage.height = strtoul(v[3].asString().c_str(), nullptr, 0);

                                // Since we do not have a real connected state with getwork, we just fake it.
                                // If getting work succeeds we know that the connection works
@@ -111,6 +112,7 @@ void EthGetworkClient::workLoop()
                                if (newWorkPackage.header != m_prevWorkPackage.header) {
                                        m_prevWorkPackage.header = newWorkPackage.header;
                                        m_prevWorkPackage.epoch = newWorkPackage.epoch;
+                                       m_prevWorkPackage.height = newWorkPackage.height;
                                        m_prevWorkPackage.boundary = h256(fromHex(v[2].asString()), h256::AlignRight);

                                        if (m_onWorkReceived) {
diff --git a/libprogpow/ProgPow.cpp b/libprogpow/ProgPow.cpp
index b55b14e..4a4a4b4 100644
--- a/libprogpow/ProgPow.cpp
+++ b/libprogpow/ProgPow.cpp
@@ -14,12 +14,10 @@ void swap(int &a, int &b)
     b = t;
 }

-std::string ProgPow::getKern(uint64_t block_number, kernel_t kern)
+std::string ProgPow::getKern(uint64_t prog_seed, kernel_t kern)
 {
     std::stringstream ret;

-    uint64_t prog_seed = block_number / PROGPOW_PERIOD;
-
     uint32_t seed0 = (uint32_t)prog_seed;
     uint32_t seed1 = prog_seed >> 32;
     uint32_t fnv_hash = 0x811c9dc5;

Possible degradation to 0 in random math

int src1 = kiss99(prog_rnd) % PROGPOW_REGS;
int src2 = kiss99(prog_rnd) % PROGPOW_REGS;

If src1 == src2 and we do XOR, result will be 0. This 0 will most likely spread because

0 * b = 0, a * 0 = 0
mul_hi(0, b) = 0, mul_hi(a, 0) = 0
ROTL32(0, b) = 0
ROTR32(0, b) = 0
0 & b = 0, a & 0 = 0
min(0, b) = 0, min(a, 0) = 0

The fix is to never do math operations that cancel out both arguments to 0. As far as I can see, it's only XOR currently. ASIC can add optimizations for the case when one of the numbers is 0.

Moreover, the case when src1 == src2 allows many other optimizations for ASIC that OpenCL/GPU won't do. It can use squarer instead of full multiplier for multiplication (more energy efficient), MIN/AND/OR simply become NOP, ADD becomes SHL by 1, CLZ/POPCOUNT become 2 times simpler/energy efficient. OpenCL compiler, on the other hand, is not guaranteed to take advantage of this. Compiler will be able to remove MIN/AND/OR from the generated code if src1 == src2, but it's unlikely to do more.

Again, the fix for all this is simple: never do math on the same register, always use two different registers:

int src_index = kiss99(prog_rnd) % (PROGPOW_REGS * (PROGPOW_REGS - 1));
int src1 = src_index % PROGPOW_REGS; // 0 <= src1 < PROGPOW_REGS
int src2 = src_index / PROGPOW_REGS; // 0 <= src2 < PROGPOW_REGS - 1

// src2 is not the final index yet
// Example: if we have 5 registers and src1 = 1, src2 = 3
// src1: 0 _1_ 2 3 4
// src2 = 3, but it's an index in the list of remaining registers: 0 2 3 _4_
// so the final index for src2 will be 4 = 3 + 1

if (src2 >= src1) ++src2; // 0 <= src2 < PROGPOW_REGS and src2 != src1

ASIC Resistance

While a custom ASIC to implement this algorithm is still possible, the efficiency gains available are minimal. The majority of a commodity GPU is required to support the above elements. The only optimizations available are: *Remove the graphics pipeline (displays, geometry engines, texturing, etc) *Remove floating point math

These would result in minimal, roughly 1.1-1.2x, efficiency gains. This is much less than the 2x for Ethash or 50x for Cryptonight.

I must disagree with these estimates on the ASIC efficiency gains and say they seem too low.

For one thing, isn't floating point a huge part of GPUs? If I understand the history correctly, integer math was emulated by floating point units for quite some time!

But let's ignore floats. There's much more to chip design than shrinking die area, which is actually a very modest cost. Far more important is the power per hash. GPUs are designed to optimize framerate and don't care about power, as long as they can dissipate the heat. Even if an ASIC had the exact same logic requirements and die area as a GPU, significant efficiency can be gained during the physical design process by choosing low power over speed. An ASIC might be slower per silicon area, but it will be better hash-per-watt. The extra capital cost for silicon area is small compared to the operational power efficiency advantage.

It's similar to ARM vs Intel in the CPU world. If you want the absolute fastest chips, you buy Intel, but all cell phones run ARMs. GPUs are Intel and ASICs are ARM.

To be fair, I think ProgPoW is a good try at device-binding, but ultimately GPUs are ill-suited for mining, and the more you utilize the GPU, the greater the gap with ASICs. There's really nothing software writers can do about that, except to minimize the usage of GPU cores by saturating the bandwidth to commodity DRAMs.

Fuller writeups here:
My first article is wrong in assuming KISS99 is a major factor. The second article emphasizes the power aspect.
https://medium.com/@timolsoncrypto/progpow-is-less-asic-resistant-than-ethash-6efd61d17cfa
https://medium.com/@timolsoncrypto/progpow-part-2-still-worse-than-ethash-2b31c5a260d2

Thanks for listening. We can just insta-close the issue if you want. Just thought I should record these comments somewhere in the project.

CUDA Kernel produces 1.3~1.7 invalid shares on concurrent kernels

Test Environment
Linux Ubuntu 16.04
CUDA 10
NVIDIA Driver 410.79
6x Gtx 1050 Ti

Software
https://github.com/AndreaLanfranchi/ethminer
with 0.9.2 implemented

Running the GPUs on CUDA I get 33 to 35.5 Mh (overall) but keep getting 1.3 to 1.7 invalid shares (2 hours test batch)
Same test running the same GPUs on OpenCL I get 30 to 32 Mh but 0% invalid shares. (same 2 hours batch)

For sake of precision I am mining on a private node linked to gangnam network and with minimum diff of 430 Mh (nicehash index 0.1).

Am I missing something ?

Make greater use of MADs

In #26, I obtained/confirmed some benchmark results for ProgPoW 0.9.2, including e.g. 22.7M at block 7M on Vega 64 actually running at 1401 MHz when running that benchmark.

Simulating this version with c-progpow, I get 36864 Merge and 20480 Math operations per hash computed. This gives 22.7M*(20480+36864) = ~1302G of those operations per second on Vega 64. While this is a sane number considering that many of those operations correspond to multiple instructions, let's recall that a Vega 64 at 1401 MHz is capable of computing 4096*1401 = ~5738G operations per cycle (or virtually ~11477G FP32 FLOPS due to how those are traditionally counted with MUL and ADD halves of a MAD as separate FLOPS).

Besides register file and local memory lookups, MULs or MADs are pretty much the only other operations we have on GPUs that consume relatively non-trivial resources in custom hardware. As once again confirmed in a recent comment in #24, all other operations of ProgPoW add relatively little.

I think ProgPoW would be greatly improved by cutting down on its limited programmability and instead making greater use of MADs. Even if we count a MAD as just one operation (unlike it's historically done for FLOPS), we have up to a 4.4x potential improvement here (as ~1302G to ~5738G).

Further, Math only performs MULs 2/11 of the time, and Merge only performs specialized multiplication by 33 (equivalent to shifting the number left by 5 bits and adding that to itself). With this, ProgPoW currently performs only 22.7M*20480*2/11 = ~84.5G arbitrary MULs per second. The potential improvement in terms of arbitrary MUL/s is thus up to 68x (~84.5G to ~5738G).

This is much more important than the limited programmability that ProgPoW now has. That said, some programmability can be retained (thus justifying the name) - it will remain in routing of different registers as inputs and outputs of each MAD. Such routing, and not the choice of instruction (which is just a MUX or such), is the relatively heavier component.

I recognize that we might not be able to run a lengthy stream consisting exclusively of MADs without many intermediate values degrading to zeroes, etc. too often (although frequent register file and local memory lookups should help here - we have lots of internal state, so it won't degrade fully too quickly). So the actual improvement we can achieve will be somewhat less than those theoretical maximums I listed. But we can probably get close to them, certainly much closer than we're now.

We might also want to revisit use of FP32 MADs in addition to integer ones. (IEEE might provide us sufficient guarantees.) This should double the throughput on NVIDIA Turing and Volta, but then we'd need to decide whether we'd make use of that throughput (maybe not fully?) not to slow down other GPUs too much. On NVIDIA's diagrams, the FP32 units are shown as occupying 2x the area of the INT32 ones; if that corresponds to actual die area, that's yet another reason to use them.

Edit: changed "M" (was an error) to "G" in op/s figures.

Profiling on 1080 Ti

There has a NSIGHT profiling result on the web:
https://medium.com/@ifdefelse/understanding-progpow-performance-and-tuning-d72713898db3

I have also tried to do profiling on 1080 Ti with the same codebsae from this github, and have some questions to ask.

The result shows that the ‘Issued Warp Per scheduler’ is only 0.77, which implies that the poor latency hiding, it might be too low compared to 0.94 on 1060, 0.88 on 1070.

Also, the result of ‘Warp State Statistics’ shows that, the bottleneck is ‘Stall Short Scoreboard’ which is related to operations to shared memory.

Below are my shared memory profiling:

Instructions, Requests, %Peak, Bank Conflicts

201326592, 706282140, 76.35, 504955548

Compared with 1060 and 1070, they are the same instructions, but more requests and bank conflicts, I guess it might be the reason of high latency on my experiment.

But, I don’t know why the requests and bank conflicts are about 257274 more than 1060/1070, could anyone help with that?

ProgPoW CPU validation is not implemented

As I understand, this ProgPoW tree relies on never finding a solution below target, or else it'd try to validate the solution on CPU as Ethash, which will fail. The fork at https://github.com/BitcoinInterestOfficial/ProgPOW does simply:

    // ProgPoW CPU validation is not implemented, override
    s_noeval = true;

Does any other tree implement lightweight ProgPoW validation on CPU?

[My c-progpow based off this ProgPoW tree's README.md currently only implements (cached-)full-DAG-based computation on CPU. Not lightweight.]

rebase on top of the ethminer master branch

Experimental

This is my rebase work on top of the latest ethminer master branch.
The master branch of the ethminer a bit faster and the latest ethash already support ProgPoW v0.9.1

Please see
https://github.com/EthersocialNetwork/ethminer-ProgPOW/commits/progpow-master-rebase

Only CUDA tested.

Screen shot

(rebased ProgPoW ethminer + chfast/ethash master branch with a small progpow CPU verifier wrapper fix)
image

image
P106-6G (samsung) ~10.4MH/s (~24.0MH/s for ethash)

OpenCL Error: clEnqueueWriteBuffer: CL_INVALID_MEM_OBJECT (-38)

I have a problem with running ProgPow on my macbook (I am doing this only for test purposes).

When I start benchmark mode, I get:

[~/open-source/ProgPOW/build]$ ./ethminer/ethminer -G -M 10000000 --opencl-devices 1 -v 9                                    [master]
  m  13:05:44|        |  ethminer version 0.15.0.dev0
  m  13:05:44|        |  Build: darwin / release +git. 66e6979
  ℹ  13:05:44|        |  Found suitable OpenCL device [ Intel(R) UHD Graphics 630 ] with 1610612736  bytes of GPU memory
Benchmarking on platform: CL
Preparing DAG for block #10000000
 cl  13:05:44|cl-0    |  No work. Pause for 3 s.
Warming up...
 cl  13:05:47|cl-0    |  New epoch 333 / period 200000
 cl  13:05:49|cl-0    |  Platform: Apple
 cl  13:05:49|cl-0    |  Device:   AMD Radeon Pro 560X Compute Engine  / OpenCL 1.2
  ✘  13:05:50|cl-0    |  Build info: <program source>:33:32: error: parameter may not be qualified with an address space
        __local const uint32_t c_dag[PROGPOW_CACHE_WORDS],
                               ^
<program source>:34:26: error: parameter may not be qualified with an address space
        __local uint64_t share[GROUP_SHARE],
                         ^
<program source>:203:6: warning: no previous prototype for function 'keccak_f800_round'
void keccak_f800_round(uint32_t st[25], const int r)
     ^
<program source>:250:10: warning: no previous prototype for function 'keccak_f800'
uint64_t keccak_f800(__constant hash32_t const* g_header, uint64_t seed, hash32_t digest)
         ^
<program source>:283:10: warning: no previous prototype for function 'kiss99'
uint32_t kiss99(kiss99_t *st)
         ^
<program source>:295:6: warning: no previous prototype for function 'fill_mix'
void fill_mix(uint64_t seed, uint32_t lane_id, uint32_t mix[PROGPOW_REGS])
     ^
<program source>:332:20: warning: unused variable 'nonce'
    uint64_t const nonce = start_nonce + gid;
                   ^

  ✘  13:05:50|cl-0    |  OpenCL Error: clEnqueueWriteBuffer: CL_INVALID_MEM_OBJECT (-38)
Trial 1...
0
Trial 2...
0
Trial 3...
0
Trial 4...
0
Trial 5...
0
min/mean/max: 0/0/0 H/s
inner mean: 0 H/s

I tried to do that on two machines with gpus like that:

FORMAT: [platformID] [deviceID] deviceName
[0] [0] Intel(R) UHD Graphics 630
    CL_DEVICE_TYPE: GPU
    CL_DEVICE_GLOBAL_MEM_SIZE: 1610612736
    CL_DEVICE_MAX_MEM_ALLOC_SIZE: 402653184
    CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
[0] [1] AMD Radeon Pro 560X Compute Engine
    CL_DEVICE_TYPE: GPU
    CL_DEVICE_GLOBAL_MEM_SIZE: 4294967296
    CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1073741824
    CL_DEVICE_MAX_WORK_GROUP_SIZE: 256```

and that:

Listing OpenCL devices.
FORMAT: [platformID] [deviceID] deviceName
[0] [0] Iris Pro
	CL_DEVICE_TYPE: GPU
	CL_DEVICE_GLOBAL_MEM_SIZE: 1610612736
	CL_DEVICE_MAX_MEM_ALLOC_SIZE: 402653184
	CL_DEVICE_MAX_WORK_GROUP_SIZE: 512

In both cases, results are the same.

I am making the project using

cmake .. -DETHASHCUDA=OFF -DETHASHCL=ON

I've seen related ethminer issue but it didn't help me.

PROGPOW favours Nvidia GPU

Hello,
after first tests it is obvious that PROGPOW favours Nvidia GPU cards.
PROGPOW is power hungry, so for now looks that is not here to "level the game" but to gave Nvidia a big boost and to kick off not only ASIC, but also AMD miners.

Can AMD miners hope for some kind of Ethlargement Pill for AMD GPU-s in the future ?

Inefficient integer multiplications

An ASIC will be 4 times more efficient with these two operations because a, b are 32-bit integers:

    case 1: return a * b;
    case 2: return mul_hi(a, b);

32-bit integer multiplications are inefficient on GPUs because GPUs only have 24-bit wide data path for multiplication. 32-bit MUL is 4 times slower than 24-bit MUL. It's better to use mul24 here.

Side note: it's a shame that OpenCL still doesn't have mul24_hi, but CUDA has it.

ProgPoW OpenCL kernel is usually built for wrong epoch

As it turns out, this ProgPoW tree builds the ProgPoW OpenCL kernels using ProgPoW epoch derived back from the coarse Ethash DAG epochs. As a result, kernels are only built correctly, and the ProgPoW computation is only correct, very rarely - only when the block number is at or just after an Ethash DAG epoch. So it is correct e.g. for block 30000 and 60000, but not for 10000, 29000, 31000, nor 7000000 or 10000000. (The last two of these are block numbers I've been using for benchmarks before, thinking the computation is hopefully correct. This news partially invalidates those benchmark results on AMD cards, and CUDA vs. OpenCL comparisons on NVIDIA cards.)

The host-side code for CUDA doesn't have the same problem. I didn't check all the history, but my guess is that when ProgPoW periods lower than Ethash DAG's were introduced, only the CUDA code was updated to reflect that. OpenCL was either forgotten or left for later, without this having been documented.

I'll probably send a PR fixing this in a few days, but I thought I'd bring this up in here first in case there are any comments.

Birthday attack on 64-bit seed

64-bit seed did look like providing only a low safety margin to me during my ProgPoW review last year, and I was going to revisit this and share some thoughts with the community, but in the end I ran out of time and I felt like ProgPoW was non-final anyway (to my liking, at least) yet further tweaks were not encouraged. Now reminded and inspired by @kik's #51 and by this community's prompt response to it and willingness to tweak ProgPoW to fix it, I present another related yet very different attack:

While mining ProgPoW with a large cluster, maintain a cluster-wide cache of mappings from 64-bit seed to 256-bit mix digest (immediate result of the memory-hard computation). This cache can be emptied on every new period (10 blocks) and filled during the period, maybe for up to a pre-defined maximum size (as memory permits) such as 2^32 entries (128 GiB).

Once the cache fill is above a threshold, each cluster node can reasonably start to utilize its attached Keccak ASICs to search the nonce space until a previously cached seed is seen. For example, with a cache fill of 2^32 entries, it'd take around 2^32 Keccak computations until a cache hit. With a large enough cache and with enough Keccak ASICs working in parallel, this might be cheaper than doing a mix digest computation for a previously unseen seed (although the node's GPUs would also continue working on new seeds in parallel).

Now, what cache size would make this attack worthwhile? We'd need to match (and then exceed) a GPU's hash rate with our rate of finding previously cached seeds. With a cache of 2^32, and thus needing to do 2^32 Keccak computations, to match a GPU's e.g. 2^24 (16.8M) hashes/s we need to perform 2^56 Keccak computations per second. Can an ASIC with enough Keccak cores (perhaps across many chips) to accomplish this potentially consume less power than a GPU does? Probably not.

Can a cache much larger than 2^32 reasonably be maintained? Probably yes, distributed across the cluster nodes' RAM. Then fewer Keccak cores would be needed, and their energy efficiency vs. GPUs would be better.

Can a cluster node's Keccak ASICs quickly determine if a seed is (likely) cached? Probably yes, with a probabilistic data structure such as a Bloom filter in RAM closely attached to the ASICs. (They would not need to wait for this check result, but would proceed to test more nonces. There would need to also be locally stored queues of seed values to check.) This RAM could be many times smaller than the cache itself (perhaps 10 to 20 bits per seed, not 256), but it would nevertheless be the limiting factor on the total size of the cache.

Can the Bloom filter RAM have enough throughput to accommodate the many candidate seeds coming out of Keccak every second? That's probably the worst bottleneck. I guess it'd be tricky ASIC design with Keccak+SRAM cores (distributed RAM), NOC, and inter-chip mesh to implement that.

Overall, this doesn't look practical yet. But if we want to have a better safety margin and greater confidence with respect to attacks like this, we need to move to larger seeds.

My bigger concern isn't this attack per-se, but rather that this line of thought could become the missing piece of the puzzle in making some other yet-undiscovered attack practical.

Performance analysis of DATASET_PARENTS​=512

The ProgPoW software audit recommend to increase the DATASET_PARENTS​ Ethash cache parameter from 256 to 512. This has direct impact on verification performance as the time for single verification doubles (while ProgPoW verification slowdown is only 30-50% over Ethash).

The DATASET_PARENTS​ increase makes the verification "even more" memory hard and lowers the instruction per cycle ratio to 1 (the max being 4).

ProgPoW verification, DATASET_PARENTS = 256, epoch 0:

cset shield -- perf stat -B -e cache-references,cache-misses,cycles,instructions test/ethash-bench --benchmark_filter=progpow_hash/0
cset: **> 1 tasks are not movable, impossible to move
cset: --> last message, executed args into cpuset "/user", new pid is: 10825
2019-09-10 14:19:50
Running test/ethash-bench
Run on (8 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
------------------------------------------------------
Benchmark               Time           CPU Iterations
------------------------------------------------------
progpow_hash/0       1960 us       1960 us        347

 Performance counter stats for 'test/ethash-bench --benchmark_filter=progpow_hash/0':

        65 642 783      cache-references                                            
        39 184 374      cache-misses              #   59,693 % of all cache refs    
     5 636 657 996      cycles                                                      
     7 104 679 821      instructions              #    1,26  insn per cycle         

       1,314309256 seconds time elapsed

       1,296116000 seconds user
       0,000000000 seconds sys

ProgPoW verification, DATASET_PARENTS = 512, epoch 0:

cset shield -- perf stat -B -e cache-references,cache-misses,cycles,instructions test/ethash-bench --benchmark_filter=progpow_hash/0
cset: **> 1 tasks are not movable, impossible to move
cset: --> last message, executed args into cpuset "/user", new pid is: 10697
2019-09-10 14:19:26
Running test/ethash-bench
Run on (8 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
------------------------------------------------------
Benchmark               Time           CPU Iterations
------------------------------------------------------
progpow_hash/0       3695 us       3694 us        195

 Performance counter stats for 'test/ethash-bench --benchmark_filter=progpow_hash/0':

        87 073 601      cache-references                                            
        48 426 695      cache-misses              #   55,616 % of all cache refs    
     6 589 826 522      cycles                                                      
     6 898 095 482      instructions              #    1,05  insn per cycle         

       1,534862112 seconds time elapsed

       1,512262000 seconds user
       0,004011000 seconds sys

"CUDAMiner_kernel.h" file

EDIT: Found the answer in "CMakeLists.txt". Thanks.


Hi,

Is there supposed to be a "CUDAMiner_kernel.h" file in the libethash-cuda/ folder? It's being included by the CUDAMiner.cpp code.

I guess it contains the definition for the string "CUDAMiner_kernel" referenced by CUDAMiner.cpp in compileKernel()? I suppose that string would just contain the source code of "CUDAMiner_kernel.cu" since it's being combined at runtime with the dynamic "progPowLoop()" code from ProgPow::getKern() and runtime compiled into the actual cuda kernel?

Tune ProgPoW for similar hashrate to Ethash's

We might want to either tune ProgPoW to produce a similar hashrate to what Ethash produces on same currently relevant hardware and at same DAG size, or document a rationale why we don't.

Right now, ProgPoW produces hashrate that is numerically significantly below Ethash's. This may require hard-coding a difficulty scaling factor to apply on a switchover from Ethash to ProgPoW, and it has a psychological effect of making ProgPoW appear "worse".

Why don't we reduce PROGPOW_CNT_DAG to a level where the hashrates are similar, effectively hard-coding this scaling factor into ProgPoW itself? I understand they can't be exactly the same since the current hashrate ratio between the two hashes varies across different GPUs, but at least we can reduce the difference between ProgPoW's and Ethash's numeric hashrate and eliminate the need for an external scaling factor.

To support the trick I described in #26 (comment) we might have some constraints on optimal values for PROGPOW_CNT_DAG - e.g., it'd need to be a multiple of 4 in order to avoid fetching a smaller last group of blocks in the example in that comment - but this still leaves us with a lot of freedom for adjusting the value.

Index cache with byte offsets

ProgPoW currently uses a source register content modulo cache size as the word index into cache. This requires (implicit) left shift by 2 in the GPU hardware to produce the byte offset. Such left shift might or might not have a runtime performance cost, depending on compiler and (micro-)architecture.

This (potential) runtime cost may be reliably avoided by applying a mask to the source register content such that the byte offset is extracted right away, without needing a further shift. This will change the computed hash values, but not other properties of ProgPoW (those values are supposed to be random anyway).

Here are the changes I tested on top of my current (revised in other ways) ProgPoW tree:

+++ b/libprogpow/ProgPow.cpp
@@ -113,9 +113,13 @@ std::string ProgPow::getKern(uint64_t block_number, kernel_t kern)
   ret << "uint32_t offset, data;\n";
 
   if (kern == KERNEL_CUDA)
+  {
+      ret << "const unsigned char *c_dag_uc = (const unsigned char *)c_dag;\n";
       ret << "const uint32_t lane_id = threadIdx.x & (PROGPOW_LANES-1);\n";
+  }
   else
   {
+      ret << "__local const unsigned char *c_dag_uc = (__local const unsigned char *)c_dag;\n";
       ret << "const uint32_t lane_id = get_local_id(0) & (PROGPOW_LANES-1);\n";
       ret << "const uint32_t group_id = get_local_id(0) / PROGPOW_LANES;\n";
   }
@@ -152,8 +157,15 @@ std::string ProgPow::getKern(uint64_t block_number, kernel_t kern)
           std::string dest = mix_dst();
           uint32_t    r = rnd();
           ret << "// cache load " << i << "\n";
-          ret << "offset = " << src << " % PROGPOW_CACHE_WORDS;\n";
-          ret << "data = c_dag[offset];\n";
+          ret << "offset = " << src << " & ((PROGPOW_CACHE_WORDS - 1) << 2);\n";
+          if (kern == KERNEL_CUDA)
+          {
+              ret << "data = *(const uint32_t *)&c_dag_uc[offset];\n";
+          }
+          else
+          {
+              ret << "data = *(__local const uint32_t *)&c_dag_uc[offset];\n";
+          }
           ret << merge(dest, "data", r);
       }
       if (i < PROGPOW_CNT_MATH)

For me, this improves the hashrate on Vega 64 and GTX 1080 by about 1% and on GTX Titan X Maxwell by about 2%. Yes, this is in my Maxwell-friendly tree. Speedups on the newer GPUs need to be confirmed on the original Maxwell-unfriendly ProgPoW as well, which I haven't done yet (am experimenting with more tweaks anyway), but I expect them to be about 1% as well (unless ProgPoW is fully memory-bound, in which case the would-be-speedup can instead be extracted to perform more random math, etc.)

Another way to implement this change is:

+++ libprogpow/ProgPow.cpp  2019-05-06 14:31:44.081259833 +0000
@@ -153,8 +157,8 @@
           std::string dest = mix_dst();
           uint32_t    r = rnd();
           ret << "// cache load " << i << "\n";
-          ret << "offset = " << src << " % PROGPOW_CACHE_WORDS;\n";
-          ret << "data = c_dag[offset];\n";
+          ret << "offset = " << src << " & ((PROGPOW_CACHE_WORDS - 1) << 2);\n";
+          ret << "data = c_dag[offset >> 2];\n";
           ret << merge(dest, "data", r);
       }
       if (i < PROGPOW_CNT_MATH)

This is simpler in source code, but relies on the compiler figuring out that the explicit right shift by 2 cancels out with the array indexing's implicit left shift by 2. In my testing, this appears to provide a slightly smaller speedup than the lengthier patch above.

Edits: fixed bugs in the first version of the patch, which were really nasty yet didn't significantly affect the observed speedups.

KISS99 optimization

The KISS99 used in ProgPoW operates on 32-bit unsigned types. But in many cases we are using only 5 bits of the result. Maybe it can be implemented with lower precision (e.g. 24-bit multiplication).

Reproducing the testcase

Thanks for the testcase introduced in response to #4. However, the ethminer.exe -U -M 30000 command in test/result.log doesn't currently produce the debugging output also shown in there, and there doesn't appear to be any code in the repository to produce that output (or is there?) - can that please be added, perhaps as a debugging mode enabled via a command-line option (that would be included in the command given in test/result.log, so it'd be clear how to reproduce the result from just looking at that file)?

Problem with build and compile

Good Afternoon, I have the problem with build and compile.
I downloaded "cable", copied it to the folder "cmake" of the ethminer, created directory "build".
But when I tried to build and compile

mkdir build; cd build
cmake .. -DETHASHCL=OFF -DETHASHCUDA=ON
cmake --build . --config Release

I got an error:

CMake error at cmake/cable/CableBuildInfo.cmake:18 (message)
The PROJECT_NAME argument missing.

Please, explain me, where I need to write the PROJECT_NAME Argument and please, write clear instructions for build and compile. I had to guess that I need to download "cable".

overclock settings

What is the best settings for nvidia graphics cards when mining ProgPoW algo?

Win64 CUDA 9.0 build does not work correctly

I'm not sure why this happens.

linux binary works nicely,
but Win x86_64 (Windows 10) + CUDA 9.0 + Visual Studio 2017 build (with -Tv140 tool set) does not work correctly, (GTX1080-8GB)

so I've dig into the source code and found that the c_dag generation part at progpow_search() seems odd.

diff --git a/libethash-cuda/CUDAMiner_kernel.cu b/libethash-cuda/CUDAMiner_kernel.cu
index e64e7ff..d13bf49 100644
--- a/libethash-cuda/CUDAMiner_kernel.cu
+++ b/libethash-cuda/CUDAMiner_kernel.cu
@@ -163,6 +163,10 @@ progpow_search(
             c_dag[word + i] =  load.s[i];
     }

+    __syncthreads();
+    if (lane_id == 0)
+        printf("c_dag[0] = %08x%08x%08x%08x%08x%08x%08x%08x\n", c_dag[0], c_dag[1], c_dag[2], c_dag[3], c_dag[4], c_dag[5], c_dag[6], c_dag[7]);
+
     hash32_t digest;
     for (int i = 0; i < 8; i++)
         digest.uint32s[i] = 0;

for normal case under linux, it print out
c_dag[0] = 2922db22466c51cc860021d27e41abf182c3d10b6acc5e7c3fa3d3f72b33ae8d - epoch 0

but Win x86_64 build case, it print some random output like as
c_dag[0] = 277c371ef7acd87123f6476e5d3b88f9be8e87ce7e544ebd5479c6a1b9db7b5c - epoch 0

and it results GPU mix != CPU mix
image

any suggestion?

Test vectors

Hi,
Thinking of porting the algo to Python, for a possible use by Bismuth Crypto.

Where to start, any tests vectors? Did not find.

Benchmark results

It should be possible to sanity-check the performance of a ProgPOW build against its developers' expectations. For that, this repository should include benchmark results on a few system setups (which should also be documented - both hardware and software) for the latest and/or other specified versions of ProgPOW. So far, I only found outdated results in the "Testing" section of the first comment in ZcashFoundation/GrantProposals-2018Q2#15, and these don't include detail on the system setups (they only list GPU types) nor the block number.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.