Code Monkey home page Code Monkey logo

Comments (25)

Entropy-Enthalpy avatar Entropy-Enthalpy commented on August 29, 2024 4

A TONS OF benchmarks

Hardware details:
Hardware_details

OS details:
Ubuntu 22.04.3 LTS, Linux 6.2.0-26-generic x86_64, GNU 11.4.0

GPU driver & toolchain:
AMD GPU driver version 6.1.5.50600-1609671, ROCm 5.6.0;
NVIDIA GPU driver 535.86.05, CUDA Toolkit 11.8

OpenMM version:
OpenMM 8.0.0OpenMM HIP Plugin 8Mar2023, natively built from source code.

Results:

  RX7900XTX RX6900XT Radeon Ⅶ RTX4090 RTX4080 RTX4070Ti RTX4070 RTX4060Ti RTX4060 RTX3080Ti OC
gbsa 2336.967 1952.980 1214.820 3813.987 3671.753 3067.930 2744.943 2475.290 1973.870 2749.813
rf 1865.653 1611.080 998.185 2638.220 2525.517 2197.977 1958.677 1652.783 1241.327 1866.060
pme 1545.733 1406.137 859.448 2250.010 2104.827 1765.640 1514.337 1333.267 940.423 1583.760
apoa1rf 706.180 576.104 314.815 1179.653 950.574 768.977 607.768 486.256 328.822 666.743
apoa1pme 605.596 496.541 267.213 784.560 627.533 499.557 386.736 314.534 222.208 449.155
apoa1ljpme 503.829 405.404 217.646 620.874 496.335 395.952 306.482 251.178 177.799 360.385
amoebapme 26.350 17.990 9.799 44.315 37.470 28.894 23.807 20.475 14.337 28.481
amber20-cellulose 154.316 118.153 62.954 201.166 160.379 118.080 87.449 68.316 47.160 81.217
amber20-stmv 51.047 39.081 22.652 59.350 43.604 28.113 20.384 14.909 11.934 23.515

The performance of AMD GPUs is impressive. However, the performance of NVIDIA Ampere consumer GPUs on larger systems is not satisfactory, 3080Ti/3090 is almost equalized by AMD Radeon VII.

Compared to previous versions, ROCm 5.6.0 has some performance improvements for smaller systems. The performance of the RDNA3 GPU (7900XTX) is not very stable, and occasionally the GPU scheduling is not active.

It is worth mentioning that tests by @muziqaz (openmm/openmm#3338 (comment), his updated tables) over the past few days have shown that for larger systems, the 7900XTX may perform better under Windows (compare to my results under Linux).

BTW, I will be posting a series of blogs of "Switch to AMD" for MD simulation APPs recently, with 4 mainstream APPs: GROMACS, Amber, LAMMPS and OpenMM. The first forum for posting may be the Computational Chemistry Commune, I am also looking for other suitable forums, do anyone have any suggestions?

from openmm-org.

peastman avatar peastman commented on August 29, 2024 1

A100:

Run 1 Run 2 Run 3 Average
gbsa 2002 2000.36 1999.22 1999.22
rf 1480.2 1476.6 1467.05 1467.05
apoa1rf 626.638 628.392 626.127 626.127
apoa1ljpme 352.423 351.813 352.966 352.966

H100:

Run 1 Run 2 Run 3 Average
gbsa 2429.47 2432.72 2430.23 2430.8067
rf 1758.1 1757.46 1761.9 1759.1533
apoa1rf 799.627 800.045 798.96 799.5440
apoa1ljpme 488.972 488.76 488.668 488.8000

4080:

Run 1 Run 2 Run 3 Average
gbsa 3667.43 3667.09 3661.51 3665.3433
rf 2441.76 2421.83 2406.36 2423.3167
apoa1rf 913.869 906.575 907.264 909.2360
apoa1ljpme 480.685 478.477 479.307 479.4897

from openmm-org.

Entropy-Enthalpy avatar Entropy-Enthalpy commented on August 29, 2024 1

It is interesting to see that OCL is stronger on Windows than on Linux.

But it's still much slower than HIP. Many vendors may be abandoning OpenCL, or is SYCL a more modern solution? GROMACS has switched to SYCL.

Now that I ran out of my GPUs on windows I will try few variations of few suggestions by @bdenhollander to maybe improve performance.

Doing so many benchmarks should be a lot of work, waiting for your good news!

@Entropy-Enthalpy do you mind if I add your results to my table?

Sure! Feel free to use them, just mention the source of the data.

Finally, 7900xtx is running with the big dogs, though 4090 performance on small systems is unbelievable

I thought the reason the 4090 works so well on small systems is because the clock frequency of the GPU is so high. All RTX 40 series GPUs can reach ~2.8GHz when running MD simulations at full load. For comparison, Ampere consumer GPUs are ~1.9GHz. In addition, RTX 40xx GPUs even have 100-200MHz overclocking space.

This may also be the reason for the unsatisfactory performance of the H100 PCIe.

from openmm-org.

Entropy-Enthalpy avatar Entropy-Enthalpy commented on August 29, 2024 1

I posted the aforementioned blog post just now: http://bbs.keinsci.com/thread-39269-1-1.html

from openmm-org.

peastman avatar peastman commented on August 29, 2024

Here are the numbers for A100.

Run 1 Run 2 Run 3 Average
pme 1273.34 1281.89 1272.86 1276.03
apoa1pme 444.552 445.128 446.368 445.349333333333
amber20-cellulose 126.658 126.716 126.689 126.687666666667
amber20-stmv 38.5787 38.5436 38.4025 38.5082666666667
amoebapme 24.0079 23.9255 23.9638 23.9657333333333

I also ran the cellulose and STMV benchmarks on multiple devices. Using two A100s:

Run 1 Run 2 Run 3 Average
amber20-cellulose 196.166 193.278 195.035 194.826333333333
amber20-stmv 63.7827 63.5919 63.6424 63.6723333333333

Using three:

Run 1 Run 2 Run 3 Average
amber20-cellulose 227.036 223.053 221.556 223.881666666667
amber20-stmv 79.015 79.2197 78.8277 79.0208

Using four:

Run 1 Run 2 Run 3 Average
amber20-cellulose 248.516 243.46 246.521 246.165666666667
amber20-stmv 90.917 90.8288 90.1617 90.6358333333333

from openmm-org.

peastman avatar peastman commented on August 29, 2024

If anyone else can run benchmarks on other GPUs, that would be great. It would be good present an assortment of Turing and newer GPUs. If anyone has a recent high end AMD GPU, that would also be good to include.

from openmm-org.

sef43 avatar sef43 commented on August 29, 2024

This is for RTX 3090:

run 1 run 2 run 3 average
pme 1531.2 1521.54 1528.54 1529.87
apoa1pme 426.283 425.176 424.684 425.4835
amber20-cellulose 79.1935 79.4315 79.1946 79.19405
amber20-stmv 22.9114 22.8062 23.3675 23.13945
amoebapme 27.5631 27.4345 27.4878 27.52545

from openmm-org.

peastman avatar peastman commented on August 29, 2024

Here are times on H100:

Run 1 Run 2 Run 3 Average
pme 1499.74 1502.61 1502.4 1501.5833
apoa1pme 609.277 609.299 608.524 609.0333
amber20-cellulose 162.21 161.718 161.612 161.8467
amber20-stmv 47.8135 47.7666 47.8271 47.8024
amoebapme 32.1292 32.0891 32.1591 32.1258

It's significantly faster than A100. Compared to 4080 it's a little faster on the large systems but slower on the small systems. I also tested using multiple GPUs.

Two:

Run 1 Run 2 Run 3 Average
amber20-cellulose 229.752 229.28 232.941 230.6577
amber20-stmv 76.0804 76.1623 75.8575 76.0334

Three:

Run 1 Run 2 Run 3 Average
amber20-cellulose 242.069 241.271 240.868 241.4027
amber20-stmv 87.2753 86.7526 86.7953 86.9411

Four:

Run 1 Run 2 Run 3 Average
amber20-cellulose 228.432 226.936 229.353 228.2403
amber20-stmv 83.3789 83.4071 83.9036 83.5632

The scaling is less good than the A100s, and it gets slower with four GPUs.

I'll aim to update the benchmarks on the website in the next day or two.

from openmm-org.

peastman avatar peastman commented on August 29, 2024

I missed a few benchmarks that we have on the website. I'll run them for the hardware I have access to. @sef43 could you run a few more benchmarks on the 3090?

python benchmark.py --platform=CUDA --style=table --test=gbsa,rf,apoa1rf,apoa1ljpme

from openmm-org.

jchodera avatar jchodera commented on August 29, 2024

Tagging @hmacdope. We are working on benchmarks on consumer grade GPUs via the FAH core version. It looks like our benchmarks project drifted out of sync with the OpenMM benchmarks, so I'll regerate these and we can get extensive consumer grade card statistics as well.

from openmm-org.

jchodera avatar jchodera commented on August 29, 2024

@peastman: I'm updating the Folding@home benchmarks project so we can also collect statistics across all consumer-grade cards, and want to confirm that our "standard" benchmarks are intended to be the following:

python benchmark.py --platform=CUDA --style=table --test=gbsa,pme,rf,apoa1rf,apoa1pme,apoa1ljpme,amber20-cellulose,amber20-stmv,amoebapme

Is that correct?

Are we sure we do not want any NPT or mixed-precision benchmarks?

from openmm-org.

peastman avatar peastman commented on August 29, 2024

It depends what you mean by "standard". The ones you listed are the ones we post on the website. But if you're interested in tracking other things, that's fine too.

from openmm-org.

peastman avatar peastman commented on August 29, 2024

@sef43 could you run the extra benchmarks on the 3090? That's all I need before I can update the website.

from openmm-org.

sef43 avatar sef43 commented on August 29, 2024

Extra benchmarks on 3090:

1 2 3 mean
gbsa 2594.57 2578.53 2575.42 2582.84
rf 1734.88 1721.03 1727.47 1727.79333
apoa1rf 609.178 608.667 608.651 608.9225
apoa1ljpme 333.08 333.442 332.235 332.919

from openmm-org.

peastman avatar peastman commented on August 29, 2024

Thanks!

from openmm-org.

Entropy-Enthalpy avatar Entropy-Enthalpy commented on August 29, 2024

Here are times on H100:

Run 1 Run 2 Run 3 Average
pme 1499.74 1502.61 1502.4 1501.5833
apoa1pme 609.277 609.299 608.524 609.0333
amber20-cellulose 162.21 161.718 161.612 161.8467
amber20-stmv 47.8135 47.7666 47.8271 47.8024
amoebapme 32.1292 32.0891 32.1591 32.1258

I guess it's the H100 PCIe version, right...? Because I recently conducted a benchmark test using the H100 PCIe and got similar results.

The H100 comes in several SKUs, with the most common ones being the H100 SXM5 and H100 PCIe. According to the Hopper architecture whitepaper (page 18), there is a significant difference in the chip specifications between these two SKUs, resulting in a notable difference in FP64/FP32 performance (page 20).

As such, it is important to specify the particular SKU on the webpage.

from openmm-org.

peastman avatar peastman commented on August 29, 2024

You're right. I thought they were the NVLINK version, but it seems they're actually PCIe. That explains why the scaling with multiple GPUs was worse.

from openmm-org.

Entropy-Enthalpy avatar Entropy-Enthalpy commented on August 29, 2024

Here are RTX 4090 benchmarks.

Command:
python benchmark.py --platform=CUDA --style=table --test=gbsa,rf,pme,apoa1rf,apoa1pme,apoa1ljpme,amoebapme,amber20-cellulose,amber20-stmv

Results:

RTX 4090  Run 1 Run 2 Run 3 Average
gbsa 3801.88 3841.71 3798.37 3813.987
rf 2720.25 2595.1 2599.31 2638.22
pme 2310.38 2220.51 2219.14 2250.01
apoa1rf 1178.92 1180.03 1180.01 1179.653
apoa1pme 784.449 783.864 785.366 784.5597
apoa1ljpme 620.933 620.417 621.272 620.874
amoebapme 44.443 44.2036 44.2992 44.31527
amber20-cellulose 201.51 200.882 201.107 201.1663
amber20-stmv 59.5814 59.2426 59.225 59.34967

Obviously, RTX 4090 is currently the fastest single GPU in any system, and in the two large systems, amber20-cellulose and amber20-stmv, the RTX 4090 is 20-25% faster than H100 PCIe.

BTW, I'll also benchmark some AMD GPUs using the latest OpenMM-HIP, should I post them here?

from openmm-org.

peastman avatar peastman commented on August 29, 2024

BTW, I'll also benchmark some AMD GPUs using the latest OpenMM-HIP, should I post them here?

That would be great!

from openmm-org.

peastman avatar peastman commented on August 29, 2024

The updates are in #98.

from openmm-org.

muziqaz avatar muziqaz commented on August 29, 2024

Great stuff there. Thank you.
Yes, we have been toying with Windows HIP SDK. There are few bugs and issues, but nothing major show stopping. AMD seems to be aware of them.
It is interesting to see that OCL is stronger on Windows than on Linux.
Now that I ran out of my GPUs on windows I will try few variations of few suggestions by @bdenhollander to maybe improve performance.
@Entropy-Enthalpy do you mind if I add your results to my table?
Finally, 7900xtx is running with the big dogs, though 4090 performance on small systems is unbelievable

from openmm-org.

muziqaz avatar muziqaz commented on August 29, 2024

OCL being stronger on Windows rather than Linux is just an internal joke, as nothing ever is faster on Windows compared to Linux 🤣
I believe FAH will still use OCL as a fallback, though with AMD removing more and more older GPUs from drivers, there isn't much of the reason to do that.
7900xtx is running at 2.9-3ghz too while HIPing ;)
I think the reason 4090 is so good, is CUDA's compiler maturity. And arch is more suited probably. On windows it would struggle for sure.
It would be great for Gromacs to look into HIP too, maybe through SYCL, to get AMD GPUs working. Their OCL implementation completely ignore AMD as a hardware device.

from openmm-org.

Entropy-Enthalpy avatar Entropy-Enthalpy commented on August 29, 2024

@muziqaz I think I've figured out why your results of the two larger systems (amber20-cellulose and amber20-stmv) are higher than mine. As you mentioned previously (amd/openmm-hip#5 (comment)), your card is Sapphire Nitro+, which has much higher TGP and clocks (420W, 2680MHz) than my MSI GAMING TRIO CLASSIC (355W, 2500MHz).

If you have time, you could try switching the VBIOS of your Sapphire Nitro+ to the "Silent" mode (using the physical switch on the card) and doing the benchmark again. The "Silent" mode has the same TGP and clocks as MBA model and MSI GAMING TRIO CLASSIC.

from openmm-org.

muziqaz avatar muziqaz commented on August 29, 2024

The difference between your amber results and mine are within normal variation between runs.
The rest of the tests your card runs faster than mine

from openmm-org.

peastman avatar peastman commented on August 29, 2024

Those are great numbers. I hope that eventually we'll be able to include the HIP platform as part of the main OpenMM package. The main barrier right now is that HIP isn't supported by conda-forge.

from openmm-org.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.