Update benchmarks for 8.0,about openmm/openmm-org

Entropy-Enthalpy commented on August 29, 2024 4

A TONS OF benchmarks

Hardware details:

OS details:
Ubuntu 22.04.3 LTS, Linux 6.2.0-26-generic x86_64, GNU 11.4.0

GPU driver & toolchain:
AMD GPU driver version 6.1.5.50600-1609671, ROCm 5.6.0;
NVIDIA GPU driver 535.86.05, CUDA Toolkit 11.8

OpenMM version:
OpenMM 8.0.0 – OpenMM HIP Plugin 8Mar2023, natively built from source code.

Results:

	RX7900XTX	RX6900XT	Radeon Ⅶ	RTX4090	RTX4080	RTX4070Ti	RTX4070	RTX4060Ti	RTX4060	RTX3080Ti OC
gbsa	2336.967	1952.980	1214.820	3813.987	3671.753	3067.930	2744.943	2475.290	1973.870	2749.813
rf	1865.653	1611.080	998.185	2638.220	2525.517	2197.977	1958.677	1652.783	1241.327	1866.060
pme	1545.733	1406.137	859.448	2250.010	2104.827	1765.640	1514.337	1333.267	940.423	1583.760
apoa1rf	706.180	576.104	314.815	1179.653	950.574	768.977	607.768	486.256	328.822	666.743
apoa1pme	605.596	496.541	267.213	784.560	627.533	499.557	386.736	314.534	222.208	449.155
apoa1ljpme	503.829	405.404	217.646	620.874	496.335	395.952	306.482	251.178	177.799	360.385
amoebapme	26.350	17.990	9.799	44.315	37.470	28.894	23.807	20.475	14.337	28.481
amber20-cellulose	154.316	118.153	62.954	201.166	160.379	118.080	87.449	68.316	47.160	81.217
amber20-stmv	51.047	39.081	22.652	59.350	43.604	28.113	20.384	14.909	11.934	23.515

The performance of AMD GPUs is impressive. However, the performance of NVIDIA Ampere consumer GPUs on larger systems is not satisfactory, 3080Ti/3090 is almost equalized by AMD Radeon VII.

Compared to previous versions, ROCm 5.6.0 has some performance improvements for smaller systems. The performance of the RDNA3 GPU (7900XTX) is not very stable, and occasionally the GPU scheduling is not active.

It is worth mentioning that tests by @muziqaz (openmm/openmm#3338 (comment), his updated tables) over the past few days have shown that for larger systems, the 7900XTX may perform better under Windows (compare to my results under Linux).

BTW, I will be posting a series of blogs of "Switch to AMD" for MD simulation APPs recently, with 4 mainstream APPs: GROMACS, Amber, LAMMPS and OpenMM. The first forum for posting may be the Computational Chemistry Commune, I am also looking for other suitable forums, do anyone have any suggestions?

from openmm-org.

peastman commented on August 29, 2024 1

A100:

	Run 1	Run 2	Run 3	Average
gbsa	2002	2000.36	1999.22	1999.22
rf	1480.2	1476.6	1467.05	1467.05
apoa1rf	626.638	628.392	626.127	626.127
apoa1ljpme	352.423	351.813	352.966	352.966

H100:

	Run 1	Run 2	Run 3	Average
gbsa	2429.47	2432.72	2430.23	2430.8067
rf	1758.1	1757.46	1761.9	1759.1533
apoa1rf	799.627	800.045	798.96	799.5440
apoa1ljpme	488.972	488.76	488.668	488.8000

4080:

	Run 1	Run 2	Run 3	Average
gbsa	3667.43	3667.09	3661.51	3665.3433
rf	2441.76	2421.83	2406.36	2423.3167
apoa1rf	913.869	906.575	907.264	909.2360
apoa1ljpme	480.685	478.477	479.307	479.4897

from openmm-org.

Entropy-Enthalpy commented on August 29, 2024 1

It is interesting to see that OCL is stronger on Windows than on Linux.

But it's still much slower than HIP. Many vendors may be abandoning OpenCL, or is SYCL a more modern solution? GROMACS has switched to SYCL.

Now that I ran out of my GPUs on windows I will try few variations of few suggestions by @bdenhollander to maybe improve performance.

Doing so many benchmarks should be a lot of work, waiting for your good news!

@Entropy-Enthalpy do you mind if I add your results to my table?

Sure! Feel free to use them, just mention the source of the data.

Finally, 7900xtx is running with the big dogs, though 4090 performance on small systems is unbelievable

I thought the reason the 4090 works so well on small systems is because the clock frequency of the GPU is so high. All RTX 40 series GPUs can reach ~2.8GHz when running MD simulations at full load. For comparison, Ampere consumer GPUs are ~1.9GHz. In addition, RTX 40xx GPUs even have 100-200MHz overclocking space.

This may also be the reason for the unsatisfactory performance of the H100 PCIe.

from openmm-org.

Entropy-Enthalpy commented on August 29, 2024 1

I posted the aforementioned blog post just now: http://bbs.keinsci.com/thread-39269-1-1.html

from openmm-org.

peastman commented on August 29, 2024

Here are the numbers for A100.

	Run 1	Run 2	Run 3	Average
pme	1273.34	1281.89	1272.86	1276.03
apoa1pme	444.552	445.128	446.368	445.349333333333
amber20-cellulose	126.658	126.716	126.689	126.687666666667
amber20-stmv	38.5787	38.5436	38.4025	38.5082666666667
amoebapme	24.0079	23.9255	23.9638	23.9657333333333

I also ran the cellulose and STMV benchmarks on multiple devices. Using two A100s:

	Run 1	Run 2	Run 3	Average
amber20-cellulose	196.166	193.278	195.035	194.826333333333
amber20-stmv	63.7827	63.5919	63.6424	63.6723333333333

Using three:

	Run 1	Run 2	Run 3	Average
amber20-cellulose	227.036	223.053	221.556	223.881666666667
amber20-stmv	79.015	79.2197	78.8277	79.0208

Using four:

	Run 1	Run 2	Run 3	Average
amber20-cellulose	248.516	243.46	246.521	246.165666666667
amber20-stmv	90.917	90.8288	90.1617	90.6358333333333

from openmm-org.

peastman commented on August 29, 2024

If anyone else can run benchmarks on other GPUs, that would be great. It would be good present an assortment of Turing and newer GPUs. If anyone has a recent high end AMD GPU, that would also be good to include.

from openmm-org.

sef43 commented on August 29, 2024

This is for RTX 3090:

	run 1	run 2	run 3	average
pme	1531.2	1521.54	1528.54	1529.87
apoa1pme	426.283	425.176	424.684	425.4835
amber20-cellulose	79.1935	79.4315	79.1946	79.19405
amber20-stmv	22.9114	22.8062	23.3675	23.13945
amoebapme	27.5631	27.4345	27.4878	27.52545

from openmm-org.

peastman commented on August 29, 2024

Here are times on H100:

	Run 1	Run 2	Run 3	Average
pme	1499.74	1502.61	1502.4	1501.5833
apoa1pme	609.277	609.299	608.524	609.0333
amber20-cellulose	162.21	161.718	161.612	161.8467
amber20-stmv	47.8135	47.7666	47.8271	47.8024
amoebapme	32.1292	32.0891	32.1591	32.1258

It's significantly faster than A100. Compared to 4080 it's a little faster on the large systems but slower on the small systems. I also tested using multiple GPUs.

Two:

	Run 1	Run 2	Run 3	Average
amber20-cellulose	229.752	229.28	232.941	230.6577
amber20-stmv	76.0804	76.1623	75.8575	76.0334

Three:

	Run 1	Run 2	Run 3	Average
amber20-cellulose	242.069	241.271	240.868	241.4027
amber20-stmv	87.2753	86.7526	86.7953	86.9411

Four:

	Run 1	Run 2	Run 3	Average
amber20-cellulose	228.432	226.936	229.353	228.2403
amber20-stmv	83.3789	83.4071	83.9036	83.5632

The scaling is less good than the A100s, and it gets slower with four GPUs.

I'll aim to update the benchmarks on the website in the next day or two.

from openmm-org.

peastman commented on August 29, 2024

I missed a few benchmarks that we have on the website. I'll run them for the hardware I have access to. @sef43 could you run a few more benchmarks on the 3090?

python benchmark.py --platform=CUDA --style=table --test=gbsa,rf,apoa1rf,apoa1ljpme

from openmm-org.

jchodera commented on August 29, 2024

Tagging @hmacdope. We are working on benchmarks on consumer grade GPUs via the FAH core version. It looks like our benchmarks project drifted out of sync with the OpenMM benchmarks, so I'll regerate these and we can get extensive consumer grade card statistics as well.

from openmm-org.

jchodera commented on August 29, 2024

@peastman: I'm updating the Folding@home benchmarks project so we can also collect statistics across all consumer-grade cards, and want to confirm that our "standard" benchmarks are intended to be the following:

python benchmark.py --platform=CUDA --style=table --test=gbsa,pme,rf,apoa1rf,apoa1pme,apoa1ljpme,amber20-cellulose,amber20-stmv,amoebapme

Is that correct?

Are we sure we do not want any NPT or mixed-precision benchmarks?

from openmm-org.

peastman commented on August 29, 2024

It depends what you mean by "standard". The ones you listed are the ones we post on the website. But if you're interested in tracking other things, that's fine too.

from openmm-org.

peastman commented on August 29, 2024

@sef43 could you run the extra benchmarks on the 3090? That's all I need before I can update the website.

from openmm-org.

sef43 commented on August 29, 2024

Extra benchmarks on 3090:

	1	2	3	mean
gbsa	2594.57	2578.53	2575.42	2582.84
rf	1734.88	1721.03	1727.47	1727.79333
apoa1rf	609.178	608.667	608.651	608.9225
apoa1ljpme	333.08	333.442	332.235	332.919

from openmm-org.

peastman commented on August 29, 2024

Thanks!

from openmm-org.

Entropy-Enthalpy commented on August 29, 2024

Here are times on H100:

Run 1 Run 2 Run 3 Average
pme 1499.74 1502.61 1502.4 1501.5833
apoa1pme 609.277 609.299 608.524 609.0333
amber20-cellulose 162.21 161.718 161.612 161.8467
amber20-stmv 47.8135 47.7666 47.8271 47.8024
amoebapme 32.1292 32.0891 32.1591 32.1258

I guess it's the H100 PCIe version, right...? Because I recently conducted a benchmark test using the H100 PCIe and got similar results.

The H100 comes in several SKUs, with the most common ones being the H100 SXM5 and H100 PCIe. According to the Hopper architecture whitepaper (page 18), there is a significant difference in the chip specifications between these two SKUs, resulting in a notable difference in FP64/FP32 performance (page 20).

As such, it is important to specify the particular SKU on the webpage.

from openmm-org.

peastman commented on August 29, 2024

You're right. I thought they were the NVLINK version, but it seems they're actually PCIe. That explains why the scaling with multiple GPUs was worse.

from openmm-org.

Entropy-Enthalpy commented on August 29, 2024

Here are RTX 4090 benchmarks.

Command:
python benchmark.py --platform=CUDA --style=table --test=gbsa,rf,pme,apoa1rf,apoa1pme,apoa1ljpme,amoebapme,amber20-cellulose,amber20-stmv

Results:

RTX 4090	Run 1	Run 2	Run 3	Average
gbsa	3801.88	3841.71	3798.37	3813.987
rf	2720.25	2595.1	2599.31	2638.22
pme	2310.38	2220.51	2219.14	2250.01
apoa1rf	1178.92	1180.03	1180.01	1179.653
apoa1pme	784.449	783.864	785.366	784.5597
apoa1ljpme	620.933	620.417	621.272	620.874
amoebapme	44.443	44.2036	44.2992	44.31527
amber20-cellulose	201.51	200.882	201.107	201.1663
amber20-stmv	59.5814	59.2426	59.225	59.34967

Obviously, RTX 4090 is currently the fastest single GPU in any system, and in the two large systems, amber20-cellulose and amber20-stmv, the RTX 4090 is 20-25% faster than H100 PCIe.

BTW, I'll also benchmark some AMD GPUs using the latest OpenMM-HIP, should I post them here?

from openmm-org.

peastman commented on August 29, 2024

BTW, I'll also benchmark some AMD GPUs using the latest OpenMM-HIP, should I post them here?

That would be great!

from openmm-org.

peastman commented on August 29, 2024

The updates are in #98.

from openmm-org.

muziqaz commented on August 29, 2024

Great stuff there. Thank you.
Yes, we have been toying with Windows HIP SDK. There are few bugs and issues, but nothing major show stopping. AMD seems to be aware of them.
It is interesting to see that OCL is stronger on Windows than on Linux.
Now that I ran out of my GPUs on windows I will try few variations of few suggestions by @bdenhollander to maybe improve performance.
@Entropy-Enthalpy do you mind if I add your results to my table?
Finally, 7900xtx is running with the big dogs, though 4090 performance on small systems is unbelievable

from openmm-org.

muziqaz commented on August 29, 2024

OCL being stronger on Windows rather than Linux is just an internal joke, as nothing ever is faster on Windows compared to Linux 🤣
I believe FAH will still use OCL as a fallback, though with AMD removing more and more older GPUs from drivers, there isn't much of the reason to do that.
7900xtx is running at 2.9-3ghz too while HIPing ;)
I think the reason 4090 is so good, is CUDA's compiler maturity. And arch is more suited probably. On windows it would struggle for sure.
It would be great for Gromacs to look into HIP too, maybe through SYCL, to get AMD GPUs working. Their OCL implementation completely ignore AMD as a hardware device.

from openmm-org.

Entropy-Enthalpy commented on August 29, 2024

@muziqaz I think I've figured out why your results of the two larger systems (amber20-cellulose and amber20-stmv) are higher than mine. As you mentioned previously (amd/openmm-hip#5 (comment)), your card is Sapphire Nitro+, which has much higher TGP and clocks (420W, 2680MHz) than my MSI GAMING TRIO CLASSIC (355W, 2500MHz).

If you have time, you could try switching the VBIOS of your Sapphire Nitro+ to the "Silent" mode (using the physical switch on the card) and doing the benchmark again. The "Silent" mode has the same TGP and clocks as MBA model and MSI GAMING TRIO CLASSIC.

from openmm-org.

muziqaz commented on August 29, 2024

The difference between your amber results and mine are within normal variation between runs.
The rest of the tests your card runs faster than mine

from openmm-org.

peastman commented on August 29, 2024

Those are great numbers. I hope that eventually we'll be able to include the HIP platform as part of the main OpenMM package. The main barrier right now is that HIP isn't supported by conda-forge.

from openmm-org.

Update benchmarks for 8.0 about openmm-org HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent