ccsb-scripps / autodock-gpu Goto Github PK
View Code? Open in Web Editor NEWAutoDock for GPUs and other accelerators
Home Page: https://ccsb.scripps.edu/autodock
License: GNU General Public License v2.0
AutoDock for GPUs and other accelerators
Home Page: https://ccsb.scripps.edu/autodock
License: GNU General Public License v2.0
FIRE: Fast Inertial Relaxation Engine
When DEBUG_INITIAL_2BRT
is enabled, then initialize genotype
using a single work-item.
This was initially set using all work-items, but it is not necessary.
It is an option used in *.dpf
files: it means running a single energy calculation.
For implementing this, there is the requirements:
See pages 44 and 59 in the AutoDock User Guide.
The name of the option is "epdb" but something like "score_only" or "single_point" would not violate the principle-of-least-astonishment.
Energy values reported using OCLADock (-6 kcal/mol) diverge far from those obtained using AutoDock4 (-14 kcal/mol) in these cases (2vaa
and 2er7
, both with 32 torsions).
It is defined here as __local float gradient_per_intracontributor[MAX_INTRAE_CONTRIBUTORS];
, but this can be excessively large.
Check here for its definition:
#define MAX_INTRAE_CONTRIBUTORS (MAX_NUM_OF_ATOMS * MAX_NUM_OF_ATOMS)
, being MAX_NUM_OF_ATOMS = 100
So this array consumes 4 bytes * 100 * 100 = 40000 bytes, although the maximum local memory size for RX-Vega56 GPU is 32768 bytes (according to clinfo
output). This is risky, specially for smaller GPU devices.
This might be related to issue #14.
Changes to be applied on fastergrad
branch:
This will make code more readable and would ease maintainance.
Several code sections contain unclear, ambiguous, non-portable hardcoding.
Replacing for instance, locally-defined numeric variables sizes, upper bound of loops, etc, with #defines (in defines.h) would improve this. This issue refers to all possible cases that could be found in the host code.
We start listing the one started in issue #37: get_ligadata() can be improved as suggested here.
From @diogo.martins:
Using the x-ray ligand as input (as opposed to using a randomized conformation/orientation) results in better energies and a greater probability of finding the best solution. Therefore, docking performance is unpredictable, because favorable inputs result in better performance, while worst inputs result in worse performance. Docking performance should be independent of the conformation and orientation of the input ligand. Indeed, an AutoDock user complained about this problem in a former version of AutoDock, which was fixed afterwards.
It is very likely (but not certain) that fixing the orientation bias in the initial population will solve this problem.
According to https://arxiv.org/pdf/physics/0506177.pdf:
If the orientation is given in axis-angle space, s, then the axis, s/ |s|, should be chosen uniformly on S 2 , and the rotation angle, |s|, should be sampled from [0, π] with probability (2/π) sin2 (|s| /2).
The first part - choosing uniformly on S 2 - is equivalent to choosing a random point on a sphere surface, see http://mathworld.wolfram.com/SpherePointPicking.html (note that theta and phi have inverted meaning with respect to ocladock notation).
The second part - sampling angle with probability (2/π) sin2 (|angle| /2) - is trickier. A look-up table might be the easiest way to implement it. As far as I know, there's no analytical function that provides such probability distribution.
For paper completeness let us include POCL experiment results.
Some preliminary info of possible evaluation platform to use:
It is written: "Structual origins ... "
It should be: "Structural origins ..."
Name them consistently as in the joint paper.
Not consistently defined in debugfastergrad due to preliminar calcgradient
implementation.
In debugfastergrad:
#define MAX_INTRAE_CONTRIBUTORS 8192
In master, which is more consistent and scalable:
#define MAX_INTRAE_CONTRIBUTORS MAX_NUM_OF_ATOMS * MAX_NUM_OF_ATOMS
Reason: AutoDock is GPL code
Originally deleted – issue 13.
This error was pointed by oclgrind in kernel4
:
Invalid write to read-only buffer
Kernel: gpu_gen_and_eval_newpops
Entity: Group(47,0,0)
call spir_func void @_Z17wait_group_eventsiP9ocl_event(i32 1, %opencl.event_t** nonnull %ev171) #9, !dbg !449
At line 2097 (column 3) of input.cl:
wait_group_events(1,&ev);
So mem_dockpars_conformations_current
must be configured as CL_MEM_READ_WRITE
,
and NOT just for reading as in master
and in debugfastergrad
.
Although the source code in kernel4
never updates this memory object, it actually updates it in every other genetic iteration because populations are updated by switching pointers as in /host/src/performdocking.cpp
.
Initially only configurations of 16, 32, 64 work-items passed tests.
Configuration 128-wi failed at the host side.
For the -lsrat argument, ocladock should accept 100, but it tests for "< 100" instead of "<= 100" in getparameters.cpp:
Analysis made from commit b243639.
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
entity_id | int | 1 | 4 * 1 | 4 |
run_id | int | 1 | 4 * 1 | 4 |
energy | float | 1 | 4 * 1 | 4 |
genotype | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
Subtotal (Bytes) | 164 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
iteration_cnt | uint | 1 | 4 * 1 | 4 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
stepsize | float | 1 | 4 * 1 | 4 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
gradient | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
candidate_energy | float | 1 | 4 * 1 | 4 |
candidate_genotype | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
Subtotal (Bytes) | 308 |
// Derived from autodockdev/maps.py
// Gradient of the intermolecular energy per each ligand atom
// Also used to store the accummulated gradient per each ligand atom
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
gradient_inter_x | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
gradient_inter_y | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
gradient_inter_z | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
Subtotal (Bytes) | 3072 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
gradient_intra_x | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
gradient_intra_y | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
gradient_intra_z | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
Subtotal (Bytes) | 3072 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
calc_coords_x | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
calc_coords_y | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
calc_coords_z | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
partial_energies | float | NUM_OF_THREADS_PER_BLOCK | 4 * 64 (GPU) or 4 * 16 (CPU) | 256 (GPU) or 64 (CPU) |
Subtotal (Bytes) | 3328 (GPU) or 3136 (CPU) |
#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE[NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE[NUM_OF_THREADS_PER_BLOCK];
#endif
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
max_trans_grad | float | 1 | 4 * 1 | 4 |
max_rota_grad | float | 1 | 4 * 1 | 4 |
max_tors_grad | float | 1 | 4 * 1 | 4 |
max_trans_stepsize | float | 1 | 4 * 1 | 4 |
max_rota_stepsize | float | 1 | 4 * 1 | 4 |
max_tors_stepsize | float | 1 | 4 * 1 | 4 |
max_stepsize | float | 1 | 4 * 1 | 4 |
Subtotal (Bytes) | 28 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
torsions_gradient | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
GPU: 164+4+4+308+3072+3072+3328+28+152 = 10132
CPU: 164+4+4+308+3072+3072+3136+28+152 = 9940
From: How can I get my C code to automatically print out its Git version hash?, see reference.
This instruction in "debugfastergrad" branch copies GENOTYPE_LENGTH_IN_GLOBMEM
elements from global memory into the local array genotype[ACTUAL_GENOTYPE_LENGTH]
.
However, such array is defined with a size smaller than what is copied into it. Therefore, the aforementioned instruction should copy ACTUAL_GENOTYPE_LENGTH
instead of GENOTYPE_LENGTH_IN_GLOBMEM
.
The ligand-xray (instead of the randomized one) should be used for RMSD calculation.
The ligand_xray is the biological answer, so the RMSD between a pose and ligand_xray tells you if you found the biological answer. We will use ligand_xray when we want to know if we found the biological solution.
So far we have been just looking at energies.
Since -lfile
is used as reference, we may want to add a new option -reflig
to provide the reference ligand for RMSD calculation, and still be able to use a randomized ligand as input file.
Proposed -reflig
would be equivalent to AutoDock rmsref
.
In AutoDock (and certainly in OCLADock as well) the input ligand is not used for clustering. The clustering algorithm is as follows:
First the best solution (out of all GA runs) is used as reference. Then, sorted by increasing energy, the solutions of the other GA runs are assigned to the first cluster if their RMSD is below -rmstol
. Then this process is repeated, as many times as necessary, for all GA solutions that did not fall below -rmstol
. To conclude, the randomized ligand is not the cause of bad clustering.
For large ligands, it is common to have as many clusters as GA runs, which means that there ain't any pair of poses below -rmstol
. This happend a lot in "forli21" using AutoDock4.2, even with 25M evals. (See .dlgs in the shared repo).
Detected with oclgrind.
Large constant coefficients might be considered as of double type by the compiler. Data derived from these coefficients can be treated as double too. If so, such data will not be valid arguments for native_*()
functions as in here.
Therefore, such constant coefficients have be expressed with fewer decimal points so they are treated as float by any compiler.
Analysis made from commit b243639.
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
genotype | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
energy | float | 1 | 4 * 1 | 4 |
run_id | int | 1 | 4 * 1 | 4 |
calc_coords_x | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
calc_coords_y | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
calc_coords_z | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
partial_energies | float | NUM_OF_THREADS_PER_BLOCK | 4 * 64 (GPU) or 4 * 16 (CPU) | 256 (GPU) or 64 (CPU) |
Subtotal (Bytes) | 3488 (GPU) or 3296 (CPU) |
#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE[NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE[NUM_OF_THREADS_PER_BLOCK];
#endif
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
partsum_evals | int | NUM_OF_THREADS_PER_BLOCK | 4 * 64 (GPU) or 4 * 16 (CPU) | 256 (GPU) or 64 (CPU) |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
genotype_candidate | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
genotype_deviate | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
genotype_bias | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
rho | float | 1 | 4 * 1 | 4 |
cons_succ | int | 1 | 4 * 1 | 4 |
cons_fail | int | 1 | 4 * 1 | 4 |
iteration_cnt | int | 1 | 4 * 1 | 4 |
candidate_energy | float | 1 | 4 * 1 | 4 |
evaluation_cnt | int | 1 | 4 * 1 | 4 |
offspring_genotype | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
run_id | int | 1 | 4 * 1 | 4 |
entity_id | int | 1 | 4 * 1 | 4 |
offspring_energy | float | 1 | 4 * 1 | 4 |
calc_coords_x | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
calc_coords_y | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
calc_coords_z | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
partial_energies | float | NUM_OF_THREADS_PER_BLOCK | 4 * 64 (GPU) or 4 * 16 (CPU) | 256 (GPU) or 64 (CPU) |
Subtotal (Bytes) | 3972 (GPU) or 3780 (CPU) |
#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE [NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE [NUM_OF_THREADS_PER_BLOCK];
#endif
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
offspring_genotype | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
parent_candidates | int | 4 | 4 * 4 | 16 |
candidate_energies | float | 4 | 4 * 4 | 16 |
parents | int | 2 | 4 * 2 | 8 |
run_id | int | 1 | 4 * 1 | 4 |
covr_point | int | 2 | 4 * 2 | 8 |
randnums | float | 10 | 4 * 10 | 40 |
energy | float | 1 | 4 * 1 | 4 |
best_energies | float | NUM_OF_THREADS_PER_BLOCK | 4 * 64 (GPU) or 4 * 16 (CPU) | 256 (GPU) or 64 (CPU) |
best_IDs | int | NUM_OF_THREADS_PER_BLOCK | 4 * 64 (GPU) or 4 * 16 (CPU) | 256 (GPU) or 64 (CPU) |
best_ID | int | 1 | 4 * 1 | 4 |
calc_coords_x | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
calc_coords_y | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
calc_coords_z | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
partial_energies | float | NUM_OF_THREADS_PER_BLOCK | 4 * 64 (GPU) or 4 * 16 (CPU) | 256 (GPU) or 64 (CPU) |
Subtotal (Bytes) | 4092 (GPU) or 3516 (CPU) |
#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE [NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE [NUM_OF_THREADS_PER_BLOCK];
#endif
GPU: 3488+256+3972+4092 = 11808
CPU: 3296+64+3780+3516 = 10656
Example: -lsit 0
.
In the ligand .pdbqt file, BRANCH
keywords define rotatable bonds, e.g.:
BRANCH 1 6
which means that the bond between atoms 1
and 6
is rotatable.
However, ocladock ignores the second atom index (6
in the example above) and assumes that the atom immediately following the BRANCH
keyword is bonded to 1
.
This example is from ligand "1n1m" from the Astex dataset:
ROOT
ATOM 1 C UNL 1 80.420 74.385 97.448 0.00 0.00 +0.016 C
ENDROOT
BRANCH 1 6
ATOM 2 C UNL 1 84.189 77.213 97.348 0.00 0.00 +0.020 C
ATOM 3 C UNL 1 84.632 76.660 98.776 0.00 0.00 +0.020 C
ATOM 4 C UNL 1 84.034 75.234 98.692 0.00 0.00 +0.108 C
ATOM 5 N UNL 1 82.662 75.516 98.209 0.00 0.00 -0.303 N
ATOM 6 C UNL 1 81.800 74.363 97.953 0.00 0.00 +0.201 C
ATOM 7 O UNL 1 82.322 73.115 98.201 0.00 0.00 -0.277 OA
ATOM 8 C UNL 1 82.854 76.381 97.032 0.00 0.00 +0.108 C
ENDBRANCH 1 6
BRANCH 1 9
ATOM 9 N UNL 1 80.215 73.353 96.379 0.00 0.00 +0.386 N
ATOM 10 H UNL 1 80.372 72.421 96.748 0.00 0.00 -0.089 HD
ATOM 11 H UNL 1 80.844 73.456 95.593 0.00 0.00 -0.089 HD
ATOM 12 H UNL 1 79.275 73.376 96.008 0.00 0.00 -0.089 HD
ENDBRANCH 1 9
BRANCH 1 13
ATOM 13 C UNL 1 79.296 74.230 98.432 0.00 0.00 -0.023 C
ATOM 14 C UNL 1 77.805 74.286 97.634 0.00 0.00 +0.006 C
ATOM 15 C UNL 1 79.221 75.282 99.458 0.00 0.00 +0.006 C
ENDBRANCH 1 13
TORSDOF 3
Notes:
This ligand was prepared with openbabel. If script prepare_ligand4.py
is used instead, the order of the atoms is shuffled so that the atom immediately following the BRANCH
keyword matches the second index.
I classify this issue as a bug because:
BRANCH
keywordBRANCH
keyword to match the second index often requires shuffling the order of the atoms, which is bad practice.This error was found with oclgrind and points to the last asynchronous copies in kernel4
and kernel_gradient
:
Work-item finished without waiting for events.
Other kernels might suffer from this too:
master
-> kernel3
and kernel4
.debugfastergrad
-> kernel3
, kernel4
, kernel_gradient
, and kernel_fire
.Keep in mind the following information from the v2.0 standard:
This function does not perform any implicit synchronization of source data such as using a barrier before performing the copy.
The kernel must wait for the completion of all async copies using the wait_group_events built-in function before exiting; otherwise the behavior is undefined.
Redefine MAX_NUM_OF_ATOMS
, as this would support almost all posible ligand cases.
So, in both branches: master
and grad
.
This might be related to issue #12.
Detected with oclgrind.
According to the clEnqueueMapBuffer documentation:
If a memory object is currently mapped for reading, the application must ensure that the memory object is unmapped before any enqueued kernels or commands that write to this memory object or any of its associated memory objects (sub-buffer or 1D image buffer objects) or its parent object (if the memory object is a sub-buffer or 1D image buffer object) begin execution; otherwise the behavior is undefined.
The maps perfomed in master
and in debugfastergrad
branches are followed bykernel2
executions as in master
and debugfastergrad
. Such kernel2
writes that memory object.
Therefore, the aforementioned map call should be followed by an unmap call before kernel2
is invoked (or any kernel that writes to the memory object being mapped).
Explanation:
Just checked the effect of the smooth parameter on the intra energy.
On a large ligand (2er7) in an fairly extended conformation (i.e. the
ligand is not folding onto itself), the intra energy with smooth=0.5
is -9 kcal. However, using smooth=0.0 the intra energy is -5
kcal. This explains the different results between autodock and
ocladock for ligands with many torsions.
Q:
Do you mind running some more tests on other bigger complexes too
(e.g. 2er7, 3er5, 4er4) to see if that happens as well? Let me know if
you can do it. Also, I assume you would need the atom-contributor
pairs for each complex, wouldn't you?
By the way, I have just checked (in *.gpf and *.gpf) that I created
the grids with the default smooth parameter (=0.5A) so I guess this
would explain why we had no significant discrepancies on that.
A:
I don't need the pairs because I'm using the original AutoDock4.2 - I
just change the smooth parameter in the .dpf file. It only affects the
pairwise contributions, the grids remain unchanged.
Here are the intra energies for the complexes you asked about:
complex, smooth=0.5, smooth=0.0
2vaa, -7.64, -5.45
3er5, +12.66, +93.78
4er4, -9.84, -6.51
Q:
I see the difference. Did you implement this in autodockdevpy?. If so, I
could reuse for ocladock ... thanks!
A:
Yes, it's implemented, but it's disabled in branch "ocladockenergy". Look in branch "dev".
In files pairwise_energies.py and pairwise_derivatives.py, it's implemented in function "_calc_smooth()", which modifies the distance before evaluating 'vdw' and 'hb' energy contributions.
I don't know how important it is to implement this smooth parameter. I'll try to figure it out with Stefano and gather other opinions from other people in the lab.
Q:
I have just checked the python code of "_calc_smooth()", and doesn't seem complex to implement in OpenCL. The only doubt I have is the meaning of "r" and "rij" and their relationship.
Anyway, let me know what you guys think about including this function.
A1:
"r" is a variable: it's the current distance between two atoms during the docking.
"rij" is a parameter: it's the optimum distance for the pair (e.g.: "rij" for C - C is 4 angstroms).
A2:
I just talked with Stefano and David about the smooth parameter.
The smooth parameter is important for the grids, we know that for sure. However, it is unclear how important it is for the pairwise interactions. According to the user guide (*), it was only added to pairwise interactions in version 4.2.5.
However, for the sake of publication, it would be beneficial to have a direct comparison with the current AutoDock version, so we recommend it's implementation. It would be even better if it could be a user specified argument either at run time or a compilation argument.
(*) http://autodock.scripps.edu/faqs-help/manual/autodock-4-2-user-guide/AutoDock4.2_UserGuide.pdf (see page 6)
Q:
Ok, then I will start implementing it.
A technical question: I assume "rij" (optimum distance) depends on the atoms types, doesn't it?
Do you know where, either in "autodockdevpy" or AD4, I can find the "rij" values ?
A:
Yes, rij is the sum of vdW radii for the atom pair. It's calculated as 0.5 * rii + 0.5 * rjj, because rii an rjj are twice the vdW radii. The rij values must be already present in OCLADock, because they are needed to calculate C12 and C6 for vdW, and C12 and C10 for hydrogen bonds. It's probably a matter of storing them along C12, C10 and C6 for use in the energy evaluation.
Q:
Regarding the smooth parameter to be specified as a user-specified argument:
In the AD4.2 documentation, it says the force field has been optimized
for a smooth value = 0.5A. So, I am setting this as the default smooth
value.
But, I was wondering it such parameter has lower and upper bounds. Can
you suggest these values? This would prevent any crazy smooth inputs ...
A:
in theory there wouldn't be any limits, and in the current AutoDock there isn't a check for values provided.
If you want to include one, I would say that 5.0 is a pretty high upper bound, while the minimum can't go lower than 0.0 (it's a distance).
Maybe single precision suffices? Otherwise, enable double-precision types and target Nvidia Volta P3 on AWS.
Solved in commit 3086bd5.
The array __local int local_evals_of_new_entities[MAX_POPSIZE]
used in master
and
debugfastergrad
is excessively large: 2048 elements, under common/defines.h
.
Such array should be removed, and related accesses should be performed from global memory directly.
From @diogo.martins:
AutoDock4.2 writes the best pose of each cluster at the end of the .dlg file. These poses can be read by PyMOL. AutoDock-GPU should also write the top pose of each cluster to allow rapid visualization.
Recompiling and running from the master branch (commit dea8214):
../ocladock/bin/ocladock_gpu_64wi
-ffile ./Data/forli21/7cpa/7cpa_protein.maps.fld
-lfile ./Data/forli21/7cpa/ligand.pdbqt
-nrun 100
-nev 100000000
-resnam x
-lsrat 99
-lsit 100
-ngen 1
The energies are always the same, independently of local search parameters.
In other words, energy values don't improve with a larger number of local-search iterations (-lsit
).
What is going on?
Check that local size is (can be) configured only from one side (ideally from Makefile), and
doesn't require manual switching in host and device.
Add flags in Makefile
which should enable/disable #defines in wrapcl/inc/commonMacros.h
.
Analysis made from commit b243639.
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
entity_id | int | 1 | 4 * 1 | 4 |
run_id | int | 1 | 4 * 1 | 4 |
energy | float | 1 | 4 * 1 | 4 |
genotype | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
Subtotal (Bytes) | 164 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
iteration_cnt | uint | 1 | 4 * 1 | 4 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
gradient | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
candidate_gradient | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
Subtotal (Bytes) | 304 |
// Then, we return the genotype corresponding
// to the best observed energy, i.e. "best_genotype"
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
best_energy | float | 1 | 4 * 1 | 4 |
best_genotype | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
candidate_energy | float | 1 | 4 * 1 | 4 |
candidate_genotype | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
Subtotal (Bytes) | 312 |
// Also used to store the accummulated gradient per each ligand atom
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
gradient_inter_x | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
gradient_inter_y | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
gradient_inter_z | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
Subtotal (Bytes) | 3072 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
gradient_intra_x | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
gradient_intra_y | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
gradient_intra_z | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
Subtotal (Bytes) | 3072 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
calc_coords_x | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
calc_coords_y | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
calc_coords_z | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
partial_energies | float | NUM_OF_THREADS_PER_BLOCK | 4 * 64 (GPU) or 4 * 16 (CPU) | 256 (GPU) or 64 (CPU) |
Subtotal (Bytes) | 3328 (GPU) or 3136 (CPU) |
#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE[NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE[NUM_OF_THREADS_PER_BLOCK];
#endif
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
velocity | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
alpha | float | 1 | 4 * 1 | 4 |
count_success | uint | 1 | 4 * 1 | 4 |
dt | float | 1 | 4 * 1 | 4 |
Subtotal (Bytes) | 164 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
gradient_tmp | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
inv_gradient_norm | float | 1 | 4 * 1 | 4 |
velocity_tmp | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
velocity_norm | float | 1 | 4 * 1 | 4 |
velnorm_div_gradnorm | float | 1 | 4 * 1 | 4 |
Subtotal (Bytes) | 316 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
power_tmp | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
power | float | 1 | 4 * 1 | 4 |
Subtotal (Bytes) | 156 |
10892
10700
Caveats:
It is possible that atom order in the reference ligand
-xraylfile
differs from atom order in the input ligand-lfile
. If this is the case, and symmetry is turned off (-hsym 0
), the RMSD will be calculated incorrectly. We may want to add this to the documentation.
Add -xraylfile information into README
The maximum allowed size for constant arguments varies for each device, e.g.:
AMD Vega56 GPU has 4.2 GB:
ULong attributes ...
1 CL_DEVICE_MAX_MEM_ALLOC_SIZE : 4244635648
1 CL_DEVICE_GLOBAL_MEM_CACHE_SIZE : 16384
1 CL_DEVICE_GLOBAL_MEM_SIZE : 8573157376
1 CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE : 4244635648
1 CL_DEVICE_LOCAL_MEM_SIZE : 32768
NVidia M2000 GPU has 16.7 MB:
ULong attributes ...
1 CL_DEVICE_MAX_MEM_ALLOC_SIZE : 8589934592
1 CL_DEVICE_GLOBAL_MEM_CACHE_SIZE : 20971520
1 CL_DEVICE_GLOBAL_MEM_SIZE : 31497080832
1 CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE : 16777216
1 CL_DEVICE_LOCAL_MEM_SIZE : 16777216
As all constant kernel arguments were qualified as __constant
, the program might be allocating more data than what fits in the constant buffer. Perhaps this is handle automatically by the compiler, i.e., it might be moving this extra data to global memory.
However, this is signalled as an error by oclgrind --check-api:
Oclgrind - OpenCL runtime error detected
Function: clEnqueueNDRangeKernel
Error: CL_OUT_OF_RESOURCES
total constant memory size (252528) exceeds device maximum of 65536
Error: clEnqueueNDRangeKernel() -5
Oclgrind - OpenCL runtime error detected
Function: clEnqueueNDRangeKernel
Error: CL_OUT_OF_RESOURCES
total constant memory size (297680) exceeds device maximum of 65536
Error: clEnqueueNDRangeKernel() -5
A solution for this is evaluating the amount of data being passed to kernel, and then re-qualifying arguments either as __constant
or __global const
.
The calculation of sizes is as follows. Originally, each of these is passed as a separate __constant
argument from host to kernel. Here, these are listed in groups for convenient data-passing from host to device (see kernel codes in commits below).
interintra
(subtotal size: 1280)
Constant array | Size definition | Size calculation | Size in Bytes |
---|---|---|---|
atom_charges | MAX_NUM_OF_ATOMS * sizeof(float) | 256 * 4 | 1024 |
atom_types | MAX_NUM_OF_ATOMS * sizeof(char) | 256 * 1 | 256 |
intracontrib
Constant array | Size definition | Size calculation | Size in Bytes |
---|---|---|---|
intraE_contributors | 3 * MAX_INTRAE_CONTRIBUTORS * sizeof(char) | 3 * 256 * 256 * 1 | 196608 |
intra
(subtotal size: 2032)
Constant array | Size definition | Size calculation | Size in Bytes |
---|---|---|---|
reqm | ATYPE_NUM * sizeof(float) | 22 * 4 | 88 |
reqm_hbond | ATYPE_NUM * sizeof(float) | 22 * 4 | 88 |
atom1_types_reqm | ATYPE_NUM * sizeof(unsigned int) | 22 * 4 | 88 |
atom2_types_reqm | ATYPE_NUM * sizeof(unsigned int) | 22 * 4 | 88 |
VWpars_AC | MAX_NUM_OF_ATYPES * MAX_NUM_OF_ATYPES * sizeof(float) | 14 * 14 * 4 | 784 |
VWpars_BD | MAX_NUM_OF_ATYPES * MAX_NUM_OF_ATYPES * sizeof(float) | 14 * 14 * 4 | 784 |
dspars_S | MAX_NUM_OF_ATYPES * sizeof(float) | 14 * 4 | 56 |
dspars_V | MAX_NUM_OF_ATYPES * sizeof(float) | 14 * 4 | 56 |
rotlist
Constant array | Size definition | Size calculation | Size in Bytes |
---|---|---|---|
rotlist | MAX_NUM_OF_ROTATIONS * sizeof(int) | 256 * 32 * 4 | 32768 |
conform
(subtotal size: 19840)
Constant array | Size definition | Size calculation | Size in Bytes |
---|---|---|---|
ref_coords_x | MAX_NUM_OF_ATOMS * sizeof(float) | 256 * 4 | 1024 |
ref_coords_y | MAX_NUM_OF_ATOMS * sizeof(float) | 256 * 4 | 1024 |
ref_coords_z | MAX_NUM_OF_ATOMS * sizeof(float) | 256 * 4 | 1024 |
rotbonds_moving_vectors | 3 * MAX_NUM_OF_ROTBONDS * sizeof(float) | 3 * 32 * 4 | 384 |
rotbonds_unit_vectors | 3 * MAX_NUM_OF_ROTBONDS * sizeof(float) | 3 * 32 * 4 | 384 |
ref_orientation_quats | 4 * MAX_NUM_OF_RUNS * sizeof(float) | 4 * 1000 * 4 | 16000 |
A total of 252528 Bytes is required for constant data, which is a much smaller size than the minimum in the available GPUs.
For debugfastergrad, we require the following arrays as well:
gradsrotbonds
Constant array | Size definition | Size calculation | Size in Bytes |
---|---|---|---|
rotbonds_atoms | MAX_NUM_OF_ATOMS * MAX_NUM_OF_ROTBONDS * sizeof(int) | 256 * 32 * 4 | 32768 |
grads
(subtotal size: 12384)
Constant array | Size definition | Size calculation | Size in Bytes |
---|---|---|---|
rotbonds | 2 * MAX_NUM_OF_ROTBONDS * sizeof(int) | 2 * 32 * 4 | 256 |
num_rotating_atoms_per_rotbond | MAX_NUM_OF_ROTBONDS * sizeof(int) | 32 * 4 | 128 |
angle | 1000*sizeof(float) | 1000 * 4 | 4000 |
dependence_on_theta | 1000*sizeof(float) | 1000 * 4 | 4000 |
dependence_on_rotangle | 1000*sizeof(float) | 1000 * 4 | 4000 |
A total of 297680 (252528 + 45152) Bytes is required for constant data, which is a much smaller size than the minimum in the available GPUs.
From @diogo.martins:
The fld file (input arg -ffile
) is parsed to get the following data:
The fld file explicitly declares an association between atom types and map files. These are currently ignored by ocladock.
It could be more intuitive if the -ffile
argument was replaced by a -mapprefix
argument. The grid center, size and spacing would then be parsed from the map files.
Note: this is very low priority.
Analysis made from commit b243639.
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
entity_id | int | 1 | 4 * 1 | 4 |
run_id | int | 1 | 4 * 1 | 4 |
energy | float | 1 | 4 * 1 | 4 |
genotype | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
Subtotal (Bytes) | 164 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
iteration_cnt | uint | 1 | 4 * 1 | 4 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
gradient | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
// Then, we return the genotype corresponding
// to the best observed energy, i.e. "best_genotype"
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
best_energy | float | 1 | 4 * 1 | 4 |
best_genotype | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
Subtotal (Bytes) | 156 |
// Also used to store the accummulated gradient per each ligand atom
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
gradient_inter_x | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
gradient_inter_y | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
gradient_inter_z | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
Subtotal (Bytes) | 3072 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
gradient_intra_x | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
gradient_intra_y | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
gradient_intra_z | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
Subtotal (Bytes) | 3072 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
calc_coords_x | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
calc_coords_y | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
calc_coords_z | float | MAX_NUM_OF_ATOMS | 4 * 256 | 1024 |
partial_energies | float | NUM_OF_THREADS_PER_BLOCK | 4 * 64 (GPU) or 4 * 16 (CPU) | 256 (GPU) or 64 (CPU) |
Subtotal (Bytes) | 3328 (GPU) or 3136 (CPU) |
#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE[NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE[NUM_OF_THREADS_PER_BLOCK];
#endif
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
square_gradient | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
// It is added to the genotype to create the next genotype.
// E.g. in steepest descent "delta" is -1.0 * stepsize * gradient
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
delta | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
Local memory | Element datatype | Size definition | Size calculation | Size in Bytes |
---|---|---|---|---|
square_delta | float | ACTUAL_GENOTYPE_LENGTH | 4 * 38 | 152 |
10404
10212
Appropriate rules should be added in Makefile
. This should be reflected in the documentation as well.
This might require an additional script.
In debugfastergrad branch, the structure of Gradientparameters
type should be removed.
Program currently accepts lsrat equal to 0 and stores its value into dockpars.num_of_lsentities
, which in turn is used to define the work-group size of LS kernel.
A work-group of size zero leads to OpenCL error code -54. In master
and fastergrad
branches.
It seems it is not needed to have <VENDOR>_INCLUDE_PATH
and <VENDOR>_LIBRARY_PATH
for INTEL, AMD, and NVIDIA.
The reason is that this would force to have as many of such variables as vendor devices OCLADock supports.
Adopt the following:
Device type | Include path | Library path |
---|---|---|
CPU | CPU_INCLUDE_PATH |
CPU_LIBRARY_PATH |
GPU | GPU_INCLUDE_PATH |
GPU_LIBRARY_PATH |
The user must define these include and library paths before either executing or compiling OCLADock.
Makefile
making it cleaner and more generalGradient-based local-search kernels utilize both energy and gradient functions, which share common:
By merging both energy and gradient functions, it would be possible to reutilize such subfunctions, and apply loop-fusion, in order to accelerate overall execution.
Such merging is possible only when both functions utilize the same input genotypes. Currently, two of these kernels (fire & adadelta) meet that criterion.
Should this disable local search or be ignored? This is currently being ignored in master
and fastergrad
branches.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.