ccsb-scripps / autodock-gpu Goto Github PK

View Code? Open in Web Editor NEW

390.0 390.0 106.0 45.7 MB

AutoDock for GPUs and other accelerators

Home Page: https://ccsb.scripps.edu/autodock

License: GNU General Public License v2.0

Makefile 0.08% Shell 0.55% C 38.00% C++ 55.34% Python 0.22% Cuda 5.80%

autodock4 cuda gpu-computing molecular-docking multicore-cpu opencl

autodock-gpu's People

Contributors

Stargazers

Watchers

Forkers

programmab1e leelasd minghao2016 sharat-chandra ascheinb scottlegrand louis-yankang icamps scottlittle l30nardosv dna2rna abb58 atillack lpmcsn mathialakan diogomart heyyoulisten giribio reyarturofernandez bio-otto wjallen akakcolin efocht wangdi2014 wankiwi masterwhook tonycurtis biocheming sollve he-hai ncu-walker rdguerrerom embodimentgeniuslm3 davinwei bbyun28 nailouzhang zerodesigner amdens-sci qize miku-101-ai sottcap qaison sisco0 k-ujihara redmisao datadriven1 feiliuem emascarenhas fairylalalan2 redcloud2022 gaoshan2006 antonsperera feigeliudan01 shanthakumar19 ur-free gmm peacewolfus007 freeenergylab rnaimehaom akshat998 robles99 fiberleif bit-lcf chaunceydust zhji0426 ftry abagusetty vas2201 forrestjgq cunji-wang sunnnymskang camelloy fengyanjun1998 wolvs noisyrain aenbleidd wwww6662003 cdri1 mahfila ardeat zyh0608 prabindh ffusmani meldaw84 pengfeili1 m-hakmi c00jsw00 aic147978 beandu zagrosman okamiwong haolan-compchem pallav-acog liyi202 aaronegolden mingkangyang whail2121 binz120 shraddha1602 lzh93

autodock-gpu's Issues

Add FIRE local search

FIRE: Fast Inertial Relaxation Engine

Initial genotype for testing should be done with a single work-item

When DEBUG_INITIAL_2BRT is enabled, then initialize genotype using a single work-item.

This was initially set using all work-items, but it is not necessary.

Add option equivalent to AutoDock epdb

It is an option used in *.dpf files: it means running a single energy calculation.

For implementing this, there is the requirements:

The ligand and the grids are needed
The ligand is scored in the input orientation

See pages 44 and 59 in the AutoDock User Guide.

The name of the option is "epdb" but something like "score_only" or "single_point" would not violate the principle-of-least-astonishment.

Energy values are wrong when complexes have 30-32 torsions

Energy values reported using OCLADock (-6 kcal/mol) diverge far from those obtained using AutoDock4 (-14 kcal/mol) in these cases (2vaa and 2er7, both with 32 torsions).

Redefine size of "gradient_per_contributor" __local array

It is defined here as __local float gradient_per_intracontributor[MAX_INTRAE_CONTRIBUTORS];, but this can be excessively large.

Check here for its definition:
#define MAX_INTRAE_CONTRIBUTORS (MAX_NUM_OF_ATOMS * MAX_NUM_OF_ATOMS), being MAX_NUM_OF_ATOMS = 100

So this array consumes 4 bytes * 100 * 100 = 40000 bytes, although the maximum local memory size for RX-Vega56 GPU is 32768 bytes (according to clinfo output). This is risky, specially for smaller GPU devices.

This might be related to issue #14.

Improve references & labels + clean unused ones

Changes to be applied on fastergrad branch:

Remove "Evopro" references
Remove "L30nardoSV" or other not-needed references

This will make code more readable and would ease maintainance.

Correct / improve hardcoding using #defines

Several code sections contain unclear, ambiguous, non-portable hardcoding.

Replacing for instance, locally-defined numeric variables sizes, upper bound of loops, etc, with #defines (in defines.h) would improve this. This issue refers to all possible cases that could be found in the host code.

We start listing the one started in issue #37: get_ligadata() can be improved as suggested here.

Orientation bias initial population

From @diogo.martins:

Using the x-ray ligand as input (as opposed to using a randomized conformation/orientation) results in better energies and a greater probability of finding the best solution. Therefore, docking performance is unpredictable, because favorable inputs result in better performance, while worst inputs result in worse performance. Docking performance should be independent of the conformation and orientation of the input ligand. Indeed, an AutoDock user complained about this problem in a former version of AutoDock, which was fixed afterwards.

It is very likely (but not certain) that fixing the orientation bias in the initial population will solve this problem.

According to https://arxiv.org/pdf/physics/0506177.pdf:

If the orientation is given in axis-angle space, s, then the axis, s/ |s|, should be chosen uniformly on S 2 , and the rotation angle, |s|, should be sampled from [0, π] with probability (2/π) sin2 (|s| /2).

The first part - choosing uniformly on S 2 - is equivalent to choosing a random point on a sphere surface, see http://mathworld.wolfram.com/SpherePointPicking.html (note that theta and phi have inverted meaning with respect to ocladock notation).

The second part - sampling angle with probability (2/π) sin2 (|angle| /2) - is trickier. A look-up table might be the easiest way to implement it. As far as I know, there's no analytical function that provides such probability distribution.

POCL testing on multicore CPU

For paper completeness let us include POCL experiment results.

Some preliminary info of possible evaluation platform to use:

Scroll down to compute optimized instances C5/C4.
Some introduction to C5.
Might require some troubleshooting: #53.

Correct typo in developers documentation - 1stp description

It is written: "Structual origins ... "

It should be: "Structural origins ..."

Improve kernel references & labels

Name them consistently as in the joint paper.

Define an appropriate size for MAX_INTRAE_CONTRIBUTORS

Not consistently defined in debugfastergrad due to preliminar calcgradient implementation.

In debugfastergrad:

#define MAX_INTRAE_CONTRIBUTORS 8192

In master, which is more consistent and scalable:

#define MAX_INTRAE_CONTRIBUTORS MAX_NUM_OF_ATOMS * MAX_NUM_OF_ATOMS

Update license to GPL

Reason: AutoDock is GPL code

Originally deleted – issue 13

Originally deleted – issue 13.

Wrong read/write configuration of memory object

This error was pointed by oclgrind in kernel4:

Invalid write to read-only buffer
	Kernel: gpu_gen_and_eval_newpops
	Entity: Group(47,0,0)
	  call spir_func void @_Z17wait_group_eventsiP9ocl_event(i32 1, %opencl.event_t** nonnull %ev171) #9, !dbg !449
	At line 2097 (column 3) of input.cl:
	  wait_group_events(1,&ev);

So mem_dockpars_conformations_current must be configured as CL_MEM_READ_WRITE,
and NOT just for reading as in master and in debugfastergrad.

Although the source code in kernel4 never updates this memory object, it actually updates it in every other genetic iteration because populations are updated by switching pointers as in /host/src/performdocking.cpp.

Add support for 128 wi

Initially only configurations of 16, 32, 64 work-items passed tests.

Configuration 128-wi failed at the host side.

Program option -lsrat should accept 100

For the -lsrat argument, ocladock should accept 100, but it tests for "< 100" instead of "<= 100" in getparameters.cpp:

master branch
debugfaster branch

Verify local memory usage in SD kernel

Analysis made from commit b243639.

Determining entity, and its run, energy, and genotype

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
entity_id	int	1	4 * 1	4
run_id	int	1	4 * 1	4
energy	float	1	4 * 1	4
genotype	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
			Subtotal (Bytes)	`164`

Iteration counter fot the minimizer

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
iteration_cnt	uint	1	4 * 1	`4`

Stepsize for the minimizer

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
stepsize	float	1	4 * 1	`4`

Partial results of the gradient step

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
gradient	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
candidate_energy	float	1	4 * 1	4
candidate_genotype	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
			Subtotal (Bytes)	`308`

Calculate gradients (forces) for intermolecular energy

// Derived from autodockdev/maps.py

// Gradient of the intermolecular energy per each ligand atom

// Also used to store the accummulated gradient per each ligand atom

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
gradient_inter_x	float	MAX_NUM_OF_ATOMS	4 * 256	1024
gradient_inter_y	float	MAX_NUM_OF_ATOMS	4 * 256	1024
gradient_inter_z	float	MAX_NUM_OF_ATOMS	4 * 256	1024
			Subtotal (Bytes)	`3072`

Gradient of the intramolecular energy per each ligand atom

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
gradient_intra_x	float	MAX_NUM_OF_ATOMS	4 * 256	1024
gradient_intra_y	float	MAX_NUM_OF_ATOMS	4 * 256	1024
gradient_intra_z	float	MAX_NUM_OF_ATOMS	4 * 256	1024
			Subtotal (Bytes)	`3072`

Ligand-atom position and partial energies

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
calc_coords_x	float	MAX_NUM_OF_ATOMS	4 * 256	1024
calc_coords_y	float	MAX_NUM_OF_ATOMS	4 * 256	1024
calc_coords_z	float	MAX_NUM_OF_ATOMS	4 * 256	1024
partial_energies	float	NUM_OF_THREADS_PER_BLOCK	4 * 64 (GPU) or 4 * 16 (CPU)	256 (GPU) or 64 (CPU)
			Subtotal (Bytes)	`3328` (GPU) or `3136` (CPU)

#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE[NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE[NUM_OF_THREADS_PER_BLOCK];
#endif

Calculating maximum possible stepsize (alpha)

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
max_trans_grad	float	1	4 * 1	4
max_rota_grad	float	1	4 * 1	4
max_tors_grad	float	1	4 * 1	4
max_trans_stepsize	float	1	4 * 1	4
max_rota_stepsize	float	1	4 * 1	4
max_tors_stepsize	float	1	4 * 1	4
max_stepsize	float	1	4 * 1	4
			Subtotal (Bytes)	`28`

Storing torsion gradients here

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
torsions_gradient	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	`152`

Total

GPU: 164+4+4+308+3072+3072+3328+28+152 = 10132
CPU: 164+4+4+308+3072+3072+3136+28+152 = 9940

Print git hash in compiled version

From: How can I get my C code to automatically print out its Git version hash?, see reference.

Wrong number of elements is copied with async_work_group_copy in kernel1

This instruction in "debugfastergrad" branch copies GENOTYPE_LENGTH_IN_GLOBMEM elements from global memory into the local array genotype[ACTUAL_GENOTYPE_LENGTH].

However, such array is defined with a size smaller than what is copied into it. Therefore, the aforementioned instruction should copy ACTUAL_GENOTYPE_LENGTH instead of GENOTYPE_LENGTH_IN_GLOBMEM.

Correct RMSD

Main idea

The ligand-xray (instead of the randomized one) should be used for RMSD calculation.

Explanation

The ligand_xray is the biological answer, so the RMSD between a pose and ligand_xray tells you if you found the biological answer. We will use ligand_xray when we want to know if we found the biological solution.
So far we have been just looking at energies.

Since -lfile is used as reference, we may want to add a new option -reflig to provide the reference ligand for RMSD calculation, and still be able to use a randomized ligand as input file.

Update

Proposed -reflig would be equivalent to AutoDock rmsref.

About clustering

In AutoDock (and certainly in OCLADock as well) the input ligand is not used for clustering. The clustering algorithm is as follows:

First the best solution (out of all GA runs) is used as reference. Then, sorted by increasing energy, the solutions of the other GA runs are assigned to the first cluster if their RMSD is below -rmstol. Then this process is repeated, as many times as necessary, for all GA solutions that did not fall below -rmstol. To conclude, the randomized ligand is not the cause of bad clustering.

For large ligands, it is common to have as many clusters as GA runs, which means that there ain't any pair of poses below -rmstol. This happend a lot in "forli21" using AutoDock4.2, even with 25M evals. (See .dlgs in the shared repo).

Ambiguous usage of native_exp()

Detected with oclgrind.

Large constant coefficients might be considered as of double type by the compiler. Data derived from these coefficients can be treated as double too. If so, such data will not be valid arguments for native_*() functions as in here.

Therefore, such constant coefficients have be expressed with fewer decimal points so they are treated as float by any compiler.

Verify local memory usage on K1, K2, K3, K4 kernel

Analysis made from commit b243639.

K1

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
genotype	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
energy	float	1	4 * 1	4
run_id	int	1	4 * 1	4
calc_coords_x	float	MAX_NUM_OF_ATOMS	4 * 256	1024
calc_coords_y	float	MAX_NUM_OF_ATOMS	4 * 256	1024
calc_coords_z	float	MAX_NUM_OF_ATOMS	4 * 256	1024
partial_energies	float	NUM_OF_THREADS_PER_BLOCK	4 * 64 (GPU) or 4 * 16 (CPU)	256 (GPU) or 64 (CPU)
			Subtotal (Bytes)	`3488` (GPU) or `3296` (CPU)

#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE[NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE[NUM_OF_THREADS_PER_BLOCK];
#endif

K2

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
partsum_evals	int	NUM_OF_THREADS_PER_BLOCK	4 * 64 (GPU) or 4 * 16 (CPU)	`256` (GPU) or `64` (CPU)

K3 (Solis-Wets)

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
genotype_candidate	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
genotype_deviate	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
genotype_bias	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
rho	float	1	4 * 1	4
cons_succ	int	1	4 * 1	4
cons_fail	int	1	4 * 1	4
iteration_cnt	int	1	4 * 1	4
candidate_energy	float	1	4 * 1	4
evaluation_cnt	int	1	4 * 1	4
offspring_genotype	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
run_id	int	1	4 * 1	4
entity_id	int	1	4 * 1	4
offspring_energy	float	1	4 * 1	4
calc_coords_x	float	MAX_NUM_OF_ATOMS	4 * 256	1024
calc_coords_y	float	MAX_NUM_OF_ATOMS	4 * 256	1024
calc_coords_z	float	MAX_NUM_OF_ATOMS	4 * 256	1024
partial_energies	float	NUM_OF_THREADS_PER_BLOCK	4 * 64 (GPU) or 4 * 16 (CPU)	256 (GPU) or 64 (CPU)
			Subtotal (Bytes)	`3972` (GPU) or `3780` (CPU)

#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE [NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE [NUM_OF_THREADS_PER_BLOCK];
#endif

K4 (GA)

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
offspring_genotype	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
parent_candidates	int	4	4 * 4	16
candidate_energies	float	4	4 * 4	16
parents	int	2	4 * 2	8
run_id	int	1	4 * 1	4
covr_point	int	2	4 * 2	8
randnums	float	10	4 * 10	40
energy	float	1	4 * 1	4
best_energies	float	NUM_OF_THREADS_PER_BLOCK	4 * 64 (GPU) or 4 * 16 (CPU)	256 (GPU) or 64 (CPU)
best_IDs	int	NUM_OF_THREADS_PER_BLOCK	4 * 64 (GPU) or 4 * 16 (CPU)	256 (GPU) or 64 (CPU)
best_ID	int	1	4 * 1	4
calc_coords_x	float	MAX_NUM_OF_ATOMS	4 * 256	1024
calc_coords_y	float	MAX_NUM_OF_ATOMS	4 * 256	1024
calc_coords_z	float	MAX_NUM_OF_ATOMS	4 * 256	1024
partial_energies	float	NUM_OF_THREADS_PER_BLOCK	4 * 64 (GPU) or 4 * 16 (CPU)	256 (GPU) or 64 (CPU)
			Subtotal (Bytes)	`4092` (GPU) or `3516` (CPU)

#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE [NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE [NUM_OF_THREADS_PER_BLOCK];
#endif

Total

GPU: 3488+256+3972+4092 = 11808
CPU: 3296+64+3780+3516 = 10656

If the user provides an invalid command-line option, raise an error & exit

Example: -lsit 0.

Ignored 2nd atom index in BRANCH keyword

In the ligand .pdbqt file, BRANCH keywords define rotatable bonds, e.g.:
BRANCH 1 6
which means that the bond between atoms 1 and 6 is rotatable.

However, ocladock ignores the second atom index (6 in the example above) and assumes that the atom immediately following the BRANCH keyword is bonded to 1.

This example is from ligand "1n1m" from the Astex dataset:

ROOT
ATOM      1  C   UNL     1      80.420  74.385  97.448  0.00  0.00    +0.016 C 
ENDROOT
BRANCH   1   6
ATOM      2  C   UNL     1      84.189  77.213  97.348  0.00  0.00    +0.020 C 
ATOM      3  C   UNL     1      84.632  76.660  98.776  0.00  0.00    +0.020 C 
ATOM      4  C   UNL     1      84.034  75.234  98.692  0.00  0.00    +0.108 C 
ATOM      5  N   UNL     1      82.662  75.516  98.209  0.00  0.00    -0.303 N 
ATOM      6  C   UNL     1      81.800  74.363  97.953  0.00  0.00    +0.201 C 
ATOM      7  O   UNL     1      82.322  73.115  98.201  0.00  0.00    -0.277 OA
ATOM      8  C   UNL     1      82.854  76.381  97.032  0.00  0.00    +0.108 C 
ENDBRANCH   1   6
BRANCH   1   9
ATOM      9  N   UNL     1      80.215  73.353  96.379  0.00  0.00    +0.386 N 
ATOM     10  H   UNL     1      80.372  72.421  96.748  0.00  0.00    -0.089 HD
ATOM     11  H   UNL     1      80.844  73.456  95.593  0.00  0.00    -0.089 HD
ATOM     12  H   UNL     1      79.275  73.376  96.008  0.00  0.00    -0.089 HD
ENDBRANCH   1   9
BRANCH   1  13
ATOM     13  C   UNL     1      79.296  74.230  98.432  0.00  0.00    -0.023 C 
ATOM     14  C   UNL     1      77.805  74.286  97.634  0.00  0.00    +0.006 C 
ATOM     15  C   UNL     1      79.221  75.282  99.458  0.00  0.00    +0.006 C 
ENDBRANCH   1  13
TORSDOF 3

Notes:
This ligand was prepared with openbabel. If script prepare_ligand4.py is used instead, the order of the atoms is shuffled so that the atom immediately following the BRANCH keyword matches the second index.

I classify this issue as a bug because:

autodock4.2 correctly parses the second index in the BRANCH keyword
vina as well
enforcing the atom immediately following the BRANCH keyword to match the second index often requires shuffling the order of the atoms, which is bad practice.

Missing synchronization after asynchronous copies?

This error was found with oclgrind and points to the last asynchronous copies in kernel4 and kernel_gradient:

Work-item finished without waiting for events.

Other kernels might suffer from this too:

See master -> kernel3 and kernel4.
See debugfastergrad -> kernel3, kernel4, kernel_gradient, and kernel_fire.

Keep in mind the following information from the v2.0 standard:

This function does not perform any implicit synchronization of source data such as using a barrier before performing the copy.

The kernel must wait for the completion of all async copies using the wait_group_events built-in function before exiting; otherwise the behavior is undefined.

Increase size of ligand to be 256 (=MAX_NUM_OF_ATOMS)

Redefine MAX_NUM_OF_ATOMS, as this would support almost all posible ligand cases.

So, in both branches: master and grad.

This might be related to issue #12.

Memory object mapped for reading should be unmapped before a kernel writes to it

Detected with oclgrind.

According to the clEnqueueMapBuffer documentation:

If a memory object is currently mapped for reading, the application must ensure that the memory object is unmapped before any enqueued kernels or commands that write to this memory object or any of its associated memory objects (sub-buffer or 1D image buffer objects) or its parent object (if the memory object is a sub-buffer or 1D image buffer object) begin execution; otherwise the behavior is undefined.

The maps perfomed in master and in debugfastergradbranches are followed bykernel2 executions as in master and debugfastergrad. Such kernel2 writes that memory object.

Therefore, the aforementioned map call should be followed by an unmap call before kernel2 is invoked (or any kernel that writes to the memory object being mapped).

Add smoothing to internal-energy potentials

Explanation:

Just checked the effect of the smooth parameter on the intra energy.
On a large ligand (2er7) in an fairly extended conformation (i.e. the
ligand is not folding onto itself), the intra energy with smooth=0.5
is -9 kcal. However, using smooth=0.0 the intra energy is -5
kcal. This explains the different results between autodock and
ocladock for ligands with many torsions.

Do you mind running some more tests on other bigger complexes too
(e.g. 2er7, 3er5, 4er4) to see if that happens as well? Let me know if
you can do it. Also, I assume you would need the atom-contributor
pairs for each complex, wouldn't you?

By the way, I have just checked (in *.gpf and *.gpf) that I created
the grids with the default smooth parameter (=0.5A) so I guess this
would explain why we had no significant discrepancies on that.

I don't need the pairs because I'm using the original AutoDock4.2 - I
just change the smooth parameter in the .dpf file. It only affects the
pairwise contributions, the grids remain unchanged.

Here are the intra energies for the complexes you asked about:

complex, smooth=0.5, smooth=0.0

2vaa, -7.64, -5.45

3er5, +12.66, +93.78

4er4, -9.84, -6.51

I see the difference. Did you implement this in autodockdevpy?. If so, I
could reuse for ocladock ... thanks!

Yes, it's implemented, but it's disabled in branch "ocladockenergy". Look in branch "dev".

In files pairwise_energies.py and pairwise_derivatives.py, it's implemented in function "_calc_smooth()", which modifies the distance before evaluating 'vdw' and 'hb' energy contributions.

I don't know how important it is to implement this smooth parameter. I'll try to figure it out with Stefano and gather other opinions from other people in the lab.

I have just checked the python code of "_calc_smooth()", and doesn't seem complex to implement in OpenCL. The only doubt I have is the meaning of "r" and "rij" and their relationship.

Anyway, let me know what you guys think about including this function.

A1:

"r" is a variable: it's the current distance between two atoms during the docking.

"rij" is a parameter: it's the optimum distance for the pair (e.g.: "rij" for C - C is 4 angstroms).

A2:

I just talked with Stefano and David about the smooth parameter.

The smooth parameter is important for the grids, we know that for sure. However, it is unclear how important it is for the pairwise interactions. According to the user guide (*), it was only added to pairwise interactions in version 4.2.5.

However, for the sake of publication, it would be beneficial to have a direct comparison with the current AutoDock version, so we recommend it's implementation. It would be even better if it could be a user specified argument either at run time or a compilation argument.

(*) http://autodock.scripps.edu/faqs-help/manual/autodock-4-2-user-guide/AutoDock4.2_UserGuide.pdf (see page 6)

Ok, then I will start implementing it.

A technical question: I assume "rij" (optimum distance) depends on the atoms types, doesn't it?
Do you know where, either in "autodockdevpy" or AD4, I can find the "rij" values ?

Yes, rij is the sum of vdW radii for the atom pair. It's calculated as 0.5 * rii + 0.5 * rjj, because rii an rjj are twice the vdW radii. The rij values must be already present in OCLADock, because they are needed to calculate C12 and C6 for vdW, and C12 and C10 for hydrogen bonds. It's probably a matter of storing them along C12, C10 and C6 for use in the energy evaluation.

Regarding the smooth parameter to be specified as a user-specified argument:

In the AD4.2 documentation, it says the force field has been optimized
for a smooth value = 0.5A. So, I am setting this as the default smooth
value.

But, I was wondering it such parameter has lower and upper bounds. Can
you suggest these values? This would prevent any crazy smooth inputs ...

in theory there wouldn't be any limits, and in the current AutoDock there isn't a check for values provided.
If you want to include one, I would say that 5.0 is a pretty high upper bound, while the minimum can't go lower than 0.0 (it's a distance).

Is double precision floating point required?

Maybe single precision suffices? Otherwise, enable double-precision types and target Nvidia Volta P3 on AWS.

Remove unnecessary local memory declaration

Solved in commit 3086bd5.

Add Windows support

Excessively large array in kernel2

The array __local int local_evals_of_new_entities[MAX_POPSIZE] used in master and
debugfastergrad is excessively large: 2048 elements, under common/defines.h.

Such array should be removed, and related accesses should be performed from global memory directly.

Output .dlg does not open with pymol

From @diogo.martins:

AutoDock4.2 writes the best pose of each cluster at the end of the .dlg file. These poses can be read by PyMOL. AutoDock-GPU should also write the top pose of each cluster to allow rapid visualization.

Wrong behavior when "-ngen 1"

Recompiling and running from the master branch (commit dea8214):

../ocladock/bin/ocladock_gpu_64wi
    -ffile ./Data/forli21/7cpa/7cpa_protein.maps.fld
    -lfile ./Data/forli21/7cpa/ligand.pdbqt
    -nrun 100
    -nev 100000000
    -resnam x
    -lsrat 99
    -lsit 100
    -ngen 1

The energies are always the same, independently of local search parameters.

In other words, energy values don't improve with a larger number of local-search iterations (-lsit).

What is going on?

Verify consistency of local size setup betwen host and kernel

Check that local size is (can be) configured only from one side (ideally from Makefile), and
doesn't require manual switching in host and device.

Add flag to display OpenCL platform and device features

Add flags in Makefile which should enable/disable #defines in wrapcl/inc/commonMacros.h.

Verify local memory usage in FIRE kernel

Analysis made from commit b243639.

Determining entity, and its run, energy, and genotype

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
entity_id	int	1	4 * 1	4
run_id	int	1	4 * 1	4
energy	float	1	4 * 1	4
genotype	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
			Subtotal (Bytes)	`164`

Iteration counter fot the minimizer

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
iteration_cnt	uint	1	4 * 1	`4`

Partial results of the gradient step

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
gradient	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
candidate_gradient	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
			Subtotal (Bytes)	`304`

Energy may go up, so we keep track of the best energy ever calculated.

// Then, we return the genotype corresponding
// to the best observed energy, i.e. "best_genotype"

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
best_energy	float	1	4 * 1	4
best_genotype	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
candidate_energy	float	1	4 * 1	4
candidate_genotype	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
			Subtotal (Bytes)	`312`

Gradient of the intermolecular energy per each ligand atom

// Also used to store the accummulated gradient per each ligand atom

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
gradient_inter_x	float	MAX_NUM_OF_ATOMS	4 * 256	1024
gradient_inter_y	float	MAX_NUM_OF_ATOMS	4 * 256	1024
gradient_inter_z	float	MAX_NUM_OF_ATOMS	4 * 256	1024
			Subtotal (Bytes)	`3072`

Gradient of the intramolecular energy per each ligand atom

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
gradient_intra_x	float	MAX_NUM_OF_ATOMS	4 * 256	1024
gradient_intra_y	float	MAX_NUM_OF_ATOMS	4 * 256	1024
gradient_intra_z	float	MAX_NUM_OF_ATOMS	4 * 256	1024
			Subtotal (Bytes)	`3072`

Ligand-atom position and partial energies

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
calc_coords_x	float	MAX_NUM_OF_ATOMS	4 * 256	1024
calc_coords_y	float	MAX_NUM_OF_ATOMS	4 * 256	1024
calc_coords_z	float	MAX_NUM_OF_ATOMS	4 * 256	1024
partial_energies	float	NUM_OF_THREADS_PER_BLOCK	4 * 64 (GPU) or 4 * 16 (CPU)	256 (GPU) or 64 (CPU)
			Subtotal (Bytes)	`3328` (GPU) or `3136` (CPU)

#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE[NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE[NUM_OF_THREADS_PER_BLOCK];
#endif

FIRE counters

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
velocity	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
alpha	float	1	4 * 1	4
count_success	uint	1	4 * 1	4
dt	float	1	4 * 1	4
			Subtotal (Bytes)	`164`

Calculating the gradient/velocity norm

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
gradient_tmp	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
inv_gradient_norm	float	1	4 * 1	4
velocity_tmp	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
velocity_norm	float	1	4 * 1	4
velnorm_div_gradnorm	float	1	4 * 1	4
			Subtotal (Bytes)	`316`

Defining FIRE power

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
power_tmp	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
power	float	1	4 * 1	4
			Subtotal (Bytes)	`156`

Total

GPU: 164+4+304+312+3072+3072+3328+164+316+156 = 10892
CPU: 164+4+304+312+3072+3072+3136+164+316+156 = 10700

Add & improve documentation

Caveats:

#20 (comment)

It is possible that atom order in the reference ligand -xraylfile differs from atom order in the input ligand -lfile. If this is the case, and symmetry is turned off (-hsym 0), the RMSD will be calculated incorrectly. We may want to add this to the documentation.

#20 (comment)

Add -xraylfile information into README

Memory space of constant kernel arguments should be re-qualified

The maximum allowed size for constant arguments varies for each device, e.g.:

AMD Vega56 GPU has 4.2 GB:

  ULong attributes ...
   1 CL_DEVICE_MAX_MEM_ALLOC_SIZE                 : 4244635648 
   1 CL_DEVICE_GLOBAL_MEM_CACHE_SIZE              : 16384 
   1 CL_DEVICE_GLOBAL_MEM_SIZE                    : 8573157376 
   1 CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE           : 4244635648 
   1 CL_DEVICE_LOCAL_MEM_SIZE                     : 32768

NVidia M2000 GPU has 16.7 MB:

 ULong attributes ...
  1 CL_DEVICE_MAX_MEM_ALLOC_SIZE                 : 8589934592  
  1 CL_DEVICE_GLOBAL_MEM_CACHE_SIZE              : 20971520  
  1 CL_DEVICE_GLOBAL_MEM_SIZE                    : 31497080832  
  1 CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE           : 16777216  
  1 CL_DEVICE_LOCAL_MEM_SIZE                     : 16777216

As all constant kernel arguments were qualified as __constant, the program might be allocating more data than what fits in the constant buffer. Perhaps this is handle automatically by the compiler, i.e., it might be moving this extra data to global memory.

However, this is signalled as an error by oclgrind --check-api:

Oclgrind - OpenCL runtime error detected
	Function: clEnqueueNDRangeKernel
	Error:    CL_OUT_OF_RESOURCES
	total constant memory size (252528) exceeds device maximum of 65536

Error: clEnqueueNDRangeKernel() -5

Oclgrind - OpenCL runtime error detected
	Function: clEnqueueNDRangeKernel
	Error:    CL_OUT_OF_RESOURCES
	total constant memory size (297680) exceeds device maximum of 65536

Error: clEnqueueNDRangeKernel() -5

A solution for this is evaluating the amount of data being passed to kernel, and then re-qualifying arguments either as __constant or __global const.

The calculation of sizes is as follows. Originally, each of these is passed as a separate __constant argument from host to kernel. Here, these are listed in groups for convenient data-passing from host to device (see kernel codes in commits below).

interintra (subtotal size: 1280)

Constant array	Size definition	Size calculation	Size in Bytes
atom_charges	MAX_NUM_OF_ATOMS * sizeof(float)	256 * 4	1024
atom_types	MAX_NUM_OF_ATOMS * sizeof(char)	256 * 1	256

intracontrib

Constant array	Size definition	Size calculation	Size in Bytes
intraE_contributors	3 * MAX_INTRAE_CONTRIBUTORS * sizeof(char)	3 * 256 * 256 * 1	196608

intra (subtotal size: 2032)

Constant array	Size definition	Size calculation	Size in Bytes
reqm	ATYPE_NUM * sizeof(float)	22 * 4	88
reqm_hbond	ATYPE_NUM * sizeof(float)	22 * 4	88
atom1_types_reqm	ATYPE_NUM * sizeof(unsigned int)	22 * 4	88
atom2_types_reqm	ATYPE_NUM * sizeof(unsigned int)	22 * 4	88
VWpars_AC	MAX_NUM_OF_ATYPES * MAX_NUM_OF_ATYPES * sizeof(float)	14 * 14 * 4	784
VWpars_BD	MAX_NUM_OF_ATYPES * MAX_NUM_OF_ATYPES * sizeof(float)	14 * 14 * 4	784
dspars_S	MAX_NUM_OF_ATYPES * sizeof(float)	14 * 4	56
dspars_V	MAX_NUM_OF_ATYPES * sizeof(float)	14 * 4	56

rotlist

Constant array	Size definition	Size calculation	Size in Bytes
rotlist	MAX_NUM_OF_ROTATIONS * sizeof(int)	256 * 32 * 4	32768

conform (subtotal size: 19840)

Constant array	Size definition	Size calculation	Size in Bytes
ref_coords_x	MAX_NUM_OF_ATOMS * sizeof(float)	256 * 4	1024
ref_coords_y	MAX_NUM_OF_ATOMS * sizeof(float)	256 * 4	1024
ref_coords_z	MAX_NUM_OF_ATOMS * sizeof(float)	256 * 4	1024
rotbonds_moving_vectors	3 * MAX_NUM_OF_ROTBONDS * sizeof(float)	3 * 32 * 4	384
rotbonds_unit_vectors	3 * MAX_NUM_OF_ROTBONDS * sizeof(float)	3 * 32 * 4	384
ref_orientation_quats	4 * MAX_NUM_OF_RUNS * sizeof(float)	4 * 1000 * 4	16000

A total of 252528 Bytes is required for constant data, which is a much smaller size than the minimum in the available GPUs.

For debugfastergrad, we require the following arrays as well:

gradsrotbonds

Constant array	Size definition	Size calculation	Size in Bytes
rotbonds_atoms	MAX_NUM_OF_ATOMS * MAX_NUM_OF_ROTBONDS * sizeof(int)	256 * 32 * 4	32768

grads (subtotal size: 12384)

Constant array	Size definition	Size calculation	Size in Bytes
rotbonds	2 * MAX_NUM_OF_ROTBONDS * sizeof(int)	2 * 32 * 4	256
num_rotating_atoms_per_rotbond	MAX_NUM_OF_ROTBONDS * sizeof(int)	32 * 4	128
angle	1000*sizeof(float)	1000 * 4	4000
dependence_on_theta	1000*sizeof(float)	1000 * 4	4000
dependence_on_rotangle	1000*sizeof(float)	1000 * 4	4000

A total of 297680 (252528 + 45152) Bytes is required for constant data, which is a much smaller size than the minimum in the available GPUs.

Parsing of fld file possibly pointless

From @diogo.martins:

The fld file (input arg -ffile) is parsed to get the following data:

spacing, grid center, and grid size
the prefix for map files

The fld file explicitly declares an association between atom types and map files. These are currently ignored by ocladock.

It could be more intuitive if the -ffile argument was replaced by a -mapprefix argument. The grid center, size and spacing would then be parsed from the map files.

Note: this is very low priority.

Verify local memory usage in ADADELTA kernel

Analysis made from commit b243639.

Determining entity, and its run, energy, and genotype

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
entity_id	int	1	4 * 1	4
run_id	int	1	4 * 1	4
energy	float	1	4 * 1	4
genotype	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
			Subtotal (Bytes)	`164`

Iteration counter fot the minimizer

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
iteration_cnt	uint	1	4 * 1	`4`

Partial results of the gradient step

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
gradient	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	`152`

Energy may go up, so we keep track of the best energy ever calculated.

// Then, we return the genotype corresponding
// to the best observed energy, i.e. "best_genotype"

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
best_energy	float	1	4 * 1	4
best_genotype	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	152
			Subtotal (Bytes)	`156`

Gradient of the intermolecular energy per each ligand atom

// Also used to store the accummulated gradient per each ligand atom

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
gradient_inter_x	float	MAX_NUM_OF_ATOMS	4 * 256	1024
gradient_inter_y	float	MAX_NUM_OF_ATOMS	4 * 256	1024
gradient_inter_z	float	MAX_NUM_OF_ATOMS	4 * 256	1024
			Subtotal (Bytes)	`3072`

Gradient of the intramolecular energy per each ligand atom

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
gradient_intra_x	float	MAX_NUM_OF_ATOMS	4 * 256	1024
gradient_intra_y	float	MAX_NUM_OF_ATOMS	4 * 256	1024
gradient_intra_z	float	MAX_NUM_OF_ATOMS	4 * 256	1024
			Subtotal (Bytes)	`3072`

Ligand-atom position and partial energies

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
calc_coords_x	float	MAX_NUM_OF_ATOMS	4 * 256	1024
calc_coords_y	float	MAX_NUM_OF_ATOMS	4 * 256	1024
calc_coords_z	float	MAX_NUM_OF_ATOMS	4 * 256	1024
partial_energies	float	NUM_OF_THREADS_PER_BLOCK	4 * 64 (GPU) or 4 * 16 (CPU)	256 (GPU) or 64 (CPU)
			Subtotal (Bytes)	`3328`(GPU) or `3136`(CPU)

#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE[NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE[NUM_OF_THREADS_PER_BLOCK];
#endif

Vector for storing squared gradients E[g^2]

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
square_gradient	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	`152`

Update vector, i.e., "delta".

// It is added to the genotype to create the next genotype.
// E.g. in steepest descent "delta" is -1.0 * stepsize * gradient

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
delta	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	`152`

Squared updates E[dx^2]

Local memory	Element datatype	Size definition	Size calculation	Size in Bytes
square_delta	float	ACTUAL_GENOTYPE_LENGTH	4 * 38	`152`

Total

GPU: 164+4+152+156+3072+3072+3328+152+152+152 = 10404
CPU: 164+4+152+156+3072+3072+3136+152+152+152 = 10212

Add support for NVIDIA GPUs

Appropriate rules should be added in Makefile. This should be reflected in the documentation as well.

Add stringify-kernel feature in Makefile

This might require an additional script.

Unused variable for gradient-based docking in host code

In debugfastergrad branch, the structure of Gradientparameters type should be removed.

Using -lsrat 0 causes clEnqueueNDRangeKernel() -54

Program currently accepts lsrat equal to 0 and stores its value into dockpars.num_of_lsentities, which in turn is used to define the work-group size of LS kernel.

A work-group of size zero leads to OpenCL error code -54. In master and fastergrad branches.

Add support for 256 wi

Correct Makefile: INCLUDE_PATH and LIBRARY_PATH

It seems it is not needed to have <VENDOR>_INCLUDE_PATH and <VENDOR>_LIBRARY_PATH for INTEL, AMD, and NVIDIA.

The reason is that this would force to have as many of such variables as vendor devices OCLADock supports.

Proposal

Adopt the following:

Naming convention
User requirements

Naming convention

Device type	Include path	Library path
CPU	`CPU_INCLUDE_PATH`	`CPU_LIBRARY_PATH`
GPU	`GPU_INCLUDE_PATH`	`GPU_LIBRARY_PATH`

User requirements

The user must define these include and library paths before either executing or compiling OCLADock.

Motivation

Improve Makefile making it cleaner and more general
Provide a single GPU-executable (as suggested by Scripss)

Merging of energy and gradient calculation

Gradient-based local-search kernels utilize both energy and gradient functions, which share common:

Subfunctiones: parallel rotation,
Processing structures: loops for intermolecular and pairwise interactions

By merging both energy and gradient functions, it would be possible to reutilize such subfunctions, and apply loop-fusion, in order to accelerate overall execution.

Such merging is possible only when both functions utilize the same input genotypes. Currently, two of these kernels (fire & adadelta) meet that criterion.

Behavior when using -lsit 0 requires redefinition

Should this disable local search or be ignored? This is currently being ignored in master and fastergrad branches.

ccsb-scripps / autodock-gpu Goto Github PK

autodock-gpu's People

Contributors

Stargazers

Watchers

Forkers

autodock-gpu's Issues

Determining entity, and its run, energy, and genotype

Iteration counter fot the minimizer

Stepsize for the minimizer

Partial results of the gradient step

Calculate gradients (forces) for intermolecular energy

Gradient of the intramolecular energy per each ligand atom

Ligand-atom position and partial energies

Calculating maximum possible stepsize (alpha)

Storing torsion gradients here

Total

Main idea

Explanation

Update

About clustering

K1

K2

K3 (Solis-Wets)

K4 (GA)

Total

Determining entity, and its run, energy, and genotype

Iteration counter fot the minimizer

Partial results of the gradient step

Energy may go up, so we keep track of the best energy ever calculated.

Gradient of the intermolecular energy per each ligand atom

Gradient of the intramolecular energy per each ligand atom

Ligand-atom position and partial energies

FIRE counters

Calculating the gradient/velocity norm

Defining FIRE power

Total

Determining entity, and its run, energy, and genotype

Iteration counter fot the minimizer

Partial results of the gradient step

Energy may go up, so we keep track of the best energy ever calculated.

Gradient of the intermolecular energy per each ligand atom

Gradient of the intramolecular energy per each ligand atom

Ligand-atom position and partial energies

Vector for storing squared gradients E[g^2]

Update vector, i.e., "delta".

Squared updates E[dx^2]

Total

Proposal

Naming convention

User requirements

Motivation

Recommend Projects

Recommend Topics

Recommend Org