Code Monkey home page Code Monkey logo

autodock-gpu's People

Contributors

aaronegolden avatar ascheinb avatar atillack avatar diogomart avatar jvermaas avatar l30nardosv avatar lpmcsn avatar scottlegrand avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autodock-gpu's Issues

Add option equivalent to AutoDock epdb

It is an option used in *.dpf files: it means running a single energy calculation.

For implementing this, there is the requirements:

  • The ligand and the grids are needed
  • The ligand is scored in the input orientation

See pages 44 and 59 in the AutoDock User Guide.

The name of the option is "epdb" but something like "score_only" or "single_point" would not violate the principle-of-least-astonishment.

Redefine size of "gradient_per_contributor" __local array

It is defined here as __local float gradient_per_intracontributor[MAX_INTRAE_CONTRIBUTORS];, but this can be excessively large.

Check here for its definition:
#define MAX_INTRAE_CONTRIBUTORS (MAX_NUM_OF_ATOMS * MAX_NUM_OF_ATOMS), being MAX_NUM_OF_ATOMS = 100

So this array consumes 4 bytes * 100 * 100 = 40000 bytes, although the maximum local memory size for RX-Vega56 GPU is 32768 bytes (according to clinfo output). This is risky, specially for smaller GPU devices.

This might be related to issue #14.

Correct / improve hardcoding using #defines

Several code sections contain unclear, ambiguous, non-portable hardcoding.

Replacing for instance, locally-defined numeric variables sizes, upper bound of loops, etc, with #defines (in defines.h) would improve this. This issue refers to all possible cases that could be found in the host code.

We start listing the one started in issue #37: get_ligadata() can be improved as suggested here.

Orientation bias initial population

From @diogo.martins:

Using the x-ray ligand as input (as opposed to using a randomized conformation/orientation) results in better energies and a greater probability of finding the best solution. Therefore, docking performance is unpredictable, because favorable inputs result in better performance, while worst inputs result in worse performance. Docking performance should be independent of the conformation and orientation of the input ligand. Indeed, an AutoDock user complained about this problem in a former version of AutoDock, which was fixed afterwards.

It is very likely (but not certain) that fixing the orientation bias in the initial population will solve this problem.

According to https://arxiv.org/pdf/physics/0506177.pdf:

If the orientation is given in axis-angle space, s, then the axis, s/ |s|, should be chosen uniformly on S 2 , and the rotation angle, |s|, should be sampled from [0, π] with probability (2/π) sin2 (|s| /2).

The first part - choosing uniformly on S 2 - is equivalent to choosing a random point on a sphere surface, see http://mathworld.wolfram.com/SpherePointPicking.html (note that theta and phi have inverted meaning with respect to ocladock notation).

The second part - sampling angle with probability (2/π) sin2 (|angle| /2) - is trickier. A look-up table might be the easiest way to implement it. As far as I know, there's no analytical function that provides such probability distribution.

Define an appropriate size for MAX_INTRAE_CONTRIBUTORS

Not consistently defined in debugfastergrad due to preliminar calcgradient implementation.

In debugfastergrad:

#define MAX_INTRAE_CONTRIBUTORS 8192

In master, which is more consistent and scalable:

#define MAX_INTRAE_CONTRIBUTORS MAX_NUM_OF_ATOMS * MAX_NUM_OF_ATOMS

Wrong read/write configuration of memory object

This error was pointed by oclgrind in kernel4:

Invalid write to read-only buffer
	Kernel: gpu_gen_and_eval_newpops
	Entity: Group(47,0,0)
	  call spir_func void @_Z17wait_group_eventsiP9ocl_event(i32 1, %opencl.event_t** nonnull %ev171) #9, !dbg !449
	At line 2097 (column 3) of input.cl:
	  wait_group_events(1,&ev);

So mem_dockpars_conformations_current must be configured as CL_MEM_READ_WRITE,
and NOT just for reading as in master and in debugfastergrad.

Although the source code in kernel4 never updates this memory object, it actually updates it in every other genetic iteration because populations are updated by switching pointers as in /host/src/performdocking.cpp.

Add support for 128 wi

Initially only configurations of 16, 32, 64 work-items passed tests.

Configuration 128-wi failed at the host side.

Verify local memory usage in SD kernel

Analysis made from commit b243639.

Determining entity, and its run, energy, and genotype

Local memory Element datatype Size definition Size calculation Size in Bytes
entity_id int 1 4 * 1 4
run_id int 1 4 * 1 4
energy float 1 4 * 1 4
genotype float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
Subtotal (Bytes) 164

Iteration counter fot the minimizer

Local memory Element datatype Size definition Size calculation Size in Bytes
iteration_cnt uint 1 4 * 1 4

Stepsize for the minimizer

Local memory Element datatype Size definition Size calculation Size in Bytes
stepsize float 1 4 * 1 4

Partial results of the gradient step

Local memory Element datatype Size definition Size calculation Size in Bytes
gradient float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
candidate_energy float 1 4 * 1 4
candidate_genotype float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
Subtotal (Bytes) 308

Calculate gradients (forces) for intermolecular energy

// Derived from autodockdev/maps.py

// Gradient of the intermolecular energy per each ligand atom

// Also used to store the accummulated gradient per each ligand atom

Local memory Element datatype Size definition Size calculation Size in Bytes
gradient_inter_x float MAX_NUM_OF_ATOMS 4 * 256 1024
gradient_inter_y float MAX_NUM_OF_ATOMS 4 * 256 1024
gradient_inter_z float MAX_NUM_OF_ATOMS 4 * 256 1024
Subtotal (Bytes) 3072

Gradient of the intramolecular energy per each ligand atom

Local memory Element datatype Size definition Size calculation Size in Bytes
gradient_intra_x float MAX_NUM_OF_ATOMS 4 * 256 1024
gradient_intra_y float MAX_NUM_OF_ATOMS 4 * 256 1024
gradient_intra_z float MAX_NUM_OF_ATOMS 4 * 256 1024
Subtotal (Bytes) 3072

Ligand-atom position and partial energies

Local memory Element datatype Size definition Size calculation Size in Bytes
calc_coords_x float MAX_NUM_OF_ATOMS 4 * 256 1024
calc_coords_y float MAX_NUM_OF_ATOMS 4 * 256 1024
calc_coords_z float MAX_NUM_OF_ATOMS 4 * 256 1024
partial_energies float NUM_OF_THREADS_PER_BLOCK 4 * 64 (GPU) or 4 * 16 (CPU) 256 (GPU) or 64 (CPU)
Subtotal (Bytes) 3328 (GPU) or 3136 (CPU)
#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE[NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE[NUM_OF_THREADS_PER_BLOCK];
#endif

Calculating maximum possible stepsize (alpha)

Local memory Element datatype Size definition Size calculation Size in Bytes
max_trans_grad float 1 4 * 1 4
max_rota_grad float 1 4 * 1 4
max_tors_grad float 1 4 * 1 4
max_trans_stepsize float 1 4 * 1 4
max_rota_stepsize float 1 4 * 1 4
max_tors_stepsize float 1 4 * 1 4
max_stepsize float 1 4 * 1 4
Subtotal (Bytes) 28

Storing torsion gradients here

Local memory Element datatype Size definition Size calculation Size in Bytes
torsions_gradient float ACTUAL_GENOTYPE_LENGTH 4 * 38 152

Total

  • GPU: 164+4+4+308+3072+3072+3328+28+152 = 10132

  • CPU: 164+4+4+308+3072+3072+3136+28+152 = 9940

Wrong number of elements is copied with async_work_group_copy in kernel1

This instruction in "debugfastergrad" branch copies GENOTYPE_LENGTH_IN_GLOBMEM elements from global memory into the local array genotype[ACTUAL_GENOTYPE_LENGTH].

However, such array is defined with a size smaller than what is copied into it. Therefore, the aforementioned instruction should copy ACTUAL_GENOTYPE_LENGTH instead of GENOTYPE_LENGTH_IN_GLOBMEM.

Correct RMSD

Main idea

The ligand-xray (instead of the randomized one) should be used for RMSD calculation.

Explanation

The ligand_xray is the biological answer, so the RMSD between a pose and ligand_xray tells you if you found the biological answer. We will use ligand_xray when we want to know if we found the biological solution.
So far we have been just looking at energies.

Since -lfile is used as reference, we may want to add a new option -reflig to provide the reference ligand for RMSD calculation, and still be able to use a randomized ligand as input file.

Update

Proposed -reflig would be equivalent to AutoDock rmsref.

About clustering

In AutoDock (and certainly in OCLADock as well) the input ligand is not used for clustering. The clustering algorithm is as follows:

First the best solution (out of all GA runs) is used as reference. Then, sorted by increasing energy, the solutions of the other GA runs are assigned to the first cluster if their RMSD is below -rmstol. Then this process is repeated, as many times as necessary, for all GA solutions that did not fall below -rmstol. To conclude, the randomized ligand is not the cause of bad clustering.

For large ligands, it is common to have as many clusters as GA runs, which means that there ain't any pair of poses below -rmstol. This happend a lot in "forli21" using AutoDock4.2, even with 25M evals. (See .dlgs in the shared repo).

Ambiguous usage of native_exp()

Detected with oclgrind.

Large constant coefficients might be considered as of double type by the compiler. Data derived from these coefficients can be treated as double too. If so, such data will not be valid arguments for native_*() functions as in here.

Therefore, such constant coefficients have be expressed with fewer decimal points so they are treated as float by any compiler.

Verify local memory usage on K1, K2, K3, K4 kernel

Analysis made from commit b243639.

K1

Local memory Element datatype Size definition Size calculation Size in Bytes
genotype float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
energy float 1 4 * 1 4
run_id int 1 4 * 1 4
calc_coords_x float MAX_NUM_OF_ATOMS 4 * 256 1024
calc_coords_y float MAX_NUM_OF_ATOMS 4 * 256 1024
calc_coords_z float MAX_NUM_OF_ATOMS 4 * 256 1024
partial_energies float NUM_OF_THREADS_PER_BLOCK 4 * 64 (GPU) or 4 * 16 (CPU) 256 (GPU) or 64 (CPU)
Subtotal (Bytes) 3488 (GPU) or 3296 (CPU)
#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE[NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE[NUM_OF_THREADS_PER_BLOCK];
#endif

K2

Local memory Element datatype Size definition Size calculation Size in Bytes
partsum_evals int NUM_OF_THREADS_PER_BLOCK 4 * 64 (GPU) or 4 * 16 (CPU) 256 (GPU) or 64 (CPU)

K3 (Solis-Wets)

Local memory Element datatype Size definition Size calculation Size in Bytes
genotype_candidate float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
genotype_deviate float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
genotype_bias float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
rho float 1 4 * 1 4
cons_succ int 1 4 * 1 4
cons_fail int 1 4 * 1 4
iteration_cnt int 1 4 * 1 4
candidate_energy float 1 4 * 1 4
evaluation_cnt int 1 4 * 1 4
offspring_genotype float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
run_id int 1 4 * 1 4
entity_id int 1 4 * 1 4
offspring_energy float 1 4 * 1 4
calc_coords_x float MAX_NUM_OF_ATOMS 4 * 256 1024
calc_coords_y float MAX_NUM_OF_ATOMS 4 * 256 1024
calc_coords_z float MAX_NUM_OF_ATOMS 4 * 256 1024
partial_energies float NUM_OF_THREADS_PER_BLOCK 4 * 64 (GPU) or 4 * 16 (CPU) 256 (GPU) or 64 (CPU)
Subtotal (Bytes) 3972 (GPU) or 3780 (CPU)
#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE [NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE [NUM_OF_THREADS_PER_BLOCK];
#endif

K4 (GA)

Local memory Element datatype Size definition Size calculation Size in Bytes
offspring_genotype float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
parent_candidates int 4 4 * 4 16
candidate_energies float 4 4 * 4 16
parents int 2 4 * 2 8
run_id int 1 4 * 1 4
covr_point int 2 4 * 2 8
randnums float 10 4 * 10 40
energy float 1 4 * 1 4
best_energies float NUM_OF_THREADS_PER_BLOCK 4 * 64 (GPU) or 4 * 16 (CPU) 256 (GPU) or 64 (CPU)
best_IDs int NUM_OF_THREADS_PER_BLOCK 4 * 64 (GPU) or 4 * 16 (CPU) 256 (GPU) or 64 (CPU)
best_ID int 1 4 * 1 4
calc_coords_x float MAX_NUM_OF_ATOMS 4 * 256 1024
calc_coords_y float MAX_NUM_OF_ATOMS 4 * 256 1024
calc_coords_z float MAX_NUM_OF_ATOMS 4 * 256 1024
partial_energies float NUM_OF_THREADS_PER_BLOCK 4 * 64 (GPU) or 4 * 16 (CPU) 256 (GPU) or 64 (CPU)
Subtotal (Bytes) 4092 (GPU) or 3516 (CPU)
#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE [NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE [NUM_OF_THREADS_PER_BLOCK];
#endif

Total

  • GPU: 3488+256+3972+4092 = 11808

  • CPU: 3296+64+3780+3516 = 10656

Ignored 2nd atom index in BRANCH keyword

In the ligand .pdbqt file, BRANCH keywords define rotatable bonds, e.g.:
BRANCH 1 6
which means that the bond between atoms 1 and 6 is rotatable.

However, ocladock ignores the second atom index (6 in the example above) and assumes that the atom immediately following the BRANCH keyword is bonded to 1.

This example is from ligand "1n1m" from the Astex dataset:

ROOT
ATOM      1  C   UNL     1      80.420  74.385  97.448  0.00  0.00    +0.016 C 
ENDROOT
BRANCH   1   6
ATOM      2  C   UNL     1      84.189  77.213  97.348  0.00  0.00    +0.020 C 
ATOM      3  C   UNL     1      84.632  76.660  98.776  0.00  0.00    +0.020 C 
ATOM      4  C   UNL     1      84.034  75.234  98.692  0.00  0.00    +0.108 C 
ATOM      5  N   UNL     1      82.662  75.516  98.209  0.00  0.00    -0.303 N 
ATOM      6  C   UNL     1      81.800  74.363  97.953  0.00  0.00    +0.201 C 
ATOM      7  O   UNL     1      82.322  73.115  98.201  0.00  0.00    -0.277 OA
ATOM      8  C   UNL     1      82.854  76.381  97.032  0.00  0.00    +0.108 C 
ENDBRANCH   1   6
BRANCH   1   9
ATOM      9  N   UNL     1      80.215  73.353  96.379  0.00  0.00    +0.386 N 
ATOM     10  H   UNL     1      80.372  72.421  96.748  0.00  0.00    -0.089 HD
ATOM     11  H   UNL     1      80.844  73.456  95.593  0.00  0.00    -0.089 HD
ATOM     12  H   UNL     1      79.275  73.376  96.008  0.00  0.00    -0.089 HD
ENDBRANCH   1   9
BRANCH   1  13
ATOM     13  C   UNL     1      79.296  74.230  98.432  0.00  0.00    -0.023 C 
ATOM     14  C   UNL     1      77.805  74.286  97.634  0.00  0.00    +0.006 C 
ATOM     15  C   UNL     1      79.221  75.282  99.458  0.00  0.00    +0.006 C 
ENDBRANCH   1  13
TORSDOF 3

Notes:
This ligand was prepared with openbabel. If script prepare_ligand4.py is used instead, the order of the atoms is shuffled so that the atom immediately following the BRANCH keyword matches the second index.

I classify this issue as a bug because:

  • autodock4.2 correctly parses the second index in the BRANCH keyword
  • vina as well
  • enforcing the atom immediately following the BRANCH keyword to match the second index often requires shuffling the order of the atoms, which is bad practice.

Missing synchronization after asynchronous copies?

This error was found with oclgrind and points to the last asynchronous copies in kernel4 and kernel_gradient:

Work-item finished without waiting for events.

Other kernels might suffer from this too:

  • See master -> kernel3 and kernel4.
  • See debugfastergrad -> kernel3, kernel4, kernel_gradient, and kernel_fire.

Keep in mind the following information from the v2.0 standard:

This function does not perform any implicit synchronization of source data such as using a barrier before performing the copy.

The kernel must wait for the completion of all async copies using the wait_group_events built-in function before exiting; otherwise the behavior is undefined.

Memory object mapped for reading should be unmapped before a kernel writes to it

Detected with oclgrind.

According to the clEnqueueMapBuffer documentation:

If a memory object is currently mapped for reading, the application must ensure that the memory object is unmapped before any enqueued kernels or commands that write to this memory object or any of its associated memory objects (sub-buffer or 1D image buffer objects) or its parent object (if the memory object is a sub-buffer or 1D image buffer object) begin execution; otherwise the behavior is undefined.

The maps perfomed in master and in debugfastergradbranches are followed bykernel2 executions as in master and debugfastergrad. Such kernel2 writes that memory object.

Therefore, the aforementioned map call should be followed by an unmap call before kernel2 is invoked (or any kernel that writes to the memory object being mapped).

Add smoothing to internal-energy potentials

Explanation:

Just checked the effect of the smooth parameter on the intra energy.
On a large ligand (2er7) in an fairly extended conformation (i.e. the
ligand is not folding onto itself), the intra energy with smooth=0.5
is -9 kcal. However, using smooth=0.0 the intra energy is -5
kcal. This explains the different results between autodock and
ocladock for ligands with many torsions.


Q:

Do you mind running some more tests on other bigger complexes too
(e.g. 2er7, 3er5, 4er4) to see if that happens as well? Let me know if
you can do it. Also, I assume you would need the atom-contributor
pairs for each complex, wouldn't you?

By the way, I have just checked (in *.gpf and *.gpf) that I created
the grids with the default smooth parameter (=0.5A) so I guess this
would explain why we had no significant discrepancies on that.

A:

I don't need the pairs because I'm using the original AutoDock4.2 - I
just change the smooth parameter in the .dpf file. It only affects the
pairwise contributions, the grids remain unchanged.

Here are the intra energies for the complexes you asked about:

complex, smooth=0.5, smooth=0.0

2vaa, -7.64, -5.45

3er5, +12.66, +93.78

4er4, -9.84, -6.51


Q:

I see the difference. Did you implement this in autodockdevpy?. If so, I
could reuse for ocladock ... thanks!

A:

Yes, it's implemented, but it's disabled in branch "ocladockenergy". Look in branch "dev".

In files pairwise_energies.py and pairwise_derivatives.py, it's implemented in function "_calc_smooth()", which modifies the distance before evaluating 'vdw' and 'hb' energy contributions.

I don't know how important it is to implement this smooth parameter. I'll try to figure it out with Stefano and gather other opinions from other people in the lab.


Q:

I have just checked the python code of "_calc_smooth()", and doesn't seem complex to implement in OpenCL. The only doubt I have is the meaning of "r" and "rij" and their relationship.

Anyway, let me know what you guys think about including this function.

A1:

"r" is a variable: it's the current distance between two atoms during the docking.

"rij" is a parameter: it's the optimum distance for the pair (e.g.: "rij" for C - C is 4 angstroms).

A2:

I just talked with Stefano and David about the smooth parameter.

The smooth parameter is important for the grids, we know that for sure. However, it is unclear how important it is for the pairwise interactions. According to the user guide (*), it was only added to pairwise interactions in version 4.2.5.

However, for the sake of publication, it would be beneficial to have a direct comparison with the current AutoDock version, so we recommend it's implementation. It would be even better if it could be a user specified argument either at run time or a compilation argument.

(*) http://autodock.scripps.edu/faqs-help/manual/autodock-4-2-user-guide/AutoDock4.2_UserGuide.pdf (see page 6)


Q:

Ok, then I will start implementing it.

A technical question: I assume "rij" (optimum distance) depends on the atoms types, doesn't it?
Do you know where, either in "autodockdevpy" or AD4, I can find the "rij" values ?

A:

Yes, rij is the sum of vdW radii for the atom pair. It's calculated as 0.5 * rii + 0.5 * rjj, because rii an rjj are twice the vdW radii. The rij values must be already present in OCLADock, because they are needed to calculate C12 and C6 for vdW, and C12 and C10 for hydrogen bonds. It's probably a matter of storing them along C12, C10 and C6 for use in the energy evaluation.


Q:

Regarding the smooth parameter to be specified as a user-specified argument:

In the AD4.2 documentation, it says the force field has been optimized
for a smooth value = 0.5A. So, I am setting this as the default smooth
value.

But, I was wondering it such parameter has lower and upper bounds. Can
you suggest these values? This would prevent any crazy smooth inputs ...

A:

in theory there wouldn't be any limits, and in the current AutoDock there isn't a check for values provided.
If you want to include one, I would say that 5.0 is a pretty high upper bound, while the minimum can't go lower than 0.0 (it's a distance).

Excessively large array in kernel2

The array __local int local_evals_of_new_entities[MAX_POPSIZE] used in master and
debugfastergrad is excessively large: 2048 elements, under common/defines.h.

Such array should be removed, and related accesses should be performed from global memory directly.

Output .dlg does not open with pymol

From @diogo.martins:

AutoDock4.2 writes the best pose of each cluster at the end of the .dlg file. These poses can be read by PyMOL. AutoDock-GPU should also write the top pose of each cluster to allow rapid visualization.

Wrong behavior when "-ngen 1"

Recompiling and running from the master branch (commit dea8214):

../ocladock/bin/ocladock_gpu_64wi
    -ffile ./Data/forli21/7cpa/7cpa_protein.maps.fld
    -lfile ./Data/forli21/7cpa/ligand.pdbqt
    -nrun 100
    -nev 100000000
    -resnam x
    -lsrat 99
    -lsit 100
    -ngen 1

The energies are always the same, independently of local search parameters.

In other words, energy values don't improve with a larger number of local-search iterations (-lsit).

What is going on?

Verify local memory usage in FIRE kernel

Analysis made from commit b243639.

Determining entity, and its run, energy, and genotype

Local memory Element datatype Size definition Size calculation Size in Bytes
entity_id int 1 4 * 1 4
run_id int 1 4 * 1 4
energy float 1 4 * 1 4
genotype float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
Subtotal (Bytes) 164

Iteration counter fot the minimizer

Local memory Element datatype Size definition Size calculation Size in Bytes
iteration_cnt uint 1 4 * 1 4

Partial results of the gradient step

Local memory Element datatype Size definition Size calculation Size in Bytes
gradient float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
candidate_gradient float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
Subtotal (Bytes) 304

Energy may go up, so we keep track of the best energy ever calculated.

// Then, we return the genotype corresponding
// to the best observed energy, i.e. "best_genotype"

Local memory Element datatype Size definition Size calculation Size in Bytes
best_energy float 1 4 * 1 4
best_genotype float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
candidate_energy float 1 4 * 1 4
candidate_genotype float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
Subtotal (Bytes) 312

Gradient of the intermolecular energy per each ligand atom

// Also used to store the accummulated gradient per each ligand atom

Local memory Element datatype Size definition Size calculation Size in Bytes
gradient_inter_x float MAX_NUM_OF_ATOMS 4 * 256 1024
gradient_inter_y float MAX_NUM_OF_ATOMS 4 * 256 1024
gradient_inter_z float MAX_NUM_OF_ATOMS 4 * 256 1024
Subtotal (Bytes) 3072

Gradient of the intramolecular energy per each ligand atom

Local memory Element datatype Size definition Size calculation Size in Bytes
gradient_intra_x float MAX_NUM_OF_ATOMS 4 * 256 1024
gradient_intra_y float MAX_NUM_OF_ATOMS 4 * 256 1024
gradient_intra_z float MAX_NUM_OF_ATOMS 4 * 256 1024
Subtotal (Bytes) 3072

Ligand-atom position and partial energies

Local memory Element datatype Size definition Size calculation Size in Bytes
calc_coords_x float MAX_NUM_OF_ATOMS 4 * 256 1024
calc_coords_y float MAX_NUM_OF_ATOMS 4 * 256 1024
calc_coords_z float MAX_NUM_OF_ATOMS 4 * 256 1024
partial_energies float NUM_OF_THREADS_PER_BLOCK 4 * 64 (GPU) or 4 * 16 (CPU) 256 (GPU) or 64 (CPU)
Subtotal (Bytes) 3328 (GPU) or 3136 (CPU)
#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE[NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE[NUM_OF_THREADS_PER_BLOCK];
#endif

FIRE counters

Local memory Element datatype Size definition Size calculation Size in Bytes
velocity float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
alpha float 1 4 * 1 4
count_success uint 1 4 * 1 4
dt float 1 4 * 1 4
Subtotal (Bytes) 164

Calculating the gradient/velocity norm

Local memory Element datatype Size definition Size calculation Size in Bytes
gradient_tmp float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
inv_gradient_norm float 1 4 * 1 4
velocity_tmp float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
velocity_norm float 1 4 * 1 4
velnorm_div_gradnorm float 1 4 * 1 4
Subtotal (Bytes) 316

Defining FIRE power

Local memory Element datatype Size definition Size calculation Size in Bytes
power_tmp float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
power float 1 4 * 1 4
Subtotal (Bytes) 156

Total

  • GPU: 164+4+304+312+3072+3072+3328+164+316+156 = 10892
  • CPU: 164+4+304+312+3072+3072+3136+164+316+156 = 10700

Add & improve documentation

Caveats:

  1. #20 (comment)

It is possible that atom order in the reference ligand -xraylfile differs from atom order in the input ligand -lfile. If this is the case, and symmetry is turned off (-hsym 0), the RMSD will be calculated incorrectly. We may want to add this to the documentation.

  1. #20 (comment)

Add -xraylfile information into README

Memory space of constant kernel arguments should be re-qualified

The maximum allowed size for constant arguments varies for each device, e.g.:

AMD Vega56 GPU has 4.2 GB:

  ULong attributes ...
   1 CL_DEVICE_MAX_MEM_ALLOC_SIZE                 : 4244635648 
   1 CL_DEVICE_GLOBAL_MEM_CACHE_SIZE              : 16384 
   1 CL_DEVICE_GLOBAL_MEM_SIZE                    : 8573157376 
   1 CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE           : 4244635648 
   1 CL_DEVICE_LOCAL_MEM_SIZE                     : 32768 

NVidia M2000 GPU has 16.7 MB:

 ULong attributes ...
  1 CL_DEVICE_MAX_MEM_ALLOC_SIZE                 : 8589934592  
  1 CL_DEVICE_GLOBAL_MEM_CACHE_SIZE              : 20971520  
  1 CL_DEVICE_GLOBAL_MEM_SIZE                    : 31497080832  
  1 CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE           : 16777216  
  1 CL_DEVICE_LOCAL_MEM_SIZE                     : 16777216  

As all constant kernel arguments were qualified as __constant, the program might be allocating more data than what fits in the constant buffer. Perhaps this is handle automatically by the compiler, i.e., it might be moving this extra data to global memory.

However, this is signalled as an error by oclgrind --check-api:

Oclgrind - OpenCL runtime error detected
	Function: clEnqueueNDRangeKernel
	Error:    CL_OUT_OF_RESOURCES
	total constant memory size (252528) exceeds device maximum of 65536

Error: clEnqueueNDRangeKernel() -5

Oclgrind - OpenCL runtime error detected
	Function: clEnqueueNDRangeKernel
	Error:    CL_OUT_OF_RESOURCES
	total constant memory size (297680) exceeds device maximum of 65536

Error: clEnqueueNDRangeKernel() -5

A solution for this is evaluating the amount of data being passed to kernel, and then re-qualifying arguments either as __constant or __global const.

The calculation of sizes is as follows. Originally, each of these is passed as a separate __constant argument from host to kernel. Here, these are listed in groups for convenient data-passing from host to device (see kernel codes in commits below).

interintra (subtotal size: 1280)

Constant array Size definition Size calculation Size in Bytes
atom_charges MAX_NUM_OF_ATOMS * sizeof(float) 256 * 4 1024
atom_types MAX_NUM_OF_ATOMS * sizeof(char) 256 * 1 256

intracontrib

Constant array Size definition Size calculation Size in Bytes
intraE_contributors 3 * MAX_INTRAE_CONTRIBUTORS * sizeof(char) 3 * 256 * 256 * 1 196608

intra (subtotal size: 2032)

Constant array Size definition Size calculation Size in Bytes
reqm ATYPE_NUM * sizeof(float) 22 * 4 88
reqm_hbond ATYPE_NUM * sizeof(float) 22 * 4 88
atom1_types_reqm ATYPE_NUM * sizeof(unsigned int) 22 * 4 88
atom2_types_reqm ATYPE_NUM * sizeof(unsigned int) 22 * 4 88
VWpars_AC MAX_NUM_OF_ATYPES * MAX_NUM_OF_ATYPES * sizeof(float) 14 * 14 * 4 784
VWpars_BD MAX_NUM_OF_ATYPES * MAX_NUM_OF_ATYPES * sizeof(float) 14 * 14 * 4 784
dspars_S MAX_NUM_OF_ATYPES * sizeof(float) 14 * 4 56
dspars_V MAX_NUM_OF_ATYPES * sizeof(float) 14 * 4 56

rotlist

Constant array Size definition Size calculation Size in Bytes
rotlist MAX_NUM_OF_ROTATIONS * sizeof(int) 256 * 32 * 4 32768

conform (subtotal size: 19840)

Constant array Size definition Size calculation Size in Bytes
ref_coords_x MAX_NUM_OF_ATOMS * sizeof(float) 256 * 4 1024
ref_coords_y MAX_NUM_OF_ATOMS * sizeof(float) 256 * 4 1024
ref_coords_z MAX_NUM_OF_ATOMS * sizeof(float) 256 * 4 1024
rotbonds_moving_vectors 3 * MAX_NUM_OF_ROTBONDS * sizeof(float) 3 * 32 * 4 384
rotbonds_unit_vectors 3 * MAX_NUM_OF_ROTBONDS * sizeof(float) 3 * 32 * 4 384
ref_orientation_quats 4 * MAX_NUM_OF_RUNS * sizeof(float) 4 * 1000 * 4 16000

A total of 252528 Bytes is required for constant data, which is a much smaller size than the minimum in the available GPUs.

For debugfastergrad, we require the following arrays as well:

gradsrotbonds

Constant array Size definition Size calculation Size in Bytes
rotbonds_atoms MAX_NUM_OF_ATOMS * MAX_NUM_OF_ROTBONDS * sizeof(int) 256 * 32 * 4 32768

grads (subtotal size: 12384)

Constant array Size definition Size calculation Size in Bytes
rotbonds 2 * MAX_NUM_OF_ROTBONDS * sizeof(int) 2 * 32 * 4 256
num_rotating_atoms_per_rotbond MAX_NUM_OF_ROTBONDS * sizeof(int) 32 * 4 128
angle 1000*sizeof(float) 1000 * 4 4000
dependence_on_theta 1000*sizeof(float) 1000 * 4 4000
dependence_on_rotangle 1000*sizeof(float) 1000 * 4 4000

A total of 297680 (252528 + 45152) Bytes is required for constant data, which is a much smaller size than the minimum in the available GPUs.

Parsing of fld file possibly pointless

From @diogo.martins:

The fld file (input arg -ffile) is parsed to get the following data:

  1. spacing, grid center, and grid size
  2. the prefix for map files

The fld file explicitly declares an association between atom types and map files. These are currently ignored by ocladock.

It could be more intuitive if the -ffile argument was replaced by a -mapprefix argument. The grid center, size and spacing would then be parsed from the map files.

Note: this is very low priority.

Verify local memory usage in ADADELTA kernel

Analysis made from commit b243639.

Determining entity, and its run, energy, and genotype

Local memory Element datatype Size definition Size calculation Size in Bytes
entity_id int 1 4 * 1 4
run_id int 1 4 * 1 4
energy float 1 4 * 1 4
genotype float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
Subtotal (Bytes) 164

Iteration counter fot the minimizer

Local memory Element datatype Size definition Size calculation Size in Bytes
iteration_cnt uint 1 4 * 1 4

Partial results of the gradient step

Local memory Element datatype Size definition Size calculation Size in Bytes
gradient float ACTUAL_GENOTYPE_LENGTH 4 * 38 152

Energy may go up, so we keep track of the best energy ever calculated.

// Then, we return the genotype corresponding
// to the best observed energy, i.e. "best_genotype"

Local memory Element datatype Size definition Size calculation Size in Bytes
best_energy float 1 4 * 1 4
best_genotype float ACTUAL_GENOTYPE_LENGTH 4 * 38 152
Subtotal (Bytes) 156

Gradient of the intermolecular energy per each ligand atom

// Also used to store the accummulated gradient per each ligand atom

Local memory Element datatype Size definition Size calculation Size in Bytes
gradient_inter_x float MAX_NUM_OF_ATOMS 4 * 256 1024
gradient_inter_y float MAX_NUM_OF_ATOMS 4 * 256 1024
gradient_inter_z float MAX_NUM_OF_ATOMS 4 * 256 1024
Subtotal (Bytes) 3072

Gradient of the intramolecular energy per each ligand atom

Local memory Element datatype Size definition Size calculation Size in Bytes
gradient_intra_x float MAX_NUM_OF_ATOMS 4 * 256 1024
gradient_intra_y float MAX_NUM_OF_ATOMS 4 * 256 1024
gradient_intra_z float MAX_NUM_OF_ATOMS 4 * 256 1024
Subtotal (Bytes) 3072

Ligand-atom position and partial energies

Local memory Element datatype Size definition Size calculation Size in Bytes
calc_coords_x float MAX_NUM_OF_ATOMS 4 * 256 1024
calc_coords_y float MAX_NUM_OF_ATOMS 4 * 256 1024
calc_coords_z float MAX_NUM_OF_ATOMS 4 * 256 1024
partial_energies float NUM_OF_THREADS_PER_BLOCK 4 * 64 (GPU) or 4 * 16 (CPU) 256 (GPU) or 64 (CPU)
Subtotal (Bytes) 3328(GPU) or 3136(CPU)
#if defined (DEBUG_ENERGY_KERNEL)
__local float partial_interE[NUM_OF_THREADS_PER_BLOCK];
__local float partial_intraE[NUM_OF_THREADS_PER_BLOCK];
#endif

Vector for storing squared gradients E[g^2]

Local memory Element datatype Size definition Size calculation Size in Bytes
square_gradient float ACTUAL_GENOTYPE_LENGTH 4 * 38 152

Update vector, i.e., "delta".

// It is added to the genotype to create the next genotype.
// E.g. in steepest descent "delta" is -1.0 * stepsize * gradient

Local memory Element datatype Size definition Size calculation Size in Bytes
delta float ACTUAL_GENOTYPE_LENGTH 4 * 38 152

Squared updates E[dx^2]

Local memory Element datatype Size definition Size calculation Size in Bytes
square_delta float ACTUAL_GENOTYPE_LENGTH 4 * 38 152

Total

  • GPU: 164+4+152+156+3072+3072+3328+152+152+152 = 10404
  • CPU: 164+4+152+156+3072+3072+3136+152+152+152 = 10212

Using -lsrat 0 causes clEnqueueNDRangeKernel() -54

Program currently accepts lsrat equal to 0 and stores its value into dockpars.num_of_lsentities, which in turn is used to define the work-group size of LS kernel.

A work-group of size zero leads to OpenCL error code -54. In master and fastergrad branches.

Correct Makefile: INCLUDE_PATH and LIBRARY_PATH

It seems it is not needed to have <VENDOR>_INCLUDE_PATH and <VENDOR>_LIBRARY_PATH for INTEL, AMD, and NVIDIA.

The reason is that this would force to have as many of such variables as vendor devices OCLADock supports.

Proposal

Adopt the following:

  1. Naming convention
  2. User requirements

Naming convention

Device type Include path Library path
CPU CPU_INCLUDE_PATH CPU_LIBRARY_PATH
GPU GPU_INCLUDE_PATH GPU_LIBRARY_PATH

User requirements

The user must define these include and library paths before either executing or compiling OCLADock.

Motivation

  1. Improve Makefile making it cleaner and more general
  2. Provide a single GPU-executable (as suggested by Scripss)

Merging of energy and gradient calculation

Gradient-based local-search kernels utilize both energy and gradient functions, which share common:

  • Subfunctiones: parallel rotation,
  • Processing structures: loops for intermolecular and pairwise interactions

By merging both energy and gradient functions, it would be possible to reutilize such subfunctions, and apply loop-fusion, in order to accelerate overall execution.

Such merging is possible only when both functions utilize the same input genotypes. Currently, two of these kernels (fire & adadelta) meet that criterion.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.