cmpa / ipic3d Goto Github PK
View Code? Open in Web Editor NEWParticle-in-Cell code using the implicit moment method
Particle-in-Cell code using the implicit moment method
mixing stdout with stderr causes jumbled output in a parallel environment.
Currently Particles3D::particle_repopulator
contains code like this:
srand (vct->getCartesian_rank()+1+ns+(int(MPI_Wtime()))%10000); if (vct->getXleft_neighbor() == MPI_PROC_NULL && bcPfaceXleft == 2) { [...] }
I do not see any need to seed the random number generator more than once in the program, and I would rather eliminate these calls. Seeding the random number generator destroys the ability to get exactly reproducible results (except in the case that no data generated by calls to rand()
is actually being used), and there should be a single point in the program which selects whether the seed is deterministic or comes from a time value that will be different with each invocation of the program.
Even if the srand calls are retained, I would change the code to:
if (vct->getXleft_neighbor() == MPI_PROC_NULL && bcPfaceXleft == 2) { srand (vct->getCartesian_rank()+1+ns+(int(MPI_Wtime()))%10000); [...] }
Issue #22 reported poor performance summing moments.
Issue #23 began to address the lack of OpenMP parallelism in summing moments.
Issue #33 reported continued poor performance summing moments.
Initial steps in addressing issue #33 have been to restore the grid coordinate accessors to be inline and, under issue #40, to use one-dimensional arrays rather than three-dimensional arrays to obtain grid coordinates. These changes alone, however, have been insufficient, and summing moments has remained slow. This suggests the need for a more comprehensive restructuring that improves computation speed by eliminating indirection.
It would be possible for a sophisticated compiler to optimize the code that sums moments in Particles3Dcomm::interpP2G
by (1) unrolling the triply nested loops that sum from 0 to 1, (2) expanding and subsequently optimizing the inline methods such as EMfields3D::addRho
that also have such loops, and (3) factoring out the repeated multiplication by the same unchanging scalars. But this is too much to expect. Summing moments is a small but performance-critical part of the code, and it therefore pays to avoid assuming that the compiler and optimizer are sophisticated enough to perform these optimizations. Thus, I have expanded the loops and factored out constant multiples from sums. Preliminary observations on my laptop indicate that this accelerates computing moments by a factor of roughly 2 or 3.
In the process of making these changes, I have chosen to move Particles3Dcomm::interpP2G
and rename it as
EMfields3D::sumMoments
. With respect to encapsulation, I believe that this is a more appropriate location for this method, since it involves write-access to moment data and merely read-access to particle data. Later we may want to pull the moments out of the EMfields3D
class and encapsulate them in a class that handles moments. At this point, the sumMoments
method could be move into this class. It will be useful to have an encapsulated class of moments that can be passed from accelerator (e.g. DEEP's "booster") to the computing unit responsible for the field solve (e.g. DEEP's "cluster").
I also feel that it is more natural to refer to the operation as summing moments than as interpolating particles to the grid. Interpolation is a way of viewing a part of what we are doing, does not really include the whole process of summing moments, and is not necessarily involved in alternative ways of representing particles. I could have chosen to call the method sumParticleMoments
, but since a collection of particles is the first argument to the method I felt that sumMoments
was sufficient. In the future, sumMoments
could also incorporate moments from a fluid representation.
In the process of moving this code, I noticed a number of unused methods such as Particles3Dcomm::calculateWeights
, Particles3D::interpP2G_onlyP
, interpP2G_notP
, all of which should be moved and reimplemented if they are to be reinstated. I am therefore removing these methods.
I also observed that that the abstract base class Particles
has
Particles3Dcomm
as its only child and therefore serves no obvious purpose. While none of the virtual methods of Particles
are called in performance-critical loops, I think that we might as well eliminate this class via
typedef Particles3Dcomm Particles
and replacing the content of Particles.h
with the line
#include "Particles3Dcomm.h"
Does anyone have any input regarding the purpose for which this and other abstract base classes were created?
As discussed in issue #4, to support tracking more than INT_MAX (usually about 2 billion) particles, we need to use long long
rather than int
to index them.
Currently the code uses unsigned long
in some places and long long
in others. On Xeon both of these are 8 bytes long, but on my mac unsigned long
is only 4 bytes (same as int
) while long long
is the full 8 bytes needed to index billions of particles. Therefore, to support billions of particles on all architectures, I am modifying the code to consistently use long long
to index particles.
On the other hand, I see no need to support having more than 2 billion particles in a single MPI process, and perhaps indexing with long integers could slow iterating through particles. I am therefore changing all array indices to use int
rather than long long
.
Put
#include <mpi.h>
at the top of
particles/Particles3D.cpp particles/Particles3Dcomm.cpp
before
#include <iostream>
Otherwise I get an error:
/opt/intel/impi/4.1.0.030/intel64/include/mpicxx.h(95): catastrophic error: #error directive: "SEEK_SET is #defined but must not be for the C++ binding of MPI. Include mpi.h before stdio.h" #error "SEEK_SET is #defined but must not be for the C++ binding of MPI. Include mpi.h before stdio.h"
I want to check in the beginnings of an ipic command line system with a calling syntax analogous to git or hg. The initially supported commands are:
ipic ctags ipic help ipic help ctags ipic help mic
To this end, I am creating a scripts subdirectory.
If this directory is placed in your path, e.g. via
export PATH="$HOME/ipic3d/scripts:$PATH"
then the commands above should work appropriately.
MPI should be initialized immediately on startup, before initializing the iPic3D library. This allows the iPic3D initialization to issue an error message and exit if appropriate without causing mpirun to complain that MPI_Init was never called.
The c_Solver
class should be subdivided into two classes: a FieldSolver
class and a ParticleSolver
class.
This fundamental restructuring is being prompted by the port to the DEEP architecture, where we will solve fields on the cluster and particles on the booster. However, separation into two separate solvers will have benefits for improved flexibility that extend far beyond DEEP. The benefits include:
The data that needs to be exchanged between these two classes is:
the electric field: Ex, Ey, Ez.
While the magnetic field is also needed on the particle side to push the particles, the magnetic field is easily updated from the electric field, and so there is no need to transfer the magnetic field data except on the first iteration of the solver. It is conceivable that magnetic field data might need to be sent again would be due to differences in the updated fields on each side due to accumulated machine arithmetic error, but such error should accumulate very slowly.
The arrays in the code that need to be sent are <code>Jxh</code>, <code>Jyh</code>, <code>Jzh</code>, and <code>rhons</code>.
Note that rhoc
(used in the Poisson correction) is computed from rho
,
that rho
is computed by summing over rhons
, and that rhoh
is easily computed from rhoc
and Jxh
.
Follow-up: communicate the magnetic field separately, in addition to the electric field. I have decided to modify the above proposal to calculate and then send the smoothed versions of E or B needed to push the particles as soon as E or B is updated and then use this data to construct the SoA fields needed to push particles upon receiving this data on the Booster side.
Justification. While it is technically possible to communicate only the electric field from the field solver to the particle solver, it will be simpler and almost certainly more efficient to communicate both fields. The reason is that the field used to push particles needs to be a modified version of the field evolved in the field solver. Smoothing needs to be applied to the field several (now 3) times before it can be used to evolve particles. Each of these smoothings requires exchange of boundary data. Moreover, updating the magnetic field from the electric field requires calculating a curl, which also requires exchange of boundary data. Even a 10x10x10 mesh is half boundary cells, and unless the dimensions of the subgrid are unfeasibly large, a substantial fraction of the mesh cells are boundary cells. The transformations of the field data to prepare it to be used by the particle solver involve successive synchronizations and are more communication-intensive then the mere task of transferring the magnetic field components. By directly transferring the magnetic field along with the electric field, we can do all transformations on the Cluster, where we expect them to be performed more efficiently, without repeating any of them. We could even do in the cluster the transposition of field data to the AoS format (see issue #52) preferred for cache performance in the particle mover. But on the other hand, transposition is much less expensive than communication, and doing the transposition on the Booster will allow us to send the magnetic field separately and will not require communicating the 33% additional garbage data needed in the AoS format for alignment. In the current algorithm, the old field is used to push particles (for consistency with the implicit moment field update), so sending the magnetic field as soon as it is updated means that communication of the magnetic field can occupy an entire cycle of the algorithm before it needs to complete.
Communicating moment boundary data on the Cluster side.
After accumulating moments on the Booster, hatted moments need to be computed on the Booster and communicated to the field solver running on the Cluster. Computing hatted current involves evaluating the divergence of pressure tensor quantities, which requires communicating boundary data. This boundary communication can be delayed and reduced, however, until the hatted moments have reached the Cluster, if one is willing to accept loss of numerical precision due to summing terms involving the divergence of the pressure tensor in boundary cells or nodes that almost cancel. (To avoid this cancelation error one must communicate boundary node data of each pressure tensor component of each species.) Using the way of delayed communication, only five moments need boundary communication, but it would be necessary to communicate and sum over a widened halo that is three layers thick, including not just boundary nodes but ghost nodes and the first layer of interior nodes. Communicating node values of 6+6=12 pressure tensor components avoids loss of numerical precision and reduces the number of layers of nodes that must be communicated at the end from three to the single layer of shared nodes. Communicating the node values of the other moments or of the divergence of the pressure tensor does not need to be done prior to summing the hatted moments (reducing the number of communications needed by 6) and communicating them to the Cluster (where it can be done faster).
I want to check in code to eliminate the use of exceptions in iPic3D.
Allowing exceptions has performance penalties. In particular, while developing an array class for iPic3D, I observed that looping through array elements took three times as long if the body of the destructor of the array class was nonempty. This performance penalty disappeared when I compiled with the -fno-exceptions
flag.
I see no reason to retain exception handling in iPic3D. This is not an event-driven application. All current use of exceptions is used to generate error messages, which can be handled adequately with simple calls to eprintf()
. Therefore, I am going through the code and replacing each throwing of an exception such as
PSK::OutputException e("make_dataset fails for " + tag, "HDF5OutputAdaptor::write(double* array)"); throw e;
with printing of an error message such as
eprintf("make_dataset fails for %s", tag.c_str());
Note that the eprintf()
macro automatically prints file, line number, and method name, so there is no need to add any of that information to the call to eprintf()
.
In my initial implementation of this change I simply comment out all occurences of try
and all catch
blocks and all calls to throw
, replacing each call to throw
with a call to eprintf
. I leave it for later to decide whether to completely eliminate the Exception
and OutputException
classes in the PSK
namespace or whether for some reason to revert to supporting the possibility of exceptions via pre-processor branching directives.
Issue #17 introduced the TimeTasks class to implement basic profiling.
A simpler and more general approach is possible based on macros and automatic destructors.
Create a set of time_task macros to be used as follows:
void mymethod() { // record time from this point to the end of this scope. timeTasks_set_main_task(Task::PARTICLES); ... // block that does communication { // mode is reset to communication until end of scope timeTasks_set_communicating(); ... } }
Improvements:
start()
and end()
methods.
I want to check in code that changes the implementation of iPic3D arrays.
iPic3D currently uses chained pointers (e.g. double *** arr3;
) rather than array classes. This has a number of deficiencies:
I have therefore implemented array classes according to the following specifications:
-DCHECK_BOUNDS
) with no performance penalty when turned off.
I have implemented extensive performance tests of these arrays, and have compared performance of native arrays, chained pointers, and calculated index access for both g++
and icpc
(Intel) and for both my laptop and the Xeon processor. With icpc
on the Xeon processor, the performance differences can be significant and hard to predict. With g++
, performance differences are much smaller. Using the -O3
flag to turn on optimization, I have verified that there is no performance difference resulting from any of the choices of syntax considered in the following paragraphs.
To avoid the need for systematic modifications to the code, I have implemented access of elements via bracket syntax, e.g. intarr[i][j]=5;
. If I were writing a new code, I would prefer to use parenthesis syntax, e.g. myarr(i,j)=5;
. But requiring such syntax in iPic3D would entail systematically changing about 1000 lines of code, mostly in the fields classes. Given the number of forked versions of iPic3D that need to be merged, it is not desirable to impose such a change at this time. We could use a script to change to such notation in the future; making the switch would be most painless at the end of a merge point in the software development cycle. Given the nature and the pressures of academic work, realizing such a merge point may not be easy.
Ideally, I prefer syntax which distinguishes write access from read access. At a minimum, we should support read-only array access so that we can pass const array references to methods that use array information without modifying it. Therefore, I have implemented get methods, e.g. var = intarr.get(i,j);
.
For completeness, I have also implemented set methods, e.g. intarr.set(i,j, myvar);
, allowing the user to clearly indicate write-access.
I have also implemented a fetch()
accessor, which allows both read and write access via:
myArr.fetch(i,j) = 5;
Of course we could replace fetch
with operator()
, allowing the usage
myArr(i,j) = 5;
as in the implementation of Vicenc Beltran and his collaborators, but I preferred being able to search for the string fetch
in order to identify potential write-access. I will consider feedback offered in response to this issue and am willing to change fetch
to operator()
if this is the consensus desideratum.
I am eliminating the following abstract base classes
Grid Field
and replacing them with typdefs that map each abstract base class to the unique class that currently inherits from it:
typedef Grid3DCU Grid; typedef EMfields3D Field;
Virtual methods require run-time dereferencing of pointers, which for frequently called functions such as Grid3DCU::getXN() and EMfields3D::getEx() hurts performance. Furthermore, calls to such virtual functions prevent vectorization on the MIC.
I don't see the need for the CollectiveIO abstract base class, so I am replacing it with
typedef Collective CollectiveIO;
in Collective.h
and I am replacing the content of CollectiveIO.h
with
#include "Collective.h"
Users should not have to upgrade hdf5 to a version that supports parallel hdf5 in order to compile and run iPic3D.
Particles are currently sorted by MPI subdomain. If one only uses MPI parallelization, then this allows for efficient pushing of particles and summing of moments as long as 10_nxn_nyn_nzn_sizeof(double)
fits in cache. But to support effective use of vector units (such as on Intel's Xeon and Xeon Phi processors) to push particles and sum moments, we need to sort particles at the level of individual mesh cells.
Sorting particles at the cell level will have benefits that extend beyond vectorization. Benefits of this fine-grain particle locality include:
If particles are sorted per mesh cell then we can vectorize the task of a single iteration of pushing all particles positioned in a given cell, because pushing them uses the same field data. Note that with each iteration of the particle solver, the position (xp, yp, zp)
used to interpolate from the field is updated, and the particles need to be resorted.
To sort particles at the cell level, we can sort by the Z-value of the cell in the mesh of each MPI process. The Z-value is computed by interleaving the bits of the x, y, and z array indices. We can adapt the multisort code that Maria Elena and I worked with at the workshop at Barcelona Supercomputing Centre in July in order to do fast, thread-safe sorting based on the Z-value. An alternative would be to simply walk through the particles and move them if necessary to the correct cell, but doing this in a thread-safe way would take some care.
This specification was written in consultation with Jorge Amaya.
cmake supports the ability to compile in a separate directory without modifying the original source tree. By convention this directory is called build.
We envision that everything needed to run ("execute") iPic3D be placed in build/exec: the executable (iPic3D) and the configuration ".inp" file. (Formerly this directory was called "build/work", but this name seems less clear. We prefer not to call it "build/run", because one might consider "run" to refer to an individual simulation.) The executable will be called "build/exec/iPic3D". (Formerly it was called "iPIC3D", but we are trying to be consistent about capitalization.)
Currently many header files include system header files or include other header files unnecessarily.
The number of .h files included in a .h file should be minimized, for two reasons:
The result of including .h files in .h files is that (2) you have to recompile everything much more frequently and (1) it takes much longer to do so. It's a compounding effect.
To reduce these problems to a minimum, I recommend that we do as follows:
If you need to forward declare iostreams, use
#include <iosfwd>
. If you need array templates, use
#include "arraysfwd"
.
Note that there is no need to include headers of classes that you are merely referring to as arguments to a method or as members of a class. Just forward-declare the class and use pointers and references:
MyClass; void myfunction(MyClass& myClass); class NewClass{ MyClass *myClass; };
To summarize, you only need to include a .h file in a .h file to access an interface.
Alec
I have modified Particles3D::mover_PC() for vectorization on the MIC. Changes:
#pragma omp parallel forand
#pragma simdprior to the loop over particles.
Note that as yet vectorization does not accelerate performance of mover_PC() on the MIC Xeon Phi coprocessor, but the changes above prevent vectorization from hurting performance.
Under commit e3c5dca, methods named asgArr3
and asgArr4
were introduced and used in Particles3D::mover_PC
. These methods do nothing and should be deleted. Lines such as
double ***Ex = asgArr3(double, grid->getNXN(), grid->getNYN(), grid->getNZN(), EMf->getEx());
can be replaced with the equivalent code
double ***Ex = EMf->getEx();
This is about code diagnostics fixes that I want to be merged into the main code.
error, debug, and warning messages available by including
debug.h errors.h asserts.h
were not implemented in an atomic manner. As a consequence, output from code diagnostic messages from multiple processes is often interleaved, making it difficult to determine what output is coming from what thread.
I am fixing this by first writing the message to a string and then printing the string to standard output with a single call to fprintf.
Note that I have also changed the output format so that messages indicate process and thread number when appropriate.
iPic3D is currently only first-order accurate in time. Second-order accuracy in time requires that particles be pushed with field values that approximate time-averaged values and that the implicit susceptibility be calculated with field values and densities that approximate time-averaged values.
To achieve this, I would introduce support for options something like:
PushWithBatTime = .5 # default: 0 PushWithEatTime = .5 # default: 1 ImplSusceptTime = .5 # default: 0
More details are posted here.
Stefano (@markidis) writes:
I have downloaded the iPIC3D version from github and compared with my
version (that is the one you started from) running with one core on my
macair. I use the same inputfile and the same amount of IO. I use the
not optimized Alloc.h.
My version for the problems finishes in 79.6 seconds, while the github
version finishes in 113 seconds. My feeling is that this in part due
to the use of a new class for the moments that introduces an
overhead. I am not sure about this, but we should investigate more.
Would it possible to have the old EMfields3D.h and compare the
performance with the new one?
Maybe we should introduce a test regarding the performance as well?
While I think 2-5% loss of performance might be acceptable in some
cases, 30% loss of performance is quite bad.
Thank you very much, stefano.
I'm using this issue to test the mailing list
This is code that Maria Elena wrote that we want to check in.
On most high-performance computing platforms, iPic3D must be cross-compiled.
Therefore, we should support cross-compiling in cmake.
We are creating a cmake subdirectory for cmake configuration files.
Commit 7cbe28c introduced a memory leak in inputoutput/WriteOutputParallel.cpp
.
In WriteOutputParallel()
, calls to EMf->getExc(grid)
etc. allocate memory that is not deleted.
This is a massive memory leak; the amount of memory leaked equals the amount of data written.
A better solution would be to allocate a single buffer and reuse it for each variable written.
I am creating this issue as a header under which to check in future modifications for the purpose of accelerating compilation.
We can accelerate compilation and recompilation by observing the following principles.
Avoid including a .h
file from a .h
file. This hurts compilation speed in two ways.
First, recompilation speed is hurt if including unnecessary header files, because cmake thinks that every header file inclusion represents an actual code dependency.
Second, compilation speed is hurt when the included header files expand to a large number of lines, because in many compilers the preprocessor must process all the included files.
In particular, the C++ stream header files are notoriously bloated, typically expanding to 20000 lines that the preprocessor must parse for every .cpp file (object module) unless sophisticated techniques are used such as precompiled headers. It is better not to write code that assumes a sophisticated compiler.
This leads to the second principle: Avoid creating a large number of .cpp
(implementation) files that include large header files.
If we want to put every class implementation in its own implementation file, then we should avoid including the iostream class except in classes that are really doing file I/O and are not just printing error messages. One of the reasons that I implemented the functionality in #include "error.h"
, which defines eprintf()
, invalid_value_error()
, and unsupported_value_error()
methods (see issue #29), was to avoid the need to include iostream classes merely to print simple error messages.
In general, we can accelerate computation by separating out declarations and headers and consolidating implementation files. For example, if class A needs to contain an instance of class B, if you make A contain a pointer to class B then you do not need to include B.h
in A.h
; you only need to make the forward declaration class B;
in A.h
. If a large number of small classes need the same large set of header files, it may be appropriate to implement them all in a common .cpp
file.
I more than doubled the speed of compilation of another software package (DoGPack) by implementing changes along these lines.
I am changing EMfield3D to use separate arrays to solve the electromagnetic field and to represent the electromagnetic field used to push particles. I am declaring the new fields representation as a four-dimensional array of type pfloat (float or double according to user configuration):
array4_pfloat EMfields3D:fieldforPcls(nxn,nyn,nzn,6)
The following benefits are anticipated:
arr1_pfloat_get field000 = fieldForPcls[ix][iy][iz];can be made followed by six one-dimensional array accesses of the form
Bxl += weight000 * field000[0]; Byl += weight000 * field000[1]; Bzl += weight000 * field000[2]; Exl += weight000 * field000[3]; Eyl += weight000 * field000[4]; Ezl += weight000 * field000[5];rather than having to recalculate a 3-dimensional subscript six times:
Bxl += weight000 * Bx[ix][iy][iz]; Byl += weight000 * By[ix][iy][iz]; Bzl += weight000 * Bz[ix][iy][iz]; Exl += weight000 * Ex[ix][iy][iz]; Eyl += weight000 * Ey[ix][iy][iz]; Ezl += weight000 * Ez[ix][iy][iz];
I have implemented macros to enforce conditions:
assert_eq(a,b) assert_ne(a,b) assert_gt(a,b) assert_lt(a,b) assert_ge(a,b) assert_le(a,b) assert_divides(a,b)
where the boolean operator names follow fortran/latex conventions.
For example, one might use:
inline Grid3DCU::Grid3DCU(CollectiveIO * col, VirtualTopology3D * vct) { assert_divides(vct->getXLEN(),col->getNxc()); assert_divides(vct->getYLEN(),col->getNyc()); assert_divides(vct->getZLEN(),col->getNzc()); // add 2 for the guard cells nxc = (col->getNxc()) / (vct->getXLEN()) + 2; [...]
If the asserted condition does not hold, an error message is printed, e.g.:
ERROR in file grids/Grid3DCU.h, line 171, function Grid3DCU assertion failed: vct->getXLEN()(divides)col->getNxc(), i.e., 2(divides)63
The assert_* macros are intended as an extension of the system assert()
macro accessible via
#include <assert.h>
I am making these macros available e.g. via
#include "utility/asserts.h"
To conform with the behavioral description of assert available via
man assert
if NDEBUG is defined then all these assert_* macros are compiled out
so that there is no check made and no performance penalty.
I have implemented macros such as
dprintf(format_string, ....)
which print the function, file, line, and message when debug is turned on.
Processor rank is prefixed to the message.
In addition, I have implemented
dprint(intvar) dprint(doublevar)
macros which are equivalent to
dprintf("intvar==%d", intvar) dprintf("doublevar==%24.16e", doublevar)
More details remain to be specified, such as restricting to lead mpi process
(rank 0) and selecting debug level.
I am creating
utility/TimeTasks.h
implemented in
utility/diagnostics.cpp
to perform coarse-level self-profiling.
The essential tasks to profile are:
The time spent in each task is subdivided into
This class is conceived of as a singleton.
Statistics are accumulated via calls to start_() and end_() methods at the beginning and end of major procedure calls and at the beginning and end of communication routines such as those found in "communication/ComNodes3D.h". The start() and end() methods call MPI_Wtime().
Detailed behavior regarding accumulation or averaging of statistics across MPI processes and/or cycles of the algorithm is not yet specified.
In EMfields.cpp
rhons = newArr4(double, ns, nxn, nyn, nzn); rhocs = newArr4(double, ns, nxc, nyc, nzn);
This should be
rhons = newArr4(double, ns, nxn, nyn, nzn); rhocs = newArr4(double, ns, nxc, nyc, nzc);
Hi,
I still have severe problems related to interpolation step on my MacAir. One of my run gives me this:
=== timing information for cycle===
| moms flds pcls Bfld cycl
| 11.19 0.25 2.28 0.00 13.72 (total time)
| 0.01 0.03 0.20 0.00 0.24 (communication)
| 11.19 0.22 2.08 0.00 13.49 (computation)
| MOMS comm FLDS comm PCLS comm CYCL comm
| 11.19 0.01 0.25 0.03 2.28 0.20 13.72 0.24
I am doubting that this problem is related to the particle classes since this issue should be fixed. I start thinking that it is more related to the class Moments in EMfields3D.h.
Any idea how to fix this?
I will check if changing compiler flags can mitigate this problem tonight. Thank you very much, stefano.
In the Grid3DCU
class, the implementation
double &getXN(int X, int Y, int Z) { return (node_coordinate[X][Y][Z][0]); } double &getYN(int X, int Y, int Z) { return (node_coordinate[X][Y][Z][1]); } double &getZN(int X, int Y, int Z) { return (node_coordinate[X][Y][Z][2]); } double &getXC(int X, int Y, int Z) { return (center_coordinate[X][Y][Z][0]); } double &getYC(int X, int Y, int Z) { return (center_coordinate[X][Y][Z][1]); } double &getZC(int X, int Y, int Z) { return (center_coordinate[X][Y][Z][2]); }
can be replaced with
const double &getXN(int X, int Y, int Z) { return node_xcoord[X];} const double &getYN(int X, int Y, int Z) { return node_ycoord[Y];} const double &getZN(int X, int Y, int Z) { return node_zcoord[Z];} const double &getXC(int X, int Y, int Z) { return center_xcoord[X];} const double &getYC(int X, int Y, int Z) { return center_ycoord[Y];} const double &getZC(int X, int Y, int Z) { return center_zcoord[Z];}
if in the constructor we replace
node_coordinate = newArr4(double, nxn, nyn, nzn, 3); // 0 -> X, 1 -> Y, 2-> Z for (int i = 0; i < nxn; i++) { for (int j = 0; j < nyn; j++) { for (int k = 0; k < nzn; k++) { node_coordinate[i][j][k][0] = xStart + (i - 1) * dx; node_coordinate[i][j][k][1] = yStart + (j - 1) * dy; node_coordinate[i][j][k][2] = zStart + (k - 1) * dz; } } }
and
center_coordinate = newArr4(double, nxc, nyc, nzc, 3); for (int i = 0; i < nxc; i++) { for (int j = 0; j < nyc; j++) { for (int k = 0; k < nzc; k++) { center_coordinate[i][j][k][0] = .5 * (node_coordinate[i][j][k][0] + node_coordinate[i + 1][j][k][0]); center_coordinate[i][j][k][1] = .5 * (node_coordinate[i][j][k][1] + node_coordinate[i][j + 1][k][1]); center_coordinate[i][j][k][2] = .5 * (node_coordinate[i][j][k][2] + node_coordinate[i][j][k + 1][2]); } } }
with
node_xcoord = new double[nxn]; node_ycoord = new double[nyn]; node_zcoord = new double[nzn]; for (int i=0; i<nxn; i++) node_xcoord[i] = xStart + (i - 1) * dx; for (int j=0; j<nyn; j++) node_ycoord[j] = yStart + (j - 1) * dy; for (int k=0; k<nzn; k++) node_zcoord[k] = zStart + (k - 1) * dz; // arrays allocation: cells ---> the first cell has index 1, the last has index ncn-2! center_xcoord = new double[nxc]; center_ycoord = new double[nyc]; center_zcoord = new double[nzc]; for(int i=0; i<nxc; i++) center_xcoord[i] = .5*(node_xcoord[i]+node_xcoord[i+1]); for(int j=0; j<nyc; j++) center_ycoord[j] = .5*(node_ycoord[j]+node_ycoord[j+1]); for(int k=0; k<nzc; k++) center_zcoord[k] = .5*(node_zcoord[k]+node_zcoord[k+1]);
This reduces memory use and the expense of obtaining coordinate positions.
I surmise that node_coordinate
and center_coordinate
were created in anticipation of generalizing to a deformed logically cartesian mesh. The rest of the code, however, does not seem to have been consistently written with this in mind. I therefore do not see the value of retaining this implementation, and I would even consider using the alternative accessors
const double &getXN(int X) { return node_xcoord[X];} const double &getYN(int Y) { return node_ycoord[Y];} const double &getZN(int Z) { return node_zcoord[Z];} const double &getXC(int X) { return center_xcoord[X];} const double &getYC(int Y) { return center_ycoord[Y];} const double &getZC(int Z) { return center_zcoord[Z];}
based in part on the philosophy that the easiest code to generalize is typically not code that attempts to anticipate generalization but rather code that is clean and simple and readable. For now I will implement these accessors for use in performance-critical parts of the code (pushing particles and summing moments) unless I hear an objection.
I want to check in two groups of error diagnostics methods, to be accessed via
#include "errors.h"
Firstly, a generic error message utility:
eprintf(char*, varargs);
This prints file, line number, function, and the error message, and then exits. Similarly, a generic warning utility:
Wprintf(char*, varargs);
Secondly, a generic utility to use when no case is selected:
invalid_value_error(int); invalid_value_error(char *); invalid_value_error(double);
and
unsupported_value_error(int); unsupported_value_error(char *); unsupported_value_error(double);
The most typical use is in the default clause of a switch statement. For example, in a routine to compute based on the number of dimensions :
switch(ndims) { default: unsupported_value_error(ndims); case 2: b2=1-mbc; s2=S2+2*mbc; case 1: b1=1-mbc; s1=S1+2*mbc; case 0: ; }
Currently the code gives a segmentation fault or faulty output if
XLEN (currently specified in processtopology/VCtopology3D.h)
does not divide nxc (specified in inputfiles/*.inp).
An error message should be issued.
Currently the natural point at which to insert such code is
at the beginning of the Grid3DCU constructor.
Code that I implemented to do this follows.
inline Grid3DCU::Grid3DCU(CollectiveIO * col, VirtualTopology3D * vct) {
int get_rank();
if(!get_rank())
{
fflush(stdout);
bool xerror = false;
bool yerror = false;
bool zerror = false;
if((col->getNxc()) % (vct->getXLEN())) xerror=true;
if((col->getNyc()) % (vct->getYLEN())) yerror=true;
if((col->getNzc()) % (vct->getZLEN())) zerror=true;
if(xerror) printf("!!!ERROR: XLEN=%d does not divide nxc=%d\n", vct->getXLEN(),col->getNxc());
if(yerror) printf("!!!ERROR: YLEN=%d does not divide nyc=%d\n", vct->getYLEN(),col->getNyc());
if(zerror) printf("!!!ERROR: ZLEN=%d does not divide nzc=%d\n", vct->getZLEN(),col->getNzc());
fflush(stdout);
bool error = xerror||yerror||zerror;
if(error) exit(1);
}
// add 2 for the guard cells
nxc = (col->getNxc()) / (vct->getXLEN()) + 2;
nyc = (col->getNyc()) / (vct->getYLEN()) + 2;
nzc = (col->getNzc()) / (vct->getZLEN()) + 2;
[...]
where get_rank() returns the MPI rank of the process.
Particles3Dcomm::getVelocityDistribution()
sums particle velocity distributions using:
for (int i = 0; i < nBins; i++) { localN = f[i]; MPI_Allreduce(&localN, &totalN, 1, MPI_LONG_LONG, MPI_SUM, MPI_COMM_WORLD); f[i] = totalN; }
This generates thousands of all-reduce calls, which is horribly inefficient and will be unacceptable on DEEP's Booster. These lines can be replaced with a single call:
MPI_Allreduce(MPI_IN_PLACE, f, nBins, MPI_LONG_LONG, MPI_SUM, MPI_COMM_WORLD);
With the currently implemented algorithm, iPic3D is unable to converge to the solution of the Klimontovich equation (molecular dynamics), and in order for the iPic3D solution to converge to the solution of the Vlasov equation, the number of particles per mesh cell must be taken to infinity.
Two simple changes to the algorithm would allow for convergence to the Klimontovich equation (for a fixed number of particles as the mesh resolution goes to infinity) and to the Vlasov equation (for a fixed number of particles per mesh cell as the mesh resolution goes to infinity):
Without this change, particle self-force is unable to become correct if the number of particles per mesh cell is bounded.
Evolving the unsmoothed field allows the self-force to become the correct self-force of a finite-width Gaussian-shaped particle as the mesh is refined and the number of smoothings is increased appropriately. If the unsmoothed field is overwritten with the smoothed field, then this property is lost.
For fuller justification, see
this proposal.
This time creating a new issue from the CmPA Organization. Emails should be received by the team members, but I'm trying to send an email to a mailing list that I created:
If anyone receives an email, please add a comment bellow for tests.... thanks!!
Maria and I want to check in code that modifies sumMoments (called by CalculateMoments()) to sum all moments so that moments of all species are summed in a single OpenMP clause, per recommendation of Alejandro Duran. This reduces thread creation overhead and accelerates execution on Xeon by about 15%.
In the next version of the code I would like to see:
If you have more ideas, please add them here bellow. We will decide which ones are we going to add to the next version.
We need to use long long
for particle IDs to support more than 2 billion (2^31) particles. But there is no reason to support having so many particles in a single MPI process. There is a potential cost to using long long in a particle loop. Therefore, I am changing to use int, preceeded by an assertion that nop is no greater than INT_MAX, in the following two critical methods:
Particles3D::mover_PC() Particles3Dcomm::interpP2G()
There is no need for us to make MPI_Barrier() calls in ipic3d, and not doing so can help to hide communication latency.
Therefore, I want to check in code that uses the following preprocessor definitions:
#define MPI_Barrier(args...)
to remove calls to MPI_Barrier() from the preprocessed code.
To accomplish this, I am placing that definition in a file named ipicdefs.h
, and am including it in all files that make
calls to MPI_Barrier:
#include "ipicdefs.h"
Ultimately I have in mind for ipicdefs.h
to be included in every ipic3d file so that ipicdefs.h can be a simple mechanism to make universal changes like this.
Currently changing the number of MPI processes requires editing VCtopology3D.h to change the following lines
// change these values to change the topology XLEN = 2; YLEN = 2; ZLEN = 1; nprocs = XLEN * YLEN * ZLEN; // here you have to set the topology for the fields PERIODICX = true; PERIODICY = false; PERIODICZ = true; // here you have to set the topology for the Particles PERIODICX_P = true; PERIODICY_P = false; PERIODICZ_P = true;
and then a recompile. cmake (mistakenly) thinks that it is necessary to recompile many files because of the include dependencies.
An initial step is to move the methods of the VCtopology3D class into a file named
communication/VCtopology3D.cpp
. That way, only a single file needs to be recompiled, follow by a relink.
A subsequent change should be to add a configuration variable to the input file for each of these variables.
Under commit 05082fc, EMfields::B_ext
is used to add an external background magnetic field. The implementation needs to be corrected and improved:
Particles3D::mover_PC()
such as
Bxl += weight000 * Bx[ix][iy][iz] + Bx_ext[ix][iy][iz];are incorrect; a correct version would be
Bxl += weight000 * (Bx[ix][iy][iz] + Bx_ext[ix][iy][iz]);
Bxl += weight000 * Bx_tot[ix][iy][iz];where
Bx_tot = Bx + Bx_ext
is computed at the top of Particles3D::mover_PC()
.
B_ext
and Bx_tot
should not be allocated unless this feature is actually turned on. Bx_tot
can simply be an alias pointer for B_ext
in case this feature is turned off.
In the original commit to the github version of iPic3D, only one iteration was used to process the "rest" of the particles after processing as many as possible P_SAME_TIME at a time. When I removed the P_SAME_TIME loop in commit 30a6ba5, all the particles were consequently advanced with only one iteration of the mover. This should be corrected so that the particle mover iterates NiterMover times.
A general principle of maintainable code is that error messages should be issued as soon as the violation of an assumed condition can be efficiently detected.
Currently configuration options are not checked for validity when they are read in, and only a small number of checks are made later on. Tightly and cleanly restricting configuration options allows us to guarantee backward compatibility in how the configuration file drives behavior. We should validate configuration options on startup and immediately exit with an informative error message upon detecting invalid options.
Note that appropriate resolution of this issue assumes that issue #64 is also resolved.
This is an enhancement code submission.
Particles3Dcomm::interpP2G() loops over all particles and sums the contributions of all particles to the 10 quadratic monomial velocity moments:
rho Jx, Jy, Jz pXX, pXY, pXZ, pYY, pYZ, pZZ
Therefore I have implemented a Moments class that can be used to sum these moments.
I have also implemented
EMfields3D::addToSpeciesMoments(const Moments& in, int species)
which adds a set of accumulated moments to the total for a given species.
I use the Moments class in the Particles3Dcomm::interpP2G() method in a block headed by a #pragma omp parallel directive; this block contains a for loop over all particles headed by a #pragma omp for directive. At the conclusion of the parallel block a #pragma omp critical directive is used to add the moments for the particles summed in the OMP thread to the moments for the species being summed. Finer-grain reduction could accelerate this last little step.
Have you ever noticed that for one simulation performed two times with different number of processors or a different cartesian domain decomposition the results are slightly different?
I want to check in some changes that make the MPIdata class a singleton. It is possible that in the future we will want to create multiple instances of the iPic3D solver class in a single process, whereas MPIdata needs to be a singleton by virtue of the fact that MPI_Init(), which is called in the MPIdata initializer, should be called only once. A crucial benefit of this change is that the mpi rank is make available to system-level diagnostics code (debug and error messages).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.