cmpa / ipic3d Goto Github PK

Particle-in-Cell code using the implicit moment method

C++ 72.39% MATLAB 6.17% Shell 12.23% Gnuplot 0.12% C 4.35% Python 2.29% CMake 1.21% Makefile 0.46% Lex 0.12% Yacc 0.09% Roff 0.57%

ipic3d's People

Contributors

Stargazers

Watchers

Forkers

alecjohnson volshevsky murci3lag0 markidis mekal84 valsusa popopol physicsplasma wanghongyu lijianying10 yangjian615 iwf-graz manolv francescopucci mandar2812 fusionplasma yangyha xyuan njisrawi tulasinandan uos-msam doby123 cfd-learner vadimromansky daseven bbw7561135 kellekai ivanredbread andreykochanov zwghit eharnett rodarima firebird00 feyncalccc jincairen atomkuo yangyhua linga009 iith-plasma rinku-mishra luohancfd philowzx raykarry shpd123 yuenxq parthpavaskar jia-banjos tiapac robyroc pku-heliosphere-org

ipic3d's Issues

error messages should consistently use stdout

mixing stdout with stderr causes jumbled output in a parallel environment.

eliminate unnecessary srand calls

Currently Particles3D::particle_repopulator contains code like this:

  srand (vct->getCartesian_rank()+1+ns+(int(MPI_Wtime()))%10000);
  if (vct->getXleft_neighbor() == MPI_PROC_NULL && bcPfaceXleft == 2)
  {
    [...]
  }

I do not see any need to seed the random number generator more than once in the program, and I would rather eliminate these calls. Seeding the random number generator destroys the ability to get exactly reproducible results (except in the case that no data generated by calls to rand() is actually being used), and there should be a single point in the program which selects whether the seed is deterministic or comes from a time value that will be different with each invocation of the program.

Even if the srand calls are retained, I would change the code to:

  if (vct->getXleft_neighbor() == MPI_PROC_NULL && bcPfaceXleft == 2)
  {
    srand (vct->getCartesian_rank()+1+ns+(int(MPI_Wtime()))%10000);
    [...]
  }

Improve summation of particle moments.

Issue #22 reported poor performance summing moments.
Issue #23 began to address the lack of OpenMP parallelism in summing moments.
Issue #33 reported continued poor performance summing moments.

Initial steps in addressing issue #33 have been to restore the grid coordinate accessors to be inline and, under issue #40, to use one-dimensional arrays rather than three-dimensional arrays to obtain grid coordinates. These changes alone, however, have been insufficient, and summing moments has remained slow. This suggests the need for a more comprehensive restructuring that improves computation speed by eliminating indirection.

It would be possible for a sophisticated compiler to optimize the code that sums moments in Particles3Dcomm::interpP2G by (1) unrolling the triply nested loops that sum from 0 to 1, (2) expanding and subsequently optimizing the inline methods such as EMfields3D::addRho that also have such loops, and (3) factoring out the repeated multiplication by the same unchanging scalars. But this is too much to expect. Summing moments is a small but performance-critical part of the code, and it therefore pays to avoid assuming that the compiler and optimizer are sophisticated enough to perform these optimizations. Thus, I have expanded the loops and factored out constant multiples from sums. Preliminary observations on my laptop indicate that this accelerates computing moments by a factor of roughly 2 or 3.

In the process of making these changes, I have chosen to move Particles3Dcomm::interpP2G and rename it as
EMfields3D::sumMoments. With respect to encapsulation, I believe that this is a more appropriate location for this method, since it involves write-access to moment data and merely read-access to particle data. Later we may want to pull the moments out of the EMfields3D class and encapsulate them in a class that handles moments. At this point, the sumMoments method could be move into this class. It will be useful to have an encapsulated class of moments that can be passed from accelerator (e.g. DEEP's "booster") to the computing unit responsible for the field solve (e.g. DEEP's "cluster").

I also feel that it is more natural to refer to the operation as summing moments than as interpolating particles to the grid. Interpolation is a way of viewing a part of what we are doing, does not really include the whole process of summing moments, and is not necessarily involved in alternative ways of representing particles. I could have chosen to call the method sumParticleMoments, but since a collection of particles is the first argument to the method I felt that sumMoments was sufficient. In the future, sumMoments could also incorporate moments from a fluid representation.

In the process of moving this code, I noticed a number of unused methods such as Particles3Dcomm::calculateWeights, Particles3D::interpP2G_onlyP, interpP2G_notP, all of which should be moved and reimplemented if they are to be reinstated. I am therefore removing these methods.

I also observed that that the abstract base class Particles has
Particles3Dcomm as its only child and therefore serves no obvious purpose. While none of the virtual methods of Particles are called in performance-critical loops, I think that we might as well eliminate this class via

typedef Particles3Dcomm Particles

and replacing the content of Particles.h with the line

#include "Particles3Dcomm.h"

Does anyone have any input regarding the purpose for which this and other abstract base classes were created?

use long long for particle ID but not to index arrays

As discussed in issue #4, to support tracking more than INT_MAX (usually about 2 billion) particles, we need to use long long rather than int to index them.

Currently the code uses unsigned long in some places and long long in others. On Xeon both of these are 8 bytes long, but on my mac unsigned long is only 4 bytes (same as int) while long long is the full 8 bytes needed to index billions of particles. Therefore, to support billions of particles on all architectures, I am modifying the code to consistently use long long to index particles.

On the other hand, I see no need to support having more than 2 billion particles in a single MPI process, and perhaps indexing with long integers could slow iterating through particles. I am therefore changing all array indices to use int rather than long long.

MIC requires that mpi.h be included before stdio.h

Put

#include <mpi.h>

at the top of

particles/Particles3D.cpp
particles/Particles3Dcomm.cpp

before

#include <iostream>

Otherwise I get an error:

/opt/intel/impi/4.1.0.030/intel64/include/mpicxx.h(95): catastrophic error: #error directive: "SEEK_SET is #defined but must not be for the C++ binding of MPI. Include mpi.h before stdio.h"
  #error "SEEK_SET is #defined but must not be for the C++ binding of MPI. Include mpi.h before stdio.h"

create ipic scripts system

I want to check in the beginnings of an ipic command line system with a calling syntax analogous to git or hg. The initially supported commands are:

  ipic ctags
  ipic help
  ipic help ctags
  ipic help mic

To this end, I am creating a scripts subdirectory.
If this directory is placed in your path, e.g. via

export PATH="$HOME/ipic3d/scripts:$PATH"

then the commands above should work appropriately.

initialize MPI immediately on startup

MPI should be initialized immediately on startup, before initializing the iPic3D library. This allows the iPic3D initialization to issue an error message and exit if appropriate without causing mpirun to complain that MPI_Init was never called.

create separate solver classes for fields and particles

The c_Solver class should be subdivided into two classes: a FieldSolver class and a ParticleSolver class.

This fundamental restructuring is being prompted by the port to the DEEP architecture, where we will solve fields on the cluster and particles on the booster. However, separation into two separate solvers will have benefits for improved flexibility that extend far beyond DEEP. The benefits include:

ability to replace the particles solver with a fluid model. This is important for coupling iPic3D to fluid models, as SWIFF aims to do. The implicit moment method can be made asymptotic-preserving with respect to fluid models by relaxing the particle distribution to a Maxwellian distribution, e.g. by resampling particles in each mesh cell.
ability to replace the fields solver with another fields solver. For the purpose of coupling to MHD, it is important to be able to replace the field solver with an MHD field solver that calculates E from Ohm's law rather than evolving it (or smoothly transition to such a solver). It would also be possible to replace the field solver with an IMEX scheme. In both these cases, the field solve is completely local and involves no communication, so the field solve along with everything else could be kept on the booster side.
ability to use a separate set of MPI processes for fields and particles and to use a different number of MPI processes for fields versus for particles so as to be optimally tuned for these two very different tasks.

The data that needs to be exchanged between these two classes is:

FieldSolver → ParticleSolver:
the electric field: E_x, E_y, E_z.
While the magnetic field is also needed on the particle side to push the particles, the magnetic field is easily updated from the electric field, and so there is no need to transfer the magnetic field data except on the first iteration of the solver. It is conceivable that magnetic field data might need to be sent again would be due to differences in the updated fields on each side due to accumulated machine arithmetic error, but such error should accumulate very slowly.
ParticleSolver → FieldSolver:
- For IMM, the data that must be sent consists of the hatted moments Ĵ_x, Ĵ_y, and Ĵ_z and the density ρ_s of each species. The densities ρ_s are used on the cluster side to construct the "implicit susceptibility" tensor χ that characterizes the response of the average current to the average electric field. Note that the hatted moment ρ̂ can be constructed from Ĵ and ρ_s. In the case of a large number (> 8) of moments (or of a dusty plasma), one would instead calculate the 9 components of χ on the booster side and send them.
```
The arrays in the code that need to be sent are  <code>Jxh</code>,  <code>Jyh</code>,  <code>Jzh</code>, and <code>rhons</code>.  
```
  Note that rhoc (used in the Poisson correction) is computed from rho,
  that rho is computed by summing over rhons, and that rhoh is easily computed from rhoc and Jxh.

Follow-up: communicate the magnetic field separately, in addition to the electric field. I have decided to modify the above proposal to calculate and then send the smoothed versions of E or B needed to push the particles as soon as E or B is updated and then use this data to construct the SoA fields needed to push particles upon receiving this data on the Booster side.

Justification. While it is technically possible to communicate only the electric field from the field solver to the particle solver, it will be simpler and almost certainly more efficient to communicate both fields. The reason is that the field used to push particles needs to be a modified version of the field evolved in the field solver. Smoothing needs to be applied to the field several (now 3) times before it can be used to evolve particles. Each of these smoothings requires exchange of boundary data. Moreover, updating the magnetic field from the electric field requires calculating a curl, which also requires exchange of boundary data. Even a 10x10x10 mesh is half boundary cells, and unless the dimensions of the subgrid are unfeasibly large, a substantial fraction of the mesh cells are boundary cells. The transformations of the field data to prepare it to be used by the particle solver involve successive synchronizations and are more communication-intensive then the mere task of transferring the magnetic field components. By directly transferring the magnetic field along with the electric field, we can do all transformations on the Cluster, where we expect them to be performed more efficiently, without repeating any of them. We could even do in the cluster the transposition of field data to the AoS format (see issue #52) preferred for cache performance in the particle mover. But on the other hand, transposition is much less expensive than communication, and doing the transposition on the Booster will allow us to send the magnetic field separately and will not require communicating the 33% additional garbage data needed in the AoS format for alignment. In the current algorithm, the old field is used to push particles (for consistency with the implicit moment field update), so sending the magnetic field as soon as it is updated means that communication of the magnetic field can occupy an entire cycle of the algorithm before it needs to complete.

Communicating moment boundary data on the Cluster side.
After accumulating moments on the Booster, hatted moments need to be computed on the Booster and communicated to the field solver running on the Cluster. Computing hatted current involves evaluating the divergence of pressure tensor quantities, which requires communicating boundary data. This boundary communication can be delayed and reduced, however, until the hatted moments have reached the Cluster, if one is willing to accept loss of numerical precision due to summing terms involving the divergence of the pressure tensor in boundary cells or nodes that almost cancel. (To avoid this cancelation error one must communicate boundary node data of each pressure tensor component of each species.) Using the way of delayed communication, only five moments need boundary communication, but it would be necessary to communicate and sum over a widened halo that is three layers thick, including not just boundary nodes but ghost nodes and the first layer of interior nodes. Communicating node values of 6+6=12 pressure tensor components avoids loss of numerical precision and reduces the number of layers of nodes that must be communicated at the end from three to the single layer of shared nodes. Communicating the node values of the other moments or of the divergence of the pressure tensor does not need to be done prior to summing the hatted moments (reducing the number of communications needed by 6) and communicating them to the Cluster (where it can be done faster).

Support -fno-exceptions

I want to check in code to eliminate the use of exceptions in iPic3D.

Allowing exceptions has performance penalties. In particular, while developing an array class for iPic3D, I observed that looping through array elements took three times as long if the body of the destructor of the array class was nonempty. This performance penalty disappeared when I compiled with the -fno-exceptions flag.

I see no reason to retain exception handling in iPic3D. This is not an event-driven application. All current use of exceptions is used to generate error messages, which can be handled adequately with simple calls to eprintf(). Therefore, I am going through the code and replacing each throwing of an exception such as

  PSK::OutputException e("make_dataset fails for " + tag, "HDF5OutputAdaptor::write(double* array)");
  throw e;

with printing of an error message such as

  eprintf("make_dataset fails for %s", tag.c_str());

Note that the eprintf() macro automatically prints file, line number, and method name, so there is no need to add any of that information to the call to eprintf().

In my initial implementation of this change I simply comment out all occurences of try and all catch blocks and all calls to throw, replacing each call to throw with a call to eprintf. I leave it for later to decide whether to completely eliminate the Exception and OutputException classes in the PSK namespace or whether for some reason to revert to supporting the possibility of exceptions via pre-processor branching directives.

Improve TimeTasks

Issue #17 introduced the TimeTasks class to implement basic profiling.

A simpler and more general approach is possible based on macros and automatic destructors.

Create a set of time_task macros to be used as follows:

  void mymethod()
  {
    // record time from this point to the end of this scope.
    timeTasks_set_main_task(Task::PARTICLES);
    ...
    // block that does communication
    {
        // mode is reset to communication until end of scope
        timeTasks_set_communicating();
        ...
    }
  }

Improvements:

use of destructors: The implementation of the time_task macros constructs a non-anonymous class instance whose destructor is called when it goes out of scope. This eliminates the need to make separate calls to start() and end() methods.
defining subtasks is now supported.

Improve multidimensional arrays implementation

I want to check in code that changes the implementation of iPic3D arrays.

iPic3D currently uses chained pointers (e.g. double *** arr3;) rather than array classes. This has a number of deficiencies:

There is no way to turn on automatic bounds checking.
De-referencing chained pointers can be much less efficient than alternative multidimensional array implementations for some compilers and architectures.
Memory must be explicitly managed.
For reliably good performance, the programmer must explicitly inform the compiler that arrays are aligned at each point in the code where they are accessed.

I have therefore implemented array classes according to the following specifications:

The user should be able to choose via a compile option whether the underlying array class is one of the following:
- chained pointers
- calculated index of an underlying one-dimensional array
- built-in C arrays using fixed dimensions defined at compile time.
Automatic bounds checking should be available via a compile option, (e.g. -DCHECK_BOUNDS) with no performance penalty when turned off.
Automatic memory management should be available.
Memory allocation should be aligned, and the compiler should be informed of this alignment in the array accessor methods.

I have implemented extensive performance tests of these arrays, and have compared performance of native arrays, chained pointers, and calculated index access for both g++ and icpc (Intel) and for both my laptop and the Xeon processor. With icpc on the Xeon processor, the performance differences can be significant and hard to predict. With g++, performance differences are much smaller. Using the -O3 flag to turn on optimization, I have verified that there is no performance difference resulting from any of the choices of syntax considered in the following paragraphs.

To avoid the need for systematic modifications to the code, I have implemented access of elements via bracket syntax, e.g. intarr[i][j]=5;. If I were writing a new code, I would prefer to use parenthesis syntax, e.g. myarr(i,j)=5;. But requiring such syntax in iPic3D would entail systematically changing about 1000 lines of code, mostly in the fields classes. Given the number of forked versions of iPic3D that need to be merged, it is not desirable to impose such a change at this time. We could use a script to change to such notation in the future; making the switch would be most painless at the end of a merge point in the software development cycle. Given the nature and the pressures of academic work, realizing such a merge point may not be easy.

Ideally, I prefer syntax which distinguishes write access from read access. At a minimum, we should support read-only array access so that we can pass const array references to methods that use array information without modifying it. Therefore, I have implemented get methods, e.g. var = intarr.get(i,j);.

For completeness, I have also implemented set methods, e.g. intarr.set(i,j, myvar);, allowing the user to clearly indicate write-access.
I have also implemented a fetch() accessor, which allows both read and write access via:

  myArr.fetch(i,j) = 5;

Of course we could replace fetch with operator(), allowing the usage

  myArr(i,j) = 5;

as in the implementation of Vicenc Beltran and his collaborators, but I preferred being able to search for the string fetch in order to identify potential write-access. I will consider feedback offered in response to this issue and am willing to change fetch to operator() if this is the consensus desideratum.

eliminate abstract base classes to improve performance

I am eliminating the following abstract base classes

  Grid
  Field

and replacing them with typdefs that map each abstract base class to the unique class that currently inherits from it:

  typedef Grid3DCU Grid;
  typedef EMfields3D Field;

Virtual methods require run-time dereferencing of pointers, which for frequently called functions such as Grid3DCU::getXN() and EMfields3D::getEx() hurts performance. Furthermore, calls to such virtual functions prevent vectorization on the MIC.

Eliminate CollectiveIO abstract base class to simplify code

I don't see the need for the CollectiveIO abstract base class, so I am replacing it with

  typedef Collective CollectiveIO;

in Collective.h and I am replacing the content of CollectiveIO.h with

  #include "Collective.h"

compiling for parallel hdf5 should be optional

Users should not have to upgrade hdf5 to a version that supports parallel hdf5 in order to compile and run iPic3D.

sort particles per mesh cell

Particles are currently sorted by MPI subdomain. If one only uses MPI parallelization, then this allows for efficient pushing of particles and summing of moments as long as 10_nxn_nyn_nzn_sizeof(double) fits in cache. But to support effective use of vector units (such as on Intel's Xeon and Xeon Phi processors) to push particles and sum moments, we need to sort particles at the level of individual mesh cells.

Sorting particles at the cell level will have benefits that extend beyond vectorization. Benefits of this fine-grain particle locality include:

ability to relax the particle distribution to Maxwellian. This will allow for smooth coupling of iPic3D to fluid models.
ability to efficiently merge and appropriately split or resample particles. This can be used to maintain appropriate resolution of velocity space.
ability to collide particles. This would allow to extend use of iPic3D to problems that involve interacting species.

If particles are sorted per mesh cell then we can vectorize the task of a single iteration of pushing all particles positioned in a given cell, because pushing them uses the same field data. Note that with each iteration of the particle solver, the position (xp, yp, zp) used to interpolate from the field is updated, and the particles need to be resorted.

To sort particles at the cell level, we can sort by the Z-value of the cell in the mesh of each MPI process. The Z-value is computed by interleaving the bits of the x, y, and z array indices. We can adapt the multisort code that Maria Elena and I worked with at the workshop at Barcelona Supercomputing Centre in July in order to do fast, thread-safe sorting based on the Z-value. An alternative would be to simply walk through the particles and move them if necessary to the correct cell, but doing this in a thread-safe way would take some care.

build/exec should contain everything needed to run iPic3D

This specification was written in consultation with Jorge Amaya.

cmake supports the ability to compile in a separate directory without modifying the original source tree. By convention this directory is called build.

We envision that everything needed to run ("execute") iPic3D be placed in build/exec: the executable (iPic3D) and the configuration ".inp" file. (Formerly this directory was called "build/work", but this name seems less clear. We prefer not to call it "build/run", because one might consider "run" to refer to an individual simulation.) The executable will be called "build/exec/iPic3D". (Formerly it was called "iPIC3D", but we are trying to be consistent about capitalization.)

Avoid including .h files in .h files.

Currently many header files include system header files or include other header files unnecessarily.

The number of .h files included in a .h file should be minimized, for two reasons:

The practice of including .h files in .h files can dramatically increase the time required to compile. In particular, including the .h files from the iostream libraries results in typically 20,000 lines of declarations that must be processed by the compiler for every .cpp file for which they get included. I reduced compile time of DOGPACK from 45 seconds to 15 seconds by eliminating such inclusion of the iostream library.
The practice of including .h files in .h files creates potential code dependencies, which makes it much harder to sort out what really depends on what, both for the programmer and for code-analyzing software. Related to this, cmake analyzes the .h file include graph and generate a makefile that incorporates all dependencies so that you never have to type "make clean". Including .h files in .h files can greatly increase the number of files that cmake falsely thinks need to be recompiled when you type make. Taken to an extreme, this can make cmake essentially useless with respect to reducing recompilation.

The result of including .h files in .h files is that (2) you have to recompile everything much more frequently and (1) it takes much longer to do so. It's a compounding effect.

To reduce these problems to a minimum, I recommend that we do as follows:

Do not include a .h file in a .h file, which two exceptions:
1. forward declaration headers.
  If you need to forward declare iostreams, use
  #include <iosfwd>. If you need array templates, use
  #include "arraysfwd".
2. headers from classes from which you need to inherit.
  
  Note that there is no need to include headers of classes that you are merely referring to as arguments to a method or as members of a class. Just forward-declare the class and use pointers and references:
```
     MyClass;
     void myfunction(MyClass& myClass);
     class NewClass{
       MyClass *myClass;
     };
```
  To summarize, you only need to include a .h file in a .h file to access an interface.

Alec

GEM case test fails between two commits in branch amaya-library

A change between commit 9a08b6f and the following commit 455d8fc produced a small change in the results of the GEM test case.

The major modification between these two commits was the update of the perfectConductor subroutines. Reverting this change doesn't seems to make any difference.

vectorize particle pusher

I have modified Particles3D::mover_PC() for vectorization on the MIC. Changes:

The attempt to vectorize by hand has been removed. TimeTasks profiling indicated that hand vectorization was hurting rather than helping performance. OpenMP versus SIMD parallelization decision should be done by compiler and/or high-level user configuration.
I have inserted the directives
```
#pragma omp parallel for
```
and
```
#pragma simd
```
prior to the loop over particles.
I have expanded loops which were preventing vectorization (I suppose because of some defect in the compiler.
I have changed multidimensional array access to use a calculated one-dimensional index rather than de-referencing a chain of pointers. This seems to double access speed in the vectorized code, although in the non-vectorized code it does not seem to make a difference.

Note that as yet vectorization does not accelerate performance of mover_PC() on the MIC Xeon Phi coprocessor, but the changes above prevent vectorization from hurting performance.

remove asgArr3 and asgArr4

Under commit e3c5dca, methods named asgArr3 and asgArr4 were introduced and used in Particles3D::mover_PC. These methods do nothing and should be deleted. Lines such as

double ***Ex = asgArr3(double, grid->getNXN(), grid->getNYN(), grid->getNZN(), EMf->getEx());

can be replaced with the equivalent code

double ***Ex = EMf->getEx();

code diagnostics should be atomic per thread

This is about code diagnostics fixes that I want to be merged into the main code.

error, debug, and warning messages available by including

  debug.h
  errors.h
  asserts.h

were not implemented in an atomic manner. As a consequence, output from code diagnostic messages from multiple processes is often interleaved, making it difficult to determine what output is coming from what thread.

I am fixing this by first writing the message to a string and then printing the string to standard output with a single call to fprintf.

Note that I have also changed the output format so that messages indicate process and thread number when appropriate.

support second order accuracy in time

iPic3D is currently only first-order accurate in time. Second-order accuracy in time requires that particles be pushed with field values that approximate time-averaged values and that the implicit susceptibility be calculated with field values and densities that approximate time-averaged values.

To achieve this, I would introduce support for options something like:

  PushWithBatTime = .5 # default: 0
  PushWithEatTime = .5 # default: 1
  ImplSusceptTime = .5 # default: 0

very bad performance of the interpolation step?

Stefano (@markidis) writes:

I have downloaded the iPIC3D version from github and compared with my
version (that is the one you started from) running with one core on my
macair. I use the same inputfile and the same amount of IO. I use the
not optimized Alloc.h.

My version for the problems finishes in 79.6 seconds, while the github
version finishes in 113 seconds. My feeling is that this in part due
to the use of a new class for the moments that introduces an
overhead. I am not sure about this, but we should investigate more.
Would it possible to have the old EMfields3D.h and compare the
performance with the new one?

Maybe we should introduce a test regarding the performance as well?
While I think 2-5% loss of performance might be acceptable in some
cases, 30% loss of performance is quite bad.

Thank you very much, stefano.

Testing mailing list

I'm using this issue to test the mailing list

support cross-compiling in cmake

This is code that Maria Elena wrote that we want to check in.

On most high-performance computing platforms, iPic3D must be cross-compiled.
Therefore, we should support cross-compiling in cmake.

We are creating a cmake subdirectory for cmake configuration files.

WriteOutputParallel leaks memory

Commit 7cbe28c introduced a memory leak in inputoutput/WriteOutputParallel.cpp.

In WriteOutputParallel(), calls to EMf->getExc(grid)
etc. allocate memory that is not deleted.

This is a massive memory leak; the amount of memory leaked equals the amount of data written.
A better solution would be to allocate a single buffer and reuse it for each variable written.

Code should be organized to accelerate compilation.

I am creating this issue as a header under which to check in future modifications for the purpose of accelerating compilation.

We can accelerate compilation and recompilation by observing the following principles.

Avoid including a .h file from a .h file. This hurts compilation speed in two ways.

First, recompilation speed is hurt if including unnecessary header files, because cmake thinks that every header file inclusion represents an actual code dependency.

Second, compilation speed is hurt when the included header files expand to a large number of lines, because in many compilers the preprocessor must process all the included files.

In particular, the C++ stream header files are notoriously bloated, typically expanding to 20000 lines that the preprocessor must parse for every .cpp file (object module) unless sophisticated techniques are used such as precompiled headers. It is better not to write code that assumes a sophisticated compiler.

This leads to the second principle: Avoid creating a large number of .cpp (implementation) files that include large header files.

If we want to put every class implementation in its own implementation file, then we should avoid including the iostream class except in classes that are really doing file I/O and are not just printing error messages. One of the reasons that I implemented the functionality in #include "error.h", which defines eprintf(), invalid_value_error(), and unsupported_value_error() methods (see issue #29), was to avoid the need to include iostream classes merely to print simple error messages.

In general, we can accelerate computation by separating out declarations and headers and consolidating implementation files. For example, if class A needs to contain an instance of class B, if you make A contain a pointer to class B then you do not need to include B.h in A.h; you only need to make the forward declaration class B; in A.h. If a large number of small classes need the same large set of header files, it may be appropriate to implement them all in a common .cpp file.

I more than doubled the speed of compilation of another software package (DoGPack) by implementing changes along these lines.

Use separate representation of electromagnetic field to push particles

I am changing EMfield3D to use separate arrays to solve the electromagnetic field and to represent the electromagnetic field used to push particles. I am declaring the new fields representation as a four-dimensional array of type pfloat (float or double according to user configuration):

  array4_pfloat EMfields3D:fieldforPcls(nxn,nyn,nzn,6)

The following benefits are anticipated:

One can add a background field, as was attempted under the commit discussed in issue #38.
This is precisely the data that needs to be transferred from the field solver to the particle pusher (e.g. from the cluster to the booster in the DEEP architecture). When singe-precision is used for particles (i.e. pfloat=float), then only half as much data needs to be transferred from the cluster to the booster.

When pushing particles, array access is accelerated (by roughly 30% on Xeon), due to the fact that a single call of the form

      arr1_pfloat_get field000 = fieldForPcls[ix][iy][iz];

can be made followed by six one-dimensional array accesses of the form

      Bxl += weight000 * field000[0];
      Byl += weight000 * field000[1];
      Bzl += weight000 * field000[2];
      Exl += weight000 * field000[3];
      Eyl += weight000 * field000[4];
      Ezl += weight000 * field000[5];

rather than having to recalculate a 3-dimensional subscript six times:

      Bxl += weight000 * Bx[ix][iy][iz];
      Byl += weight000 * By[ix][iy][iz];
      Bzl += weight000 * Bz[ix][iy][iz];
      Exl += weight000 * Ex[ix][iy][iz];
      Eyl += weight000 * Ey[ix][iy][iz];
      Ezl += weight000 * Ez[ix][iy][iz];

create assert_ macros to succinctly enforce conditions

I have implemented macros to enforce conditions:

  assert_eq(a,b)
  assert_ne(a,b)
  assert_gt(a,b)
  assert_lt(a,b)
  assert_ge(a,b)
  assert_le(a,b)
  assert_divides(a,b)

where the boolean operator names follow fortran/latex conventions.

For example, one might use:

 inline Grid3DCU::Grid3DCU(CollectiveIO * col, VirtualTopology3D * vct) { 
    assert_divides(vct->getXLEN(),col->getNxc());
    assert_divides(vct->getYLEN(),col->getNyc());
    assert_divides(vct->getZLEN(),col->getNzc());
    // add 2 for the guard cells
    nxc = (col->getNxc()) / (vct->getXLEN()) + 2;
    [...]

If the asserted condition does not hold, an error message is printed, e.g.:

ERROR in file grids/Grid3DCU.h, line 171, function Grid3DCU
    assertion failed: vct->getXLEN()(divides)col->getNxc(), i.e., 2(divides)63

The assert_* macros are intended as an extension of the system assert()
macro accessible via

  #include <assert.h>

I am making these macros available e.g. via

  #include "utility/asserts.h"

To conform with the behavioral description of assert available via

  man assert

if NDEBUG is defined then all these assert_* macros are compiled out
so that there is no check made and no performance penalty.

create dprint* macros for efficient debug

I have implemented macros such as

  dprintf(format_string, ....)

which print the function, file, line, and message when debug is turned on.
Processor rank is prefixed to the message.

In addition, I have implemented

  dprint(intvar)
  dprint(doublevar)

macros which are equivalent to

  dprintf("intvar==%d", intvar)
  dprintf("doublevar==%24.16e", doublevar)

More details remain to be specified, such as restricting to lead mpi process
(rank 0) and selecting debug level.

Created TimeTasks class for coarse-level profiling

I am creating

  utility/TimeTasks.h

implemented in

  utility/diagnostics.cpp

to perform coarse-level self-profiling.

The essential tasks to profile are:

computing moments
computing fields
pushing particles

The time spent in each task is subdivided into

computation
communication

This class is conceived of as a singleton.

Statistics are accumulated via calls to start_() and end_() methods at the beginning and end of major procedure calls and at the beginning and end of communication routines such as those found in "communication/ComNodes3D.h". The start() and end() methods call MPI_Wtime().

Detailed behavior regarding accumulation or averaging of statistics across MPI processes and/or cycles of the algorithm is not yet specified.

rhocs last dimension is allocated incorrectly

In EMfields.cpp

    rhons = newArr4(double, ns, nxn, nyn, nzn);
    rhocs = newArr4(double, ns, nxc, nyc, nzn);

This should be

    rhons = newArr4(double, ns, nxn, nyn, nzn);
    rhocs = newArr4(double, ns, nxc, nyc, nzc);

still severe problem with interpolation step. Problem related to Moment class?

Hi,

I still have severe problems related to interpolation step on my MacAir. One of my run gives me this:

I am doubting that this problem is related to the particle classes since this issue should be fixed. I start thinking that it is more related to the class Moments in EMfields3D.h.

Any idea how to fix this?

I will check if changing compiler flags can mitigate this problem tonight. Thank you very much, stefano.

replace node_coordinate and center_coordinate arrays

In the Grid3DCU class, the implementation

  double &getXN(int X, int Y, int Z) { return (node_coordinate[X][Y][Z][0]); }
  double &getYN(int X, int Y, int Z) { return (node_coordinate[X][Y][Z][1]); }
  double &getZN(int X, int Y, int Z) { return (node_coordinate[X][Y][Z][2]); }
  double &getXC(int X, int Y, int Z) { return (center_coordinate[X][Y][Z][0]); }
  double &getYC(int X, int Y, int Z) { return (center_coordinate[X][Y][Z][1]); }
  double &getZC(int X, int Y, int Z) { return (center_coordinate[X][Y][Z][2]); }

can be replaced with

  const double &getXN(int X, int Y, int Z) { return node_xcoord[X];}
  const double &getYN(int X, int Y, int Z) { return node_ycoord[Y];}
  const double &getZN(int X, int Y, int Z) { return node_zcoord[Z];}
  const double &getXC(int X, int Y, int Z) { return center_xcoord[X];}
  const double &getYC(int X, int Y, int Z) { return center_ycoord[Y];}
  const double &getZC(int X, int Y, int Z) { return center_zcoord[Z];}

if in the constructor we replace

node_coordinate = newArr4(double, nxn, nyn, nzn, 3);  // 0 -> X, 1 -> Y, 2-> Z
for (int i = 0; i < nxn; i++) {
  for (int j = 0; j < nyn; j++) {
    for (int k = 0; k < nzn; k++) {
      node_coordinate[i][j][k][0] = xStart + (i - 1) * dx;
      node_coordinate[i][j][k][1] = yStart + (j - 1) * dy;
      node_coordinate[i][j][k][2] = zStart + (k - 1) * dz;
    }
  }
}

and

center_coordinate = newArr4(double, nxc, nyc, nzc, 3);
for (int i = 0; i < nxc; i++) {
  for (int j = 0; j < nyc; j++) {
    for (int k = 0; k < nzc; k++) {
      center_coordinate[i][j][k][0] = .5 * (node_coordinate[i][j][k][0] + node_coordinate[i + 1][j][k][0]);
      center_coordinate[i][j][k][1] = .5 * (node_coordinate[i][j][k][1] + node_coordinate[i][j + 1][k][1]);
      center_coordinate[i][j][k][2] = .5 * (node_coordinate[i][j][k][2] + node_coordinate[i][j][k + 1][2]);
    }
  }
}

with

node_xcoord = new double[nxn];
node_ycoord = new double[nyn];
node_zcoord = new double[nzn];
for (int i=0; i<nxn; i++) node_xcoord[i] = xStart + (i - 1) * dx;
for (int j=0; j<nyn; j++) node_ycoord[j] = yStart + (j - 1) * dy;
for (int k=0; k<nzn; k++) node_zcoord[k] = zStart + (k - 1) * dz;
// arrays allocation: cells ---> the first cell has index 1, the last has index ncn-2!
center_xcoord = new double[nxc];
center_ycoord = new double[nyc];
center_zcoord = new double[nzc];
for(int i=0; i<nxc; i++) center_xcoord[i] = .5*(node_xcoord[i]+node_xcoord[i+1]);
for(int j=0; j<nyc; j++) center_ycoord[j] = .5*(node_ycoord[j]+node_ycoord[j+1]);
for(int k=0; k<nzc; k++) center_zcoord[k] = .5*(node_zcoord[k]+node_zcoord[k+1]);

This reduces memory use and the expense of obtaining coordinate positions.

I surmise that node_coordinate and center_coordinate were created in anticipation of generalizing to a deformed logically cartesian mesh. The rest of the code, however, does not seem to have been consistently written with this in mind. I therefore do not see the value of retaining this implementation, and I would even consider using the alternative accessors

const double &getXN(int X) { return node_xcoord[X];}
const double &getYN(int Y) { return node_ycoord[Y];}
const double &getZN(int Z) { return node_zcoord[Z];}
const double &getXC(int X) { return center_xcoord[X];}
const double &getYC(int Y) { return center_ycoord[Y];}
const double &getZC(int Z) { return center_zcoord[Z];}

based in part on the philosophy that the easiest code to generalize is typically not code that attempts to anticipate generalization but rather code that is clean and simple and readable. For now I will implement these accessors for use in performance-critical parts of the code (pushing particles and summing moments) unless I hear an objection.

Create errors.h for error diagnostics

I want to check in two groups of error diagnostics methods, to be accessed via

  #include "errors.h"

Firstly, a generic error message utility:

  eprintf(char*, varargs);

This prints file, line number, function, and the error message, and then exits. Similarly, a generic warning utility:

  Wprintf(char*, varargs);

Secondly, a generic utility to use when no case is selected:

  invalid_value_error(int);
  invalid_value_error(char *);
  invalid_value_error(double);

and

  unsupported_value_error(int);
  unsupported_value_error(char *);
  unsupported_value_error(double);

The most typical use is in the default clause of a switch statement. For example, in a routine to compute based on the number of dimensions :

    switch(ndims)
    {
        default: unsupported_value_error(ndims);
        case 2: b2=1-mbc; s2=S2+2*mbc;
        case 1: b1=1-mbc; s1=S1+2*mbc;
        case 0: ;
    }

XLEN divides nxc should be enforced

Currently the code gives a segmentation fault or faulty output if
XLEN (currently specified in processtopology/VCtopology3D.h)
does not divide nxc (specified in inputfiles/*.inp).

An error message should be issued.
Currently the natural point at which to insert such code is
at the beginning of the Grid3DCU constructor.

Code that I implemented to do this follows.

inline Grid3DCU::Grid3DCU(CollectiveIO * col, VirtualTopology3D * vct) {
int get_rank();
if(!get_rank())
{
fflush(stdout);
bool xerror = false;
bool yerror = false;
bool zerror = false;
if((col->getNxc()) % (vct->getXLEN())) xerror=true;
if((col->getNyc()) % (vct->getYLEN())) yerror=true;
if((col->getNzc()) % (vct->getZLEN())) zerror=true;
if(xerror) printf("!!!ERROR: XLEN=%d does not divide nxc=%d\n", vct->getXLEN(),col->getNxc());
if(yerror) printf("!!!ERROR: YLEN=%d does not divide nyc=%d\n", vct->getYLEN(),col->getNyc());
if(zerror) printf("!!!ERROR: ZLEN=%d does not divide nzc=%d\n", vct->getZLEN(),col->getNzc());
fflush(stdout);
bool error = xerror||yerror||zerror;
if(error) exit(1);
}
// add 2 for the guard cells
nxc = (col->getNxc()) / (vct->getXLEN()) + 2;
nyc = (col->getNyc()) / (vct->getYLEN()) + 2;
nzc = (col->getNzc()) / (vct->getZLEN()) + 2;
[...]

where get_rank() returns the MPI rank of the process.

getVelocityDistribution should make one MPI_Allreduce call

Particles3Dcomm::getVelocityDistribution() sums particle velocity distributions using:

  for (int i = 0; i < nBins; i++) {
    localN = f[i];
    MPI_Allreduce(&localN, &totalN, 1, MPI_LONG_LONG, MPI_SUM, MPI_COMM_WORLD);
    f[i] = totalN;
  }

This generates thousands of all-reduce calls, which is horribly inefficient and will be unacceptable on DEEP's Booster. These lines can be replaced with a single call:

  MPI_Allreduce(MPI_IN_PLACE, f, nBins, MPI_LONG_LONG, MPI_SUM, MPI_COMM_WORLD);

support convergence when number of particles per mesh cell is bounded

With the currently implemented algorithm, iPic3D is unable to converge to the solution of the Klimontovich equation (molecular dynamics), and in order for the iPic3D solution to converge to the solution of the Vlasov equation, the number of particles per mesh cell must be taken to infinity.

Two simple changes to the algorithm would allow for convergence to the Klimontovich equation (for a fixed number of particles as the mesh resolution goes to infinity) and to the Vlasov equation (for a fixed number of particles per mesh cell as the mesh resolution goes to infinity):

Use a configurable number of smoothings. In the current code, the current is smoothed M=1 time before being used in the field solve, and the electric field is smoothed N=3 times after being solved. M and N should be made user-configurable numbers.
Without this change, particle self-force is unable to become correct if the number of particles per mesh cell is bounded.
For convergence to the Klimontovich equation, M should equal N and we should also allow the user to specify that the unsmoothed field be retained for subsequent field solves and that the smoothed field be used only to push particles. Currently in iPic3D the smoothing overwrites the unsmoothed field.
Evolving the unsmoothed field allows the self-force to become the correct self-force of a finite-width Gaussian-shaped particle as the mesh is refined and the number of smoothings is increased appropriately. If the unsmoothed field is overwritten with the smoothed field, then this property is lost.

For fuller justification, see

this proposal.

Testing again mailing list

This time creating a new issue from the CmPA Organization. Emails should be received by the team members, but I'm trying to send an email to a mailing list that I created:

[email protected]

If anyone receives an email, please add a comment bellow for tests.... thanks!!

Sum moments for all species in one OpenMP parallel clause

Maria and I want to check in code that modifies sumMoments (called by CalculateMoments()) to sum all moments so that moments of all species are summed in a single OpenMP clause, per recommendation of Alejandro Duran. This reduces thread creation overhead and accelerates execution on Xeon by about 15%.

List of code enhancements for the next version

In the next version of the code I would like to see:

Separation between .cpp and .h files.
Elimination of "inline" functions, where they are not necessary.
Elimination of all unused source code.
Update of the pre and post processing tools: we should go Python!
Parallel HDF5 files.
Updating of the particle mover.
Contiguous memory allocation of dynamic arrays.
Elimination of all makefiles (use only CMake).
Addition of pre-processing calls for MIC compilation.
10.....

If you have more ideas, please add them here bellow. We will decide which ones are we going to add to the next version.

use int rather than long long in critical particle loops

We need to use long long for particle IDs to support more than 2 billion (2^31) particles. But there is no reason to support having so many particles in a single MPI process. There is a potential cost to using long long in a particle loop. Therefore, I am changing to use int, preceeded by an assertion that nop is no greater than INT_MAX, in the following two critical methods:

  Particles3D::mover_PC()
  Particles3Dcomm::interpP2G()

Support making MPI_Barrier() a no-op.

There is no need for us to make MPI_Barrier() calls in ipic3d, and not doing so can help to hide communication latency.

Therefore, I want to check in code that uses the following preprocessor definitions:

 #define MPI_Barrier(args...)

to remove calls to MPI_Barrier() from the preprocessed code.

To accomplish this, I am placing that definition in a file named ipicdefs.h, and am including it in all files that make
calls to MPI_Barrier:

   #include "ipicdefs.h"

Ultimately I have in mind for ipicdefs.h to be included in every ipic3d file so that ipicdefs.h can be a simple mechanism to make universal changes like this.

changing processor topology should not require recompile

Currently changing the number of MPI processes requires editing VCtopology3D.h to change the following lines

  // change these values to change the topology
  XLEN = 2;
  YLEN = 2;
  ZLEN = 1;
  nprocs = XLEN * YLEN * ZLEN;
  // here you have to set the topology for the fields
  PERIODICX = true;
  PERIODICY = false;
  PERIODICZ = true;
  // here you have to set the topology for the Particles
  PERIODICX_P = true;
  PERIODICY_P = false;
  PERIODICZ_P = true;

and then a recompile. cmake (mistakenly) thinks that it is necessary to recompile many files because of the include dependencies.

An initial step is to move the methods of the VCtopology3D class into a file named
communication/VCtopology3D.cpp. That way, only a single file needs to be recompiled, follow by a relink.

A subsequent change should be to add a configuration variable to the input file for each of these variables.

Use of B_ext is incorrect.

Under commit 05082fc, EMfields::B_ext is used to add an external background magnetic field. The implementation needs to be corrected and improved:

Lines in Particles3D::mover_PC() such as

Bxl += weight000 * Bx[ix][iy][iz] + Bx_ext[ix][iy][iz];

are incorrect; a correct version would be

Bxl += weight000 * (Bx[ix][iy][iz] + Bx_ext[ix][iy][iz]);

This, however, becomes very inefficient as the number of particles per mesh cell becomes large. A better solution would be to write this as
```
Bxl += weight000 * Bx_tot[ix][iy][iz];
```
where Bx_tot = Bx + Bx_ext is computed at the top of Particles3D::mover_PC().
Memory for B_ext and Bx_tot should not be allocated unless this feature is actually turned on. Bx_tot can simply be an alias pointer for B_ext in case this feature is turned off.

particle pusher is iterating only once rather than NiterMover times

In the original commit to the github version of iPic3D, only one iteration was used to process the "rest" of the particles after processing as many as possible P_SAME_TIME at a time. When I removed the P_SAME_TIME loop in commit 30a6ba5, all the particles were consequently advanced with only one iteration of the mover. This should be corrected so that the particle mover iterates NiterMover times.

support checking validity of configuration during initialization

A general principle of maintainable code is that error messages should be issued as soon as the violation of an assumed condition can be efficiently detected.

Currently configuration options are not checked for validity when they are read in, and only a small number of checks are made later on. Tightly and cleanly restricting configuration options allows us to guarantee backward compatibility in how the configuration file drives behavior. We should validate configuration options on startup and immediately exit with an informative error message upon detecting invalid options.

Note that appropriate resolution of this issue assumes that issue #64 is also resolved.

Use Moments class to sum moments in parallel with OpenMP

This is an enhancement code submission.

Particles3Dcomm::interpP2G() loops over all particles and sums the contributions of all particles to the 10 quadratic monomial velocity moments:

  rho
  Jx, Jy, Jz
  pXX, pXY, pXZ, pYY, pYZ, pZZ

Therefore I have implemented a Moments class that can be used to sum these moments.

I have also implemented
EMfields3D::addToSpeciesMoments(const Moments& in, int species)
which adds a set of accumulated moments to the total for a given species.

I use the Moments class in the Particles3Dcomm::interpP2G() method in a block headed by a #pragma omp parallel directive; this block contains a for loop over all particles headed by a #pragma omp for directive. At the conclusion of the parallel block a #pragma omp critical directive is used to add the moments for the particles summed in the OMP thread to the moments for the species being summed. Finer-grain reduction could accelerate this last little step.

Does processor number/distribution change your results?

Have you ever noticed that for one simulation performed two times with different number of processors or a different cartesian domain decomposition the results are slightly different?

MPIdata should be a singleton

I want to check in some changes that make the MPIdata class a singleton. It is possible that in the future we will want to create multiple instances of the iPic3D solver class in a single process, whereas MPIdata needs to be a singleton by virtue of the fact that MPI_Init(), which is called in the MPIdata initializer, should be called only once. A crucial benefit of this change is that the mpi rank is make available to system-level diagnostics code (debug and error messages).