Code Monkey home page Code Monkey logo

Comments (7)

 avatar commented on July 29, 2024

@dahongli I believe this issue is what you are working on

from mana.

xuyao0127 avatar xuyao0127 commented on July 29, 2024

Thanks for testing the problem. This could be related to some corner case of the hybrid 2pc algorithm. I can't work on it today, but I can take a look during the weekend.

Since there aren't many tanks, can you attach gdb to each of the ranks, and print the seq_nun and target_seq_num maps? I believe they are unordered map, so probably you need to define a print function for them and call the function in gdb.

from mana.

 avatar commented on July 29, 2024

Rank 0:

seq_num:
elem[0]->left: $1 = 1140850688
elem[0]->right: $2 = 4421956594040832
Map size = 1

target_start_triv_barrier:
Map size = 0

target_stop_triv_barrier:
elem[0]->left: $3 = 1140850688
elem[0]->right: $4 = 0
Map size = 1

Rank 1:

seq_num:
elem[0]->right: $2 = 4421952299073536
Map size = 1

target_start_triv_barrier:
Map size = 0

target_stop_triv_barrier:
elem[0]->left: $3 = 1140850688
elem[0]->right: $4 = 0
Map size = 1

Rank 2:

seq_num:
elem[0]->right: $2 = 4421943709138944
Map size = 1

target_start_triv_barrier:
Map size = 0

target_stop_triv_barrier:
elem[0]->left: $3 = 1140850688
elem[0]->right: $4 = 0
Map size = 1

Rank 3:

seq_num:
elem[0]->left: $1 = 1140850688
elem[0]->right: $2 = 4421939414171648
Map size = 1

target_start_triv_barrier:
Map size = 0

target_stop_triv_barrier:
elem[0]->left: $3 = 1140850688
elem[0]->right: $4 = 0
Map size = 1

This data was taken from the first preSuspendBarrier after sending the checkpoint command

from mana.

Marc-Miranda avatar Marc-Miranda commented on July 29, 2024

Good afternoon!

Sorry for intervening, but I was facing the same issue with iPIC3D. I have been reading the seq_num.cpp file and it seems to me that check_seq_nums() should be

int check_seq_nums() {
  unsigned int comm_id;
  unsigned int seq;
  int target_reached = 0;
  for (comm_seq_pair_t pair : seq_num) {
    comm_id = pair.first;
    seq = pair.second;
    if (target_start_triv_barrier[comm_id] < seq_num[comm_id]) {
      target_reached = 1;
      break;
    }
  }
  return target_reached;
}

Then the following code fragment works fine.

if (ckpt_pending && check_seq_nums()) {
      current_phase = STOP_BEFORE_CS;
      while (!freepass && ckpt_pending);
      freepass = false;
      current_phase = IN_CS;
}

If a single comm has reached the target we stop before proceeding to the critical section. It seems that what was happening is that ranks were allowed to enter the CS even though it is not what they were meant to do. I have been running iPIC3D with this modification for about an hour and no error has been raised. Before, the error appeared quite frequently.

Best,
Marc

from mana.

 avatar commented on July 29, 2024

I have tested this change with the Allgather test case I identified above, and this does not appear to be a complete fix for the issue (I am still seeing the same problem). @xuyao0127 any input?

from mana.

xuyao0127 avatar xuyao0127 commented on July 29, 2024
int check_seq_nums() {
  unsigned int comm_id;
  unsigned int seq;
  int target_reached = 0;
  for (comm_seq_pair_t pair : seq_num) {
    comm_id = pair.first;
    seq = pair.second;
    if (target_start_triv_barrier[comm_id] < seq_num[comm_id]) {
      target_reached = 1;
      break;
    }
  }
  return target_reached;
}

This change is incorrect because the function is used to check if all communicators of a rank have reached their targets. Then the ckpt thread can share this information among other ranks to decide when to checkpoint. All communicators of all ranks need to reach their targets before getting checkpointed.

Calling ckpt_seq_num before entering the STOP_BEFORE_CS loop is an optimization to reduce the number of free passes. It's not where ckpt_seq_num is used primarily.

from mana.

 avatar commented on July 29, 2024

#233

from mana.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.