Skip to content

Bug: rendezvous causes stall in specific situation #6

@TizianoDeMatteis

Description

@TizianoDeMatteis

Description

If in the same kernel there are 2+ SMI channels opened toward the same destination, if the message length is small (=1 network packet) the rendezvous mechanism could cause a stall

Example

 SMI_Channel chan_send1=SMI_Open_send_channel(2, SMI_INT, my_rank+1, 0, comm);
 SMI_Channel chan_send2=SMI_Open_send_channel(2, SMI_INT, my_rank+1, 1, comm);

for(int i=0;i<2;i++)
   <push the data in the two channels>

On the receiver side symmetric operations are applied. This is broken in the following case:

  • when i=1, we first push the second data elements in the channel. The network packet is sent. We have zero tokens and the rendezvous mechanism wait for a message from my_rank+1. This prevents the execution of the second push
  • on the receiver side, we received the network packet. We perform the first pop. However, tokens is not zero, therefore we will not send the rendevous message. The second pop is stalled because the data will never arrive

Possible solution
Change tokens condition on push?

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions