[dcmf] Question about DCMF Library from a new user

Thu Feb 7 15:54:21 CST 2008

This is how I do it in the GA/ARMCI reference implementation (see: 
comm/lib/ga/toolkit/armci/src/x/dcmf/armcix_impl.c) --- I added some 
comments for the purpose of this email....

int ARMCIX_Init ()
{
  // This is the only DCMF call that can be invoked before
  // DCMF_Messager_initialize(). Interrupt processing is turned off
  // in a critical section. Doing this here prevents the kernel from
  // firing an interrupt before the local node has been initialized.
  DCMF_CriticalSection_enter(0);

  DCMF_Messager_initialize ();

  ...

  DCMF_Configure_t config;
  memset (&config, 0x00, sizeof(DCMF_Configure_t));
  config.interrupts = 
(interrupts==0)?DCMF_INTERRUPTS_OFF:DCMF_INTERRUPTS_ON;

  // First (input) parameter is the "requested" configuration, the second
  // (output) parameter is the actual configuration. This is important
  // because multiple independent software libraries can coexist using
  // the same dcmf runtime (for example, MPI and ARMCI). Interrupt mode 
  // in particular can only be enabled, but not disabled.  Because the
  // application using the independent software libraries may invoke the
  // init-type routine for those libraries in any order it is possible 
that
  // the first may request to configure interrupts on and then second may
  // request to configure interrupts off. In this case the second 
configure
  // request will not "degrade" the interrupt level and interrupts will
  // enabled for the dcmf runtime.  Whew!
  DCMF_Messager_configure (&config, &config);

  ...

  // Ready to process interrupts now.
  DCMF_CriticalSection_exit(0);
}

I have to point out that, specifically on BG/P, interrupt mode may not be 
as useful as you may think. BG/P has a DMA engine that offloads 
communication from the cores which allows BG/P to achieve high overlap of 
computation and communication. In my experience with ARMCI - another 
one-sided API - the remote nodes end up polling "enough" that progress is 
always made. In fact, in most ARMCI and GA test cases the performance is 
actually lower in interrupt mode.

One thing to remember is that a single call to DCMF_Messager_advance() may 
invoke many different callbacks (local completion callbacks, recv notify 
callbacks, recv done callbacks, etc), so it's not like you have to call 
DCMF_Messager_advance() a gajillion times. Usually you are looping on a 
variable that will be changed by one of the callback functions. Other 
callbacks may be invoked and may be altering state on other variables, but 
the loop doesn't exit until the particular condition is met. Check out 
this snip (again from the ARMCI reference implementation) from the file: 
comm/lib/ga/toolkit/armci/src/x/dcmf/armcix_get.c :

int ARMCIX_Get(void * src, void * dst, int bytes, int proc)
{
  DCMF_CriticalSection_enter (0);

  volatile unsigned active = 1;
  DCMF_Callback_t cb_wait = { ARMCIX_DCMF_cb_decrement, (void *)&active };
  DCMF_Request_t request;

  DCMF_Get ( &__get_protocol,
             &request,
             cb_wait,
             DCMF_MATCH_CONSISTENCY,
             proc,
             bytes,
             (char *) dst,
             (char *) src);

  while (active) DCMF_Messager_advance ();

  DCMF_CriticalSection_exit  (0);

  return 0;
}

A couple things to note here ..
the 'volatile' keyword is important or the compiler may optimize out the 
while loop.
the code is wrapped in a critical section.  no point in taking the 
overhead for an interrupt if you are actively polling!
the "cb_wait" callback just decrements the "active" variable - 
non-blocking operations may free the DCMF_Request_t that was used for the 
operation. Since this is blocking the DCMF_Request_t was placed on the 
stack.

Michael Blocksome
Blue Gene Messaging Team Lead
Advanced Systems SW Development
blocksom at us.ibm.com

rajesh.nishtala at gmail.com wrote on 02/07/2008 03:18:24 PM:

> Thanks for the answers! They definitely help. In order to reconfigure
> one the messaging layer to use Polling versus interrupts does one need
> to reinstall/reconfigure the software or is there a way to do it
> within a particular installation. In either case how would one enable
> or disable the 'interrupt mode'.  Is this the 'DCMF_Interrupts' field
> in the DCMF_Configure_t type?
> 
> Thanks,
> -rajesh
> 
> On Feb 7, 2008 12:12 PM, Michael Blocksome <blocksom at us.ibm.com> wrote:
> >
> > Rajesh,
> >
> > These are excellent questions! We need to provide better documentation 
on
> > how the callback flow works ... baring that, I'll answer your 
questions as
> > they come and maybe we can pull together a document after answering 
these
> > questions on the mailing list.
> >
> >
> > There are two types of callbacks - think of them as "local completion" 
and
> > "remote notification" callbacks.
> >
> > The local completion callbacks are specified not at registration time, 
but
> > when an individual operation is started (DCMF_Send(), for example). 
This
> > callback is invoked by the dcmf runtime, when the local node calls the
> > DCMF_Messager_advance() function, after the source buffer has been
> > completely sent.  Once the local completion callback is invoked all 
buffers
> > associated with the operation may be deallocated, etc.
> >
> > The callbacks that are registered for DCMF_Send (and DCMF_Control,etc) 
with
> > the DCMF_Send_register() function are invoked by the dcmf runtime on 
the
> > remote node when that node call the DCMF_Messager_advance() function.
> > Typically all nodes in the system will periodically poll with
> > DCMF_Messager_advance() to make progress, however the BGP messager can 
be
> > configured to enable an interupt to be fired when a core receives a 
packet.
> > In this interrupt mode active polling is not required - although you 
do take
> > a performance hit because of the overhead of processing the 
interrupts.
> >
> > The remote callbacks for DCMF_Send are invoked before any data has 
been
> > written to the remote node.  There are two callback types and each has
> > slightly different use by the application programmer.
> >
> > The DCMF_RecvSendShort ("short") callbacks are invoked when the entire
> > message has been received by the remote node into a temporary location 
(on
> > BGP this is a single packet of data that has been received by the DMA 
into a
> > memory fifo).  The application's responsibility is to copy the data 
out of
> > the temporary buffer and into the final destination buffer. This 
callback
> > type was created specifically to allow the dcmf implementation to 
optimize
> > the performance for small messages.
> >
> > The DCMF_RecvSend ("long" or "asynhcronous") callbacks are invoked 
when the
> > control information has been received by the remote node into a 
temporary
> > location (on BGP the control information will be contained in a single
> > packet).  The application's responsibility is to allocate memory
> > (DCMF_Request_t) for the dcmf runtime to use to receive the rest of 
the
> > data, as well as specify the destination buffer and length and a
> > ("recv_done") callback. This "recv_done" callback is invoked by the 
dcmf
> > runtime when the data has been completely received and written to the
> > destination buffer.  Typically applications will free/deallocate the
> > DCMF_Request_t memory that was allocated previously.
> >
> >
> >
> > DCMF_Put and memory regions (i.e., registration, pinning)
> >
> > The DCMF_Put in the library is just stubbed in as we didn't have a 
need for
> > it in release 1 of the BG/P software.  However, we are actively 
working on
> > adding the DCMF_Put() into the API which will also require a memory 
region
> > API. The existing DCMF_Get API will be updated to use these new memory
> > regions objects.  Perhaps we should go into more detail on the memory 
region
> > API in a separate email.
> >
> >
> > I hope this helps!
> >
> > Michael Blocksome
> >  Blue Gene Messaging Team Lead
> >  Advanced Systems SW Development
> >  blocksom at us.ibm.com
> >
> >
> >
> >
> >  "Rajesh Nishtala" <rajeshn at eecs.berkeley.edu>
> > Sent by: dcmf-bounces at lists.anl-external.org
> >
> > 02/07/2008 12:01 PM
> >
> > To dcmf at lists.anl-external.org
> >
> > cc upc-devel at lbl.gov
> >
> > Subject [dcmf] Question about DCMF Library from a new user
> >
> >
> >
> >
> >
> >
> > Hi,
> >  I am porting GASNet, our portable runtime layer for the Berkeley UPC
> >  compiler, to the BlueGene/P and i'm using DCMF as the lower level
> >  messaging layer. I have some high level questions regarding the
> >  library that will influence the design of our BlueGene/P port. The
> >  main difference between our library and MPI is that we focus on
> >  one-sided communication so my main questions are regarding these
> >  issues.
> >
> >  + When does the callback that gets registered with DCMF_Send() get
> >  called? Does it get called after the data has been committed to the
> >  memory on the remote node or does it simply imply that the data 
buffer
> >  is safe to reuse on the local node?
> >
> >  + I notice that when I do an nm on the dcmf libraries there is
> >  DCMF_Put() function, however when i waded through the code a little
> >  bit more I noticed that the function simply called an abort which to
> >  me implies that it is not implemented. Is this why it doesn't show up
> >  in the dcmf.h header files?
> >
> >  + I have heard that the BlueGene/P supports RDMA operations. Are 
there
> >  any special considerations for managing the memory registration (i.e.
> >  memory pinning) to enable these operations or is this done
> >  automatically under the covers?
> >
> >
> >  Thanks in advance for any help!
> >
> >  Sincerely,
> >  Rajesh Nishtala
> >  _______________________________________________
> >  dcmf mailing list
> >  dcmf at lists.anl-external.org
> >  http://lists.anl-external.org/cgi-bin/mailman/listinfo/dcmf
> >  http://dcmf.anl-external.org/wiki
> >
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alcf.anl.gov/pipermail/dcmf/attachments/20080207/86e3a275/attachment.htm>