[dcmf] 32-bit ROMIO: What is the best solution?

Fri Feb 8 14:41:40 CST 2008

I'm going to forward parts of a technical discussion about 32-bit 
MPICH/ROMIO issues.    This discussion has been going on privately for a 
couple weeks.  I just wanted to open it up to the list for discussion and 
tracking.

I'm not going to include all the technical information that's been passed 
on.  If anyone wants to add more of the historical discussion, go for it.

We all realize there are issues with MPI_Aint's in 32 bit implementations. 
 Using signed 32 bit addresses along with 64 bit offsets can result in 
some pretty broken code.  I've reproduced several problems on 
BGL/BGP/linux.   The problems are most obvious with romio files > 2G or 
virtual addresses > 2G.

We've begun to consider making MPI_Aint a 64 bit value as a solution.

So, below are my selections of technical tidbits on this from Jeff Parker.

------------------------------------------
Your (Rob Ross) document describing the MPI-IO file size limitations on 
32-bit platforms offered two solutions:

1.  Change internal ROMIO code to use 64-bit variables.  <snip.  Rob Ross 
can detail this if he'd like to discuss this further.>

2.  Change the MPI_Aint data type to be 64-bit.  I looked at this solution 
and it requires a scrub of all places throughout ROMIO and MPICH where 
MPI_Aint variables are used, ensuring calculations will have the correct 
result.  However, it does fix everything, including the external MPI 
interfaces that return MPI_Aint values.  Perhaps a variation of this is to 
code MPICH and ROMIO to handle both a 32-bit or a 64-bit MPI_Aint based on 
a compile flag, changing calculations to be correct for either size, and 
adding assertions when overflow occurs.  We could have two MPICH 
libraries, one with 32-bit MPI_Aint and the other with 64-bit MPI_Aint. 
Applications that don't have large datatypes or large MPI-IO files can 
continue to use the 32-bit library, eliminating the risk of performance 
issues, while applications using large datatypes can use the 64-bit 
version.

One could argue that these MPI interfaces that return MPI_Aint values are 
not used by typical applications, or if they are used, they are only used 
on datatypes that fit in 31-bit sizes (up to 2GB), so those applications 
don't have a problem.  However, if the 32-bit platform has more than 2GB 
of memory, the MPI_Aint values can go negative, which could also result in 
incorrect calculations or comparisons.

Both solutions involve a significant amount of work.  However, it appears 
that only solution 2 fixes all of the issues.  Other 64-bit platforms 
having both 64-bit MPI_Aint and 64 bit pointers appear to work correctly. 
It seems that we must change MPI_Aint to be 64-bit and fix the 
calculations involving MPI_Aint and 32-bit addresses in order to fix the 
whole problem.
------------------------------------------
Something else that comes to mind that may be related:  Even though a Blue 
Gene/P compute node has 2GB of memory, virtual addresses may be arranged 
by CNK such that they are larger than 2GB.  In some cases, when signed 
addresses are used with 64 bit offsets in calculations, this may produce 
incorrect results.  Any scrub of the code should account for this as well.
------------------------------------------
(Jeff's earlier investigation)
------------------------------------------
Regarding the MPI-IO issue with large files, exposed by the b_eff_io 
testcase...

I have done a brief investigation of solution 2, changing the MPI_Aint 
data type from 32 bits to 64 bits.  <snip>
Compile warnings appear in generic MPICH code, and a lot of work 
(including code changes) would need to be done to address these issues.

It appears that the MPICH code expects MPI_Aint to be signed and be the 
same size as a pointer, and many sensitive code changes will be required 
throughout MPICH to allow them to be different.  There are 1474 lines of 
code that explicitly reference MPI_Aint (typically declaring variables and 
function parameters of this type), and each of these threads would need to 
be examined in detail. 

A detailed look through the ROMIO code appears to be needed to determine 
which variables need to be 64 bits and replace them with a new unsigned 64 
bit typedef.  While I haven't investigated solution 1 yet, it seems likely 
that it too will require a similar scrub of at least the ROMIO code. 

Here is evidence of the problems with changing MPI_Aint to be 64 bits:

1.  The mpi/mpich2/configure.in file sets up MPI_Aint to be the same size 
as a pointer, and exports this size to other config files.

2.  MPI_Aint is signed.  Making it unsigned would help, but I don't think 
that is allowed, correct?

3.  If I force MPI_Aint to be 64 bits (signed) in this config file as 
follows:

MPI_SIZEOF_AINT=8
export MPI_SIZEOF_AINT
MPI_AINT="long long"

the MPICH compilation has warnings similar to the following:

mpi/mpich2/src/mpid/common/datatype/mpid_segment.c:659: warning: cast from 
pointer to integer of different size

These places will be problematic because casting a 32 bit pointer to a 64 
bit signed integer produces incorrect results when the high-order bit of 
the pointer is set (the address is larger than 2GB)...the high-order bit 
is sign-extended, and the resulting integer is a large negative number. 
Some code change will be required in each of these places to make this 
work.  Here is one example:

mpi/mpich2/src/mpid/common/datatype/mpid_segment.c:659: warning: cast from 
pointer to integer of different size

mpid/common/datatype/mpid_dataloop.h:#define DLOOP_Offset     MPI_Aint

    633 static int MPID_Segment_contig_flatten(DLOOP_Offset *blocks_p,
    634                                        DLOOP_Type el_type,
    635                                        DLOOP_Offset rel_off,
    636                                        void *bufp,
    637                                        void *v_paramp)
    638 {
    639     int index, el_size;
    640     DLOOP_Offset size;
    641     struct MPID_Segment_piece_params *paramp = v_paramp;
    642     MPIDI_STATE_DECL(MPID_STATE_MPID_SEGMENT_CONTIG_FLATTEN);
    643 
    644     MPIDI_FUNC_ENTER(MPID_STATE_MPID_SEGMENT_CONTIG_FLATTEN);
    645 
    646     el_size = MPID_Datatype_get_basic_size(el_type);
    647     size = *blocks_p * (DLOOP_Offset) el_size;
    648     index = paramp->u.flatten.index;
    649 
    650 #ifdef MPID_SP_VERBOSE
    651     MPIU_dbg_printf("\t[contig flatten: index = %d, loc = (%x + 
%x) = %x, size = %d]\n",
    652                     index,
    653                     (unsigned) bufp,
    654                     (unsigned) rel_off,
    655                     (unsigned) bufp + rel_off,
    656                     (int) size);
    657 #endif
    658 
    659     if (index > 0 && ((DLOOP_Offset) bufp + rel_off) ==
    660         ((paramp->u.flatten.offp[index - 1]) +
    661          (DLOOP_Offset) paramp->u.flatten.sizep[index - 1]))
    662     {
    663         /* add this size to the last vector rather than using up 
another one */
    664         paramp->u.flatten.sizep[index - 1] += size;
    665     }

4.  Here is another example:

#define MPID_IOV         struct iovec
#define MPID_IOV_LEN     iov_len
#define MPID_IOV_BUF     iov_base

    void *iov_base;     /* Pointer to data.  */

In src/mpid/common/datatype/gen_type_struct.c:

    DLOOP_Offset *tmp_disps, bytes;
    MPID_IOV *iov_array;

    for (i=0; i < nr_blks; i++)
    {
        tmp_blklens[i]  = iov_array[i].MPID_IOV_LEN;
        tmp_disps[i] = (DLOOP_Offset) iov_array[i].MPID_IOV_BUF;
    } 

The "tmp_disps[i] =" line casts a void* to an MPI_Aint (possibly causing 
sign extension, resulting in an incorrect offset.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alcf.anl.gov/pipermail/dcmf/attachments/20080208/0a8792b2/attachment.htm>