[dcmf] 32-bit ROMIO: What is the best solution?

Fri Feb 8 15:06:29 CST 2008

One little comment: MPI_Aint must be signed. MPI explicitly allows  
negative extents (which are stored in MPI_Aint types).

Rob

On Feb 8, 2008, at 2:41 PM, Bob Cernohous wrote:

>
> I'm going to forward parts of a technical discussion about 32-bit  
> MPICH/ROMIO issues.    This discussion has been going on privately  
> for a couple weeks.  I just wanted to open it up to the list for  
> discussion and tracking.
>
> I'm not going to include all the technical information that's been  
> passed on.  If anyone wants to add more of the historical  
> discussion, go for it.
>
> We all realize there are issues with MPI_Aint's in 32 bit  
> implementations.  Using signed 32 bit addresses along with 64 bit  
> offsets can result in some pretty broken code.  I've reproduced  
> several problems on BGL/BGP/linux.   The problems are most obvious  
> with romio files > 2G or virtual addresses > 2G.
>
> We've begun to consider making MPI_Aint a 64 bit value as a solution.
>
> So, below are my selections of technical tidbits on this from Jeff  
> Parker.
>
> ------------------------------------------
> Your (Rob Ross) document describing the MPI-IO file size limitations  
> on 32-bit platforms offered two solutions:
>
> 1.  Change internal ROMIO code to use 64-bit variables.  <snip.  Rob  
> Ross can detail this if he'd like to discuss this further.>
>
> 2.  Change the MPI_Aint data type to be 64-bit.  I looked at this  
> solution and it requires a scrub of all places throughout ROMIO and  
> MPICH where MPI_Aint variables are used, ensuring calculations will  
> have the correct result.  However, it does fix everything, including  
> the external MPI interfaces that return MPI_Aint values.  Perhaps a  
> variation of this is to code MPICH and ROMIO to handle both a 32-bit  
> or a 64-bit MPI_Aint based on a compile flag, changing calculations  
> to be correct for either size, and adding assertions when overflow  
> occurs.  We could have two MPICH libraries, one with 32-bit MPI_Aint  
> and the other with 64-bit MPI_Aint.  Applications that don't have  
> large datatypes or large MPI-IO files can continue to use the 32-bit  
> library, eliminating the risk of performance issues, while  
> applications using large datatypes can use the 64-bit version.
>
> One could argue that these MPI interfaces that return MPI_Aint  
> values are not used by typical applications, or if they are used,  
> they are only used on datatypes that fit in 31-bit sizes (up to  
> 2GB), so those applications don't have a problem.  However, if the  
> 32-bit platform has more than 2GB of memory, the MPI_Aint values can  
> go negative, which could also result in incorrect calculations or  
> comparisons.
>
> Both solutions involve a significant amount of work.  However, it  
> appears that only solution 2 fixes all of the issues.  Other 64-bit  
> platforms having both 64-bit MPI_Aint and 64 bit pointers appear to  
> work correctly.  It seems that we must change MPI_Aint to be 64-bit  
> and fix the calculations involving MPI_Aint and 32-bit addresses in  
> order to fix the whole problem.
> ------------------------------------------
> Something else that comes to mind that may be related:  Even though  
> a Blue Gene/P compute node has 2GB of memory, virtual addresses may  
> be arranged by CNK such that they are larger than 2GB.  In some  
> cases, when signed addresses are used with 64 bit offsets in  
> calculations, this may produce incorrect results.  Any scrub of the  
> code should account for this as well.
> ------------------------------------------
> (Jeff's earlier investigation)
> ------------------------------------------
> Regarding the MPI-IO issue with large files, exposed by the b_eff_io  
> testcase...
>
> I have done a brief investigation of solution 2, changing the  
> MPI_Aint data type from 32 bits to 64 bits.  <snip>
> Compile warnings appear in generic MPICH code, and a lot of work  
> (including code changes) would need to be done to address these  
> issues.
>
> It appears that the MPICH code expects MPI_Aint to be signed and be  
> the same size as a pointer, and many sensitive code changes will be  
> required throughout MPICH to allow them to be different.  There are  
> 1474 lines of code that explicitly reference MPI_Aint (typically  
> declaring variables and function parameters of this type), and each  
> of these threads would need to be examined in detail.
>
> A detailed look through the ROMIO code appears to be needed to  
> determine which variables need to be 64 bits and replace them with a  
> new unsigned 64 bit typedef.  While I haven't investigated solution  
> 1 yet, it seems likely that it too will require a similar scrub of  
> at least the ROMIO code.
>
> Here is evidence of the problems with changing MPI_Aint to be 64 bits:
>
> 1.  The mpi/mpich2/configure.in file sets up MPI_Aint to be the same  
> size as a pointer, and exports this size to other config files.
>
> 2.  MPI_Aint is signed.  Making it unsigned would help, but I don't  
> think that is allowed, correct?
>
> 3.  If I force MPI_Aint to be 64 bits (signed) in this config file  
> as follows:
>
> MPI_SIZEOF_AINT=8
> export MPI_SIZEOF_AINT
> MPI_AINT="long long"
>
> the MPICH compilation has warnings similar to the following:
>
> mpi/mpich2/src/mpid/common/datatype/mpid_segment.c:659: warning:  
> cast from pointer to integer of different size
>
> These places will be problematic because casting a 32 bit pointer to  
> a 64 bit signed integer produces incorrect results when the high- 
> order bit of the pointer is set (the address is larger than  
> 2GB)...the high-order bit is sign-extended, and the resulting  
> integer is a large negative number.  Some code change will be  
> required in each of these places to make this work.  Here is one  
> example:
>
> mpi/mpich2/src/mpid/common/datatype/mpid_segment.c:659: warning:  
> cast from pointer to integer of different size
>
> mpid/common/datatype/mpid_dataloop.h:#define DLOOP_Offset     MPI_Aint
>
>     633 static int MPID_Segment_contig_flatten(DLOOP_Offset *blocks_p,
>     634                                        DLOOP_Type el_type,
>     635                                        DLOOP_Offset rel_off,
>     636                                        void *bufp,
>     637                                        void *v_paramp)
>     638 {
>     639     int index, el_size;
>     640     DLOOP_Offset size;
>     641     struct MPID_Segment_piece_params *paramp = v_paramp;
>     642     MPIDI_STATE_DECL(MPID_STATE_MPID_SEGMENT_CONTIG_FLATTEN);
>     643
>     644     MPIDI_FUNC_ENTER(MPID_STATE_MPID_SEGMENT_CONTIG_FLATTEN);
>     645
>     646     el_size = MPID_Datatype_get_basic_size(el_type);
>     647     size = *blocks_p * (DLOOP_Offset) el_size;
>     648     index = paramp->u.flatten.index;
>     649
>     650 #ifdef MPID_SP_VERBOSE
>     651     MPIU_dbg_printf("\t[contig flatten: index = %d, loc =  
> (%x + %x) = %x, size = %d]\n",
>     652                     index,
>     653                     (unsigned) bufp,
>     654                     (unsigned) rel_off,
>     655                     (unsigned) bufp + rel_off,
>     656                     (int) size);
>     657 #endif
>     658
>     659     if (index > 0 && ((DLOOP_Offset) bufp + rel_off) ==
>     660         ((paramp->u.flatten.offp[index - 1]) +
>     661          (DLOOP_Offset) paramp->u.flatten.sizep[index - 1]))
>     662     {
>     663         /* add this size to the last vector rather than  
> using up another one */
>     664         paramp->u.flatten.sizep[index - 1] += size;
>     665     }
>
> 4.  Here is another example:
>
> #define MPID_IOV         struct iovec
> #define MPID_IOV_LEN     iov_len
> #define MPID_IOV_BUF     iov_base
>
>     void *iov_base;     /* Pointer to data.  */
>
> In src/mpid/common/datatype/gen_type_struct.c:
>
>     DLOOP_Offset *tmp_disps, bytes;
>     MPID_IOV *iov_array;
>
>     for (i=0; i < nr_blks; i++)
>     {
>         tmp_blklens[i]  = iov_array[i].MPID_IOV_LEN;
>         tmp_disps[i] = (DLOOP_Offset) iov_array[i].MPID_IOV_BUF;
>     }
>
> The "tmp_disps[i] =" line casts a void* to an MPI_Aint (possibly  
> causing sign extension, resulting in an incorrect offset.
> _______________________________________________
> dcmf mailing list
> dcmf at lists.anl-external.org
> http://lists.anl-external.org/cgi-bin/mailman/listinfo/dcmf
> http://dcmf.anl-external.org/wiki