[dcmf] 32-bit ROMIO: What is the best solution?
Rob Ross
rross at mcs.anl.gov
Fri Feb 8 15:06:29 CST 2008
One little comment: MPI_Aint must be signed. MPI explicitly allows
negative extents (which are stored in MPI_Aint types).
Rob
On Feb 8, 2008, at 2:41 PM, Bob Cernohous wrote:
>
> I'm going to forward parts of a technical discussion about 32-bit
> MPICH/ROMIO issues. This discussion has been going on privately
> for a couple weeks. I just wanted to open it up to the list for
> discussion and tracking.
>
> I'm not going to include all the technical information that's been
> passed on. If anyone wants to add more of the historical
> discussion, go for it.
>
> We all realize there are issues with MPI_Aint's in 32 bit
> implementations. Using signed 32 bit addresses along with 64 bit
> offsets can result in some pretty broken code. I've reproduced
> several problems on BGL/BGP/linux. The problems are most obvious
> with romio files > 2G or virtual addresses > 2G.
>
> We've begun to consider making MPI_Aint a 64 bit value as a solution.
>
> So, below are my selections of technical tidbits on this from Jeff
> Parker.
>
> ------------------------------------------
> Your (Rob Ross) document describing the MPI-IO file size limitations
> on 32-bit platforms offered two solutions:
>
> 1. Change internal ROMIO code to use 64-bit variables. <snip. Rob
> Ross can detail this if he'd like to discuss this further.>
>
> 2. Change the MPI_Aint data type to be 64-bit. I looked at this
> solution and it requires a scrub of all places throughout ROMIO and
> MPICH where MPI_Aint variables are used, ensuring calculations will
> have the correct result. However, it does fix everything, including
> the external MPI interfaces that return MPI_Aint values. Perhaps a
> variation of this is to code MPICH and ROMIO to handle both a 32-bit
> or a 64-bit MPI_Aint based on a compile flag, changing calculations
> to be correct for either size, and adding assertions when overflow
> occurs. We could have two MPICH libraries, one with 32-bit MPI_Aint
> and the other with 64-bit MPI_Aint. Applications that don't have
> large datatypes or large MPI-IO files can continue to use the 32-bit
> library, eliminating the risk of performance issues, while
> applications using large datatypes can use the 64-bit version.
>
> One could argue that these MPI interfaces that return MPI_Aint
> values are not used by typical applications, or if they are used,
> they are only used on datatypes that fit in 31-bit sizes (up to
> 2GB), so those applications don't have a problem. However, if the
> 32-bit platform has more than 2GB of memory, the MPI_Aint values can
> go negative, which could also result in incorrect calculations or
> comparisons.
>
> Both solutions involve a significant amount of work. However, it
> appears that only solution 2 fixes all of the issues. Other 64-bit
> platforms having both 64-bit MPI_Aint and 64 bit pointers appear to
> work correctly. It seems that we must change MPI_Aint to be 64-bit
> and fix the calculations involving MPI_Aint and 32-bit addresses in
> order to fix the whole problem.
> ------------------------------------------
> Something else that comes to mind that may be related: Even though
> a Blue Gene/P compute node has 2GB of memory, virtual addresses may
> be arranged by CNK such that they are larger than 2GB. In some
> cases, when signed addresses are used with 64 bit offsets in
> calculations, this may produce incorrect results. Any scrub of the
> code should account for this as well.
> ------------------------------------------
> (Jeff's earlier investigation)
> ------------------------------------------
> Regarding the MPI-IO issue with large files, exposed by the b_eff_io
> testcase...
>
> I have done a brief investigation of solution 2, changing the
> MPI_Aint data type from 32 bits to 64 bits. <snip>
> Compile warnings appear in generic MPICH code, and a lot of work
> (including code changes) would need to be done to address these
> issues.
>
> It appears that the MPICH code expects MPI_Aint to be signed and be
> the same size as a pointer, and many sensitive code changes will be
> required throughout MPICH to allow them to be different. There are
> 1474 lines of code that explicitly reference MPI_Aint (typically
> declaring variables and function parameters of this type), and each
> of these threads would need to be examined in detail.
>
> A detailed look through the ROMIO code appears to be needed to
> determine which variables need to be 64 bits and replace them with a
> new unsigned 64 bit typedef. While I haven't investigated solution
> 1 yet, it seems likely that it too will require a similar scrub of
> at least the ROMIO code.
>
> Here is evidence of the problems with changing MPI_Aint to be 64 bits:
>
> 1. The mpi/mpich2/configure.in file sets up MPI_Aint to be the same
> size as a pointer, and exports this size to other config files.
>
> 2. MPI_Aint is signed. Making it unsigned would help, but I don't
> think that is allowed, correct?
>
> 3. If I force MPI_Aint to be 64 bits (signed) in this config file
> as follows:
>
> MPI_SIZEOF_AINT=8
> export MPI_SIZEOF_AINT
> MPI_AINT="long long"
>
> the MPICH compilation has warnings similar to the following:
>
> mpi/mpich2/src/mpid/common/datatype/mpid_segment.c:659: warning:
> cast from pointer to integer of different size
>
> These places will be problematic because casting a 32 bit pointer to
> a 64 bit signed integer produces incorrect results when the high-
> order bit of the pointer is set (the address is larger than
> 2GB)...the high-order bit is sign-extended, and the resulting
> integer is a large negative number. Some code change will be
> required in each of these places to make this work. Here is one
> example:
>
> mpi/mpich2/src/mpid/common/datatype/mpid_segment.c:659: warning:
> cast from pointer to integer of different size
>
> mpid/common/datatype/mpid_dataloop.h:#define DLOOP_Offset MPI_Aint
>
> 633 static int MPID_Segment_contig_flatten(DLOOP_Offset *blocks_p,
> 634 DLOOP_Type el_type,
> 635 DLOOP_Offset rel_off,
> 636 void *bufp,
> 637 void *v_paramp)
> 638 {
> 639 int index, el_size;
> 640 DLOOP_Offset size;
> 641 struct MPID_Segment_piece_params *paramp = v_paramp;
> 642 MPIDI_STATE_DECL(MPID_STATE_MPID_SEGMENT_CONTIG_FLATTEN);
> 643
> 644 MPIDI_FUNC_ENTER(MPID_STATE_MPID_SEGMENT_CONTIG_FLATTEN);
> 645
> 646 el_size = MPID_Datatype_get_basic_size(el_type);
> 647 size = *blocks_p * (DLOOP_Offset) el_size;
> 648 index = paramp->u.flatten.index;
> 649
> 650 #ifdef MPID_SP_VERBOSE
> 651 MPIU_dbg_printf("\t[contig flatten: index = %d, loc =
> (%x + %x) = %x, size = %d]\n",
> 652 index,
> 653 (unsigned) bufp,
> 654 (unsigned) rel_off,
> 655 (unsigned) bufp + rel_off,
> 656 (int) size);
> 657 #endif
> 658
> 659 if (index > 0 && ((DLOOP_Offset) bufp + rel_off) ==
> 660 ((paramp->u.flatten.offp[index - 1]) +
> 661 (DLOOP_Offset) paramp->u.flatten.sizep[index - 1]))
> 662 {
> 663 /* add this size to the last vector rather than
> using up another one */
> 664 paramp->u.flatten.sizep[index - 1] += size;
> 665 }
>
> 4. Here is another example:
>
> #define MPID_IOV struct iovec
> #define MPID_IOV_LEN iov_len
> #define MPID_IOV_BUF iov_base
>
> void *iov_base; /* Pointer to data. */
>
> In src/mpid/common/datatype/gen_type_struct.c:
>
> DLOOP_Offset *tmp_disps, bytes;
> MPID_IOV *iov_array;
>
> for (i=0; i < nr_blks; i++)
> {
> tmp_blklens[i] = iov_array[i].MPID_IOV_LEN;
> tmp_disps[i] = (DLOOP_Offset) iov_array[i].MPID_IOV_BUF;
> }
>
> The "tmp_disps[i] =" line casts a void* to an MPI_Aint (possibly
> causing sign extension, resulting in an incorrect offset.
> _______________________________________________
> dcmf mailing list
> dcmf at lists.anl-external.org
> http://lists.anl-external.org/cgi-bin/mailman/listinfo/dcmf
> http://dcmf.anl-external.org/wiki
More information about the dcmf
mailing list