<br><font size=2 face="sans-serif">I'm going to forward parts of a technical
discussion about 32-bit MPICH/ROMIO issues. This discussion
has been going on privately for a couple weeks. I just wanted to
open it up to the list for discussion and tracking.</font>
<br>
<br><font size=2 face="sans-serif">I'm not going to include all the technical
information that's been passed on. If anyone wants to add more of
the historical discussion, go for it.</font>
<br><font size=2 face="sans-serif"><br>
We all realize there are issues with MPI_Aint's in 32 bit implementations.
Using signed 32 bit addresses along with 64 bit offsets can result
in some pretty broken code. I've reproduced several problems on BGL/BGP/linux.
The problems are most obvious with romio files > 2G or virtual
addresses > 2G.</font>
<br>
<br><font size=2 face="sans-serif">We've begun to consider making MPI_Aint
a 64 bit value as a solution.</font>
<br>
<br><font size=2 face="sans-serif">So, below are my selections of technical
tidbits on this from Jeff Parker.</font>
<br>
<br><font size=2 face="sans-serif">------------------------------------------</font>
<br><font size=2 face="sans-serif">Your (Rob Ross) document describing
the MPI-IO file size limitations on 32-bit platforms offered two solutions:</font>
<br>
<br><font size=2 face="sans-serif">1. Change internal ROMIO code
to use 64-bit variables. <snip. Rob Ross can detail this
if he'd like to discuss this further.></font>
<br>
<br><font size=2 face="sans-serif">2. Change the MPI_Aint data type
to be 64-bit. I looked at this solution and it requires a scrub of
all places throughout ROMIO and MPICH where MPI_Aint variables are used,
ensuring calculations will have the correct result. However, it does
fix everything, including the external MPI interfaces that return MPI_Aint
values. Perhaps a variation of this is to code MPICH and ROMIO to
handle both a 32-bit or a 64-bit MPI_Aint based on a compile flag, changing
calculations to be correct for either size, and adding assertions when
overflow occurs. We could have two MPICH libraries, one with 32-bit
MPI_Aint and the other with 64-bit MPI_Aint. Applications that don't
have large datatypes or large MPI-IO files can continue to use the 32-bit
library, eliminating the risk of performance issues, while applications
using large datatypes can use the 64-bit version.</font>
<br>
<br><font size=2 face="sans-serif">One could argue that these MPI interfaces
that return MPI_Aint values are not used by typical applications, or if
they are used, they are only used on datatypes that fit in 31-bit sizes
(up to 2GB), so those applications don't have a problem. However,
if the 32-bit platform has more than 2GB of memory, the MPI_Aint values
can go negative, which could also result in incorrect calculations or comparisons.</font>
<br>
<br><font size=2 face="sans-serif">Both solutions involve a significant
amount of work. However, it appears that only solution 2 fixes all
of the issues. Other 64-bit platforms having both 64-bit MPI_Aint
and 64 bit pointers appear to work correctly. It seems that we must
change MPI_Aint to be 64-bit and fix the calculations involving MPI_Aint
and 32-bit addresses in order to fix the whole problem.</font>
<br><font size=2 face="sans-serif">------------------------------------------</font>
<br><font size=2 face="sans-serif">Something else that comes to mind that
may be related: Even though a Blue Gene/P compute node has 2GB of
memory, virtual addresses may be arranged by CNK such that they are larger
than 2GB. In some cases, when signed addresses are used with 64 bit
offsets in calculations, this may produce incorrect results. Any
scrub of the code should account for this as well.</font>
<br><font size=2 face="sans-serif">------------------------------------------</font>
<br><font size=2 face="sans-serif">(Jeff's earlier investigation)</font>
<br><font size=2 face="sans-serif">------------------------------------------</font>
<br><font size=2 face="sans-serif">Regarding the MPI-IO issue with large
files, exposed by the b_eff_io testcase...</font>
<br>
<br><font size=2 face="sans-serif">I have done a brief investigation of
solution 2, changing the MPI_Aint data type from 32 bits to 64 bits. <snip></font>
<br><font size=2 face="sans-serif">Compile warnings appear in generic MPICH
code, and a lot of work (including code changes) would need to be done
to address these issues.</font>
<br>
<br><font size=2 face="sans-serif">It appears that the MPICH code expects
MPI_Aint to be signed and be the same size as a pointer, and many sensitive
code changes will be required throughout MPICH to allow them to be different.
There are 1474 lines of code that explicitly reference MPI_Aint (typically
declaring variables and function parameters of this type), and each of
these threads would need to be examined in detail. </font>
<br>
<br><font size=2 face="sans-serif">A detailed look through the ROMIO code
appears to be needed to determine which variables need to be 64 bits and
replace them with a new unsigned 64 bit typedef. While I haven't
investigated solution 1 yet, it seems likely that it too will require a
similar scrub of at least the ROMIO code. </font>
<br>
<br><font size=2 face="sans-serif">Here is evidence of the problems with
changing MPI_Aint to be 64 bits:</font>
<br>
<br><font size=2 face="sans-serif">1. The mpi/mpich2/configure.in
file sets up MPI_Aint to be the same size as a pointer, and exports this
size to other config files.</font>
<br>
<br><font size=2 face="sans-serif">2. MPI_Aint is signed. Making
it unsigned would help, but I don't think that is allowed, correct?</font>
<br>
<br><font size=2 face="sans-serif">3. If I force MPI_Aint to be 64
bits (signed) in this config file as follows:</font>
<br>
<br><font size=2 face="sans-serif">MPI_SIZEOF_AINT=8</font>
<br><font size=2 face="sans-serif">export MPI_SIZEOF_AINT</font>
<br><font size=2 face="sans-serif">MPI_AINT="long long"</font>
<br>
<br><font size=2 face="sans-serif">the MPICH compilation has warnings similar
to the following:</font>
<br>
<br><font size=2 face="sans-serif">mpi/mpich2/src/mpid/common/datatype/mpid_segment.c:659:
warning: cast from pointer to integer of different size</font>
<br>
<br><font size=2 face="sans-serif">These places will be problematic because
casting a 32 bit pointer to a 64 bit signed integer produces incorrect
results when the high-order bit of the pointer is set (the address is larger
than 2GB)...the high-order bit is sign-extended, and the resulting integer
is a large negative number. Some code change will be required in
each of these places to make this work. Here is one example:</font>
<br>
<br><font size=2 face="sans-serif">mpi/mpich2/src/mpid/common/datatype/mpid_segment.c:659:
warning: cast from pointer to integer of different size</font>
<br>
<br><font size=2 face="sans-serif">mpid/common/datatype/mpid_dataloop.h:#define
DLOOP_Offset MPI_Aint</font>
<br>
<br><font size=2 face="sans-serif"> 633 static int MPID_Segment_contig_flatten(DLOOP_Offset
*blocks_p,</font>
<br><font size=2 face="sans-serif"> 634
DLOOP_Type el_type,</font>
<br><font size=2 face="sans-serif"> 635
DLOOP_Offset rel_off,</font>
<br><font size=2 face="sans-serif"> 636
void *bufp,</font>
<br><font size=2 face="sans-serif"> 637
void *v_paramp)</font>
<br><font size=2 face="sans-serif"> 638 {</font>
<br><font size=2 face="sans-serif"> 639 int
index, el_size;</font>
<br><font size=2 face="sans-serif"> 640 DLOOP_Offset
size;</font>
<br><font size=2 face="sans-serif"> 641 struct
MPID_Segment_piece_params *paramp = v_paramp;</font>
<br><font size=2 face="sans-serif"> 642 MPIDI_STATE_DECL(MPID_STATE_MPID_SEGMENT_CONTIG_FLATTEN);</font>
<br><font size=2 face="sans-serif"> 643 </font>
<br><font size=2 face="sans-serif"> 644 MPIDI_FUNC_ENTER(MPID_STATE_MPID_SEGMENT_CONTIG_FLATTEN);</font>
<br><font size=2 face="sans-serif"> 645 </font>
<br><font size=2 face="sans-serif"> 646 el_size
= MPID_Datatype_get_basic_size(el_type);</font>
<br><font size=2 face="sans-serif"> 647 size
= *blocks_p * (DLOOP_Offset) el_size;</font>
<br><font size=2 face="sans-serif"> 648 index
= paramp->u.flatten.index;</font>
<br><font size=2 face="sans-serif"> 649 </font>
<br><font size=2 face="sans-serif"> 650 #ifdef MPID_SP_VERBOSE</font>
<br><font size=2 face="sans-serif"> 651 MPIU_dbg_printf("\t[contig
flatten: index = %d, loc = (%x + %x) = %x, size = %d]\n",</font>
<br><font size=2 face="sans-serif"> 652
index,</font>
<br><font size=2 face="sans-serif"> 653
(unsigned) bufp,</font>
<br><font size=2 face="sans-serif"> 654
(unsigned) rel_off,</font>
<br><font size=2 face="sans-serif"> 655
(unsigned) bufp + rel_off,</font>
<br><font size=2 face="sans-serif"> 656
(int) size);</font>
<br><font size=2 face="sans-serif"> 657 #endif</font>
<br><font size=2 face="sans-serif"> 658 </font>
<br><font size=2 face="sans-serif"> 659 if (index
> 0 && ((DLOOP_Offset) bufp + rel_off) ==</font>
<br><font size=2 face="sans-serif"> 660
((paramp->u.flatten.offp[index - 1]) +</font>
<br><font size=2 face="sans-serif"> 661
(DLOOP_Offset) paramp->u.flatten.sizep[index - 1]))</font>
<br><font size=2 face="sans-serif"> 662 {</font>
<br><font size=2 face="sans-serif"> 663
/* add this size to the last vector rather than using up another
one */</font>
<br><font size=2 face="sans-serif"> 664
paramp->u.flatten.sizep[index - 1] += size;</font>
<br><font size=2 face="sans-serif"> 665 }</font>
<br>
<br><font size=2 face="sans-serif">4. Here is another example:</font>
<br>
<br><font size=2 face="sans-serif">#define MPID_IOV
struct iovec</font>
<br><font size=2 face="sans-serif">#define MPID_IOV_LEN iov_len</font>
<br><font size=2 face="sans-serif">#define MPID_IOV_BUF iov_base</font>
<br>
<br><font size=2 face="sans-serif"> void *iov_base;
/* Pointer to data. */</font>
<br>
<br><font size=2 face="sans-serif">In src/mpid/common/datatype/gen_type_struct.c:</font>
<br>
<br><font size=2 face="sans-serif"> DLOOP_Offset *tmp_disps,
bytes;</font>
<br><font size=2 face="sans-serif"> MPID_IOV *iov_array;</font>
<br>
<br><font size=2 face="sans-serif"> for (i=0; i < nr_blks;
i++)</font>
<br><font size=2 face="sans-serif"> {</font>
<br><font size=2 face="sans-serif"> tmp_blklens[i]
= iov_array[i].MPID_IOV_LEN;</font>
<br><font size=2 face="sans-serif"> tmp_disps[i]
= (DLOOP_Offset) iov_array[i].MPID_IOV_BUF;</font>
<br><font size=2 face="sans-serif"> } </font>
<br>
<br><font size=2 face="sans-serif">The "tmp_disps[i] =" line
casts a void* to an MPI_Aint (possibly causing sign extension, resulting
in an incorrect offset.</font>
<br>