<br><font size=2 face="sans-serif">I'm going to forward parts of a technical

discussion about 32-bit MPICH/ROMIO issues. &nbsp; &nbsp;This discussion

has been going on privately for a couple weeks. &nbsp;I just wanted to

open it up to the list for discussion and tracking.</font>

<br>

<br><font size=2 face="sans-serif">I'm not going to include all the technical

information that's been passed on. &nbsp;If anyone wants to add more of

the historical discussion, go for it.</font>

<br><font size=2 face="sans-serif"><br>

We all realize there are issues with MPI_Aint's in 32 bit implementations.

&nbsp;Using signed 32 bit addresses along with 64 bit offsets can result

in some pretty broken code. &nbsp;I've reproduced several problems on BGL/BGP/linux.

&nbsp; The problems are most obvious with romio files &gt; 2G or virtual

addresses &gt; 2G.</font>

<br>

<br><font size=2 face="sans-serif">We've begun to consider making MPI_Aint

a 64 bit value as a solution.</font>

<br>

<br><font size=2 face="sans-serif">So, below are my selections of technical

tidbits on this from Jeff Parker.</font>

<br>

<br><font size=2 face="sans-serif">------------------------------------------</font>

<br><font size=2 face="sans-serif">Your (Rob Ross) document describing

the MPI-IO file size limitations on 32-bit platforms offered two solutions:</font>

<br>

<br><font size=2 face="sans-serif">1. &nbsp;Change internal ROMIO code

to use 64-bit variables. &nbsp;&lt;snip. &nbsp;Rob Ross can detail this

if he'd like to discuss this further.&gt;</font>

<br>

<br><font size=2 face="sans-serif">2. &nbsp;Change the MPI_Aint data type

to be 64-bit. &nbsp;I looked at this solution and it requires a scrub of

all places throughout ROMIO and MPICH where MPI_Aint variables are used,

ensuring calculations will have the correct result. &nbsp;However, it does

fix everything, including the external MPI interfaces that return MPI_Aint

values. &nbsp;Perhaps a variation of this is to code MPICH and ROMIO to

handle both a 32-bit or a 64-bit MPI_Aint based on a compile flag, changing

calculations to be correct for either size, and adding assertions when

overflow occurs. &nbsp;We could have two MPICH libraries, one with 32-bit

MPI_Aint and the other with 64-bit MPI_Aint. &nbsp;Applications that don't

have large datatypes or large MPI-IO files can continue to use the 32-bit

library, eliminating the risk of performance issues, while applications

using large datatypes can use the 64-bit version.</font>

<br>

<br><font size=2 face="sans-serif">One could argue that these MPI interfaces

that return MPI_Aint values are not used by typical applications, or if

they are used, they are only used on datatypes that fit in 31-bit sizes

(up to 2GB), so those applications don't have a problem. &nbsp;However,

if the 32-bit platform has more than 2GB of memory, the MPI_Aint values

can go negative, which could also result in incorrect calculations or comparisons.</font>

<br>

<br><font size=2 face="sans-serif">Both solutions involve a significant

amount of work. &nbsp;However, it appears that only solution 2 fixes all

of the issues. &nbsp;Other 64-bit platforms having both 64-bit MPI_Aint

and 64 bit pointers appear to work correctly. &nbsp;It seems that we must

change MPI_Aint to be 64-bit and fix the calculations involving MPI_Aint

and 32-bit addresses in order to fix the whole problem.</font>

<br><font size=2 face="sans-serif">------------------------------------------</font>

<br><font size=2 face="sans-serif">Something else that comes to mind that

may be related: &nbsp;Even though a Blue Gene/P compute node has 2GB of

memory, virtual addresses may be arranged by CNK such that they are larger

than 2GB. &nbsp;In some cases, when signed addresses are used with 64 bit

offsets in calculations, this may produce incorrect results. &nbsp;Any

scrub of the code should account for this as well.</font>

<br><font size=2 face="sans-serif">------------------------------------------</font>

<br><font size=2 face="sans-serif">(Jeff's earlier investigation)</font>

<br><font size=2 face="sans-serif">------------------------------------------</font>

<br><font size=2 face="sans-serif">Regarding the MPI-IO issue with large

files, exposed by the b_eff_io testcase...</font>

<br>

<br><font size=2 face="sans-serif">I have done a brief investigation of

solution 2, changing the MPI_Aint data type from 32 bits to 64 bits. &nbsp;&lt;snip&gt;</font>

<br><font size=2 face="sans-serif">Compile warnings appear in generic MPICH

code, and a lot of work (including code changes) would need to be done

to address these issues.</font>

<br>

<br><font size=2 face="sans-serif">It appears that the MPICH code expects

MPI_Aint to be signed and be the same size as a pointer, and many sensitive

code changes will be required throughout MPICH to allow them to be different.

&nbsp;There are 1474 lines of code that explicitly reference MPI_Aint (typically

declaring variables and function parameters of this type), and each of

these threads would need to be examined in detail. &nbsp;</font>

<br>

<br><font size=2 face="sans-serif">A detailed look through the ROMIO code

appears to be needed to determine which variables need to be 64 bits and

replace them with a new unsigned 64 bit typedef. &nbsp;While I haven't

investigated solution 1 yet, it seems likely that it too will require a

similar scrub of at least the ROMIO code. &nbsp; </font>

<br>

<br><font size=2 face="sans-serif">Here is evidence of the problems with

changing MPI_Aint to be 64 bits:</font>

<br>

<br><font size=2 face="sans-serif">1. &nbsp;The mpi/mpich2/configure.in

file sets up MPI_Aint to be the same size as a pointer, and exports this

size to other config files.</font>

<br>

<br><font size=2 face="sans-serif">2. &nbsp;MPI_Aint is signed. &nbsp;Making

it unsigned would help, but I don't think that is allowed, correct?</font>

<br>

<br><font size=2 face="sans-serif">3. &nbsp;If I force MPI_Aint to be 64

bits (signed) in this config file as follows:</font>

<br>

<br><font size=2 face="sans-serif">MPI_SIZEOF_AINT=8</font>

<br><font size=2 face="sans-serif">export MPI_SIZEOF_AINT</font>

<br><font size=2 face="sans-serif">MPI_AINT=&quot;long long&quot;</font>

<br>

<br><font size=2 face="sans-serif">the MPICH compilation has warnings similar

to the following:</font>

<br>

<br><font size=2 face="sans-serif">mpi/mpich2/src/mpid/common/datatype/mpid_segment.c:659:

warning: cast from pointer to integer of different size</font>

<br>

<br><font size=2 face="sans-serif">These places will be problematic because

casting a 32 bit pointer to a 64 bit signed integer produces incorrect

results when the high-order bit of the pointer is set (the address is larger

than 2GB)...the high-order bit is sign-extended, and the resulting integer

is a large negative number. &nbsp;Some code change will be required in

each of these places to make this work. &nbsp;Here is one example:</font>

<br>

<br><font size=2 face="sans-serif">mpi/mpich2/src/mpid/common/datatype/mpid_segment.c:659:

warning: cast from pointer to integer of different size</font>

<br>

<br><font size=2 face="sans-serif">mpid/common/datatype/mpid_dataloop.h:#define

DLOOP_Offset &nbsp; &nbsp; MPI_Aint</font>

<br>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 633 static int MPID_Segment_contig_flatten(DLOOP_Offset

*blocks_p,</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 634 &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;DLOOP_Type el_type,</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 635 &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;DLOOP_Offset rel_off,</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 636 &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;void *bufp,</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 637 &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;void *v_paramp)</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 638 {</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 639 &nbsp; &nbsp; int

index, el_size;</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 640 &nbsp; &nbsp; DLOOP_Offset

size;</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 641 &nbsp; &nbsp; struct

MPID_Segment_piece_params *paramp = v_paramp;</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 642 &nbsp; &nbsp; MPIDI_STATE_DECL(MPID_STATE_MPID_SEGMENT_CONTIG_FLATTEN);</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 643 &nbsp; &nbsp; </font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 644 &nbsp; &nbsp; MPIDI_FUNC_ENTER(MPID_STATE_MPID_SEGMENT_CONTIG_FLATTEN);</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 645 </font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 646 &nbsp; &nbsp; el_size

= MPID_Datatype_get_basic_size(el_type);</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 647 &nbsp; &nbsp; size

= *blocks_p * (DLOOP_Offset) el_size;</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 648 &nbsp; &nbsp; index

= paramp-&gt;u.flatten.index;</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 649 </font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 650 #ifdef MPID_SP_VERBOSE</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 651 &nbsp; &nbsp; MPIU_dbg_printf(&quot;\t[contig

flatten: index = %d, loc = (%x + %x) = %x, size = %d]\n&quot;,</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 652 &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; index,</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 653 &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (unsigned) bufp,</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 654 &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (unsigned) rel_off,</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 655 &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (unsigned) bufp + rel_off,</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 656 &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (int) size);</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 657 #endif</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 658 </font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 659 &nbsp; &nbsp; if (index

&gt; 0 &amp;&amp; ((DLOOP_Offset) bufp + rel_off) ==</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 660 &nbsp; &nbsp; &nbsp;

&nbsp; ((paramp-&gt;u.flatten.offp[index - 1]) +</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 661 &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp;(DLOOP_Offset) paramp-&gt;u.flatten.sizep[index - 1]))</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 662 &nbsp; &nbsp; {</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 663 &nbsp; &nbsp; &nbsp;

&nbsp; /* add this size to the last vector rather than using up another

one */</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 664 &nbsp; &nbsp; &nbsp;

&nbsp; paramp-&gt;u.flatten.sizep[index - 1] += size;</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; 665 &nbsp; &nbsp; }</font>

<br>

<br><font size=2 face="sans-serif">4. &nbsp;Here is another example:</font>

<br>

<br><font size=2 face="sans-serif">#define MPID_IOV &nbsp; &nbsp; &nbsp;

&nbsp; struct iovec</font>

<br><font size=2 face="sans-serif">#define MPID_IOV_LEN &nbsp; &nbsp; iov_len</font>

<br><font size=2 face="sans-serif">#define MPID_IOV_BUF &nbsp; &nbsp; iov_base</font>

<br>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; void *iov_base; &nbsp;

&nbsp; /* Pointer to data. &nbsp;*/</font>

<br>

<br><font size=2 face="sans-serif">In src/mpid/common/datatype/gen_type_struct.c:</font>

<br>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; DLOOP_Offset *tmp_disps,

bytes;</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; MPID_IOV *iov_array;</font>

<br>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; for (i=0; i &lt; nr_blks;

i++)</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; {</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; &nbsp; &nbsp; tmp_blklens[i]

&nbsp;= iov_array[i].MPID_IOV_LEN;</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; &nbsp; &nbsp; tmp_disps[i]

= (DLOOP_Offset) iov_array[i].MPID_IOV_BUF;</font>

<br><font size=2 face="sans-serif">&nbsp; &nbsp; } </font>

<br>

<br><font size=2 face="sans-serif">The &quot;tmp_disps[i] =&quot; line

casts a void* to an MPI_Aint (possibly causing sign extension, resulting

in an incorrect offset.</font>

<br>