[dcmf] 32-bit ROMIO: What is the best solution?
Bob Cernohous
bobc at us.ibm.com
Fri Feb 8 14:41:40 CST 2008
I'm going to forward parts of a technical discussion about 32-bit
MPICH/ROMIO issues. This discussion has been going on privately for a
couple weeks. I just wanted to open it up to the list for discussion and
tracking.
I'm not going to include all the technical information that's been passed
on. If anyone wants to add more of the historical discussion, go for it.
We all realize there are issues with MPI_Aint's in 32 bit implementations.
Using signed 32 bit addresses along with 64 bit offsets can result in
some pretty broken code. I've reproduced several problems on
BGL/BGP/linux. The problems are most obvious with romio files > 2G or
virtual addresses > 2G.
We've begun to consider making MPI_Aint a 64 bit value as a solution.
So, below are my selections of technical tidbits on this from Jeff Parker.
------------------------------------------
Your (Rob Ross) document describing the MPI-IO file size limitations on
32-bit platforms offered two solutions:
1. Change internal ROMIO code to use 64-bit variables. <snip. Rob Ross
can detail this if he'd like to discuss this further.>
2. Change the MPI_Aint data type to be 64-bit. I looked at this solution
and it requires a scrub of all places throughout ROMIO and MPICH where
MPI_Aint variables are used, ensuring calculations will have the correct
result. However, it does fix everything, including the external MPI
interfaces that return MPI_Aint values. Perhaps a variation of this is to
code MPICH and ROMIO to handle both a 32-bit or a 64-bit MPI_Aint based on
a compile flag, changing calculations to be correct for either size, and
adding assertions when overflow occurs. We could have two MPICH
libraries, one with 32-bit MPI_Aint and the other with 64-bit MPI_Aint.
Applications that don't have large datatypes or large MPI-IO files can
continue to use the 32-bit library, eliminating the risk of performance
issues, while applications using large datatypes can use the 64-bit
version.
One could argue that these MPI interfaces that return MPI_Aint values are
not used by typical applications, or if they are used, they are only used
on datatypes that fit in 31-bit sizes (up to 2GB), so those applications
don't have a problem. However, if the 32-bit platform has more than 2GB
of memory, the MPI_Aint values can go negative, which could also result in
incorrect calculations or comparisons.
Both solutions involve a significant amount of work. However, it appears
that only solution 2 fixes all of the issues. Other 64-bit platforms
having both 64-bit MPI_Aint and 64 bit pointers appear to work correctly.
It seems that we must change MPI_Aint to be 64-bit and fix the
calculations involving MPI_Aint and 32-bit addresses in order to fix the
whole problem.
------------------------------------------
Something else that comes to mind that may be related: Even though a Blue
Gene/P compute node has 2GB of memory, virtual addresses may be arranged
by CNK such that they are larger than 2GB. In some cases, when signed
addresses are used with 64 bit offsets in calculations, this may produce
incorrect results. Any scrub of the code should account for this as well.
------------------------------------------
(Jeff's earlier investigation)
------------------------------------------
Regarding the MPI-IO issue with large files, exposed by the b_eff_io
testcase...
I have done a brief investigation of solution 2, changing the MPI_Aint
data type from 32 bits to 64 bits. <snip>
Compile warnings appear in generic MPICH code, and a lot of work
(including code changes) would need to be done to address these issues.
It appears that the MPICH code expects MPI_Aint to be signed and be the
same size as a pointer, and many sensitive code changes will be required
throughout MPICH to allow them to be different. There are 1474 lines of
code that explicitly reference MPI_Aint (typically declaring variables and
function parameters of this type), and each of these threads would need to
be examined in detail.
A detailed look through the ROMIO code appears to be needed to determine
which variables need to be 64 bits and replace them with a new unsigned 64
bit typedef. While I haven't investigated solution 1 yet, it seems likely
that it too will require a similar scrub of at least the ROMIO code.
Here is evidence of the problems with changing MPI_Aint to be 64 bits:
1. The mpi/mpich2/configure.in file sets up MPI_Aint to be the same size
as a pointer, and exports this size to other config files.
2. MPI_Aint is signed. Making it unsigned would help, but I don't think
that is allowed, correct?
3. If I force MPI_Aint to be 64 bits (signed) in this config file as
follows:
MPI_SIZEOF_AINT=8
export MPI_SIZEOF_AINT
MPI_AINT="long long"
the MPICH compilation has warnings similar to the following:
mpi/mpich2/src/mpid/common/datatype/mpid_segment.c:659: warning: cast from
pointer to integer of different size
These places will be problematic because casting a 32 bit pointer to a 64
bit signed integer produces incorrect results when the high-order bit of
the pointer is set (the address is larger than 2GB)...the high-order bit
is sign-extended, and the resulting integer is a large negative number.
Some code change will be required in each of these places to make this
work. Here is one example:
mpi/mpich2/src/mpid/common/datatype/mpid_segment.c:659: warning: cast from
pointer to integer of different size
mpid/common/datatype/mpid_dataloop.h:#define DLOOP_Offset MPI_Aint
633 static int MPID_Segment_contig_flatten(DLOOP_Offset *blocks_p,
634 DLOOP_Type el_type,
635 DLOOP_Offset rel_off,
636 void *bufp,
637 void *v_paramp)
638 {
639 int index, el_size;
640 DLOOP_Offset size;
641 struct MPID_Segment_piece_params *paramp = v_paramp;
642 MPIDI_STATE_DECL(MPID_STATE_MPID_SEGMENT_CONTIG_FLATTEN);
643
644 MPIDI_FUNC_ENTER(MPID_STATE_MPID_SEGMENT_CONTIG_FLATTEN);
645
646 el_size = MPID_Datatype_get_basic_size(el_type);
647 size = *blocks_p * (DLOOP_Offset) el_size;
648 index = paramp->u.flatten.index;
649
650 #ifdef MPID_SP_VERBOSE
651 MPIU_dbg_printf("\t[contig flatten: index = %d, loc = (%x +
%x) = %x, size = %d]\n",
652 index,
653 (unsigned) bufp,
654 (unsigned) rel_off,
655 (unsigned) bufp + rel_off,
656 (int) size);
657 #endif
658
659 if (index > 0 && ((DLOOP_Offset) bufp + rel_off) ==
660 ((paramp->u.flatten.offp[index - 1]) +
661 (DLOOP_Offset) paramp->u.flatten.sizep[index - 1]))
662 {
663 /* add this size to the last vector rather than using up
another one */
664 paramp->u.flatten.sizep[index - 1] += size;
665 }
4. Here is another example:
#define MPID_IOV struct iovec
#define MPID_IOV_LEN iov_len
#define MPID_IOV_BUF iov_base
void *iov_base; /* Pointer to data. */
In src/mpid/common/datatype/gen_type_struct.c:
DLOOP_Offset *tmp_disps, bytes;
MPID_IOV *iov_array;
for (i=0; i < nr_blks; i++)
{
tmp_blklens[i] = iov_array[i].MPID_IOV_LEN;
tmp_disps[i] = (DLOOP_Offset) iov_array[i].MPID_IOV_BUF;
}
The "tmp_disps[i] =" line casts a void* to an MPI_Aint (possibly causing
sign extension, resulting in an incorrect offset.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alcf.anl.gov/pipermail/dcmf/attachments/20080208/0a8792b2/attachment.htm>
More information about the dcmf
mailing list