[alcf-discuss] [Fwd: Re: [Fwd: Re: Parallel hdf5 I/O of strided data]]
John R. Cary
cary at txcorp.com
Mon May 11 13:42:04 CDT 2009
Just for the record (and as I think I sent in before), we pass the
collective property to H5Dwrite per the documentation, but
we do so for only strided data, which is what is giving us the
problem.
Our other data (in another dump file) is laid out per processor:
all proc 1 data contiguous
all proc 2 data contiguous
...
and we used independent I/O. This works fine, it seems.
Thx....John Cary
Jason Kraftcheck wrote:
> Tim Tautges wrote:
>
>> ???
>>
>>
>
> a) No, the MOAB parallel writer does no collective IO right now.
>
> b) The discussion below seems to include the erroneous assumption that the
> independent/collective property is specified at file open.
>
>
>> - tim
>>
>> -------- Original Message --------
>> Subject: Re: [Fwd: Re: [alcf-discuss] Parallel hdf5 I/O of strided data]
>> Date: Wed, 06 May 2009 15:52:53 -0500
>> From: Jason Kraftcheck <kraftche at cae.wisc.edu>
>> To: Tim Tautges <tautges at mcs.anl.gov>
>> References: <4A01F6CB.3040107 at mcs.anl.gov>
>>
>> Tim Tautges wrote:
>>
>>> Are you using the collective property in the parallel hdf5 writer? (See
>>> below...)
>>>
>>>
>> The comment below seems to be inconsistent with the documentation. If you
>> follow the link below to the description of H5Pset_dxpl_mpio, note the part
>> where it says "The property list can then be used to control the I/O
>> transfer mode during data I/O operations." That seems to imply that the
>> property is to be passed to things like H5Dread and H5Dwrite (data I/O)
>> rather than H5Fcreate.
>>
>> The Parallel HDF5 documentation (what there is of it) also says that this
>> property gets passed to H5Dread and H5Dwrite
>> (http://www.hdfgroup.org/HDF5/Tutor/pcrtaccd.html). It never mentions the
>> property in connection with H5Fcreate or H5Fopen, nor can I find such a
>> usage in any of the examples.
>>
>>
>> - jason
>>
>>
>>
>>> - tim
>>>
>>> -------- Original Message --------
>>> Subject: Re: [alcf-discuss] Parallel hdf5 I/O of strided data
>>> Date: Wed, 6 May 2009 14:48:48 -0500
>>> From: Rob Latham <robl at mcs.anl.gov>
>>> To: John R. Cary <cary at txcorp.com>
>>> CC: discuss at lists.alcf.anl.gov
>>> References: <4A01BF23.3030500 at txcorp.com>
>>> <20090506181859.GD9315 at mcs.anl.gov> <4A01E33E.9070702 at txcorp.com>
>>>
>>> On Wed, May 06, 2009 at 12:21:34PM -0700, John R. Cary wrote:
>>>
>>>> First, Thanks for setting up this list. I wonder whether it
>>>> makes sense to have Reply-To list turned
>>>> on for this? I usually just hit reply, and then the list does not get
>>>> this.
>>>>
>>> There are a lot of Robs at argonne. I didn't set up this list:
>>> another Rob did :> I don't like reply-to-list myself, but like
>>> arguing about which is better even less! :>
>>>
>>>
>>>> /gpfs1/cary
>>>>
>>> Do you see a similar result if you write to /pvfs-surveyor ?
>>>
>>>
>>>>> - what, if any, HDF5 property list tunables are you setting
>>>>>
>>>>>
>>>> On file open:
>>>>
>>>> MPI_Comm comm = MPI_COMM_WORLD;
>>>> MPI_Info info = MPI_INFO_NULL;
>>>> hid_t plistId = H5Pcreate(H5P_FILE_ACCESS);
>>>> H5Pset_fapl_mpio(plistId, comm, info);
>>>> hid_t res = H5Fcreate(filename.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT,
>>>> plistId);
>>>>
>>> OK, so you have turned on parallel I/O, but not collective I/O. This
>>> is orthogonal to the correctness issue, but if you enable collective
>>> I/O, you'll get much much better performance.
>>>
>>> http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetDxplMpio
>>>
>>> set the H5FD_MPIO_COLLECTIVE flag and you'll exercise a different code
>>> path. Maybe it will make your problem go away, too?
>>>
>>> As to your correctness issue, independent I/O is a pretty simple
>>> workload. I'm surprised you're seeing errors with that.
>>>
>>> Do you have a small testcase for this, or do you have to run your
>>> entire application to see the corruption?
>>>
>>> ==rob
>>>
>>>
>>
>
>
>
More information about the discuss
mailing list