[alcf-discuss] Parallel hdf5 I/O of strided data

Rob Latham robl at mcs.anl.gov
Wed May 6 14:48:48 CDT 2009


On Wed, May 06, 2009 at 12:21:34PM -0700, John R. Cary wrote:
> First, Thanks for setting up this list.  I wonder whether it
> makes sense to have Reply-To list turned
> on for this?  I usually just hit reply, and then the list does not get
> this.

There are a lot of Robs at argonne.  I didn't set up this list:
another Rob did :>   I don't like reply-to-list myself, but like
arguing about which is better even less! :>

> /gpfs1/cary

Do you see a similar result if you write to /pvfs-surveyor ?

>> - what, if any, HDF5 property list tunables are you setting
>>   
> On file open:
>
>  MPI_Comm comm = MPI_COMM_WORLD;
>  MPI_Info info = MPI_INFO_NULL;
>  hid_t plistId = H5Pcreate(H5P_FILE_ACCESS);
>  H5Pset_fapl_mpio(plistId, comm, info);
>  hid_t res = H5Fcreate(filename.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT,
>        plistId);

OK, so you have turned on parallel I/O, but not collective I/O.  This
is orthogonal to the correctness issue, but if you enable collective
I/O, you'll get much much better performance.

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetDxplMpio

set the H5FD_MPIO_COLLECTIVE flag and you'll exercise a different code
path.  Maybe it will make your problem go away, too? 

As to your correctness issue, independent I/O is a pretty simple
workload.  I'm surprised you're seeing errors with that.

Do you have a small testcase for this, or do you have to run your
entire application to see the corruption?

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the discuss mailing list