plfs / plfs-core Goto Github PK

LANL no longer develops PLFS. Feel free to fork and develop as you wish.

Python 0.92% Shell 0.21% C++ 93.94% C 4.73% Logos 0.01% Perl 0.20%

plfs-core's Introduction

###############################################################################
          PLFS: A Checkpoint Filesystem for Parallel Applications
###############################################################################

PLFS is a parallel virtual file system that rearranges random, concurrent
writes into sequential writes to unique files (i.e. N-1 into N-N). This
rearrangment allows write access patterns to be optimized for the underlying
file system.

Additional info:
http://institutes.lanl.gov/plfs. Available under LANL LA-CC-08-104.

Email [email protected] with any user questions.
Email [email protected] with any developer questions.


***************************************************************************
What's New
***************************************************************************

A list of new features and bug fixes for each release is maintained in
Changelog. Please refer to that file for more information on what changes in
PLFS between releases.


***************************************************************************
Installing PLFS
***************************************************************************

For building and installation instructions, please see README.install.


***************************************************************************
Using PLFS
***************************************************************************

There are three ways to use PLFS: through the FUSE system, through MPI/IO, and
through the PLFS API.


*** Mounting a PLFS file system using FUSE: ***

1) Build and install PLFS with FUSE support (please see README.install)
2) Make sure your plfsrc file is correct by using plfs_check_config. In order
   to quickly create the needed directories, use 'plfs_check_config -mkdir'.
3) Launch the PLFS daemon (which calls fusermount):
   plfs /path/to/mount
4) The PLFS daemon can be killed via fusermount:
   fusermount -u /path/to/mount

/path/to/mount can now be used as any POSIX file system. To check FUSE stats
and make sure everything is working, run the following:
cat /path/to/mount/.plfsdebug

NOTE: The plfs binary does not check the existence of backend directories;
only the existence of the mount point directory is checked. Therefore it is
possible to successfully run plfs even if plfs_check_config reports that there
are missing backends. If plfs is started on a mount point that is missing
backends, the mount point will not be usable. As soon as all of the backends
are available, the plfs mount point will be usable (no restart is necessary).

NOTE: Not checking for the existence of backend directories is
done to make mounting plfs scalable on a large cluster. If plfs is configured
to start at boot time and plfs checks backend directories when it is started,
this could result in thousands or millions (or more) of stat calls to the
underlying filesystem in a very short period of time.

NOTE on the above NOTES: Not checking backend directories only applies to
fuse's actual mount command; plfs will not check the existence of backend
directories during that call. However, it is possible for fuse to issue
additional calls on the mount point, such as statfs, when
'plfs <mount point>' is executed on the command line. These calls will fail
and it may not be possible to mount plfs without the backend directories
for those versions/implementations of fuse. It is possible to see what
commands fuse makes when running 'plfs <mount point>' by using strace.

Several options can be passed to the plfs binary that are passed on to the FUSE
system. Use the plfs binary's -o command line parameter. Use 'plfs -h' to see
all options that can be passed. Some useful ones are:
allow_other   : if mounting as root, this allows others to use plfs
big_writes    : newer versions of fuse can break writes into 128K chunks 
direct_io     : older versions can too, but then mmap doesn't work
auto_cache    : if you are writing to PLFS through MPI-IO or PLFS-API and you want
                to read from a PLFS FUSE, then use this flag

Example:
plfs -o big_writes,allow_other /path/to/mount


*** Using PLFS as a ROMIO ADIO layer for MPI/IO ***

Please see mpi_adio/README for information on using PLFS as a ROMIO ADIO layer.


*** Using the PLFS API ***

1) Build and install the PLFS library. Verify your plfsrc file is correct by
   using plfs_check_config.
2) Look at plfs.h to become familar with the PLFS API.
3) Modify application source to use the PLFS API. Basically, you just change
   opens to plfs_opens, etc.
4) Link the PLFS library to your application at compile time. Please see the
   linking requirements in using PLFS as a ROMIO ADIO layer for what the
   linking command should look like.

*** For an example of an application ported to use the PLFS native API, see
    the open source LANL fs_test at: https://github.com/fs-test/fs_test


***************************************************************************
Other Info
***************************************************************************

For information on PLFS's POSIX-compliance, please see README.POSIX.

Information about the logging capabilities of PLFS is contained in
README.mlog. Please see that file for more information about getting log
information from PLFS.

Information for developers, such as rules for coding and working with the
source repository, is contained in README.developer.


***************************************************************************
Testing
***************************************************************************

Some functionality tests are provided in the 'tests' directory. Please see and
use tests/Tester.tcsh.

A regression suite is also available for PLFS, but is not included in the
normal source tree for PLFS. It is included in the PLFS project on github:

https://github.com/plfs/plfs-regression

The following command should be sufficient to check out the regression suite,
but if there are any problems, please look at the above URL and follow the
directions there for getting the regression suite:

git clone [email protected]:plfs/plfs-regression.git

The regression suite is kept up-to-date with PLFS's master branch. As working
with PLFS changes, the regression suite changes. This means that the regression
suite's master branch may not work with a particular PLFS release if there has
been significant divergence in source tree layout and PLFS configuration
capabilities between the release and master branch. As PLFS does a release, the
regression suite will be tagged with that release. Thus, it is possible to use
git to get the regression suite in to a state that should work with a
particular release if the master branch no longer works with that release.

The regression suite has its own set of documentation, so please refer to its
README for further information.

plfs-core's People

Contributors

Stargazers

Watchers

plfs-core's Issues

openhosts and metadata subdirs in container should be hierarchical

8/2/2010:
A container looks like:

container:
hostdir1..hostdirN accessfile openhosts metadata

Then each proc puts its data chunk and index chunk in one of the hostdirs and it stashes its metadata into a file name on close into the metadata directory and it stashes its node name into a file name on open so that any previous cached droppings in the metadata directory are invalidated.

We've been careful to use a set of hostdirs to reduce the number of entries in a directory. But we don't do that for the openhosts and the metadata directories: each of those can have N entries where N is the number of procs. We need to make them do hierarchical subdirs just like the hostdirs.

9/15/2010:
We don't need to do this for V1.0. Deferred until V2.0. The scale of the systems for V1.0 don't necessitate this.

4/29/2011:
Adam added something in ADIO so there is just one open and meta file per job not per proc. For FUSE it is still a problem. I'm not even sure actually whether it is one per proc through FUSE or one per node. In any event this should maybe be fixed eventually for FUSE. FUSE at scale will put a lot of meta and open files in the same directory.

Well, we're not currently at large FUSE scale so I'm gonna continue to ignore this.

Type incompatibility in ad_plfs_close.c

(from DK)

I noticed a type incompatibility in ad_plfs_close.c:

    close_opt.num_procs = &procs;

"num_procs" is an int. I don't think it breaks anything because as
far as I can tell, this member is never used. But someone should look
at it.

Specfile needs some work

4/28/2011:
The specfile needs:

specificities for rhel v sles
wait for modprobe fuse to succeed before mounting plfs

opendir in ad_plfs_open

5/10/2011:
should be ported to readdirop. will need code moved into plfs lib. also, the bitmap stuff.

rewinddir generates an error

Here's a snippet of code that can be turned in to a test case. After the rewinddir, the first call to readdir again returns NULL when it should not. However, if you replace the rewinddir with a closedir and opendir again, it works.

dirp = opendir(dir);
if (dirp == NULL) ;

count = 0;
while ((dirent = readdir(dirp)) != NULL) {

++count;

}
printf( "Directory count is %d\n", count );

rewinddir(dirp);
/*
Replace rewinddir, above, with these and it will work.

closedir(dirp);
dirp = opendir(dir);
*/

count = 0;
while ((dirent = readdir(dirp)) != NULL) {

++count;

}
printf( "Directory count is %d\n", count );

MPI/IO write to FUSE mount, MPI/IO read from ADIO interface error

I used fs_test to write an N-1 file using MPI/IO to a Smog FUSE mount, /plfs/scratch2/brettk/n1-mpi-no-plfs-colon/, with the "touch" parameter set to 3 so that it wrote a certain value to every byte.

I then used fs_test to read that N-1 file using MPI/IO from the same spot on Smog, only with "plfs:" prepended to the path. I got these type of errors:

Rank 17 Host nid00003 WARNING ERROR 1324053580: 50331648 bad bytes at file offset 17767071744. Nothing but zeroes. (errno=No such f
ile or directory)
Rank 16 Host nid00003 WARNING ERROR 1324053580: 50331648 bad bytes at file offset 16106127360. Nothing but zeroes.
Rank 0 Host nid00002 WARNING ERROR 1324053580: 50331648 bad bytes at file offset 15300820992. Nothing but zeroes.
Rank 0 Host nid00002 WARNING ERROR 1324053580: 50331648 bad bytes at file offset 16911433728. Nothing but zeroes.
Rank 0 Host nid00002 WARNING ERROR 1324053580: 50331648 bad bytes at file offset 18522046464. Nothing but zeroes.
Rank 0 Host nid00002 WARNING ERROR 1324053580: 50331648 bad bytes at file offset 20132659200. Nothing but zeroes.
Rank 0 Host nid00002 WARNING ERROR 1324053580: 50331648 bad bytes at file offset 21743271936. Nothing but zeroes.
Rank 0 Host nid00002 WARNING ERROR 1324053580: 50331648 bad bytes at file offset 23353884672. Nothing but zeroes.
Rank 4 Host nid00002 WARNING ERROR 1324053580: 50331648 bad bytes at file offset 17112760320. Nothing but zeroes. (errno=No such fi
le or directory)
Rank 2 Host nid00002 WARNING ERROR 1324053580: 50331648 bad bytes at file offset 18622709760. Nothing but zeroes. (errno=No such fi
le or directory)

Disabling data_sieving for PLFS in ADIO

10/27/2010:
Can you pls edit the patches a bit? We need to patch adio/common ad_open.c to change line 234 to this:
if (access_mode & ADIO_WRONLY && fd->file_system != ADIO_PLFS ) {

The problem is that ADIO changes WRONLY to RDWR for the write open so that it can do data sieving which is this optimization that it does if lots of people are doing small writes with holes between them. ADIO does a read of the whole region, does the overwrites in memory, and then does a large write. But we don't want that in PLFS and the RDWR open is really slow in PLFS since we create the index structure. When I did this patch, I dropped write open times on turing with 768 procs from 30 seconds to 6.

Also, we need to figure out where to patch it so that it doesn't try
data sieving on PLFS.

By the way, it seems ugly putting PLFS specific code in common but we have to since they don't pass us the original open flag AND there is already filesystem specific code in there for PVFS2 and NFS and TEST.

12/16/2010:
Turns out there is a way to do this purely in ad_plfs in mpich and the problem is that openmpi is using an old version of ROMIO. blech. See email from Rob Ross to John Bent on Oct 27 2010.

I think we also need to do a similar fix for the Cray MPT because mpich I guess requires that we implement ADIOI_GEN_OpenColl

PLFS with N-1 opens perform horribly unless DVS_MAXNODES=1

This is a problem we've noticed on Cray XE-6 systems with a lanes-based parallel file system by Panasas.

I used fs_test to model the 512 pe N-1 job Jeff was doing. I ran 512 pes where each wrote and read 16864 bytes per write/read 934 times. This results in a total file size of 8,064,499,712 bytes (7.51 GB).

Cielo, N-1, MPI/IO w/ PLFS via ADIO ("plfs:/"), DVS_MAXNODES=9:

12/21/11, 19:44:42
Write Eff Bandwidth: 21.9 MB/s
Write Raw Bandwidth: 3490 MB/s
Write Avg Open Time: 347 sec.
Write Avg Close Time: 0.692 sec.

Read Eff Bandwidth: 1840 MB/s
Read Raw Bandwidth: 4150 MB/s
Read Avg Open Time: 2.21 sec.
Read Avg Close Time: 0.0663 sec.

Cielo, N-1, MPI/IO (350 wide, 4 MB), DVS_MAXNODES=9:

12/21/11, 15:41:29
Write Eff Bandwidth: 240 MB/s
Write Raw Bandwidth: 245 MB/s
Write Avg Open Time: 0.381 sec.
Write Avg Close Time: 0.122 sec.

Read Eff Bandwidth: 1460 MB/s
Read Raw Bandwidth: 1650 MB/s
Read Avg Open Time: 0.548 sec.
Read Avg Close Time: 0.0519 sec.

Cielo, N-1, MPI/IO w/ PLFS via ADIO ("plfs:/"), DVS_MAXNODES=1:

12/22/11, 10:16:29
Write Eff Bandwidth: 1370 MB/s
Write Raw Bandwidth: 3720 MB/s
Write Avg Open Time: 2.73 sec.
Write Avg Close Time: 0.473 sec.

Read Eff Bandwidth: 2330 MB/s
Read Raw Bandwidth: 3980 MB/s
Read Avg Open Time: 1.28 sec.
Read Avg Close Time: 0.065 sec.

Observations:

Using PLFS with DVS_MAXNODES=9 showed horrible write performance, primarily due to the extreme overhead doing all the opens, most of which are not necessary.
Using PLFS and restricting DVS_MAXNODES to 1 made a huge difference in lowering the file open overhead and produced much better effective bandwidth. While there is still significant overhead due to the number of files being used versus the very small amount of data, it indicates that we might see an improvement on the real problem that would make the performance attractive to users.
A straight N-1 MPI/IO job striped widely, as true N-1 jobs should be (note PLFS turns the N-1 into N-N so we don't want PLFS jobs striped widely), was still much lower bandwidth than Jeff's BulkIO.

Then, I wanted to see what would happen if we repeated this experiment with an N-N I/O model instead of N-1. Since PLFS converts N-1 to N-N we didn't expect to see any difference, but we did.

Cielo, N-N, MPI/IO w/ PLFS via ADIO ("plfs:/"), DVS_MAXNODES=9:

1/3/12
Write Eff Bandwidth: 2490 MB/s
Write Raw Bandwidth: 3730 MB/s
Write Avg Open Time: 0.425 sec.
Write Avg Close Time: 0.139 sec.

Read Eff Bandwidth: 3680 MB/s
Read Raw Bandwidth: 3950 MB/s
Read Avg Open Time: 0.0657 sec.
Read Avg Close Time: 0.0266 sec.

Cielo, N-N, MPI/IO w/ PLFS via ADIO ("plfs:/"), DVS_MAXNODES=1:

1/3/12
Write Eff Bandwidth: 2550 MB/s
Write Raw Bandwidth: 3770 MB/s
Write Avg Open Time: 0.41 sec.
Write Avg Close Time: 0.152 sec.

Read Eff Bandwidth: 3660 MB/s
Read Raw Bandwidth: 3920 MB/s
Read Avg Open Time: 0.0652 sec.
Read Avg Close Time: 0.0277 sec.

Cielo, N-N, MPI/IO (2 wide, 4 MB), DVS_MAXNODES=9:

1/3/12
Write Eff Bandwidth: 2310 MB/s
Write Raw Bandwidth: 3810 MB/s
Write Avg Open Time: 0.513 sec.
Write Avg Close Time: 0.159 sec.

Read Eff Bandwidth: 3310 MB/s
Read Raw Bandwidth: 3920 MB/s
Read Avg Open Time: 0.235 sec.
Read Avg Close Time: 0.004 sec.

Cielo, N-N, MPI/IO (2 wide, 4 MB), DVS_MAXNODES=1:

1/3/12
Write Eff Bandwidth: 2420 MB/s
Write Raw Bandwidth: 3740 MB/s
Write Avg Open Time: 0.469 sec.
Write Avg Close Time: 0.158 sec.

Read Eff Bandwidth: 3370 MB/s
Read Raw Bandwidth: 3980 MB/s
Read Avg Open Time: 0.225 sec.
Read Avg Close Time: 0.00405 sec.

Observations:

Using DVS_MAXNODES=1 improved effective write bandwidth and open time, but for the numbers are so close it's hard to say much about it at this level. It didn't do anything for reads.
PLFS improved the write and read performance by distributing the metadata load (presumably) and lowering the open overhead.
There was not huge open time penalty, like with N-1.

So, the question is, what is going on in opens with PLFS and N-1 that is not going on with PLFS and N-N?

Need PLFS build version display

When diagnosing a problem, the versions of PLFS libraries currently in use can be useful information. PLFS should provide a way for an application program to display the versions of all PLFS related libraries.

One possible implementation would be to have PLFS initialization look for an environment variable, such as PLFS_SHOW_VERSION, and if set, write version strings from the main library and from the adio code layer to stdout. Additional information such as mounted PLFS file systems and their container levels might also be displayed, perhaps conditionally depending on the value of the environment variable.

Getting Unix FS semantics in PLFS API calls

Andy Nelson noted that checking for the existence of upper portions of a PLFS mount does not behave the same as a normal Unix file system would.

Here are some suggested fixes from Brett Kettering and Gary Grider.

Brett:
I ran into Andy at the pool tonight and we talked some more about his complaint regarding plfs_access. IMHO, his argument is kind of flimsy because it relies on the fact that / is always a mount point and contains the directory hierarchy down to other mount points. So, he may be going after a deeper hierarchical mount point, like /a/b/c, and ask if /a is there if /a/b/c and /a/b aren't, but he's already out of the true mount point he wanted and will likely lack the privileges to create b and c, but that doesn't seem to matter to him. He says if he gets an error that he can't mkdir b or c then he quits and errors out to the user.

So, I think we can provide the same semantic by:

If a person does not provide a full file spec that defines a mount point in the plfsrc file(s), then plfs_access returns true or valid or success if a person specifies a partial spec that matches a valid mount point up to the point the person specifies. For example, suppose /a/b/c is a valid PLFS mount point and the person asks for /a/b, we return success (true or valid). But if they say /a/c we say failure (false or invalid) because /a/c is not a substring from the start of /a/b/c (neither is /b/c).
If a person then tries to plfs_mkdir without specifying at least a full, valid mount point, we return failure. That is, we don't allow a person to create a directory that isn't fully inside a valid mount point. That would be a privileged operation.

That way, Andy gets his semantic. He can ask for plfs_access on a
partial PLFS mount point and get success, but he'll get failure if he tries to create the next directory down and then he'll tell his user, "Sorry, but I couldn't create the directories in your specification for you."

Gary:
We could do this sort of thing or we could have all plfs dir related ops do the plfs dir op if the path resolves to inside the plfs space and if not in the plfs space just call the appropriate posix dir call and return the answer, so plfsmkdir could do - if in plfs space call plfs_mkdir_fancy_hashed_thing() else call mkdir() and return what you get. On access (I guess this is equivalent to the posix access() call?) so that would be -- if in plfs space call plfs_access else call access() and return. This would have to be just for dir ops I guess because the open/read/write/close calls pass this huge struct back to the client in an fd like thing. This would allow people to ripple through and check/make directories all the way down into plfs space from above. Of course if fuse was there they wouldn't need to do this because this is pretty much exactly what the vfs does for you, if you are inside a mountpoint it calls the correct routine.

I cant say I love either option, but this seems like a surmountable issue. If you think about it, if we extend mpi-io and add a mpi_file_mkdir, the common would resolve to just mkdir(), but the plfs adio adio_plfs_mkdir() would need to do what I am saying above, if within plfs name space do the plfs_magic_mkdir() and if not just do the common mkdir(), because someone else could be doing this name space traversal crap in their parallel code. So for dir ops, it feels like we would need to have the fall back to posix if you are not in the plfs name space stuff eventually if we ever do get an mpi_file_mkdir() etc.

race condition in creating shadow container

Currently we:

create shadow container
create metalink in canonical

Imagine a crash btwn 1) and 2). The shadow container will be orphaned. Switch 1) and 2). Also, now imagine this:

rm logical_file
foreach backend( remove pieices )

crash somewhere

Maybe canonical has been deleted but pieces haven't. weird things might happen. plfs_recover can help. but if canonical is gone and just shadows remain, plfs_recover will think they are logical directories and recover them as such. Then user will see weird subdirs and stuff.

So maybe each subdir should have a .plfsaccess file in it. Perhaps the .plfsaccess file should be the last thing deleted on an rm

All file operations need to be atomic transactions

PLFS file operations involve several actual UNIX file operations because of the way files are represented in PLFS. Consequently, an interrupted PLFS file operation could leave the PLFS file system in a corrupt state with partial work done.

In general, this can be cleaned-up using plfs_recover. However, it is better to avoid circumstances where the file system may become corrupted because a file operation did not complete before an error occurred.

So, we will make all PLFS file operations atomic transactions. If an error occurs, PLFS will clean-up any partially done work.

plfs_check_config -mkdir

Whenever I set up a PLFS installation, I write the plfsrcs and then I do this:

plfs_check_config | & awk '/ENOENT/ {print $5}' | xargs mkdir

Can we add a -mkdir flag to plfs_check_config to do this?

Put IP address into droppings instead of string hostname

1/13/2011:
We're distributing work now where one rank does a readdir of the hostdir and sends filenames to the other ranks to parallelize the reading of the droppings. Sending the filenames as strings is a bit cumbersome and requires more bytes; we should change the droppings to have fixed length IP addresses instead of strings.

Fix Performance Bug When Reading with O_RDWR

By design, reading from a PLFS file in O_RDWR mode gets horrible performance. The technical reason is that in order to read, we need to create an index structure by aggregating N physical index files. In RDONLY mode, we cache this thing. In RDWR mode, we blow it away and recreate it on every read operation in case there has been an interleaved write.

So, we have ideas to do better in future versions but not in 2.1 which is feature froze. So, we need to make sure users are reading only in RDONLY mode. If users wonder why reads are incredibly slow, we need to remember to ask them whether they have opened the file in O_RDWR mode.

EPERM in FUSE layer

5/5/2011:
There's a possibility that user A creates a file and gives write permissions to user B. User B writes to the file which creates some
physical droppings owned by B. Then later A tries to do something to the file (unlink, chown, chmod, or something) which fails with EPERM due to some of the internal physical files owned by B.

When Adam first joined the project, he added some complicated code to deal with this. However, when I just went through the code to fix chown/chmod which weren't yet updated to work w/ shadow containers, I cleaned up a bunch of code. I'm really happy with how a lot of the operations works. I reduced the LOC a bunch. However, I didn't bother to maintain the complicated stuff that addresses this situation.

I think the easy way to deal with this is to make basically every function be a wrapper that responds to EPERM by checking if the caller is the original owner and if so doing a recursive chown over the canonical and shadow containers to restore all droppings to the ownership of the original owner and then reattempt the original operation. Blech.

This isn't specific to the FUSE layer.

Misleading variable names in WriteFile

4/20/2011:
WriteFile? has a variable called physical path. However, there is a time when WriteFile? is called from fuse:f_rename() when the path is reset to be a logical path. This should be cleaned up (I don't know just how off the top of my head) so that's it's not misleading.

[There was a similar problem with Index where it was labeling a variable as logical_path but it was always exclusively holding a physical path so that was easily fixed by just renaming the Index::logical_path variable to Index::physical_path.]

Below is the email from Chuck discovering this problem. Thanks Chuck!

The leak described in his email has been patched as well in the trunk.

--- Begin forwarded message from Chuck Cranor ---
From: Chuck Cranor <chuck@…>
To: PLFS Development <plfs-devel@…>
Date: Wed, 20 Apr 2011 12:36:31 -0600
Subject: [Plfs-devel] leak + odd

leak?
Container::hostdir_index_read() calls opendir(), never calls closedir().

if((dirp=opendir(path)) == NULL) {

odd?

OpenFile?.cpp says:
void Plfs_fd::setPath( string p ) {
this->path = p;
if ( writefile ) writefile->setPath( p );
if ( index ) index->setPath( p );
}

WriteFile?.cpp says:
void WriteFile::setPath ( string p ) {
this->physical_path = p;
this->has_been_renamed = true;
}

Index.cpp says:
void Index::setPath( string p ) {
this->logical_path = p;
}

so is path a "physical_path" as WriteFile?.cpp says, or is it a "logical_path" like Index.cpp says?

looking at it, setPath() functions only get called in 2 places:

f_rename in FUSE calls Plfs_fd::setPath()
plfs_parindex_read() calls Index::setPath()
hmm.

chuck

Open-Write-Read-Close broken

3/28/2011:

MPI_File_open(MPI_COMM_SELF, filename, MPI_MODE_CREATE | MPI_MODE_RDWR,
MPI_INFO_NULL, &fh);

MPI_File_write(fh, buf, SIZE, MPI_DOUBLE, &status);

MPI_File_get_size(fh, &size);

MPI_File_seek(fh, 0, MPI_SEEK_SET);

MPI_File_read(fh, buf, SIZE, MPI_DOUBLE, &status);

MPI_File_close(&fh);

this results in
file read:: No such file or directory
for the reads.

Its ok if I do this:

MPI_File_open(MPI_COMM_SELF, filename, MPI_MODE_CREATE | MPI_MODE_RDWR,
MPI_INFO_NULL, &fh);

MPI_File_write(fh, buf, SIZE, MPI_DOUBLE, &status);

MPI_File_close(&fh);

MPI_File_open(MPI_COMM_SELF, filename, MPI_MODE_RDONLY,
MPI_INFO_NULL, &fh);

MPI_File_get_size(fh, &size);

MPI_File_seek(fh, 0, MPI_SEEK_SET);

MPI_File_read(fh, buf, SIZE, MPI_DOUBLE, &status);

MPI_File_close(&fh);

Do you expect coherent reads for MPI_MODE_RDWR?
This is for version 1.1.9 on cielito.

plfs_query needs to handle directories

plfs_query :
all the physical stuff (containers, metalinks, subdirs, droppings)

plfs_query

:
all the physical dirs across all backends (do a sanity check that they all exist)

plfs_query <mount_point>:
all the backends. This doesn't need explicit code. It should just be handled by plfs_query

above.

plfs_query <some_bogus_path>:
ENOENT

build warnings: signed:unsigned comparisons

I'm not sure if there's a plfs mailing list, so I thought I'd just send
this to you (you seem to be the most recent committer) :-)

I turned on a few warnings (using clang 2.0/llvm 2.9 with -Wall -Wextra
-std=gnu99) and hit the following:

Index.cpp:540:16: warning: comparison of unsigned expression < 0 is
always false [-Wtautological-compare]

if ( quant < 0 ) return -EBADF;

    ~ ~

Index.cpp:591:16: warning: comparison of unsigned expression < 0 is
always false [-Wtautological-compare]

if ( quant < 0 ) {

    ~ ~

plfs.cpp:1351:35: warning: comparison of unsigned expression < 0 is
always false [-Wtautological-compare]

if (pconf->buffer_mbs <0) {

    ~ ~

plfs.cpp:2346:29: warning: comparison of unsigned expression < 0 is
always false [-Wtautological-compare]

} else if ( writers < 0 ) {

    ~ ~

rename.C:69:41: warning: comparison of unsigned expression < 0 is always
false [-Wtautological-compare]

if ( fread( buf, 4096, 1, cat ) < 0 ) {

    ~

rename.C:96:35: warning: comparison of unsigned expression < 0 is always
false [-Wtautological-compare]

if ( fread( buf, 100, 1, fp ) < 0 ) {

    ~

Most of them just seem to be sanity checks, but I thought I'd say
something just to make sure you guys are aware.

Distributed metadata broke atomic rename/unlink

5/5/2011:
Now that we have shadow containers, we no longer have atomic
rename/unlink. When everything was contained in just one canonical
container, we could do rename with just one atomic rename system call. Unlink we did with an atomic rename to a hidden file and then did an unlink of that entire container.

But now with shadow containers, we lost atomicity for these operations.
Ugh.

Chuck's ideas about storing objects and having one object lookup
mechanism might solve this problem as well.

PLFS hashing across metadata server additions

2/16/2011:
The metadata server hashing code works pretty well. There are a couple of things we might still want to do:

mkdir/rmdir/readdir/unlink all iterate over the backends. These
should probably be threaded
we didn't change any of the read code by creating symlinks in the
container on the write phase. the symlinks might be a performance hit so maybe we should remove them
or, instead of removing the symlinks, we might also just have 0
resolve them and then pass that info to everyone else so only 0 hits the canonical location for the symlinks. As is, everyone resolves the symlinks on a single server. Still much better since the physical files are distributed but we can do better still....

plfs_check_config returns 32 for num_hostdirs

On Smog and Cielito where /etc/plfsrc lists num_hostdirs as 37, when one runs plfs_check_config it returns 32 for that value.

Strangely, on Cielo where /etc/plfsrc lists num_hostdirs as 379, plfs_check_config also returns 32 for that value.

Remove FUSE 2.7 workarounds

10/12/2010:
Once all LANL systems are at FUSE 2.8 or above, remove the fuse_getgroups and the direct_io workarounds that we had in place for FUSE 2.7 limitations.

Is there a way to get some #define statements added from the configure
depending on the version of FUSE? Basically if it's 2.8 or above, then
we can define HAS_FUSE_GETGROUPS 1? If we could just assume it would
only run on 2.8+, then we could really cleanup some ugly workaround
code....

write-then-read without close

The attached file ("small_file.c") is a slightly modified version of the "large_file.c" test from the anl mpich2 repository.

When run with PLFS, it produces the following behavior:
file size = 1048576 bytes
error: buf 0 is -1, should be 0
error: buf 1 is -1, should be 1
...

This behavior occurs on different platforms.

This test uses only one process (and thus should work?). We expect problematic behavior with several processes, but it would be good if this behavior was addressed/specified.

A clue: when fuse is being used, the error only occurs w/ -o direct_io

'No such file or directory' in "ls" output

119 ct-fe1 : pwd /plfs/scratch3/afn
118 ct-fe1 : ls -als
ls: cannot access blah: No such file or directory
total 32
16 drwxr-xr-x 2 afn eapdev 4096 Sep 10 01:04 .
16 drwxr-xr-x 11 root root 4096 Sep 6 18:08 ..

? ?????????? ? ? ? ? ? blah

We sometimes see stuff like the above. I've never reproduced it in a session. I just occasionally stumble upon it when I look at old mount points. I assume it's a plfsrc misconfiguration and a file was written with one plfsrc and then read with a different one. Everytime I've found this, I've dug in, and everything that I see fits the plfsrc misconfiguration theory.

It's possible there is something wrong inside the code and this isn't a plfsrc problem but if so we really need to find a reproducer. All of the below, however, assumes this is a problem with the plfsrc.

We just have a policy enforcing our plfsrc rules. We could alternatively do a lot of extra work on readdir/getattr/open to try to find stuff like this but that'd be a decent performance hit so I'd like to enforce this with policy if possible.

We haven't apparently done a good job of this so far. The policy needs to be that:

any one backend path can only exist in one set of immutable backends.

I've added a bit more checking of the plfsrc to help catch violations of the policy. It just makes sure that any one backend can't exist in multiple backend sets for plfsrc's that define multiple mount points. The problem of course is that this just catches violations which are obvious within the context of a single plfsrc. I suspect the more difficult challenge is to avoid problems from one plfsrc to another both across space and time.

Add a -version flag to the PLFS tools

In the tools directory there are tools: plfs_check_config, plfs_flatten_index, plfs_map, plfs_query, plfs_recover, and plfs_version.

Each of these tools needs a -version (or --version, depending on how flags are done with plfs tools now) flag that returns the version of the product. There's a utility plfs_version that calls plfs_version in plfs.C.

With this done, anyone that uses a tool can see what version of the product they are using. Also, any code can call plfs_version in plfs.C to find out what the version is.

Parsing of include directive assumes a mount_point directive

There is a bug in the plfs_check_config (and, I'm sure, the plfs parsing code) for the plfsrc files that have the "include" directives in them. The parser incorrectly assume that every file that is included must contain or be preceded by a file that has a mount_point directive in it.

The parser should read-in all of the file, inlining any include directive material, and then check the whole resulting plfsrc file for correctness.

Fix for PLFS 2.2.

Implement ad_plfs_writestrided

4/5/2011:
Some applications would benefit from adding ad_plfs_writestrided.

No discussion of what this variable is, hint passed on open/create? Other?

Implement an include directive in the plfsrc file

Fixed in 2.1

Was super easy to fix! However, did add one new bug which was that code was erroring out when any one plfsrc file did not define a mount point. This made sense where there was just a single plfsrc file. With multiples, it is reasonable that some might define num_hostdirs and stuff and not define mount points while others define the mount points. So that just required changing it so that at least one mount point must be defined across the set of all of them (instead of within each).

expandPath inefficiencies in plfs.cpp

1/26/2011:
It's just a memory op but for many functions in plfs.cpp (non-critical ones like chmod), we expand the path twice. First when we enter the function and then again when we call is_plfs_file. We should add an optimized version of is_plfs_file that takes an already expanded path.

createPath(s1,s2) function needed

in the code we often do this:

path = component1 + "/" + component2;

this often results in paths that have multiple slashes in them (e.g.
/foo////bar). Instead we need a function that does:

string
makePath(const string &component1, const string &component2) {

// append 2 to 1 and add a slash only if necessary

}

EXASCALE: openhosts and metadata subdirs in container should be hierarchical

The ADIO portion of this problem is resolved, but the PLFS researcher can confirm this assumption.

The FUSE portion is an issue where at Exascale we could have at least O( 1K ) more files per directory.

Here is the original material from the problem we worked in 2010 and 2011:

8/2/2010:
A container looks like:

container:
hostdir1..hostdirN accessfile openhosts metadata
Then each proc puts its data chunk and index chunk in one of the
hostdirs and it stashes its metadata into a file name on close into the metadata directory and it stashes its node name into a file name on open so that any previous cached droppings in the metadata directory are invalidated.

We've been careful to use a set of hostdirs to reduce the number of
entries in a directory. But we don't do that for the openhosts and the metadata directories: each of those can have N entries where N is the number of procs. We need to make them do hierarchical subdirs just like the hostdirs.

9/15/2010:
We don't need to do this for V1.0. Deferred until V2.0. The scale of the systems for V1.0 don't necessitate this.

Well, we're not currently at large FUSE scale so I'm gonna continue to ignore this.

plfs_query option to tell which PLFS file a file part is from

Implement a plfs_query option (I think you’ll need a new option) where you give it an underlying file system filename that is part of a PLFS file and tell what the PLFS file is to which it belongs.

Race condition for N-1 I/O pattern through FUSE

fs_test doing an N-1 I/O pattern to a PLFS FUSE mount encounters errors with not being able to find the file. This appears to be a race condition. It may be related to Trac Items #33 and #38.

Here's the fs_test command:
aprun -n 1024 -N 16 /users/brettk/Testing/test_fs/src/fs_test.smog.x -strided 1 -sync -tmpdirname /users/brettk/tmp -io posix -size 48M -target /plfs/scratch2/brettk/n1-posix/out.%s -shift -experiment N-1_POSIX_PLFS_Testing.1322082066 -time 300 -barriers aopen -hints panfs_concurrent_write=1 -type 2

Here's the error output:
Wed Nov 23 22:55:12 2011: at operation bopen.
Wed Nov 23 22:55:12 2011: at operation aopen.
Rank 945 Host nid00086 FATAL ERROR 1322114112: Unable to open file /plfs/scratch2/brettk/n1-posix/out.1322114109 for write. (errno=No such file or directory)
Rank [945] DEBUG: Query in /users/brettk/smog_db_up needs to be uploaded
Rank 243 Host nid00028 FATAL ERROR 1322114112: Unable to open file /plfs/scratch2/brettk/n1-posix/out.1322114109 for write. (errno=No such file or directory)
Rank [243] DEBUG: Query in /users/brettk/smog_db_up needs to be uploaded

...and on and on for all the processes.

Gary Grider says: There is a race condition to create the directory for all the procs to write into with N to 1 that doesn’t exist for N to N plfs fuse, so the create part ends up being different because with N to N the race is with yourself. There is no atomic mkdir/file create so there is a tmp/rename thing that is done in the code I think. Anyway, there is for sure a difference in file create, not the code path but in the concurrence in a container directory for N to N vs N to 1 plfs fuse.

This needs to be fixed for PLFS 2.2.

Merge contiguous incoming host entries

Merge contiguous incoming host entries.

EXASCALE: ReaddirOp might break at exascale

This is related to Trac Ticket about porting container.cpp to use FileOp and ReaddirOp classes. That one will be addressed at current scales.

However, this problem will be at least O( 1K ) worse at Exascale. So, this becomes a research topic to address at Exascale and beyond.

Assess effort to convert PLFS to all C

There is thought that having PLFS in all C will be advantageous. We need to assess the effort level to make this conversion.

Reasons include:

Using C++ requires a separate build for each compiler used or including GNU's libstdc++ with each application built in addition to the specific compiler's libstdc++. This does a couple things:

a) Creates potential for symbol conflicts, depending on how the various compilers do symbol name-mangling as compared to how GNU does it.

b) Forces applications to have two copies of the libstdc++ library in their image. The compiler-specific one is used by the application code and the GNU one is used by PLFS.

It is felt that it will be easier to get MPI implementations to include support for PLFS file systems if PLFS is all in C and easy to integrate to the MPI code in C.
It is felt that it will be easier to attract contributors and users by having it in C, a more familiar language to most developers, and easier to use C libraries with applications.

Memory corruption on Mac laptop

Saw this error on my mac laptop:

plfs(43997,0xb0081000) malloc: *** error for object 0x19c420:
incorrect checksum for freed object - object was probably modified
after being freed.
*** set a breakpoint in malloc_error_break to debug

while doing an svn co

Here's the backtrace:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xcc0029c0
[Switching to process 43997 thread 0x313]
0x907e6589 in tiny_malloc_from_free_list ()
(gdb) bt
#0 0x907e6589 in tiny_malloc_from_free_list ()
#1 0x907df1ad in szone_malloc ()
#2 0x907df0b8 in malloc_zone_malloc ()
#3 0x907df04c in malloc ()
#4 0x94c72598 in operator new ()
#5 0x94c59fd2 in std::string::_Rep::_S_create ()
#6 0x94c5a656 in std::string::_S_construct<char*> ()
#7 0x94c5a6c0 in std::basic_string<char, std::char_traits,

std::allocator >::basic_string<char*> ()
#8 0x94c58015 in std::basic_stringbuf<char, std::char_traits,

std::allocator >::str ()
#9 0x0005edea in LogMessage::flush (this=0xb0080b84) at LogMessage?.cpp:53
#10 0x00003e06 in Plfs::f_flush (path=0x182580

"/plfs/doc/SC09/data/overhead/.svn/prop-base/gpfs.dat.svn-base",
fi=0x73656363) at plfs_fuse.cpp:1048
#11 0x0002216d in fuse_fs_flush (fs=0x101b30, path=0x182580

"/plfs/doc/SC09/data/overhead/.svn/prop-base/gpfs.dat.svn-base",
fi=0xb0080e70) at fuse.c:983
#12 0x00024bca in fuse_flush_common (f=0x1019f0, req=0x1862a0,

ino=4635, path=0x182580
"/plfs/doc/SC09/data/overhead/.svn/prop-base/gpfs.dat.svn-base",
fi=0xb0080e70) at fuse.c:3021
#13 0x00024d35 in fuse_lib_flush (req=0x1862a0, ino=4635,

fi=0xb0080e70) at fuse.c:3081
#14 0x0002a713 in do_flush (req=0x1862a0, nodeid=4635, inarg=0x1019f0)

at fuse_lowlevel.c:783
#15 0x00029b91 in fuse_ll_process (data=0x806a00, buf=0x300000 "@",

len=64, ch=0x101950) at fuse_lowlevel.c:1327
#16 0x0002bff8 in fuse_session_process (se=0x1019d0, buf=0x300000 "@",

len=64, ch=0x101950) at fuse_session.c:93
#17 0x00028bef in fuse_do_work (data=0x101720) at fuse_loop_mt.c:105
#18 0x9080f155 in _pthread_start ()
#19 0x9080f012 in thread_start ()

(gdb) frame 9
#9 0x0005edea in LogMessage::flush (this=0xb0080b84) at LogMessage?.cpp:53

53 Util::Debug("%s", this->str().c_str() );
(gdb) p this
Cannot access memory at address 0x0
(gdb) p *this
Cannot access memory at address 0x0
(gdb) frame 10
#10 0x00003e06 in Plfs::f_flush (path=0x182580

"/plfs/doc/SC09/data/overhead/.svn/prop-base/gpfs.dat.svn-base",
fi=0x73656363) at plfs_fuse.cpp:1048
1048 PLFS_ENTER; GET_OPEN_FILE;

On Wed, Sep 15, 2010 at 9:04 PM, John Bent <johnbent@…> wrote:

we should do some memory error checking at some point too. maybe
something like insure or valgrind.

plfs_query -logical /mnt/plfs/file

If you pass -logical when you should pass -physical then plfs dumps core with a bus error. Can plfs_query be smart and not need the -logical or the -physical argument and just act accordingly?

Port container.cpp to use FileOp classes, especially on ReaddirOp

5/6/2011:
Port container.cpp to use FileOp? classes, especially ReaddirOp?.

.plfsdebug not refreshing on Mac

9/16/2010:
The first time I read .plfsdebug on a mac, it's correct. Every subsequent time, it's the same as the first and plfs itself doesn't seem to be queried for it. It must be cached. But what's weird is that it gets opened again the second time but not read from. There must be something in the fd that someone interprets to mean that cached data is OK. Here's what FUSE sees on the first read:

unique: 4, opcode: LOOKUP (1), nodeid: 1, insize: 51
LOOKUP /.plfsdebug
NODEID: 2
unique: 4, error: 0 (Unknown error: 0), outsize: 152
unique: 0, opcode: ACCESS (34), nodeid: 2, insize: 48
ACCESS /.plfsdebug 04
unique: 0, error: 0 (Unknown error: 0), outsize: 16
unique: 1, opcode: OPEN (14), nodeid: 2, insize: 48
unique: 1, error: 0 (Unknown error: 0), outsize: 32
OPEN[0] flags: 0x0 /.plfsdebug
unique: 2, opcode: READ (15), nodeid: 2, insize: 64
READ[0] 16384 bytes from 0
READ[0] 16384 bytes
unique: 2, error: 0 (Unknown error: 0), outsize: 16400
unique: 3, opcode: FLUSH (25), nodeid: 2, insize: 64
FLUSH[0]
unique: 3, error: 0 (Unknown error: 0), outsize: 16
unique: 4, opcode: RELEASE (18), nodeid: 2, insize: 64
RELEASE[0] flags: 0x0
unique: 4, error: 0 (Unknown error: 0), outsize: 16

Here's what it sees the second time (the same but no READ...):

unique: 0, opcode: ACCESS (34), nodeid: 2, insize: 48
ACCESS /.plfsdebug 04
unique: 0, error: 0 (Unknown error: 0), outsize: 16
unique: 1, opcode: OPEN (14), nodeid: 2, insize: 48
unique: 1, error: 0 (Unknown error: 0), outsize: 32
OPEN[0] flags: 0x0 /.plfsdebug
unique: 2, opcode: FLUSH (25), nodeid: 2, insize: 64
FLUSH[0]
unique: 2, error: 0 (Unknown error: 0), outsize: 16
unique: 3, opcode: RELEASE (18), nodeid: 2, insize: 64
RELEASE[0] flags: 0x0
unique: 3, error: 0 (Unknown error: 0), outsize: 16

Out of memory error when reading

2/8/2011:
There was some exchange with Richard Hedges, of LLNL, regarding and out-of-memory error using IOR. It was unclear if the problem exhibited for write, read, or both. We definitely know there is a problem for read. This was the command-line to repeat it:

srun -N4 -n32 IOR -s100 (-i10 in case you need more than one iteration for the nodes to wedge).

3/11/2011:
Adam and John ran out of memory on a read. It was 10,240 processors, n-1 through plfs adio, write size 47001. It was to one metadata server and hashed to 27 directories in the container. The index files were 145k each.

Track this as part of the ListIO effort over using Collective Buffering?

Automate the release process

1/31/2011:
There are several steps we have to do to roll a new release. We should automate this and put it into scripts/plfs_release VERSION_NUMBER (e.g. scripts/plfs_release 1.1.9) and scripts/plfs_release will update the spec file, the VERSION file, redo the autogen, the configure, and the make dist and then print a reminder about uploading to sourceforge and emailing Trent. Anything else?

Problem running fs_test/plfs testing on a local RAMfs backend

2/7/2011:
I can run fs_test.x on remote node.
1 . Single node - OK
2. using more than one node - NO
Is this because of the backend is not a parallel file system
fs_test.x processes are running on both nodes but never write
to file (N-to-1, or N-to-N)

I will turn on plfs-debug option.

scr001
mount point: /tmp/plfs
backend: /tmp/plfslocal --> local ramfs using memory

scr002
mount point: /tmp/plfs
backend: /tmp/plfslocal --> local ramfs using memory

2/7/2011:
This isn't high priority but I'm curious about this. fs_test without
-shift should work against PLFS with local backends....

2/8/2011:
I thought you wouldn't be able to read. You are going to look for
offsets that are not contained in your local index.

Fsck needs work

Create shadow container.

Put metalink into canonical.

Later remove file, canonical won't contain metalink, shadow container will leak.

The Fix: Put metalink dropping before creating shadow. On metalink resolution, ignore missing shadow.

Also, If shadow container leak, then plfs_recover will think it's a user directory (e.g. not a user file) and will recover it as a directory and then user will see container structure. To fix, put an accessfile into each shadow. Then how to tell difference between canonical and shadow? Does it matter?

Truncate of an open file

There's a bit of a concern that in an N-1 open through FUSE with O_TRUNC passed and no barrier after open that some procs will truncate data from other procs. This is app misbehavior so I don't know if we should care about this but if we want to protect them, we can always check with nodes have the file open and not truncate their data.

Enhance PLFS logging capabilities

This is to document the work that Chuck has done with mlog.

Chuck made changes on the trunk, including:

added a README.mlog with some docs
updated plfsrc(5) man pages for the new mlog-related key values
added low-level environment variable config. Just take the key name from plfsrc, prefix it with "PLFS_" and put it in ALL CAPS. Example: plfsrc key "mlog_file" becomes "PLFS_MLOG_FILE" environment var
added high-level environment variable config based on John's PLFS_DEBUG_WHERE, PLFS_DEBUG_LEVEL, and PLFS_DEBUG_WHICH suggestion.
PLFS_DEBUG_WHERE is a log file name (or could be /dev/stderr)
PLFS_DEBUG_LEVEL is a priority level (crit, warn, info, debug, etc.)
PLFS_DEBUG_WHICH is a set of comma sep'd subsystem names (index, container, fuse, etc.)

with LEVEL/WHICH there are 4 cases:
LEVEL=undefined, WHICH=undefined -- don't do anything.
LEVEL=undefined, WHICH=defined -- set the list in 'WHICH' to debug lvl
LEVEL=defined, WHICH=undefined -- set all subsystems log level to LEVEL
LEVEL=defined, WHICH=defined -- set the list in 'WHICH' to LEVEL

Rename is broken

4/11/2011:
When we added multiple backend support in 2.0, we broke rename. The
problem is that we were focused on ADIO to the detriment of FUSE.

Rename currently just finds the canonical location and changes that one thing. Instead it needs to do:

rename(to,from)
foreach backend
phys_to = expandPath(to,NO_HASH,backend)
phys_from = expandPath(from,NO_HASH,backend)
rename(phys_to,phys_from)
if ( is_directory(to) ) return // all done if directory
// if file, a bit more work
plfs_recover(to); // this does most of work
// but also need to overwrite all symlinks within the
// container to reflect name change

5/1/2011:
We're making good progress towards 2.0.2 which I'm thinking we'll name 2.1 after a thorough testing regimen.

However, I've realized that the symlinks that we use to like shadow
subdirs into the canonical containers make it very difficult to rename parent directories.

Renaming a container itself is also a bit of a pain and requires us to search through the container for symlinks and recreate them with the new path.

But a rename of a directory is even harder. It requires a full
traversal to find all descendants and fix all of their symlinks as well.

For 2.1, I'm tempted to just return ENOSYS ("is not implemented") if
someone tries to rename a directory.

Objections? Is this much worse than I'm thinking?

5/2/2011:
Currently, we distribute internal container files over multiple backends using shadow containers. We then create links to the subdirs within the shadows into the canonical container.

So say we have /mnt/plfs/johnbent/dir1/dir2/foo
and the canonical location is
/panfs1/.plfs_store/johnbent/dir1/dir2/foo/
and we have a shadow and it's hostdir at
/panfs2/.plfs_store/johnbent/dir1/dir2/foo/hostdir.2

We then create a symlink like this:

/panfs1/.plfs_store/johnbent/dir1/dir2/foo/hostdir.2@ ->
/panfs2/.plfs_store/johnbent/dir1/dir2/foo/hostdir.2

Now imagine that someone renames dir1 to dir3, then the symlink within the canonical container is dangling.

An easy fix: return ENOSYS on rename of directory

A bad fix: scan all backends to find all shadows. This is horrible for N-N.

A decent fix: instead of putting the full path in the symlink, just put enough metadata in to enable location.

hostdir.2@ -> backend2

The bitmaps that we use to distribute container creation work will still work. But instead of just doing a readdir on each hostdir, we'll have to treat them differently if they're local.

The right fix: actually the decent fix is the right fix. This is
because the next thing on my plate is doing a rename of a container
which will require going in and fixing symlinks. But if the symlinks
don't contain absolute paths, then doing a rename of a container is
exactly the same as doing a rename of a directory (iterate over
backends, rename each one). Oh, the rename of the container will also require moving the canonical stuff from the old canonical to the new canonical. But that's easy compared to what I was gonna do.

The tradeoff is that this does change how container index aggregation works and it changes it a bit how the dropping files get created.

5/5/2011:
By the way, I'm planning for version 2.1 to change the symlinks so they no longer container absolute paths. This makes some code a bit more challenging but makes the rename not have to do anything more than change the name of the canonical and shadow containers. rename will also however have to call plfs_recover since the location of the canonical might change.

So, in version 2.1, a symlink inside canonical will just say something like '5.13' and that will expand to the 5th backend and subdir.13.

5/5/2011:
Actually the way to fix this is to make symlinks in canonical not point absolutely to the shadow hostdirs. Instead they just need to containing enough metadata to locate the hostdir. For example,

/back1/plfs/johnbent/foo/hostdir.13@ will contain just 5 which means the 5th backend. Then we find the shadow hostdir at:
/back1/plfs/johnbent/foo/hostdir.13

But this is a pain since it changes how the index aggregation works. But this means that rename on a container is easy:

just rename each shadow and canonical and then do plfs_recover to make sure canonical is in right location.

Then rename on a directory is easier even yet. Just rename each directory.

Incompatible ROMIOs in OpenMPI and MPICH

10/27/2010:
We have a problem where the way that Rob Ross and Rob Latham want us to do something isn't available in OpenMPI since it's using an old version of ROMIO. You guys can push a new OpenMPI with the updated ROMIO into openmpi-1.5?

I'd really rather not maintain a separate PLFS branch for both openmpi and mpich....

--- Begin forwarded message from Rob Latham ---
From: Rob Latham <robl@…>
To: Milo <milop@…>
Cc: Rob Ross <rross@…>, Garth Gibson <garth@…>, John Bent <johnbent@…>
Date: Wed, 27 Oct 2010 15:33:27 -0600
Subject: Re: ADIO converting open_write_only to open_readwrite ?

On Wed, Oct 27, 2010 at 05:29:25PM -0400, Milo wrote:

On Oct 27, 2010, at 5:09 PM, Rob Ross wrote:

    Why not just replace the function in the table as I suggested? Am I missing something? -- Rob


Ugh, no. We are. This is one of the differences between mpich2 and openmpi. I'll replace the function as suggested.

If you can test out Pascal Deveze's integration of new romio into
openmpi-1.5, and then tell the openmpi guys it works for you, that
would be great:

http://bitbucket.org/devezep/new-romio-for-openmpi/

They don't seem very excited about re-syncing, even though Pascal did all the work.

http://www.open-mpi.org/community/lists/devel/2010/10/8584.php

11/1/2010:
I'm not sure. If OMPI 1.5 doesn't have the updated ROMIO that you
need, I don't see that happening. We can't maintain LANL-specific
OMPIs - that would be too painful. I think that there is an update of ROMIO that is slated to go into the OMPI trunk sometime, but I don't know the ETA for that.

We are thinking about syncing the trunk and the 1.5 branch for 1.5.1, but that's also still up in the air. If the ROMIO updates go in before that AND we chose to sync for 1.5.1, then you can talk to David about providing a production build.

11/2/2010:
But could you guys lend your voices to try to push the new ROMIO into 1.5? I think that's what the Rob's want us to do. And that's what PLFS wants to so we can apply the same patch to the ROMIO in mpich and openmpi.

11/2/2010:
I will, but ultimately it's not up to me when the ROMIO work get
integrated/completed.

Update on OMPI trunk sync.

1.5.1 will just be a minor release with bug fixes. 1.5.2 may or may
not be the trunk/1.5 branch sync, so at least this buys us some
time :-).

plfs / plfs-core Goto Github PK

plfs-core's Introduction

plfs-core's People

Contributors

Stargazers

Watchers

Forkers

plfs-core's Issues

Recommend Projects

Recommend Topics

Recommend Org