Dynamic link libraries and grids
From Ben's Writing
It seems natural to have an error at the OS level, when a dynamic link library is missing. If the library is not on the machine, or not in the search paths, then of course the executable cannot be run. But this style of feedback does not scale well.
If, for instance, a user wanted to run a few thousand experiments, and was convinced that using a grid system would be the correct approach, then having a few thousand error messages returned might be somewhat surprising to the user. As far as they were concerned, they only executed one process, not thousands. We might be inclined to argue that the means by which they requested the experiments be run implied that thousands of processes should be spawned, so the resulting errors should not be unexpected. But this misses the point. As far as the user should concerned, the grid is just one big machine. They don't care if some of the machines are configured differently than others, what users care about is getting their work done. If the grid cannot process their request, then most, I believe, would prefer a single, clear reason why the work cannot be done.
Joel Spolsky, of Joel on Software fame, refers to this behaviour as a leaky abstraction (The Law of Leaky Abstractions). In Spolsy's words, an abstraction is a "simplification of something much more complicated that is going on under the covers." A leaky abstraction is one which allows some of the complicated details escape from the simplification. In the case of a grid, this can be seen when a sub-set of the machines in it do not contain the same libraries. There are of course many other leaks in contemporary grid implementations, but we'll ignore them for now.
What is most frustrating about this type of leak, is that it need not exist.
Take Condor as an example. It suffers from this issue. Not that it is the grid package's fault that the machines are not configured uniformly, one of Condor's great attributes is in fact that it can run across a heterogeneous set of systems--it is designed and intended to. That it does not supply the user with sufficient information about these difference is the issue. So, while it does allow a user some degree of liberty in ignoring some of the differences in the systems that are part of the grid, it also forces them to pay very close attention the other ones.
There is a simple fix for this, however. The Condor system employs a very clever matchmaking system to connect jobs with computational resources. In this system the work, or job, that the user want done has a set of attributes that describe aspects of the work, like the executable, the parameters, memory requirements, etc. Machines in the Condor Pool (read: grid) also have a set of attributes, like the number of cores they have, the amount of physical memory, and so on. The Condor "matchmaker" takes these attributes and matches jobs with machines that qualify. That is, if the machine has enough memory, or is not being used by another human, then it will run the job for the user. If no machine exists that can satisfy the needs of the job, then matchmaking fails, and the user can figure out how to make their job less picky.
It seems to me, that the same system could be used to match job requirements at a much finer level of detail, and that machines could advertise much more information than they already do.
Returning to the issue of thousands of user process failing because of library dependencies: if the machines listed the libraries available in the default path, and the job listed the libraries it required, then the matchmaker could do all the work, and return one less surprising error: namely, "there exists no machine with the dynamic libraries you have requested." The user could then either statically link to the library they needed, or kindly request that the system administrator add the libraries to the grid machines.
I think this would be a far more natural behaviour for modern grid systems. (I am aware that this is already done on a small scale, some machines have values inserted in to their ads for this purpose, but there exists no general purpose system for exporting all libraries on a particular machine.)
Afterthought: It's not clear if each individual machine should be responsible for advertising the libraries, as there would be a great deal of duplicate information between two machines. So maybe a package manager system that knows enough about certain machine classes to be able to tell the matchmaker which machines have which libraries.