This note is from Greg Louis, who did everyone the enormous service of taking notes at the Enschede workshop. ---------------------------------------------------- Hi. Here is a brief writeup from the notes I took at the workshop. It tries to document outcomes rather than covering discussion in detail. Please make any use of it that you see fit. ======================================== Notes taken 2001.11.27-8 at the Linux Clustering workshop held in Enschede. This workshop was a followup to an introductory meeting held in Ottawa in July. In Ottawa it had been decided that the creation of a framework for development of Linux clustering would be a worthwhile effort. In Enschede, discussion began with a recap of the history and motivation of the project. Alan Robertson began by pointing out that in the area of Linux clustering, there's no one organization or group that sets standards de facto, and there probably won't be any such. Currently Linux clustering isn't moving forward as fast as it could, because of fragmentation and duplication of effort. The image of Linux as a viable component in large-scale systems is suffering because we haven't a clear path for implementors of clustering to follow. Our goals, therefore, should be -- without depriving current customers of support -- to focus the development of Linux clustering, to eliminate duplication of effort, and to create a bandwagon effect that will draw most or all Linux clustering developers together. More specifically, we need to create an accepted body of standards, preferably in association with a standards organization like the IETF or the Free Standards group. The former has a clustering SIG and its web page refers to linux-ha, but the Free Standards group has more of an API orientation as well as being more directly involved with Linux. Alan proposed that we should - Standardize the external APIs - Produce a reference implementation with components that have defined but not mandatory internal interfaces. The idea here is that we do not dictate _how_ the standard APIs are actually implemented, but we do offer the option of building plugins that can replace components of the reference implementation, so that developers can give choices and cluster builders can make tradeoffs. The API standards are to be royalty-free. There is nontrivial legal work involved in avoiding issues of patent infraction; organizations need to commit that they know of no existing patents that would apply, and that if, after a grace period, they find that they hold such patents, they will not enforce the patents against standard-compliant implementors. This legal work has already been addressed by the Free Standards group, which is another motive for us to work with them. It was pointed out that a clustering standard approved by the Free Standards group would have significant weight and would help to create the bandwagon effect mentioned above. We need to define - The APIs that will go into the standard - The components needed to implement clustering - The "plumbing," aka "infrastructure," such as plugin loader, RPC and so on -- basically, the high-fan-in utilities. It was suggested that - No component should be larger than a master's thesis, i.e., a few months' work - APIs should be agnostic, not dictating any implementation details - APIs should have C as their native language, to facilitate interfaces Some consideration was then given to the organization of the project. Consensus appeared to develop around the idea that there should be no one team leader with final authority, but rather a core team assisted by working groups. The decision-making process would involve striving for consensus, but when that could not be reached, a two-thirds majority should suffice. Critical issues, or issues where the core team cannot agree, might be addressed by a committee of the whole, in which perhaps a higher degree of unanimity (3/4 or 4/5) might be desirable. There being an appearance of agreement in principle on this sort of organization, we agreed that the details could be worked out and a charter drafted later, with communication by email. The goals were then set forth in a bit more detail: - An accepted standard defining the APIs for clustering - The APIs, though suitable for Linux implementation, should not be applicable to Linux only; after some discussion, we found ourselves in agreement that - Linux implementation is the primary target - We don't want to compromise that in any way for the sake of supporting other OS's - Nevertheless we do want to define the APIs so as to make it as easy to implement in other OS's as is consistent with the first two points; it was mentioned that the APIs need to be OS-independent to be useful in HPC work, where applications are often written by people who don't own the cluster(s) they will run on, and are often run on more than one cluster - API specs should aspire to POSIX compliance - APIs should in no way dictate an "in-kernel" implementation - APIs must not preclude cross-platform work - Reference implementation(s) - The build system should be chosen to build on various platforms (automake) - Components should be portable when possible - Component interfaces should be agnostic with regard to OS and to kernel vs. non-kernel implementation (Most interfaces among components, it was suggested, might end up being external anyway) - Bandwagon effect (aka mindshare) - A viable OSS project - Timely results - It was felt that early release of something usable was important in preserving interest and building momentum - Divers solution spaces - many niches - broad coverage Some strategies for achieving those goals were suggested: - Incremental, iterative development - Early implementation of basic functionality, with sophistication developing over time A basic list of components was discussed but as this was much refined on the following day I won't reproduce it here. Initial focus was deemed likely to be on membership, communication and resource management, with fencing and group services to be addressed in a second phase. Bruce Walker led a discussion of the membership component's functionality. It began with a definition of a cluster as a group of peers sharing trust and residing in a common administrative domain; computers in a client-server relationship were excluded from the definition. Some attention was then given to the topic "clusters of clusters," including hierarchies, overlap and group services. Reasons for configuring a hierarchy might include geographical separation, large size, varying requirements in detection frequency, different network topologies and different functions (e.g. core vs. task). After some discussion it seemed obvious that the API needs to support membership in more than one cluster. Ted Ts'o suggested that a good way to do this is to set library context, implementing a default (which takes care of the large majority of cases where there is only one cluster anyway). The group reviewed the functions that had been implemented in CI. As these are available on the CI website (http://ci-linux.sf.net), they're not repeated here. During this discussion, Bruce developed a list of functions not present in CI that will be wanted in the API set: - list of potential nodes - policy on formation - multiple cluster membership - set cluster context - inclusion of "up but not joined" in the list of node states - deal with node numbers in a hierarchy There was some discussion round this last point, Ted favouring the use of a uuid of 128 bits. He pointed out as well that support for hierarchical naming requires to be carefully specified in order not to create an impractical naming system. CI's clusternode_info() call provoked a brief discussion of the pros and cons of passing structures vs name-value pairs. Consensus seemed to favour structures for common information (in the interest of efficiency) and separate support for name-value pairs for vendor-specific information. Static cluster configuration was discussed next. This should include the functions of - list nodes - add - delete - list node configuration - edit node configuration Dynamic information, it was agreed, should be managed separately. This wasn't discussed in any more detail. Bruce described splitting off a detector module from membership, so that detection algorithms could be easily swapped. This concept found favour with the group, and it was agreed that in addition to node membership (NMS) and node communications (NCS) there should be a node "liveness" component (NLV). A discussion arose around what happens when a node is going down and a second node starts down as well. One view is that the move from initial to final state should be transactional: the intrusion of a second event causes the transaction in progress to abort and a new transaction begins to cover both. The other view is that the individual state changes should execute in parallel, because at the time any "abort" signal is received, some nodes could have completed the first state change. The two approaches could coexist if it were agreed that such nodes should process the second change as a separate transaction, but this would lead to individual nodes having different views of the cluster history. It wasn't decided how this issue should be resolved, though the transactional approach appeared to be favoured by most. Messaging (communication) was considered next. Currently, heartbeat and CI both do guaranteed message delivery with no guarantee of ordering. The APIs should define addressing, deal in some way with the question of ordering, and (decided after some discussion) provide messaging 1-1 and 1-many, the latter of course including 1-all. (The option of 1-1 and 1-all, where unconcerned nodes just drop the 1-all messages, was rejected as unlikely to scale well.) The distributed lock manager (DLM) was considered to be, in effect, a solved problem. Our API should define one of those in the way it's done now. The group then took a second cut at a component diagram (not yet complete) which looked something like this (display with monospace font): CONFIG init -------- Group services --------- ----- Resource management ----- GVS GBS CM RA voting barriers cluster mgmt agent (scripts) GMS GCS GTS RIF RWATCH membership messaging transactions instantiation monitoring RFS fencing --------------------------------- ------------------------------- -------- Node services --------- DLM SNMP locking agent NLV NCS NMS liveness messaging membership -------------------------------- Notes: - The CM component implements policies - Remote execution is handled by RIF - RA can ask questions like "can we start this now?" We talked a bit about resources. Alan offered the definition "A resource is something that can be started, stopped, checked for running and checked for operating." Resource instances have attributes, and it must be possible to interrogate the resource as to what these are (this being a possible application of name==value pairs). The suggestion was made that XML rich data types would serve well for this. Alan also suggested that resources should be able to report resource dependencies (e.g. "I'm using /dev/sda4"). At that point lunch became a priority and we agreed to adjourn the meeting and continue by email :) =================================================== Regards............... -- | G r e g L o u i s | gpg public key: | | http://www.bgl.nu/~glouis | finger greg@bgl.nu |