This note is from Greg Louis, who did everyone the enormous service of
taking notes at the Enschede workshop.

----------------------------------------------------
Hi.
Here is a brief writeup from the notes I took at the workshop.  It tries to
document outcomes rather than covering discussion in detail.  Please make
any use of it that you see fit.

========================================
Notes taken 2001.11.27-8 at the Linux Clustering workshop
held in Enschede.

This workshop was a followup to an introductory meeting held in Ottawa
in July.  In Ottawa it had been decided that the creation of a
framework for development of Linux clustering would be a worthwhile
effort.  In Enschede, discussion began with a recap of the history and
motivation of the project.  Alan Robertson began by pointing out that
in the area of Linux clustering, there's no one organization or group
that sets standards de facto, and there probably won't be any such.

Currently Linux clustering isn't moving forward as fast as it could,
because of fragmentation and duplication of effort.  The image of Linux
as a viable component in large-scale systems is suffering because we
haven't a clear path for implementors of clustering to follow.

Our goals, therefore, should be -- without depriving current customers
of support -- to focus the development of Linux clustering, to
eliminate duplication of effort, and to create a bandwagon effect that
will draw most or all Linux clustering developers together.

More specifically, we need to create an accepted body of standards,
preferably in association with a standards organization like the IETF
or the Free Standards group.  The former has a clustering SIG and its
web page refers to linux-ha, but the Free Standards group has more of
an API orientation as well as being more directly involved with Linux.

Alan proposed that we should
- Standardize the external APIs
- Produce a reference implementation with components that have defined
  but not mandatory internal interfaces.  The idea here is that we do
  not dictate _how_ the standard APIs are actually implemented, but we
  do offer the option of building plugins that can replace components
  of the reference implementation, so that developers can give choices
  and cluster builders can make tradeoffs.

The API standards are to be royalty-free.  There is nontrivial legal
work involved in avoiding issues of patent infraction; organizations
need to commit that they know of no existing patents that would apply,
and that if, after a grace period, they find that they hold such
patents, they will not enforce the patents against standard-compliant
implementors.  This legal work has already been addressed by the Free
Standards group, which is another motive for us to work with them.

It was pointed out that a clustering standard approved by the Free
Standards group would have significant weight and would help to create
the bandwagon effect mentioned above.

We need to define
- The APIs that will go into the standard
- The components needed to implement clustering
- The "plumbing," aka "infrastructure," such as plugin loader, RPC and
  so on -- basically, the high-fan-in utilities.

It was suggested that
- No component should be larger than a master's thesis, i.e., a few
  months' work
- APIs should be agnostic, not dictating any implementation details
- APIs should have C as their native language, to facilitate interfaces

Some consideration was then given to the organization of the project.
Consensus appeared to develop around the idea that there should be no
one team leader with final authority, but rather a core team assisted
by working groups.  The decision-making process would involve striving
for consensus, but when that could not be reached, a two-thirds
majority should suffice.  Critical issues, or issues where the core
team cannot agree, might be addressed by a committee of the whole, in
which perhaps a higher degree of unanimity (3/4 or 4/5) might be
desirable.  There being an appearance of agreement in principle on
this sort of organization, we agreed that the details could be worked
out and a charter drafted later, with communication by email.

The goals were then set forth in a bit more detail:
- An accepted standard defining the APIs for clustering
  - The APIs, though suitable for Linux implementation, should
    not be applicable to Linux only; after some discussion, we
    found ourselves in agreement that
    - Linux implementation is the primary target
    - We don't want to compromise that in any way for the sake
      of supporting other OS's
    - Nevertheless we do want to define the APIs so as to make it
      as easy to implement in other OS's as is consistent with the
      first two points; it was mentioned that the APIs need to be
      OS-independent to be useful in HPC work, where applications
      are often written by people who don't own the cluster(s) they
      will run on, and are often run on more than one cluster
  - API specs should aspire to POSIX compliance
  - APIs should in no way dictate an "in-kernel" implementation
  - APIs must not preclude cross-platform work
- Reference implementation(s)
  - The build system should be chosen to build on various platforms
    (automake)
  - Components should be portable when possible
  - Component interfaces should be agnostic with regard to OS and to
    kernel vs. non-kernel implementation  (Most interfaces among
    components, it was suggested, might end up being external anyway)
- Bandwagon effect (aka mindshare)
- A viable OSS project
- Timely results
  - It was felt that early release of something usable was important
    in preserving interest and building momentum
- Divers solution spaces
  - many niches
  - broad coverage

Some strategies for achieving those goals were suggested:
- Incremental, iterative development
- Early implementation of basic functionality, with sophistication
  developing over time

A basic list of components was discussed but as this was much refined
on the following day I won't reproduce it here.  Initial focus was
deemed likely to be on membership, communication and resource
management, with fencing and group services to be addressed in a second
phase.

Bruce Walker led a discussion of the membership component's
functionality.  It began with a definition of a cluster as a group of
peers sharing trust and residing in a common administrative domain;
computers in a client-server relationship were excluded from the
definition.  Some attention was then given to the topic "clusters of
clusters," including hierarchies, overlap and group services.  Reasons
for configuring a hierarchy might include geographical separation,
large size, varying requirements in detection frequency, different
network topologies and different functions (e.g. core vs. task).

After some discussion it seemed obvious that the API needs to support
membership in more than one cluster.  Ted Ts'o suggested that a good
way to do this is to set library context, implementing a default (which
takes care of the large majority of cases where there is only one
cluster anyway).

The group reviewed the functions that had been implemented in CI.  As
these are available on the CI website (http://ci-linux.sf.net),
they're not repeated here.  During this discussion, Bruce developed a
list of functions not present in CI that will be wanted in the API set:
- list of potential nodes
- policy on formation
- multiple cluster membership
  - set cluster context
- inclusion of "up but not joined" in the list of node states
- deal with node numbers in a hierarchy

There was some discussion round this last point, Ted favouring the use
of a uuid of 128 bits.  He pointed out as well that support for
hierarchical naming requires to be carefully specified in order not to
create an impractical naming system.

CI's clusternode_info() call provoked a brief discussion of the pros
and cons of passing structures vs name-value pairs.  Consensus seemed
to favour structures for common information (in the interest of
efficiency) and separate support for name-value pairs for
vendor-specific information.

Static cluster configuration was discussed next.  This should include
the functions of
  - list nodes
  - add
  - delete
  - list node configuration
  - edit node configuration

Dynamic information, it was agreed, should be managed separately.
This wasn't discussed in any more detail.

Bruce described splitting off a detector module from membership, so
that detection algorithms could be easily swapped.  This concept found
favour with the group, and it was agreed that in addition to
node membership (NMS) and node communications (NCS) there should be a
node "liveness" component (NLV).

A discussion arose around what happens when a node is going down and a
second node starts down as well.  One view is that the move from
initial to final state should be transactional: the intrusion of a
second event causes the transaction in progress to abort and a new
transaction begins to cover both.  The other view is that the
individual state changes should execute in parallel, because at the
time any "abort" signal is received, some nodes could have completed
the first state change.  The two approaches could coexist if it were
agreed that such nodes should process the second change as a separate
transaction, but this would lead to individual nodes having different
views of the cluster history.  It wasn't decided how this issue should
be resolved, though the transactional approach appeared to be favoured
by most.

Messaging (communication) was considered next.  Currently, heartbeat
and CI both do guaranteed message delivery with no guarantee of
ordering.  The APIs should define addressing, deal in some way with the
question of ordering, and (decided after some discussion) provide
messaging 1-1 and 1-many, the latter of course including 1-all.  (The
option of 1-1 and 1-all, where unconcerned nodes just drop the 1-all
messages, was rejected as unlikely to scale well.)

The distributed lock manager (DLM) was considered to be, in effect, a
solved problem.  Our API should define one of those in the way it's
done now.

The group then took a second cut at a component diagram (not yet complete)
which looked something like this (display with monospace font):

                                CONFIG
                                init

-------- Group services ---------       ----- Resource management -----

GVS               GBS                   CM              RA
voting            barriers              cluster mgmt    agent (scripts)

GMS        GCS       GTS                RIF             RWATCH
membership messaging transactions       instantiation   monitoring

                                        RFS
                                        fencing
---------------------------------       -------------------------------

-------- Node services ---------      DLM                   SNMP
                                      locking               agent
NLV        NCS        NMS
liveness   messaging  membership

--------------------------------

Notes:
- The CM component implements policies
- Remote execution is handled by RIF
- RA can ask questions like "can we start this now?"

We talked a bit about resources.  Alan offered the definition "A
resource is something that can be started, stopped, checked for running
and checked for operating."  Resource instances have attributes, and it
must be possible to interrogate the resource as to what these are (this
being a possible application of name==value pairs).  The suggestion was
made that XML rich data types would serve well for this.  Alan also
suggested that resources should be able to report resource dependencies
(e.g. "I'm using /dev/sda4").

At that point lunch became a priority and we agreed to adjourn the
meeting and continue by email :)

===================================================

Regards...............
--
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg@bgl.nu |