Search the web
Sign In
New User? Sign Up
dat-discussions · DAT Collaborative
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Multicast RDMA proposal   Message List  
Reply | Forward Message #4139 of 4166 |
Re: [dat-discussions] Multicast RDMA proposal

Ok, having sparked the discussion, I'd better have some answers for
some of the more problematic questions - if not immediately, then in
the near future.

On Fri, 6 Apr 2007 09:53:48 -0700
"Caitlin Bestler" <caitlinb@...> wrote:

> dat-discussions@yahoogroups.com wrote:
> > Caitlin,
> > I had not looked at this partial multicast case.
> >
> > For Send case, what is the semantic when a buffer is not
> > posted by one of the receivers?
> > A group "failes" and need to be recreated?
> > A receiver falls out of the group?
> >
>
> My hunch is that the receiver without a buffer would
> have to NACK the packet (if it anticipated that it
> would soon have a buffer) and drop-out of the group
> if was a persistent shortage.

Based on the work I've done so far, I'm going to say that your hunch is
correct. It's not actually a case I'd considered in any depth, but I
can't think of anything that would either be simpler or work better.

(If you had some nodes treat it as an expected message, others as an
unexpected, and for those with no spare memory at all as a failure, you
would produce an ugly mess that would cost more to manage than you'd
ever gain.)

> Going beyond multicast to also embrace partial reliability
> would further require unambiguous grouping of packets into
> "exchanges". With full reliability, a gap can simply block
> *all* later completions. With partial reliability you have
> to correctly assign the "data missing" warning to the correct
> message/completion. Theoretically doable, but not trivial.

Now, partial vs full reliability is a case I've looked at in some depth,
because networks are traditionally organized to minimize transmission
distances for very obvious reasons, and that determines what you can do
in the way of reliability and how.

It occurred to me very early on that if N nodes have a copy of a given
packet, then after a NACK has been multicast to the group, the nearest
neighbor with a copy of the NACKed packet need not be the same as the
originator.

(To use modern terminology, NACKs would be anycast, with all recipients
in the group doubling up as anycast servers.)

This requires that gaps be repaired at the time they occur, rather than
being collated, since there is no way of guaranteeing that a node that
has one packet that went missing elsewhere has ALL packets that went
missing elsewhere (and, by definition, never lost any packets itself).
The packets would be multicast out, so if a node discovered it had lost
a packet after the NACK and before the resend, it could suppress the
NACK and simply process the resend when it got there.

In the end, I concluded that wire distances and messages sent are all
minimized, but that the management of an anycast-based reliability
mechanism was adding a lot of complexity for insufficient gain in the
general case.

The alternative is to catalog all NACKs on the originator and then
re-*cast all NACKed packets.

In the multicast case, we can ignore the overheads of sending to
members of the group who already have the packet, as this will have
lower administrative and wire overheads than sending to each specific
target the packets that specific target failed to get. With
multicasting, we can also simply catalog the presence of a NACK, we
don't have to care who sent it, how many NACKed it, etc. If we resend
once to the group, everyone who NACKed it will receive it.

This is not how NORM is specified, NORM repairs via PtP, not via
multicast, but I don't think this is so useful in HPC. If there's heavy
congestion in a fat tree, for example, you're likely to temporarily lose
a region, not a node. Aggravating the problem with large numbers of
resends may cause additional problems.

This raises the question of what reliable multicast protocol is best
for this and I'll need to look into the research there further to see
if the requirements for reliable multicast would require full
reliability or partial reliability.

> > The meaning to R-key for sender and receivers become strange.
> > Sender can create a key usable by all receivers. Model can be
> > extended to be restricted to a group.
> > But for receivers, unless there is a way for all of them to
> > generate the same key or sender lower layer impl somehow
> > create a table under the covers to convert a single key to
> > receiver specific key and present a single key to a sender...
> >
> Yes, that is *exactly* what it implies -- each receiver
> must be able to create an R-Key with the same number that
> maps to the local buffer that is logically the same from
> the viewpoint of the transmitter.
>
> This is easy to specify as a protocol. But it does not
> work with the very common implementation strategy where
> the R-Key is a direct index to an RDMA Device accessible
> resource. Indeed, *requiring* the RDMA Device to accept
> a user-supplied value for the R-Key would effectively
> require a layer of indirection that would cost on-chip
> resources and/or extra memory references.

Yes, indirection would be required. A direct reference can't work for
multiple devices in potentially very different states.

> In any event, any attempt to develop "Multicast RDMA"
> would have to show patterns across multiple protocols.
> What portion of the usage of multicast is focused on
> multicast synchronization vs. multicast data distribution.
> What requirement best characterizes application expectations:
> reliable, partially reliable or unreliable? Are groups
> static, or can nodes drop out? What about dropping in?

I would agree with that. Some of this can be answered by using any of
the numerous network simulators out there, some would be more of an
analysis of HPC environments (I'm in discussions with a few labs that
can help me out with characterizations that I don't have the facilities
to reproduce directly), and some will require experiments with a
full-blown reference implementation.

Group characteristics are about the only thing I can be reasonably sure
of. Any use of multicast RDMA by something like MPI-2, OpenMOSIX,
Occam-Pi, or similar environments, will require the ability to join and
leave groups dynamically. I don't see any obvious way of avoiding that.

Nodes should be able to drop out. If a node leaves a group, any partial
receives would need to be discarded. This can be simplified by saying
that a leave can be scheduled during a receive but not executed until
all pending transactions are complete. Then we never get indeterminate
states.

Joins are also required, as there are many mechanisms that permit the
migration of processes between nodes. If a process migrates and the
process is a member of a group, then the new node must also become a
member of that group. Static groups with mobile processes just won't
work.

(Actually this is an added reason why leaves must occur in
inter-message gaps - there must never be a case where a process gets
migrated during a message, as that could lead to really nasty and
unpredictable results.)



Fri Apr 6, 2007 9:00 pm

jdaylightfleet
Offline Offline
Send Email Send Email

Forward
Message #4139 of 4166 |
Expand Messages Author Sort by Date

This proposal extends the RDMA semantics to include delivery under a message-based reliable multicast protocol, such as NACK-Oriented Reliable Multicast...
Jonathan Day
jdaylightfleet
Offline Send Email
Apr 5, 2007
8:32 pm

DAT is probably not the correct forum to discuss this, since I believe the implications of multicast RDMA would be neutral to the API. A reliable multicast...
Caitlin Bestler
caitlinbestler
Offline Send Email
Apr 5, 2007
9:17 pm

Caitlin, I had not looked at this partial multicast case. For Send case, what is the semantic when a buffer is not posted by one of the receivers? A group...
Kanevsky, Arkady
arkadynetappcom
Offline Send Email
Apr 6, 2007
12:45 pm

... My hunch is that the receiver without a buffer would have to NACK the packet (if it anticipated that it would soon have a buffer) and drop-out of the group...
Caitlin Bestler
caitlinbestler
Offline Send Email
Apr 6, 2007
4:54 pm

Ok, having sparked the discussion, I'd better have some answers for some of the more problematic questions - if not immediately, then in the near future. On...
Jonathan Day
jdaylightfleet
Offline Send Email
Apr 6, 2007
9:01 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help