Notes on the BMI Portals implementation
=======================================

Copyright (C) 2007 Pete Wyckoff <pw@osc.edu>

Connection management
---------------------
Portals is (for the most part) connectionless.  To find each other requires
external support.  Three bits of information are required to
communicate:  nid, pid, portal index.  The BMI address will provide a
hostname and the pid.  We assume some external mechanism
to convert hostname to nid; for the utcp portals device this is the IPv4
address.  The pid is explicitly provided in the BMI address.  The portal
index for the server is "well known" and hard-coded in the program.

There is no connection structure in the code.  Duplicate request
detection is handled without sequence numbers, as discussed below.


Data motion
-----------
Portals has the same mismatches with BMI that IB has.  The crux is that BMI
permits sends to occur without pre-matching receives, but Portals would
drop these messages.  The sender always sends the entire message, and if
the receive happened to be preposted, everything is done then.  Portals
sends an ack back to the sender under the hood, indicating completion.  If
not preposted, the data goes into one of a few big buffers, but only the
first so much of the data is retained.  If the data fits completely, the
sender is done.  Later when the receive is posted, if the full message
happened to have been received, it is copied out of the unexpected area and
all is done.  If the message was truncated, the sender knows this and
arranges things so that the receiver can do a Get.  Events happen and all
completes.  The Portals atomic MDUpdate command is used to handle the usual
unexpected race condition.

Note the other usage of the word "unexpected" message in BMI is for the
method calls "post_sendunexpected" and "testunexpected".  These are
essentially new requests from clients, and never really have an explictly
pre-posted receive, but need to go somewhere that the server can find them.
We manage a queue of preposted buffers for all possible senders, not a
separate queue for each sender as in the IB case.  (IB was written before
SRQ implementations were available.)  There is no explicit flow control, so
if the clients manage to overwhelm this buffer, we rely on the upper layers
of PVFS to cancel the request and try again.  Adding resends in BMI would
lead to duplicate messages and require adding per-client state.  Limiting
the number of outstanding requests per client, and sizing the buffer on the
server to expect that many may guarantee that this never happens.

Put buffers and get destination buffers do not have to be explicitly
registered.  Building the Portals MD structures does that under the
hood, if necessary.


State paths
-----------
Below are descriptions of how the states progress for the sender and receiver
for the various possible message types.

BMI_post_send
    start
	do the put operation
    SQ_WAITING_ACK
	(ack arrives)
	mark completed, transition
    SQ_WAITING_ACK
	(timeout occurs)
	redo the put operation, stay in this state
    SQ_WAITING_USER_TEST
	(user tests)
	release sendq

BMI_post_recv
    start
	build md/me
    RQ_WAITING_INCOMING
	(event happens)
	mark completed, transition
    RQ_WAITING_USER_TEST
	(user tests)
	release recvq

BMI_post_sendunexpected
    (Same as BMI_post_send, but the put has different match bits
     and indicates receiver-managed.)


Event processing
----------------
All MDs are tied to a single EV.  We distinguish among them by some
sort of user pointer that Portals provides.


Match lists
-----------
Diagram of the various match lists used.  The two bars in each
of the match lists separate the 64 bits into:  2 bits, 30 bits, 32 bits,
in that order.

preposted receives

    match 0 | seqno | bmi_tag -> preposted buf

    match 0 | seqno | bmi_tag -> preposted buf

    ...

outgoing sends

    match 1 | seqno | bmi_tag <- preposted buf, respond to get

    ...

mark

    match none

nonpreopsted receive buffers

    match 0 | any | any -> nonprepost buffer1, max size

    match 0 | any | any -> nonprepost buffer2, max size

unexpected message buffers

    match 2 | any | any -> unexpected buffer1, max size

    match 2 | any | any -> unexpected buffer2, max size

zero

    match 0 | any | any -> no buffer, trunc, max size 0

Preposted receives must come first and be in order so that they match
for expected incoming messages.  The outgoing posted sends come next, and
can be mixed up with the receives since they have a unique bit 62 set.

The order of nonprepost and unexpected buffers doesn't matter, so we let them
mix up among themselves.  The mark entry is used to be able to find the point
between the prepost and other entries, otherwise we'd need lots of code to
track that by hand.

The nonprepost and unexpected buffers are managed as "circular" lists,
where one is filled up until it is unlinked, then it is reposted after
the other that has now started to fill up.  They are actually all mixed
up in the area between mark and zero, as it doesn't matter which order
they appear in.

Nonpreopst messages are kept in the buffers until the app posts a receive
that matches.  If they fill up, later messages fall off the bottom.
Working apps will pre-post their receives before the sender tries to send
to them.

Unexpected messages are a protocol feature of BMI.  A special high-bit 63
indicates this.  They are limited in size by the protocol (8k here), and
are always new requests from a client to a server.  As they arrive, they
are immediately copied into new buffers that are handed back to the server
through BMI_testunexpected.  So we don't need reference counting and the
circular buffer management is straightforward:  just repost immediately
when the unlink event happens.

Finally, for nonprepost messages that are too big, it lands into the zero
md at the end that just generates an event on the sender and receiver.  The
receiver does a get to read the data from the sender later when the app
finally posts the receive.

Note that BMI will post multiple sends to the same dest with the same
tag.  With this scheme, a race condition can develop where peer A sends tag
5 and tag 5, then peer B recvs and acks tag 5, then peer B recvs the second
tag 5 to its zero md, then peer B has an internal post_recv and does a get
to tag 5.  This get hits the _first_ tag 5 on peer A, as peer A has not
gotten around to processing the ack and unlinking that one.  To get around
this, we add a sequence number that sits in the 30 bits just above the 32
bits reserved for the tag.


TODO Notes
----------
Maybe have a separate completion queue distinct from sendq and recvq.
Some lookups are O(N), like for incoming RTS and receive matching.

% vi: set tw=75 :
