* sort/align endpoint eventq fields for better locality
* group similar fields in user structure for cache effects, cache-align some fields?
  + add a counter per list and group them as well to reduce the progression loop overhead

* idr for the peer table

* dynamically alloc the sendq_map index array out of the medium request?

* drop the mmapped jiffies
  * use clock_gettime(CLOCK_MONOTONIC_COARSE) when available,
    it has the HZ resolution and cost about 10ns
  * use clock_gettime(CLOCK_MONOTONIC) otherwise,
    but it costs about 30ns

* Symlinks for both the static and shared libraries are
  created even if --disable-shared and/or --disable-static
  is given
  + or drop libopen-mx.so and only ship libmyriexpress.so
    (and always enable abi compat)

* set the library ABI version in the library filename

* fix more icc warnings

* constify the open-mx interface ?

* add restrict when c99 is enabled?

* stop supporting mtu!=1500/9000 ?
  + go back to mtu/wire-compat indep library, with send
    function pointers initialized at connect
    - how to generate the multiple instances ?

* cleanup the skbfrags stuff:
  + is it the number of attached frags, or also the linear part?
  + enforce it or just try to optimize?
  + fix the way to compute frags in medium_sends
  + or just make it a binary?

* how to tune mtu/packet-ring-size since 8k-mediums do not always help?

* when invalidating a pinned region, change its status first, with a memory barrier
  and rcu_synchronize if possible so that people using it are gone and won't come back

* push-based model for large

* make omx_prepare_binding more flexible:
  + work on a single board, with --append or --clearfile
  + pass a modulo, or even a hash function
  + pass a irq-name/pattern
* add a tool to auto-tune the interface coalescing and binding?
  + do it within the startup script?
  + driver-specific ethtool configs

* when removing a peer (for instance a local iface), make the index available again
  in case of adding another peer/iface
  + need to add omx_peers_nr since we only have omx_peer_next_nr

* single cmd to send the whole mediumsq message, with single done event ?
  + get_user_pages/dev_queue_xmit, put_pages in the last callback
  + less pipelining copy/queue_xmit

* pre-alloc many requests in the critical path to avoid ENOMEM in the background later ?
  + stop aborting on failure to alloc a fake recv notify when discard an unexp rndv

* regcache
  + disable regcache in omx_rcache_test when the driver feature flag is missing
* if killed while registering, needed to mark the region as failed?
* huge page pinning support
  + add a pagesize in the region segment structure and use either pagesize of hugepagesize when possible
* if failing to deregister region
  *** glibc detected *** tests/omx_pingpong: malloc(): memory corruption: 0x000000000064edd0 ***

* rework counters, distinguish packet/shared level transfer from
  commands and events ?

* add a flag in the endpoint desc for skbfrags
  + use mediumva by default if not skbfrags
  + add driver-specific skbfrags quirks?

* only poll what's really need to be polled in the progression loop
  + only poll the exp event queue if some events are expected
  + only if jiffies changed
    - check enough progression
    - resend queue
    - partners to ack
  + only process delayed requests if something happened in other loops
  + what about check endpoint desc ?
  + make request_alloc_check() called in debug only

* merge the endpoint management stuff from the kernel-matching branch?
* endpoint_close cannot be called from interrupt context
  + we can destroy region at first closing too?
* cleanup endpoint acquiring/locking/closing again
  + synchronize_rcu before freeing after unlinking
  + rcu_deference+acquire+check status during ioctl and network
  + stop allocating during the whole fd life
  + drop free/init status
  + drop the status_lock

* drop the iface endpoint array mutex and use the big ifaces mutex?

* cleanup the idr+BHlock for handles
  + sleep if not enough slot available

* fix/improve ioat support
  + sleep a bit before polling on synchronous copies
    - add an empirical ioat speed, tested at startup, tuned at runtime, configurable with a module param
    - use the empirical ioat speed to sleep using msleep during large
      synchronous copy instead of polling
    - if copy completed on first poll, increase the speed by 12.5% or so
    - no need to reduce the speed ever

* support non-ethernet networks by providing ways to extract the mac address differently
  + use eth_hdr(), dev->header_ops->create() and ->parse() to clarify the accessing
    to eth headers of packets and make it less ethernet specific

* affinity
  + add OMX_BH_PREWARM_CACHE= to use non-temporal only if not on the process core
    - store the process core binding when opening the endpoint
      and assume the binding won't change if the env var was set

* error handler when completing zombie send/recv only if ep totally open, with special message

* drop no-progression-during-one-second warning if no requests are pending?
  what's a pending request? anything but a posted recv?

* resource allocation cleanups
  + add a counter of failure to alloc small buffer with ENOMEM and abort if too many?
  + add a counter of failure to pull with ENOMEM and abort if too many?

* 64bit internals by default
* 64bit API by default, unless OMX_32BITS_LENGTH is defined
  + how to force the #define OMX_32BITS_LENGTH once compiled and installed?

* look at vringfd (http://lwn.net/Articles/276364/) to replace event ring

* endpoint parameters
  + unexp queue length

* don't free partner on disconnect since we can't change the session anymore?
  + or randomify the initial session?

* thread safety
  + filter wakeup on recv/send/connect/...
  + filter on specific request id too
  + progress thread only woken up if nobody else
  + split the progression timer out of the timeout timer and make it global
    and wakeup a single process
  + change the wakeup_jiffies mapped value into an ioctl parameter?
  + add a last_poll_jiffies so that the driver knows if the progress thread is needed after a timeout

* skb_clone and alloc_skb_fclone for pull and pull replies?

* wire compat
  + fix lib ack contents?

* use large region id in the lib, only allocate uint8_t id for the wire when
  posting the rndv message or the pull request
* export rdma_get and rdma window management functions
* rdma_put
* write parameter in ioctl to register a region, check it when reading/writing from/to the region
  + different rdmawin id for sender/receiver
    - no need to check for deadlock if too many sender's rdmawin registered

* move recv lib in the kernel
  + early ioatcopy expected medium, move the other in the recv ring, drop when no space anymore
    - keep unexpected data in the ring for ever
    - when unexpected is posted, notify the kernel that we acquired the ring slots and let it
      finish receiving in the target buffer
  + limit unexpq size to avoid memory starvation
    - or allocate default unexp buffers upto 32kB
    - use a usual recv queue for unexp?
  + need to keep partner's recv_seqnum in the kernel
    - and synchronize with user-space which takes care of acking
    - share mapping of peer-index based array of recv seqnums between kernel and user-space?
  + use ioat to offload the whole recv medium copy and wait in the last frag
  + move the whole lib in the driver and only keep basic request status in userspace?

* add warning about mtu in nic/switches when truncated packet arrives, with printk_rate

=== obsolete ideas
* why shared memory from/to same send and recv buffer is very bad from 64k+ 128k ?
* change --iface into ---iface, and use --iface as a "close when no more endpoint"
* make NEED_SEQNUM a resource and use DELAYED for throttling as well ?
* what's in src_generation ? unset/unchecked in OMX
* wire-compat at runtime (8kB frame needs 17bits for medium length)
* req->recv.specific.medium.frags_received_mask and .accumulated_length are redundant
  + the mask is needed to detect duplicates, and the length prevents from precomputing the missing mask
* make shared comms runtime-configurable in the driver
* use mediumva for shared medium? seems slower
  + and mediums are not that useful for shared anyway (lower rndv threshold)
* fix make install -j
  + just support make -j followed by make install -j, otherwise it's a useless mess
* regcache malloc hook (if no driver feature flag)
* merge small/tiny and change the packet type in the driver to TINY only in wire-compat mode?
  + putting the data inside the request/event really helps performance, cannot remove this path
* improve event delivery
  + per cpu rings? not scalable with number of cups
  + prefetch non-temporal from the event ring ro prevent cache line bounces?
  + or write event with non-temporal? hard in the kernel
