Name

    NV_parameter_buffer_object2

Name Strings

    GL_NV_parameter_buffer_object2

Contact

    Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com)

Status

    Shipping (July 2009, Release 190)

Version

    Last Modified Date:         09/09/09
    NVIDIA Revision:            2

Number

    378

Dependencies

    OpenGL 2.0 is required.

    NV_gpu_program4 is required.

    NV_parameter_buffer_object is required.

    This extension is written against the NV_gpu_program4 specification.

    NV_shader_buffer_load trivially affects the definition of this extension.

Overview

    This extension builds on the NV_parameter_buffer_object extension to
    provide additional flexibility in sourcing data from buffer objects.  

    The original NV_parameter_buffer_object (PaBO) extension provided the
    ability to bind buffer objects to a set of numbered binding points and
    access them in assembly programs as though they were arrays of 32-bit
    scalars (via the BUFFER variable type) or arrays of four-component vectors
    with 32-bit scalar components (via the BUFFER4 variable type).  However,
    the functionality it provided had some significant limits on flexibility.
    Since any given buffer binding point could be used either as a BUFFER or
    BUFFER4, but not both, programs couldn't do both 32- and 128-bit fetches
    from a single binding point.  Additionally, No support was provided for
    8-, 16-, or 64-bit fetches, though they could be emulated using a larger
    loads, with bitfield operations and/or write masking to put components in
    the right places.  Indexing was supported, but strides were limited to 4-
    and 16-byte multiples, depending on whether BUFFER or BUFFER4 is used.

    This new extension provides the buffer variable declaration type CBUFFER
    to specify a buffer that is treated as an array of bytes, rather than an
    array of words or vectors.  The LDC instruction allows programs to extract
    a vector of data from a CBUFFER variable, using a size and component count
    specified in the opcode modifier.  1-, 2-, and 4-component fetches are
    supported.  The LDC instruction supports byte offsets using normal array
    indexing mechanisms; both run-time and immediate offsets are supported.
    Offsets used for a buffer object fetch are required to be aligned to the
    size of the fetch (1, 2, 4, 8, or 16 bytes).

New Procedures and Functions

    None.

New Tokens

    None.

Additions to Chapter 2 of the OpenGL 3.0 Specification (OpenGL Operation)

    (All modifications are relative to Section 2.X, GPU Programs, from the
     NV_gpu_program4 specification.)

    Modify Section 2.X.2, Program Grammar

    (add after the long list of grammar rules) If a program specifies the
    NV_parameter_buffer_object2 program option, the following rules are added
    to the NV_gpu_program4 base program grammar:

    <VECTORop>              ::= "LDC"

    <opModifier>            ::= "F32";
                              | "F32X2";
                              | "F32X4";
                              | "S8";
                              | "S16";
                              | "S32";
                              | "S32X2";
                              | "S32X4";
                              | "U8";
                              | "U16";
                              | "U32";
                              | "U32X2";
                              | "U32X4";

    <bufferDeclType>        ::= "CBUFFER"


    Modify Section 2.X.3.6, Program Parameter Buffers

    (modify the paragraph describing the different type of parameter buffer
    variable declarations to include support for "CBUFFER".)

    Program parameter buffer variables are treated as an array of
    single-component words if the <bufferDeclType> grammar rule matches
    "BUFFER" or as an array of four-component vectors if it matches "BUFFER4".
    Program parameter buffers may also be declared as an array of basic
    machine units from which data can be extracted using the LDC (load
    constant) instruction, if <bufferDeclType> matches "CBUFFER".  Parameter
    buffer variables declared using "CBUFFER" may not be used as an operand in
    any instruction other than LDC, while "BUFFER" and "BUFFER4" variables may
    not be used with LDC.  A program will fail to load if a variable declared
    as "BUFFER" and another variable declared as "BUFFER4" use the same buffer
    binding point.  There is no limitation on the use of "CBUFFER" variables
    in conjunction with "BUFFER" or "BUFFER4" variables using the same buffer
    binding point.

    (modify/restructure the paragraph describing basic program parameter
     bindings to handle the byte bindings provided by "CBUFFER" variables)

    If a program parameter buffer binding matches "program.buffer[a][b]", the
    program parameter variable corresponds to element <b> of the buffer object
    bound to binding point <a>.  Each element of the bound buffer object is
    treated as:

      * a single basic machine unit of data, if the variable is declared using
        "CBUFFER";

      * a single word of data that can hold an integer or floating-point
        value, if the variable is declared as "BUFFER"; or

      * four words of data that can hold integer or floating-point values, if
        the variable is declared as "BUFFER4".

    When a binding corresponding to a "BUFFER" variable is used as an operand,
    the selected word is broadcast to all four components of the variable.
    When a binding corresponding to a "BUFFER4" variable is used as an
    operand, the four components of the selected buffer element are loaded
    into the variable.  A binding corresponding to a "CBUFFER" variable may be
    used only in the LDC instruction, and will be used there as a pointer to
    extract operand values from buffer memory.  If no buffer object is bound
    to binding point <a>, or the bound buffer object is not large enough to
    hold element <b>, the values used are undefined.  The binding point <a>
    must be a nonnegative integer constant.


    Modify Section 2.X.4, Program Execution Environment

    (Add to the set of opcodes in Table X.13)

                  Modifiers 
      Instruction F I C S H D  Out Inputs    Description
      ----------- - - - - - -  --- --------  --------------------------------
      LDC         X X X X - F  v   v         load from constant buffer


    Modify Section 2.X.4.1, Program Instruction Modifiers

    (Add to Table X.14, Instruction Modifiers, and to the corresponding
    description following the table)

      Modifier  Description
      --------  -----------------------------------------------
      F32       Access one 32-bit floating-point value
      F32X2     Access two 32-bit floating-point values
      F32X4     Access four 32-bit floating-point values
      S8        Access one 8-bit signed integer value
      S16       Access one 16-bit signed integer value
      S32       Access one 32-bit signed integer value
      S32X2     Access two 32-bit signed integer values
      S32X4     Access four 32-bit signed integer values
      U8        Access one 8-bit unsigned integer value
      U16       Access one 16-bit unsigned integer value
      U32       Access one 32-bit unsigned integer value
      U32X2     Access two 32-bit unsigned integer values
      U32X4     Access four 32-bit unsigned integer values

    For memory load operations, the "F32", "F32X2", "F32X4", "S8", "S16",
    "S32", "S32X2", "S32X4", "U8", "U16", "U32", "U32X2", and "U32X4" storage
    modifiers control how data are loaded from memory.  Storage modifiers are
    supported by the LDC and LOAD instructions and are covered in more detail
    in the descriptions of these instructions.  These instructions must
    specify exactly one of these modifiers, and may not specify any of the
    base data type modifiers (F,U,S) described above.  The base data type of
    the result vector of a LOAD or LDC instruction is trivially derived from
    the storage modifier.


    Add New Section 2.X.4.5, Program Memory Access

    Programs may load from buffer object memory via the LDC (load constant)
    and LOAD (global load) instructions.

    Load instructions read 8, 16, 32, 64, or 128 bits of data from a source
    address to produce a four-component vector, according to the storage
    modifier specified with the instruction.  The storage modifier has three
    parts:

      - a base data type, "F", "S", or "U", specifying that the instruction
        fetches floating-point, signed integer, or unsigned integer values,
        respectively;

      - a component size, specifying that the components fetched by the
        instruction have 8, 16, or 32 bits; and

      - an optional component count, where "X2" and "X4" indicate that two or
        four components be fetched, and no count indicates a single component
        fetch.

    When the storage modifier specifies that fewer than four components should
    be fetched, remaining components are filled with zeroes.  When performing
    a global load (LOAD), the GPU address is specified as an instruction
    operand.  When performing a constant buffer load (LDC), the GPU address is
    derived by adding the base address of the bound buffer object to an offset
    specified as an instruction operand.  Given a GPU address <address> and a
    storage modifier <modifier>, the memory load can be described by the
    following code:

      result_t_vec BufferMemoryLoad(char *address, OpModifier modifier)
      {
        result_t_vec result = { 0, 0, 0, 0 };
        switch (modifier) {
        case F32:
            result.x = ((float32_t *)address)[0];
            break;
        case F32X2:
            result.x = ((float32_t *)address)[0];
            result.y = ((float32_t *)address)[1];
            break;
        case F32X4:
            result.x = ((float32_t *)address)[0];
            result.y = ((float32_t *)address)[1];
            result.z = ((float32_t *)address)[2];
            result.w = ((float32_t *)address)[3];
            break;
        case S8:
            result.x = ((int8_t *)address)[0];
            break;
        case S16:
            result.x = ((int16_t *)address)[0];
            break;
        case S32:
            result.x = ((int32_t *)address)[0];
            break;
        case S32X2:
            result.x = ((int32_t *)address)[0];
            result.y = ((int32_t *)address)[1];
            break;
        case S32X4:
            result.x = ((int32_t *)address)[0];
            result.y = ((int32_t *)address)[1];
            result.z = ((int32_t *)address)[2];
            result.w = ((int32_t *)address)[3];
            break;
        case U8:
            result.x = ((uint8_t *)address)[0];
            break;
        case U16:
            result.x = ((uint16_t *)address)[0];
            break;
        case U32:
            result.x = ((uint32_t *)address)[0];
            break;
        case U32X2:
            result.x = ((uint32_t *)address)[0];
            result.y = ((uint32_t *)address)[1];
            break;
        case U32X4:
            result.x = ((uint32_t *)address)[0];
            result.y = ((uint32_t *)address)[1];
            result.z = ((uint32_t *)address)[2];
            result.w = ((uint32_t *)address)[3];
            break;
        }
        return result;
      }

    The offset used for the constant buffer loads must be aligned to the fetch
    size corresponding to the storage opcode modifier.  For S8 and U8, the
    offset has no alignment requirements.  For S16 and U16, the offset must be
    a multiple of two basic machine units.  For F32, S32, and U32, the offset
    must be a multiple of four.  For F32X2, S32X2, and U32X2, the offset must
    be a multiple of eight.  For F32X4, S32X4, and U32X4, the offset must be a
    multiple of sixteen.  If an offset is not correctly aligned, the values
    returned by a constant buffer load will be undefined.


    Modify Section 2.X.6, Program Options

    + Extended Parameter Buffer Object Support (NV_parameter_buffer_object2)

    If a program specifies the "NV_parameter_buffer_object2" option, it may
    use the CBUFFER statement to declare program parameter buffer variables
    and the LDC instruction to load data from parameter buffer variables using
    arbitrary offsets.


    Modify Section 2.X.8, Program Instruction Set

    Section 2.X.8.Z, LDC:  Load from Constant Buffer

    The LDC instruction loads a vector operand from a buffer object to yield a
    result vector.  The operand used for the LDC instruction must correspond
    to a parameter buffer variable declared using the "CBUFFER" statement; a
    program will fail to load if any other type of operand is used in an LDC
    instruction.

      result = BufferMemoryLoad(&op0, storageModifier);

    A base operand vector is fetched from memory as described in Section
    2.X.4.5, with the GPU address derived from the binding corresponding to
    the operand.  A final operand vector is derived from the base operand
    vector by applying swizzle, negation, and absolute value operand modifiers
    as described in Section 2.X.4.2.

    The amount of memory in any given buffer object binding accessible by the
    LDC instruction may be limited.  If any component fetched by the LDC
    instruction extends 4*<n> or more basic machine units from the beginning
    of the buffer object binding, where <n> is the implementation-dependent
    constant MAX_PROGRAM_PARAMETER_BUFFER_SIZE_NV, the value fetched for that
    component will be undefined.

    LDC supports no base data type modifiers, but requires exactly one storage
    modifier.  The base data types of the operand and result vectors are
    derived from the storage modifier.


Additions to Chapter 3 of the OpenGL 3.0 Specification (Rasterization)

    None.

Additions to Chapter 4 of the OpenGL 3.0 Specification (Per-Fragment
Operations and the Frame Buffer)

    None.

Additions to Chapter 5 of the OpenGL 3.0 Specification (Special Functions)

    None.

Additions to Chapter 6 of the OpenGL 3.0 Specification (State and
State Requests)

    None.

Additions to Appendix A of the OpenGL 3.0 Specification (Invariance)

    None.

Additions to the AGL/GLX/WGL Specifications

    None.

Errors

    No new errors.

Dependencies on NV_shader_buffer_load

    If NV_shader_buffer_load (or equivalent functionality) is not supported,
    references to the "LOAD" opcode in the description of the opcode modifiers
    for "LDC" should be removed.

New State

    None.

New Implementation Dependent State

    None.

Issues

    (1) What sort of alignment requirements, if any, should be imposed on the
        operand provided to the LDC instruction?

      RESOLVED:  The offset of the operand must be aligned according to the
      size of the fetch.  For 1-, 2-, and 4-component fetches, the offset must
      be a multiple of <N>, 2*<N>, and 4*<N>, where <N> is the size in bytes
      of the components being fetched.

    (2) NV_parameter_buffer_object provides an implementation-dependent limit
        on the portion of a buffer object that may be fetched via BUFFER and
        BUFFER4 variables?  Should the same limits apply to the LDC
        instruction?

      RESOLVED:  Yes.  On currently shipping NVIDIA GPUs, the maximum program
      parameter buffer size is 16384 32-bit words, or 64KB.  Buffers larger
      than 64KB may be used, but any fetches accessing memory beyond the first
      64KB of a buffer binding will return undefined values.

    (3) Should we support fetches of 3-component vectors?  If so, what should
    be the minimum alignment for the specified offset?

      RESOLVED:  No, we'll leave 3-component vectors out of this extension.
      This limitation can be worked around by either by doing three separate
      single-component fetches or a four-component fetch with an appropriate
      write mask.  The former approach supports indexing in a tightly packed
      array of 3-component vectors; the latter would require that array
      elements be padded to four components.

    (4) Should we support fetches of 8- and 16-bit components?

      RESOLVED:  Yes, we will support fetches of 8- and 16-bit signed and
      unsigned integers.

      Fetches of vectors of 8- and 16-bit integers are not supported but may
      be emulated by performing shift/mask operations on the results of 32-bit
      fetches.

      Fetches of 16-bit floating-point values, or floating-point vectors
      thereof, are not supported.  A single fp16 fetch may be emulated using a
      16-bit unsigned integer fetch and the UP2H instruction to convert the 16
      LSBs of the fetch to a floating-point value.  The encoding of 16-bit
      floating-point values is described in section 2.1.2 of the OpenGL 3.0
      specification.

    (5) Should we support fetches of 64-bit components?

      RESOLVED:  No; the instruction set provided by NV_gpu_program4 does not
      support 64-bit components anywhere.  If future instructions support
      64-bit components, this restriction should be removed.

    (6) How should the operands of the LDC instruction should be specified?

      RESOLVED:  We will create a new type of buffer variable ("CBUFFER"),
      which defines an array of bytes to be fetched form.  The type of fetch
      to perform is specified by a storage modifier (as in
      NV_shader_buffer_load).  An offset relative to the buffer binding (in
      bytes) may be specified using normal array indexing syntax, and an index
      computed at run-time is supported.

      Some examples:

        CBUFFER buffer[] = { program.buffer[0] };
        TEMP      i;
        MOV.S     i, 32;                  # computed offset of 32B
        LDC.F32   result, buffer[12];     # (x,0,0,0) from bytes 12..15
        LDC.F32X4 result, buffer[16];     # (x,y,z,w) from bytes 16..31
        LDC.U8    result, buffer[i.x+3];  # (x,0,0,0) from byte 35
        LDC.S32   result, buffer[i.x+12]; # (x,0,0,0) from bytes 44..47
        LDC.U32X2 result, buffer[i.x+8];  # (x,y,0,0) from bytes 40..47
        LDC.S16   result, buffer[i.x+2];  # (x,0,0,0) from bytes 34..35

      We chose to provide the new buffer variable type (CBUFFER) rather than
      reusing BUFFER or BUFFER4.  For CBUFFER variables, "buffer[12]"
      unambiguously specifies a 12-byte offset.  For BUFFER or BUFFER4
      variables, an operand of "buffer[12]" already has an existing meaning,
      implying an offset of 12 words or vectors, which would be 48 or 192
      bytes, respectively.  Because we want to be able to fetch 8-, and 16-bit
      units, having an offset multiplied by four doesn't make sense.  We could
      have had LDC simply ignore the type of binding and always interpret an
      index as a byte offset, but chose the new declaration type to avoid
      confusion.
        
      We also considered an approach where the buffer and offset were
      specified in separate operands.  That would be similar to texture, where
      the coordinates and texture are specified separately.  The first operand
      would have been interpreted as a unsigned scalar specifying a byte
      offset, the second operand would have specified a buffer variable
      binding, and a pointer would be obtained by adding the two
      operands. This would have looked something like:

        BUFFER buffer[] = { program.buffer[0] };
        LDC.S32X2 result, offset.x, buffer;

      We chose not to implement this approach mainly because this syntax would
      require specifying a new type of instruction; the syntax we adopted
      simply reuses existing vector operand and indexing mechanisms.
      Additionally, the syntax in this extension provides immediate offsets
      for "free", which the operand-buffer syntax would not support directly
      without additional new syntax.  For example, to load a structure with a
      pair of two-component vectors using offset-buffer syntax, you would have
      to do something like:

        BUFFER buffer[] = { program.buffer[0] };
        TEMP offset;
        LDC.S32X2 result1, offset.x, buffer;
        ADD.U offset.x, offset.x, 8;            # bump offset to second vector
        LDC.S32X2 result2, offset.x, buffer;

    (7) How should the fetches in the LDC instruction interact with other
        operand modifiers (swizzle, absolute value, negation)?  With result
        modifiers (condition codes, saturation)?

      RESOLVED:  These features will be orthogonal.  When any of these
      modifiers are specified, the base data type to which they apply come
      from the storage modifier of the LDC instruction.

      The LDC instruction is defined to produce a "base operand vector" from a
      memory fetch.  This isn't particularly different from normal operands,
      where a base operand vector is derived from the binding corresponding to
      the operand.  In both cases, the components of this vector are swizzled
      and have optional absolute value and negation operations performed to
      produce a final vector operand, as is the case with other vector
      operands.  

      If condition code operations or saturation are specified for the result
      vector, these operations are performed using the appropriate data types.

    (8) What happens if a non-zero base offset is specified for a CBUFFER
        variable?

      RESOLVED:  A subset of the bytes in a buffer object can be specified
      using range syntax like the following:

        CBUFFER buffer[] = { program.buffer[0][16..31] };

      The sub-range need not start at the beginning of the buffer object; in
      the example above, it starts 16 bytes into the buffer.  When accessing a
      parameter buffer variable corresponding to such a sub-range, an array
      index is relative to the base of the sub-range.  So the offset of the
      sub-range is effectively added to the index used for the LDC operand:

        LDC.F32   result, buffer[12];     # (x,0,0,0) from bytes 28..31

    (9) What happens if a non-array CBUFFER variable is used?

      RESOLVED:  A non-array variable may be used with LDC.  However, array
      indexing isn't supported with non-array variables, so all LDC loads
      using that variable will fetch using the same base address.

        CBUFFER bufferElement = program.buffer[0][32];
        LDC.U8    result, buffer;     # (x,0,0,0) from byte 32
        LDC.S16   result, buffer;     # (x,0,0,0) from bytes 32..33
        LDC.F32   result, buffer;     # (x,0,0,0) from bytes 32..35
        LDC.F32X4 result, buffer;     # (x,y,z,w) from bytes 32..47

    (10) Should single-component fetches from LDC smear their results across
         all four components of the result vector, to allow packing multiple
         non-vectors into a single vector?

      RESOLVED:  No.  However, swizzle suffixes on the operand will provide
      this capability for free.  For example, let's say you wanted to fetch
      four scalars from a buffer and pack the results into a single temporary
      vector.  The swizzle syntax lets you do this by smearing the real
      component (always fetched in "x") into the other components:

        CBUFFER buffer[] = { program.buffer[0] };
        LDC.F32 temp.x, buffer[16];
        LDC.F32 temp.y, buffer[28].x;
        LDC.F32 temp.z, buffer[32].x;
        LDC.F32 temp.w, buffer[40].x;
        

Revision History

    Rev.    Date    Author    Changes
    ----  --------  --------  -----------------------------------------
     1              pbrown    Internal revisions.
     2    09/09/09  mjk       Assigned number
