2 - GPU INSTANCE RAM (RAMIN) ============================== A GPU contains a block called "XVE" that manages the interface with PCI, a block called "Host" that fetches graphics instructions, blocks called "engines" that execute graphics instructions, and blocks that manage the interface with memory. .-----. .------. | |<------------------>| | | | | | | | .---------. | | | |<--->| Engine1 |<---| | | | `---------' | | .---------. | | | | | GPU | | | .---------. | Host | | Local |<-->| FB |<--->| Engine2 |<---| | | Memory | | MMU | `---------' | | `---------' | Hub | ... | | .--------. | | .---------. | | | System | | |<--->| EngineN |<---| | | Memory | | | `---------' `------' `--------' | | ^ ^ | | | | .---------. | | .--V--. PCI .--V--. .-----. | Display |<-->| |<------------------>| XVE |<--->| NB |<--->| CPU | `---------' `-----' `-----' `-----' `-----' A GPU context is a virtualization of the GPU for a particular software application. A GPU instance block is a block of memory that contains the state for a GPU context. A GPU context's instance block consists of Host state, pointers to each engine's state, and memory management state. A GPU instance block also contains a pointer to a block of memory that contains that part of a GPU context's state that a user-level driver may access. A GPU instance block fits within a single 4K-byte page of memory. Run List Channel-Map RAM .----------. Ch Id .----------------. | RL Entry0 |----. |Ch0 Inst Blk Ptr| | RL Entry1 | | |Ch1 Inst Blk Ptr| | RL Entry2 | | | ... | | ... | `--->|ChI Inst Blk Ptr|----. | RL EntryN | | ... | | `-----------' |ChN Inst Blk Ptr| | `----------------' | | .-----------------------------------------------' | | GPU Instance Block GPFIFO `-->.-----------------. GP_GET .--------. PB Seg | |------------------------------>|GP Entry| .--------. | Host State | |GP Entry|--->|PB Entry| | (RAMFC) | User-Driver State | | |PB Entry| | | .-------. |GP Entry| | ... | | |------------->|(USERD)| GP_PUT |GP Entry| |PB Entry| | | | |------->`--------' `--------' | | | | +-----------------+ | | | Memory | `-------' | Management |----------. Page Directory Page Table | State | | .-------. .-------. +-----------------+ `-->| PDE | | PTE | | Pointer to | | PDE |------->| PTE | | Engine0 |--------. | ... | | ... | | State | | | PDE | | PTE | +-----------------+ | `-------' `-------' | Pointer to | | | Engine1 |-----. | Engine0 State | State | | | .-------. +-----------------+ | `---->| | ... | `-------' +-----------------+ | | Pointer to | | Engine1 State | EngineN |--. | .-------. | State | | `------->| | `-----------------' | `-------' | ... | | EngineN State | .-------. `---------->| | `-------' The GPU context's Host state occupies the first 128 double words of an instance block. A GPU context's Host state is called "RAMFC". Please see the NV_RAMFC section below for a description of Host state. The GPU context's memory-management state defines the virtual address space that the GPU context uses. Memory management state consists of page and directory tables (that specify the mapping between virtual addresses and physical addresses, and the attributes of memory pages), and the limit of the virtual address space. The NV_RAMIN_PAGE_DIR_BASE entry contains the address of base of the GPU context's page directory table (PDB). NV_RAMIN_PAGE_DIR_BASE is 4K-byte aligned. The NV_RAMIN_ENG*_WFI_PTR entry contains the address of a block of memory for storing an engine's context state. Blocks of memory that contain engine state are 4K-byte aligned. Only one engine context is supported per instance block. The NV_RAMIN_ENG*_CS field is deprecated, it was used to indicate whether GPU state should be restored from the FGCS pointer or from the WFI CS pointer. Engines only need/support one CTXSW pointer and all state is stored there whether a WFI CS or other form of preemption was performed. This field must always be set to WFI for legacy reasons, and will eventually be deleted. #define NV_RAMIN /* ----G */ // The instance block must be 4k-aligned. #define NV_RAMIN_BASE_SHIFT 12 /* */ // The instance block size fits within a single 4k block. #define NV_RAMIN_ALLOC_SIZE 4096 /* */ // Host State #define NV_RAMIN_RAMFC (127*32+31):(0*32+0) /* RWXUF */ // Memory-Management State The following fields are used for non-VEID engines. The NV_RAMIN_SC_* described later are used for VEID engines. NV_RAMIN_PAGE_DIR_BASE_TARGET determines if the top level of the page tables is in video memory or system memory (peer is not allowed), and the CPU cache coherency for system memory. Using INVALID, unbinds the selected engine. #define NV_RAMIN_PAGE_DIR_BASE_TARGET (128*32+1):(128*32+0) /* RWXUF */ #define NV_RAMIN_PAGE_DIR_BASE_TARGET_VID_MEM 0x00000000 /* RW--V */ #define NV_RAMIN_PAGE_DIR_BASE_TARGET_INVALID 0x00000001 /* RW--V */ #define NV_RAMIN_PAGE_DIR_BASE_TARGET_SYS_MEM_COHERENT 0x00000002 /* RW--V */ #define NV_RAMIN_PAGE_DIR_BASE_TARGET_SYS_MEM_NONCOHERENT 0x00000003 /* RW--V */ NV_RAMIN_PAGE_DIR_BASE_VOL identifies the volatile behavior of top level of the page table (whether local L2 can cache it or not). #define NV_RAMIN_PAGE_DIR_BASE_VOL (128*32+2):(128*32+2) /* RWXUF */ #define NV_RAMIN_PAGE_DIR_BASE_VOL_TRUE 0x00000001 /* RW--V */ #define NV_RAMIN_PAGE_DIR_BASE_VOL_FALSE 0x00000000 /* RW--V */ These bits specify whether the MMU will treats faults as replayable or not. The engine will send these bits to the MMU as part of the instance bind. #define NV_RAMIN_PAGE_DIR_BASE_FAULT_REPLAY_TEX (128*32+4):(128*32+4) /* RWXUF */ #define NV_RAMIN_PAGE_DIR_BASE_FAULT_REPLAY_TEX_DISABLED 0x00000000 /* RW--V */ #define NV_RAMIN_PAGE_DIR_BASE_FAULT_REPLAY_TEX_ENABLED 0x00000001 /* RW--V */ #define NV_RAMIN_PAGE_DIR_BASE_FAULT_REPLAY_GCC (128*32+5):(128*32+5) /* RWXUF */ #define NV_RAMIN_PAGE_DIR_BASE_FAULT_REPLAY_GCC_DISABLED 0x00000000 /* RW--V */ #define NV_RAMIN_PAGE_DIR_BASE_FAULT_REPLAY_GCC_ENABLED 0x00000001 /* RW--V */ NV_RAMIN_USE_NEW_PT_FORMAT determines which page table format to use. When NV_RAMIN_USE_NEW_PT_FORMAT is false, the page table uses the old format. When NV_RAMIN_USE_NEW_PT_FORMAT is true, the page table uses the new format. Volta only supports the new format. Selecting the old format results in an UNBOUND_INSTANCE fault. #define NV_RAMIN_USE_VER2_PT_FORMAT (128*32+10):(128*32+10) /* */ #define NV_RAMIN_USE_VER2_PT_FORMAT_FALSE 0x00000000 /* */ #define NV_RAMIN_USE_VER2_PT_FORMAT_TRUE 0x00000001 /* */ When NV_PFB_PRI_MMU_CTRL_USE_PDB_BIG_PAGE_SIZE is bit TRUE, the bit selects the big page size. When NV_PFB_PRI_MMU_CTRL_USE_PDB_BIG_PAGE_SIZE is bit FALSE, NV_PFB_PRI_MMU_CTRL_VM_PG_SIZE selects the big page size. Volta only supports 64KB for big pages. Selecting 128KB for big pages results in an UNBOUND_INSTANCE fault. #define NV_RAMIN_BIG_PAGE_SIZE (128*32+11):(128*32+11) /* RWXUF */ #define NV_RAMIN_BIG_PAGE_SIZE_128KB 0x00000000 /* RW--V */ #define NV_RAMIN_BIG_PAGE_SIZE_64KB 0x00000001 /* RW--V */ NV_RAMIN_PAGE_DIR_BASE_LO and NV_RAMIN_PAGE_DIR_BASE_HI identify the page directory base (start of the page table) location for this context. #define NV_RAMIN_PAGE_DIR_BASE_LO (128*32+31):(128*32+12) /* RWXUF */ #define NV_RAMIN_PAGE_DIR_BASE_HI (129*32+31):(129*32+0) /* RWXUF */ // Single engine pointer channels cannot support multiple // engines with CTXSW pointers #define NV_RAMIN_ENGINE_CS (132*32+3):(132*32+3) /* */ #define NV_RAMIN_ENGINE_CS_WFI 0x00000000 /* */ #define NV_RAMIN_ENGINE_CS_FG 0x00000001 /* */ #define NV_RAMIN_ENGINE_WFI_TARGET (132*32+1):(132*32+0) /* */ #define NV_RAMIN_ENGINE_WFI_TARGET_LOCAL_MEM 0x00000000 /* */ #define NV_RAMIN_ENGINE_WFI_TARGET_SYS_MEM_COHERENT 0x00000002 /* */ #define NV_RAMIN_ENGINE_WFI_TARGET_SYS_MEM_NONCOHERENT 0x00000003 /* */ #define NV_RAMIN_ENGINE_WFI_MODE (132*32+2):(132*32+2) /* */ #define NV_RAMIN_ENGINE_WFI_MODE_PHYSICAL 0x00000000 /* */ #define NV_RAMIN_ENGINE_WFI_MODE_VIRTUAL 0x00000001 /* */ #define NV_RAMIN_ENGINE_WFI_PTR_LO (132*32+31):(132*32+12) /* */ #define NV_RAMIN_ENGINE_WFI_PTR_HI (133*32+7):(133*32+0) /* */ #define NV_RAMIN_ENGINE_WFI_VEID (134*32+(6-1)):(134*32+0) /* */ #define NV_RAMIN_ENABLE_ATS (135*32+31):(135*32+31) /* RWXUF */ #define NV_RAMIN_ENABLE_ATS_TRUE 0x00000001 /* RW--V */ #define NV_RAMIN_ENABLE_ATS_FALSE 0x00000000 /* RW--V */ #define NV_RAMIN_PASID (135*32+(20-1)):(135*32+0) /* RWXUF */ Pointer to a method buffer in BAR2 memory where a faulted engine can save out methods. BAR2 accesses are assumed to be virtual, so the address saved here is a virtual address. #define NV_RAMIN_ENG_METHOD_BUFFER_ADDR_LO (136*32+31):(136*32+0) /* RWXUF */ #define NV_RAMIN_ENG_METHOD_BUFFER_ADDR_HI (137*32+(((49-1)-32))):(137*32+0) /* RWXUF */ These entries are used to inform FECS which of the below array of PDBs are valid/filled in and need to subsequently be bound. This needs to reserve at least NV_LITTER_NUM_SUBCTX entries. Currently there is enough space reserved for 64 subcontexts. #define NV_RAMIN_SC_PDB_VALID(i) (166*32+i):(166*32+i) /* RWXUF */ #define NV_RAMIN_SC_PDB_VALID__SIZE_1 64 /* */ #define NV_RAMIN_SC_PDB_VALID_FALSE 0x00000000 /* RW--V */ #define NV_RAMIN_SC_PDB_VALID_TRUE 0x00000001 /* RW--V */ // Memory-Management VEID array The NV_RAMIN_SC_PAGE_DIR_BASE_* entries are an array of page table settings for each subcontext. When a context supports subcontexts, the page table information for a given VEID/Subcontext needs to be filled in or else page faults will result on access. These properties for the page table must be filled in for all channels sharing the same context as any channel's NV_RAMIN may be used to load the context. The non-subcontext page table information such as NV_RAMIN_PAGE_DIR_BASE* are used by non-subcontext engines and clients such as Host, CE, or the video engines. NV_RAMIN_SC_PAGE_DIR_BASE_TARGET(i) determines if the top level of the page tables is in video memory or system memory (peer is not allowed), and the CPU cache coherency for system memory. Using INVALID, unbinds the selected subcontext. #define NV_RAMIN_SC_PAGE_DIR_BASE_TARGET(i) ((168+(i)*4)*32+1):((168+(i)*4)*32+0) /* RWXUF */ #define NV_RAMIN_SC_PAGE_DIR_BASE_TARGET__SIZE_1 64 /* */ #define NV_RAMIN_SC_PAGE_DIR_BASE_TARGET_VID_MEM 0x00000000 /* RW--V */ #define NV_RAMIN_SC_PAGE_DIR_BASE_TARGET_INVALID 0x00000001 /* RW--V */ // Note: INVALID should match PEER #define NV_RAMIN_SC_PAGE_DIR_BASE_TARGET_SYS_MEM_COHERENT 0x00000002 /* RW--V */ #define NV_RAMIN_SC_PAGE_DIR_BASE_TARGET_SYS_MEM_NONCOHERENT 0x00000003 /* RW--V */ NV_RAMIN_SC_PAGE_DIR_BASE_VOL(i) identifies the volatile behavior of the top level of the page table (whether local L2 can cache it or not). #define NV_RAMIN_SC_PAGE_DIR_BASE_VOL(i) ((168+(i)*4)*32+2):((168+(i)*4)*32+2) /* RWXUF */ #define NV_RAMIN_SC_PAGE_DIR_BASE_VOL__SIZE_1 64 /* */ #define NV_RAMIN_SC_PAGE_DIR_BASE_VOL_TRUE 0x00000001 /* RW--V */ #define NV_RAMIN_SC_PAGE_DIR_BASE_VOL_FALSE 0x00000000 /* RW--V */ NV_RAMIN_SC_PAGE_DIR_BASE_FAULT_REPLAY_TEX(i) and NV_RAMIN_SC_PAGE_DIR_BASE_FAULT_REPLAY_GCC(i) bits specify whether the MMU will treats faults from TEX and GCC as replayable or not. Based on that fault packets are written into replayable fault buffer (or not) and faulting requests are put into replay request buffer (or not). The last bind that does not unbind a sub-context determines the REPLAY_TEX and REPLAY_GCC for all sub-contexts. #define NV_RAMIN_SC_PAGE_DIR_BASE_FAULT_REPLAY_TEX(i) ((168+(i)*4)*32+4):((168+(i)*4)*32+4) /* RWXUF */ #define NV_RAMIN_SC_PAGE_DIR_BASE_FAULT_REPLAY_TEX__SIZE_1 64 /* */ #define NV_RAMIN_SC_PAGE_DIR_BASE_FAULT_REPLAY_TEX_DISABLED 0x00000000 /* RW--V */ #define NV_RAMIN_SC_PAGE_DIR_BASE_FAULT_REPLAY_TEX_ENABLED 0x00000001 /* RW--V */ #define NV_RAMIN_SC_PAGE_DIR_BASE_FAULT_REPLAY_GCC(i) ((168+(i)*4)*32+5):((168+(i)*4)*32+5) /* RWXUF */ #define NV_RAMIN_SC_PAGE_DIR_BASE_FAULT_REPLAY_GCC__SIZE_1 64 /* */ #define NV_RAMIN_SC_PAGE_DIR_BASE_FAULT_REPLAY_GCC_DISABLED 0x00000000 /* RW--V */ #define NV_RAMIN_SC_PAGE_DIR_BASE_FAULT_REPLAY_GCC_ENABLED 0x00000001 /* RW--V */ NV_RAMIN_SC_USE_VER2_PT_FORMAT determines which page table format to use. When NV_RAMIN_SC_USE_VER2_PT_FORMAT is false, the page table uses the old format(2-level page table). When NV_RAMIN_SC_USE_VER2_PT_FORMAT is true, the page table uses the new format (5-level 49-bit VA format). The last bind that does not unbind a sub-context determines the page table format for all sub-contexts. Volta only supports the new format. Selecting the old format results in an UNBOUND_INSTANCE fault. #define NV_RAMIN_SC_USE_VER2_PT_FORMAT(i) ((168+(i)*4)*32+10):((168+(i)*4)*32+10) /* RWXUF */ #define NV_RAMIN_SC_USE_VER2_PT_FORMAT__SIZE_1 64 /* */ #define NV_RAMIN_SC_USE_VER2_PT_FORMAT_FALSE 0x00000000 /* RW--V */ #define NV_RAMIN_SC_USE_VER2_PT_FORMAT_TRUE 0x00000001 /* RW--V */ The last bind that does not unbind a sub-context determines the big page size for all sub-contexts. Volta only supports 64KB for big pages. #define NV_RAMIN_SC_BIG_PAGE_SIZE(i) ((168+(i)*4)*32+11):((168+(i)*4)*32+11) /* RWXUF */ #define NV_RAMIN_SC_BIG_PAGE_SIZE__SIZE_1 64 /* */ #define NV_RAMIN_SC_BIG_PAGE_SIZE_64KB 0x00000001 /* RW--V */ NV_RAMIN_SC_PAGE_DIR_BASE_LO(i) and NV_RAMIN_SC_PAGE_DIR_BASE_HI(i) identify the page directory base (start of the page table) location for subcontext i. #define NV_RAMIN_SC_PAGE_DIR_BASE_LO(i) ((168+(i)*4)*32+31):((168+(i)*4)*32+12) /* RWXUF */ #define NV_RAMIN_SC_PAGE_DIR_BASE_LO__SIZE_1 64 /* */ #define NV_RAMIN_SC_PAGE_DIR_BASE_HI(i) ((169+(i)*4)*32+31):((169+(i)*4)*32+0) /* RWXUF */ #define NV_RAMIN_SC_PAGE_DIR_BASE_HI__SIZE_1 64 /* */ NV_RAMIN_SC_ENABLE_ATS(i) tells whether subcontext i is ATS enabled or not. In case, set to TRUE, GMMU will look for VA->PA translations into both GMMU and ATS page tables. ATS can be enabled or disabled per subcontext. #define NV_RAMIN_SC_ENABLE_ATS(i) ((170+(i)*4)*32+31):((170+(i)*4)*32+31) /* RWXUF */ NV_RAMIN_SC_PASID(i) identifies the PASID (process address space ID) in CPU for subcontext i. PASID is used to get ATS translation when ATS page table lookup is needed. During ATS TLB shootdown, PASID is also used to match against the one coming with shootdown request. #define NV_RAMIN_SC_PASID(i) ((170+(i)*4)*32+(20-1)):((170+(i)*4)*32+0) /* RWXUF */ 3 - FIFO CONTEXT RAM (RAMFC) ============================== The NV_RAMFC part of a GPU-instance block contains Host's part of a virtual GPU's state. Host is referred to as "FIFO". "FC" stands for FIFO Context. When Host switches from serving one GPU context to serving a second, Host saves state for the first GPU context to the first GPU context's RAMFC area, and loads state for the second GPU context from the second GPU context's RAMFC area. RAMFC is located at NV_RAMIN_RAMFC within the GPU instance block. In Kepler, this is at the start of the block. RAMFC is 4KB aligned. Every Host word entry in RAMFC directly corresponds to a PRI-accessible register. For a description of the contents of a RAMFC entry, please see the description of the corresponding register in "manuals/dev_pbdma.ref". The offsets of the fields within each entry in RAMFC match those of the corresponding register in the associated PBDMA unit's PRI space. RAMFC Entry PBDMA Register ------------------------------- ---------------------------------- NV_RAMFC_SIGNATURE NV_PPBDMA_SIGNATURE(i) NV_RAMFC_GP_BASE NV_PPBDMA_GP_BASE(i) NV_RAMFC_GP_BASE_HI NV_PPBDMA_GP_BASE_HI(i) NV_RAMFC_GP_FETCH NV_PPBDMA_GP_FETCH(i) NV_RAMFC_GP_GET NV_PPBDMA_GP_GET(i) NV_RAMFC_GP_PUT NV_PPBDMA_GP_PUT(i) NV_RAMFC_PB_FETCH NV_PPBDMA_PB_FETCH(i) NV_RAMFC_PB_FETCH_HI NV_PPBDMA_PB_FETCH_HI(i) NV_RAMFC_PB_GET NV_PPBDMA_GET(i) NV_RAMFC_PB_GET_HI NV_PPBDMA_GET_HI(i) NV_RAMFC_PB_PUT NV_PPBDMA_PUT(i) NV_RAMFC_PB_PUT_HI NV_PPBDMA_PUT_HI(i) NV_RAMFC_PB_TOP_LEVEL_GET NV_PPBDMA_TOP_LEVEL_GET(i) NV_RAMFC_PB_TOP_LEVEL_GET_HI NV_PPBDMA_TOP_LEVEL_GET_HI(i) NV_RAMFC_GP_CRC NV_PPBDMA_GP_CRC(i) NV_RAMFC_PB_HEADER NV_PPBDMA_PB_HEADER(i) NV_RAMFC_PB_COUNT NV_PPBDMA_PB_COUNT(i) NV_RAMFC_PB_CRC NV_PPBDMA_PB_CRC(i) NV_RAMFC_SUBDEVICE NV_PPBDMA_SUBDEVICE(i) NV_RAMFC_METHOD0 NV_PPBDMA_METHOD0(i) NV_RAMFC_METHOD1 NV_PPBDMA_METHOD1(i) NV_RAMFC_METHOD2 NV_PPBDMA_METHOD2(i) NV_RAMFC_METHOD3 NV_PPBDMA_METHOD3(i) NV_RAMFC_DATA0 NV_PPBDMA_DATA0(i) NV_RAMFC_DATA1 NV_PPBDMA_DATA1(i) NV_RAMFC_DATA2 NV_PPBDMA_DATA2(i) NV_RAMFC_DATA3 NV_PPBDMA_DATA3(i) NV_RAMFC_TARGET NV_PPBDMA_TARGET(i) NV_RAMFC_METHOD_CRC NV_PPBDMA_METHOD_CRC(i) NV_RAMFC_REF NV_PPBDMA_REF(i) NV_RAMFC_RUNTIME NV_PPBDMA_RUNTIME(i) NV_RAMFC_SEM_ADDR_LO NV_PPBDMA_SEM_ADDR_LO(i) NV_RAMFC_SEM_ADDR_HI NV_PPBDMA_SEM_ADDR_HI(i) NV_RAMFC_SEM_PAYLOAD_LO NV_PPBDMA_SEM_PAYLOAD_LO(i) NV_RAMFC_SEM_PAYLOAD_HI NV_PPBDMA_SEM_PAYLOAD_HI(i) NV_RAMFC_SEM_EXECUTE NV_PPBDMA_SEM_EXECUTE(i) NV_RAMFC_ACQUIRE_DEADLINE NV_PPBDMA_ACQUIRE_DEADLINE(i) NV_RAMFC_ACQUIRE NV_PPBDMA_ACQUIRE(i) NV_RAMFC_MEM_OP_A NV_PPBDMA_MEM_OP_A(i) NV_RAMFC_MEM_OP_B NV_PPBDMA_MEM_OP_B(i) NV_RAMFC_MEM_OP_C NV_PPBDMA_MEM_OP_C(i) NV_RAMFC_USERD NV_PPBDMA_USERD(i) NV_RAMFC_USERD_HI NV_PPBDMA_USERD_HI(i) NV_RAMFC_HCE_CTRL NV_PPBDMA_HCE_CTRL(i) NV_RAMFC_CONFIG NV_PPBDMA_CONFIG(i) NV_RAMFC_SET_CHANNEL_INFO NV_PPBDMA_SET_CHANNEL_INFO(i) ------------------------------- ---------------------------------- #define NV_RAMFC /* ----G */ #define NV_RAMFC_GP_PUT (0*32+31):(0*32+0) /* RWXUF */ #define NV_RAMFC_MEM_OP_A (1*32+31):(1*32+0) /* RWXUF */ #define NV_RAMFC_USERD (2*32+31):(2*32+0) /* RWXUF */ #define NV_RAMFC_USERD_HI (3*32+31):(3*32+0) /* RWXUF */ #define NV_RAMFC_SIGNATURE (4*32+31):(4*32+0) /* RWXUF */ #define NV_RAMFC_GP_GET (5*32+31):(5*32+0) /* RWXUF */ #define NV_RAMFC_PB_GET (6*32+31):(6*32+0) /* RWXUF */ #define NV_RAMFC_PB_GET_HI (7*32+31):(7*32+0) /* RWXUF */ #define NV_RAMFC_PB_TOP_LEVEL_GET (8*32+31):(8*32+0) /* RWXUF */ #define NV_RAMFC_PB_TOP_LEVEL_GET_HI (9*32+31):(9*32+0) /* RWXUF */ #define NV_RAMFC_REF (10*32+31):(10*32+0) /* RWXUF */ #define NV_RAMFC_RUNTIME (11*32+31):(11*32+0) /* RWXUF */ #define NV_RAMFC_ACQUIRE (12*32+31):(12*32+0) /* RWXUF */ #define NV_RAMFC_ACQUIRE_DEADLINE (13*32+31):(13*32+0) /* RWXUF */ #define NV_RAMFC_SEM_ADDR_HI (14*32+31):(14*32+0) /* RWXUF */ #define NV_RAMFC_SEM_ADDR_LO (15*32+31):(15*32+0) /* RWXUF */ #define NV_RAMFC_SEM_PAYLOAD_LO (16*32+31):(16*32+0) /* RWXUF */ #define NV_RAMFC_SEM_EXECUTE (17*32+31):(17*32+0) /* RWXUF */ #define NV_RAMFC_GP_BASE (18*32+31):(18*32+0) /* RWXUF */ #define NV_RAMFC_GP_BASE_HI (19*32+31):(19*32+0) /* RWXUF */ #define NV_RAMFC_GP_FETCH (20*32+31):(20*32+0) /* RWXUF */ #define NV_RAMFC_PB_FETCH (21*32+31):(21*32+0) /* RWXUF */ #define NV_RAMFC_PB_FETCH_HI (22*32+31):(22*32+0) /* RWXUF */ #define NV_RAMFC_PB_PUT (23*32+31):(23*32+0) /* RWXUF */ #define NV_RAMFC_PB_PUT_HI (24*32+31):(24*32+0) /* RWXUF */ #define NV_RAMFC_MEM_OP_B (25*32+31):(25*32+0) /* RWXUF */ #define NV_RAMFC_RESERVED26 (26*32+31):(26*32+0) /* RWXUF */ #define NV_RAMFC_RESERVED27 (27*32+31):(27*32+0) /* RWXUF */ #define NV_RAMFC_RESERVED28 (28*32+31):(28*32+0) /* RWXUF */ #define NV_RAMFC_GP_CRC (29*32+31):(29*32+0) /* RWXUF */ #define NV_RAMFC_PB_HEADER (33*32+31):(33*32+0) /* RWXUF */ #define NV_RAMFC_PB_COUNT (34*32+31):(34*32+0) /* RWXUF */ #define NV_RAMFC_SUBDEVICE (37*32+31):(37*32+0) /* RWXUF */ #define NV_RAMFC_PB_CRC (38*32+31):(38*32+0) /* RWXUF */ #define NV_RAMFC_SEM_PAYLOAD_HI (39*32+31):(39*32+0) /* RWXUF */ #define NV_RAMFC_MEM_OP_C (40*32+31):(40*32+0) /* RWXUF */ #define NV_RAMFC_RESERVED20 (41*32+31):(41*32+0) /* RWXUF */ #define NV_RAMFC_RESERVED21 (42*32+31):(42*32+0) /* RWXUF */ #define NV_RAMFC_TARGET (43*32+31):(43*32+0) /* RWXUF */ #define NV_RAMFC_METHOD_CRC (44*32+31):(44*32+0) /* RWXUF */ #define NV_RAMFC_METHOD0 (48*32+31):(48*32+0) /* RWXUF */ #define NV_RAMFC_DATA0 (49*32+31):(49*32+0) /* RWXUF */ #define NV_RAMFC_METHOD1 (50*32+31):(50*32+0) /* RWXUF */ #define NV_RAMFC_DATA1 (51*32+31):(51*32+0) /* RWXUF */ #define NV_RAMFC_METHOD2 (52*32+31):(52*32+0) /* RWXUF */ #define NV_RAMFC_DATA2 (53*32+31):(53*32+0) /* RWXUF */ #define NV_RAMFC_METHOD3 (54*32+31):(54*32+0) /* RWXUF */ #define NV_RAMFC_DATA3 (55*32+31):(55*32+0) /* RWXUF */ #define NV_RAMFC_HCE_CTRL (57*32+31):(57*32+0) /* RWXUF */ #define NV_RAMFC_CONFIG (61*32+31):(61*32+0) /* RWXUF */ #define NV_RAMFC_SET_CHANNEL_INFO (63*32+31):(63*32+0) /* RWXUF */ #define NV_RAMFC_BASE_SHIFT 12 /* */ Size of the full range of RAMFC in bytes. #define NV_RAMFC_SIZE_VAL 0x00000200 /* ----C */ 4 - USER-DRIVER ACCESSIBLE RAM (RAMUSERD) ========================================= A user-level driver is allowed to access only a small portion of a GPU context's state. The portion of a GPU context's state that a user-level driver can access is stored in a block of memory called NV_RAMUSERD. NV_RAMUSERD is a user-level driver's window into NV_RAMFC. The NV_RAMUSERD state for each GPU context is stored in an aligned NV_RAMUSERD_CHAN_SIZE-byte block of memory. To submit more methods, a user driver writes a PB segment to memory, writes a GP entry that points to the PB segment, updates GP_PUT in RAMUSERD, and writes the channel's handle to the NV_USERMODE_NOTIFY_CHANNEL_PENDING register (see dev_usermode.ref). The RAMUSERD data structure is updated at regular intervals as controlled by the NV_PFIFO_USERD_WRITEBACK setting (see dev_fifo.ref). For a particular channel, RAMUSERD writeback can be disabled and it is reccomended that SW track pushbuffer and channel progress via Host WFI_DIS semaphores rather than reading the RAMUSERD data structure. When write-back is enabled a user driver can check the GPU progress in executing a channel's PB segments. The driver can use: * GP_GET to monitor the index of the next GP entry the GPU will process * PB_GET to monitor the address of the next PB entry the GPU will process * TOP_LEVEL_GET (see NV_PPBDMA_TOP_LEVEL_GET) to monitor the address of the next "top-level" (non-SUBROUTINE) PB entry the GPU will process * REF to monitor the current "reference count" value see NV_PPBDMA_REF. Each entry in RAMUSERD corresponds to a PRI-accessible PBDMA register in Host. For a description of the behavior and contents of a RAMUSERD entry, please see the description of the corresponding register in "manuals/dev_pbdma.ref". RAMUSERD Entry PBDMA Register Access ------------------------------- ----------------------------- ---------- NV_RAMUSERD_GP_PUT NV_PPBDMA_GP_PUT(i) Read/Write NV_RAMUSERD_GP_GET NV_PPBDMA_GP_GET(i) Read-only NV_RAMUSERD_GET NV_PPBDMA_GET(i) Read-only NV_RAMUSERD_GET_HI NV_PPBDMA_GET_HI(i) Read-only NV_RAMUSERD_PUT NV_PPBDMA_PUT(i) Read-only NV_RAMUSERD_PUT_HI NV_PPBDMA_PUT_HI(i) Read-only NV_RAMUSERD_TOP_LEVEL_GET NV_PPBDMA_TOP_LEVEL_GET(i) Read-only NV_RAMUSERD_TOP_LEVEL_GET_HI NV_PPBDMA_TOP_LEVEL_GET_HI(i) Read-only NV_RAMUSERD_REF NV_PPBDMA_REF(i) Read-only ------------------------------- ----------------------------- ---------- A user driver may write to NV_RAMUSERD_GP_PUT to kick off more work in a channel. Although writes to the other, read-only, entries can alter memory, writes to those entries will not affect the operation of the GPU, and can be overwritten by the GPU. When Host loads its part of a GPU context's state from RAMFC memory, it may not immediately read RAMUSERD_GP_PUT. Host can use the GP_PUT values from RAMFC directly from RAMFC while waiting for the RAMUSERD_GP_PUT to synchronize. Because reads of RAMUSERD_GP_PUT can be delayed, the value in NV_PPBDMA_GP_PUT can be older than the value in NV_RAMUSERD_GP_PUT. When Host saves a GPU context's state to NV_RAMFC, it also writes to NV_RAMUSERD the values of the entries other than GP_PUT. Because Host does not continuously write the read-only RAMFC entries, the read-only values in USERD memory can be older than the values in the Host PBDMA unit. #define NV_RAMUSERD /* ----G */ #define NV_RAMUSERD_PUT (16*32+31):(16*32+0) /* RWXUF */ #define NV_RAMUSERD_GET (17*32+31):(17*32+0) /* RWXUF */ #define NV_RAMUSERD_REF (18*32+31):(18*32+0) /* RWXUF */ #define NV_RAMUSERD_PUT_HI (19*32+31):(19*32+0) /* RWXUF */ #define NV_RAMUSERD_TOP_LEVEL_GET (22*32+31):(22*32+0) /* RWXUF */ #define NV_RAMUSERD_TOP_LEVEL_GET_HI (23*32+31):(23*32+0) /* RWXUF */ #define NV_RAMUSERD_GET_HI (24*32+31):(24*32+0) /* RWXUF */ #define NV_RAMUSERD_GP_GET (34*32+31):(34*32+0) /* RWXUF */ #define NV_RAMUSERD_GP_PUT (35*32+31):(35*32+0) /* RWXUF */ #define NV_RAMUSERD_BASE_SHIFT 9 /* */ #define NV_RAMUSERD_CHAN_SIZE 512 /* */ 5 - RUN-LIST RAM (RAMRL) ======================== Software specifies the GPU contexts that hardware should "run" by writing a list of entries (known as a "runlist") to a 4k-aligned area of memory (beginning at NV_PFIFO_RUNLIST_BASE), and by notifying Host that a new list is available (by writing to NV_PFIFO_RUNLIST). Submission of a new runlist causes Host to expire the timeslice of all work scheduled by the previous runlist, allowing it to schedule the channels present in the new runlist once they are fetched. SW can check the status of the runlist by polling NV_PFIFO_ENG_RUNLIST_PENDING. (see dev_fifo.ref NV_PFIFO_RUNLIST for a full description of the runlist submit mechanism). Runlists can be stored in system memory or video memory (as specified by NV_PFIFO_RUNLIST_BASE_TARGET). If a runlist is stored in video memory, software will have to execute flush or read the last entry written before submitting the runlist to Host to guarantee coherency . The size of a runlist entry data structure is 16 bytes. Each entry specifies either a channel entry or a TSG header; the type is determined by the NV_RAMRL_ENTRY_TYPE. Runlist Channel Entry Type: A runlist entry of type NV_RAMRL_ENTRY_TYPE_CHAN specifies a channel to run. All such entries must occur within the span of some TSG as specified by the NV_RAMRL_ENTRY_TYPE_TSG described below. If a channel entry is encountered outside a TSG, Host will raise the NV_PFIFO_INTR_SCHED_ERROR_CODE_BAD_TSG interrupt. The fields available in a channel runlist entry are as follows (Fig 5.1): ENTRY_TYPE (T) : type of this entry: ENTRY_TYPE_CHAN CHID (ID) : identifier of the channel to run (overlays ENTRY_ID) RUNQUEUE_SELECTOR (Q) : selects which PBDMA should run this channel if more than one PBDMA is supported by the runlist INST_PTR_LO : lower 20 bits of the 4k-aligned instance block pointer INST_PTR_HI : upper 32 bit of instance block pointer INST_TARGET (TGI) : aperture of the instance block USERD_PTR_LO : upper 24 bits of the low 32 bits, of the 512-byte-aligned USERD pointer USERD_PTR_HI : upper 32 bits of USERD pointer USERD_TARGET (TGU) : aperture of the USERD data structure CHID is a channel identifier that uniquely specifies the channel described by this runlist entry to the scheduling hardware and is reported in various status registers. RUNQUEUE_SELECTOR determines to which runqueue the channel belongs, and thereby which PBDMA will run the channel. Increasing values select increasingly numbered PBDMA IDs serving the runlist. If the selector value exceeds the number of PBDMAs on the runlist, the hardware will silently reassign the channel to run on the first PBDMA as though RUNQUEUE_SELECTOR had been set to 0. (In current hardware, this is used by SCG on the graphics runlist only to determine which FE pipe should service a given channel. A value of 0 targets the first FE pipe, which can process all FE driven engines: Graphics, Compute, Inline2Memory, and TwoD. A value of 1 targets the second FE pipe, which can only process Compute work. Note that GRCE work is allowed on either runqueue.) The INST fields specify the physical address of the channel's instance block, the in-memory data structure that stores the context state. The target aperture of the instance block is given by INST_TARGET, and the byte offset within that aperture is calculated as (INST_PTR_HI << 32) | (INST_PTR_LO << NV_RAMRL_ENTRY_CHAN_INST_PTR_ALIGN_SHIFT) This address should match the one specified in the channel RAM's NV_PCCSR_CHANNEL_INST register; see NV_RAMIN and NV_RAMFC for the format of the instance block. The hardware ignores the RAMRL INST fields, but in future chips the instance pointer may be removed from the channel RAM and the RAMRL INST fields used instead, resulting in smaller hardware. The USERD fields specify the physical address of the USERD memory region used by software to submit additional work to the channel. The target aperture of the USERD region is given by USERD_TARGET, and the byte offset within that aperture is calculated as (USERD_PTR_HI << 32) | (USERD_PTR_LO << NV_RAMRL_ENTRY_CHAN_USERD_PTR_ALIGN_SHIFT) SW uses the NV_RAMUSERD_CHAN_SIZE define to allocate and align a channel's RAMUSERD data structure. See the documentation for NV_RAMUSERD for a description of the use of USERD and its format. This address and it's alignment must match the one specified in the RAMFC's NV_RAMFC_USERD and NV_RAMFC_USERD_HI fields which are backed by NV_PPBDMA_USERD in dev_pbdma.ref. The hardware ignores the RAMRL USERD fields, but in future chips the USERD pointer may be read from these fields in the runlist entry instead of the RAMFC to avoid the extra level of indirection in fetching the USERD data that currently results in a dependent read. Runlist TSG Entry Type: The other type of runlist entry is Timeslice Group (TSG) header entry (Fig 5.2). This type of entry is specified by NV_RAMRL_ENTRY_TYPE_TSG. A TSG entry describes a collection of channels all of which share the same context and are scheduled as a single unit by Host. All runlists support this type of entry. The fields available in a TSG header runlist entry are as follows (Fig 5.2): ENTRY_TYPE (T) : type of this entry: ENTRY_TYPE_TSG TSGID : identifier of the Timeslice group (overlays ENTRY_ID) TSG_LENGTH : number of channels that are part of this timeslice group TIMESLICE_SCALE : scale factor for the TSG's timeslice TIMESLICE_TIMEOUT : timeout amount for the TSG's timeslice A timeslice group entry consists of an integer identifier along with a length which specifies the number of channels in the TSG. After a TSG header runlist entry, the next TSG_LENGTH runlist entries are considered to be part of the timeslice group. Note that the minimum length of a TSG is at least one entry. All channels in a TSG share the same runlist timeslice which specifies how long a single context runs on an engine or PBDMA before being swapped for a different context. The timeslice period is set in the TSG header by specifying TSG_TIMESLICE_TIMEOUT and TSG_TIMESLICE_SCALE. The TSG timeslice period is calculated as follows: timeslice = (TSG_TIMESLICE_TIMEOUT << TSG_TIMESLICE_SCALE) * 1024 nanoseconds The timeslice period should normally not be set to zero. A timeslice of zero will be treated as a timeslice period of one . The runlist timeslice period begins after the context has been loaded on a PBDMA but is paused while the channel has an outstanding context load to an engine. Time spent switching a context into an engine is not part of the runlist timeslice. If Host reaches the end of the runlist or receives another entry of type NV_RAMRL_ENTRY_TYPE_TSG before processing TSG_LENGTH additional runlist entries, or if it encounters a TSG of length 0, a SCHED_ERROR interrupt will be generated with ERROR_CODE_BAD_TSG. Host Scheduling Memory Layout: Example of graphics runlist entry to GPU context mapping via channel id: .------Ints_ptr -------. | | Graphics Runlist | Channel-Map RAM | GPU Instance Block .------------ . | .----------------. | .-------------------. | TSG Hdr L=m |--.----' |Ch0 Inst Blk Ptr|--'------->| Host State | | RL Entry T1 | | |Ch1 Inst Blk Ptr| .------| Memory State | | RL Entry T2 | | | ... | | | Engine0 State Ptr | | ... | |-chid->|ChI Inst Blk Ptr| | | Engine1 State Ptr | | RL Entry Tm | | | ... | | | ... | | TSG Hdr L=n | | |ChN Inst Blk Ptr| | .-| EngineN State Ptr | | RL Entry T1 | | `----------------' | | `-------------------' | RL Entry T2 |userd_ptr | | | ... | | .--------------. | | .--------------. | RL Entry Tn | | | USERD | | | | Engine Ctx | | | '------->| |<----' '-->| State N | `-------------' | | | | `--------------' `--------------' Runlist Diagram Description: Here we have (M+N) number of channel type (ENTRY_TYPE_CHID) runlist entries grouped together within two TSGs. The first entry in the runlist is a TSG header entry (ENTRY_TYPE_TSG) that describes the first TSG. The TSG header specifies m as the length of the TSG. The header would also contain the timeslice information for the TSG (SCALE/TIMEOUT), as well as the TSG id specified in the TSGID field. Because the length here is M, the Runlist *must* contain M additional runlist entries of type ENTRY_TYPE_CHAN that will be part of this TSG. Similarly, the next (N+1) number of entries, a TSG header entry followed by N number of regular channel entry, correspond to the second TSG. #define NV_RAMRL_ENTRY /* ----G */ #define NV_RAMRL_ENTRY_RANGE 0xF:0x00000000 /* RW--M */ #define NV_RAMRL_ENTRY_SIZE 16 /* */ // Runlist base must be 4k-aligned. #define NV_RAMRL_ENTRY_BASE_SHIFT 12 /* */ #define NV_RAMRL_ENTRY_TYPE (0+0*32):(0+0*32) /* RWXUF */ #define NV_RAMRL_ENTRY_TYPE_CHAN 0x00000000 /* RW--V */ #define NV_RAMRL_ENTRY_TYPE_TSG 0x00000001 /* RW--V */ #define NV_RAMRL_ENTRY_ID (11+2*32):(0+2*32) /* RWXUF */ #define NV_RAMRL_ENTRY_ID_HW 11:0 /* RWXUF */ #define NV_RAMRL_ENTRY_ID_MAX (4096-1) /* RW--V */ #define NV_RAMRL_ENTRY_CHAN_RUNQUEUE_SELECTOR (1+0*32):(1+0*32) /* RWXUF */ #define NV_RAMRL_ENTRY_CHAN_INST_TARGET (5+0*32):(4+0*32) /* RWXUF */ #define NV_RAMRL_ENTRY_CHAN_INST_TARGET_VID_MEM 0x00000000 /* RW--V */ #define NV_RAMRL_ENTRY_CHAN_INST_TARGET_SYS_MEM_COHERENT 0x00000002 /* RW--V */ #define NV_RAMRL_ENTRY_CHAN_INST_TARGET_SYS_MEM_NONCOHERENT 0x00000003 /* RW--V */ #define NV_RAMRL_ENTRY_CHAN_USERD_TARGET (7+0*32):(6+0*32) /* RWXUF */ #define NV_RAMRL_ENTRY_CHAN_USERD_TARGET_VID_MEM 0x00000000 /* RW--V */ #define NV_RAMRL_ENTRY_CHAN_USERD_TARGET_VID_MEM_NVLINK_COHERENT 0x00000001 /* RW--V */ #define NV_RAMRL_ENTRY_CHAN_USERD_TARGET_SYS_MEM_COHERENT 0x00000002 /* RW--V */ #define NV_RAMRL_ENTRY_CHAN_USERD_TARGET_SYS_MEM_NONCOHERENT 0x00000003 /* RW--V */ #define NV_RAMRL_ENTRY_CHAN_USERD_PTR_LO (31+0*32):(8+0*32) /* RWXUF */ #define NV_RAMRL_ENTRY_CHAN_USERD_PTR_HI (31+1*32):(0+1*32) /* RWXUF */ #define NV_RAMRL_ENTRY_CHAN_CHID (11+2*32):(0+2*32) /* RWXUF */ #define NV_RAMRL_ENTRY_CHAN_INST_PTR_LO (31+2*32):(12+2*32) /* RWXUF */ #define NV_RAMRL_ENTRY_CHAN_INST_PTR_HI (31+3*32):(0+3*32) /* RWXUF */ // Macros for shifting out low bits of INST_PTR and USERD_PTR. #define NV_RAMRL_ENTRY_CHAN_INST_PTR_ALIGN_SHIFT 12 /* ----C */ #define NV_RAMRL_ENTRY_CHAN_USERD_PTR_ALIGN_SHIFT 8 /* ----C */ #define NV_RAMRL_ENTRY_TSG_TIMESLICE_SCALE (19+0*32):(16+0*32) /* RWXUF */ #define NV_RAMRL_ENTRY_TSG_TIMESLICE_SCALE_3 0x00000003 /* RWI-V */ #define NV_RAMRL_ENTRY_TSG_TIMESLICE_TIMEOUT (31+0*32):(24+0*32) /* RWXUF */ #define NV_RAMRL_ENTRY_TSG_TIMESLICE_TIMEOUT_128 0x00000080 /* RWI-V */ #define NV_RAMRL_ENTRY_TSG_TIMESLICE_TIMEOUT_1US 0x00000000 /* */ #define NV_RAMRL_ENTRY_TSG_LENGTH (7+1*32):(0+1*32) /* RWXUF */ #define NV_RAMRL_ENTRY_TSG_LENGTH_INIT 0x00000000 /* RW--V */ #define NV_RAMRL_ENTRY_TSG_LENGTH_MIN 0x00000001 /* RW--V */ #define NV_RAMRL_ENTRY_TSG_LENGTH_MAX 0x00000080 /* RW--V */ #define NV_RAMRL_ENTRY_TSG_TSGID (11+2*32):(0+2*32) /* RWXUF */ 6 - Host Pushbuffer Format (FIFO_DMA) ======================================= "FIFO" refers to Host. "FIFO_DMA" means data that Host reads from memory: the pushbuffer. Host autonomously reads pushbuffer data from memory and generates method address/data pairs from the data. Pushbuffer terminology: - A channel is the logical sequence of instructions associated with a GPU context. - The pushbuffer is a stream of data in memory containing the specifications of the operations that a channel is to perform for a particular client. Pushbuffer data consists of pushbuffer entries. - A pushbuffer entry (PB entry) is a 32-bit (doubleword) sized unit of pushbuffer data. This is the smallest granularity at which Host consumes pushbuffer data. A PB entry is either a PB instruction (which is either a PB control entry or a PB method header), or a method data entry. - A pushbuffer segment (PB segment) is a contiguous block of memory containing pushbuffer entries. The location and size of a pushbuffer segment is defined by its respective GP entry in the GPFIFO. - A pushbuffer control entry (PB control entry) is a single PB entry of type SET_SUBDEVICE_MASK, STORE_SUBDEVICE_MASK, USE_SUBDEVICE_MASK, END_PB_SEGMENT, or a universal NOP (NV_FIFO_DMA_NOP). - A pushbuffer compressed method sequence is a sequence of pushbuffer entries starting with a method header and a variable-length sequence of method data entries (the length being defined by the method header). A single PB compressed method sequence expands into one or more methods. This may also be known as a "pushbuffer method" (PB method), but that terminology is ambiguous and not preferred. - A pushbuffer method header (PB method header) is the first PB entry found in a PB compressed method sequence. A PB method header is a PB instruction performed on method data entries. - A pushbuffer instruction (PB instruction) is a PB entry that is not a PB method data entry. A PB instruction is either a PB control entry or a PB method header. - A method is an address/data pair representing an operation to perform. - A method data entry is the 32-bit operand for its corresponding method. #define NV_FIFO_PB_ENTRY_SIZE 4 /* */ Some engines such as Graphics internally support a double-wide method FIFO; these are known as "data-hi" methods. It is Host that performs the packing of two methods into one double-wide entry. Host will only generate data-hi methods if the following conditions are satisfied: 1. The two methods come from the same PB method (in other words they share the same method header). 2. The method header specifies a non-incrementing method, an incrementing method, or an increment-once method. 3. The paired methods either have the same method address, or the first method has an even NV_FIFO_DMA_METHOD_ADDRESS field and the second (data-hi) method is the increment of the first. (That is, the left-shifted method address as listed in the class files must be divisible by 8 for this condition to hold.) 4. The second method is available at the time of pushing the first one into the engine's method FIFO. In other words, Host will not wait to pack methods. Note that if the engine's method fifo is full, the back-pressure will in itself create a "wait time". The first three conditions are under SW's control. Only the graphics engine supports data-hi methods. Types of PB Entries PB entries can be classified into three types: PB method headers, PB control entries, and PB method data. Different types of PB entries have different formats. Because PB compressed method sequences are of variable length, it is impossible to determine the type of a PB entry without tracking the pushbuffer from the beginning or from the location of a PB entry that is known to not be a PB method data entry. A PB method data entry is always found in a method data sequence immediately following a PB method header in the logical stream of PB entries. The PB method header contains a NV_FIFO_DMA_METHOD_COUNT field, the value of which is equal to the length of the method data sequence. Note a PB method header does not necessarily come with PB method data entries (see details below about immediate-data method headers and method headers for which COUNT is zero). Also note the PB method data entries may be located in a PB segment separate from their corresponding method header. The format of any given PB method data entry is defined in the "NV_UDMA" section of dev_pbdma.ref. A PB entry that is either a PB method header or PB control entry is known as a PB instruction. The type of a PB instruction is specified by the NV_FIFO_DMA_SEC_OP field and the NV_FIFO_DMA_TERT_OP field. secondary tertiary opcode opcode entry type --------- -------- -------------------------------- 000 01 SET_SUBDEVICE_MASK 000 10 STORE_SUBDEVICE_MASK 000 11 USE_SUBDEVICE_MASK 001 xx incrementing method header 011 xx non-incrementing method header 100 xx immediate-data method header 101 xx increment-once method header 111 xx END_PB_SEGMENT --------- -------- -------------------------------- Types of methods: - A Host method is a method whose address is defined in the NV_UDMA device range. - A Host-only method is any Host method excluding SetObject (also known as NV_UDMA_OBJECT). - An engine method is a method whose address is not defined within the NV_UDMA device range. There are multiple engines designated by a subchannel ID. Software methods are included in this category. - A software method (SW method) is a method which causes an interrupt for the express purpose of being handled by software. For details see the section on software methods below. For more information about types of methods see "HOST METHODS" and "RESERVED METHOD ADDRESSES" in dev_pbdma.ref. The method address in a PB method header (stored in the NV_FIFO_DMA_METHOD_ADDRESS field) is a dword-address, not a byte-address. In other words the least significant two bits of the address are not stored because the byte-address is dword-aligned (thus the least significant two bits are always zero). The subchannel in a PB method header (stored in the NV_FIFO_DMA_*_SUBCHANNEL field) determines the engine to which a method will be sent if the method is SetObject or an engine method (otherwise, the SUBCHANNEL field is ignored). SetObject enables SW to request HW to check the expectation that a given subchannel serves the specified class ID; see the description of "NV_UDMA_OBJECT" in dev_pbdma.ref. The mapping between subchannels and engines is fixed. A subchannel is bound to a given class according to the runlist. Each engine method is applied to an "object," which itself is an instance of an NV class as defined by the master MFS class files. Each object belongs to an engine. For SetObject and engine methods, the engine is determined entirely by the SUBCHANNEL field of the method's header via a fixed mapping that depends on the runlist on which the method arrives. Methods on subchannels 0-4 are handled by the primary engine served by the runlist, except that subchannel 4 targets GRCOPY0 and GRCOPY1 on the graphics runlist. For Graphics/Compute, SetObject associates subchannels 0, 1, 2, and 3 with class identifiers for 3D, compute, I2M, and 2D respectively. On other runlists, the subchannel is ignored, and Host does not send the subchannel ID to the engine. It is recommended that SW only use subchannel 4 on the dedicated copy engines for consistency with GRCOPY usage. Subchannels 5-7 are for software methods. Any methods on these subchannels (including SetObject methods) are kicked back to software for handling via the SW method dispatch mechanism using the NV_PPBDMA_INTR_*_DEVICE interrupt. SW may choose to send a SetObject method to each engine subchannel before sending any methods on that particular subchannel in order to support multiple software classes. If a method stream subchannel-switches from targeting graphics/compute to a copy engine or vice-versa, that is, to or from subchannel 4 on GR, Host will: 1. Wait until the first engine has completed all its methods, 2. Wait until that engine indicates that it is idle (WFI), and 3. Send a sysmem barrier flush and wait until it completes. Only then will Host send methods to the newly targeted engine. Note that this WFI will not occur for sending Host-only methods on the new subchannel, since Host-only methods ignore the subchannel field. Additionally, when switching from CE to graphics/compute, Host forces FE to perform a cache invalidate. Other subchannel switch semantics may be provided by the engines themselves, such as switching between subchannels 0-3 within FE. #define NV_FIFO_DMA /* ----G */ #define NV_FIFO_DMA_METHOD_ADDRESS_OLD 12:2 /* RWXUF */ #define NV_FIFO_DMA_METHOD_ADDRESS 11:0 /* RWXUF */ #define NV_FIFO_DMA_SUBDEVICE_MASK 15:4 /* RWXUF */ #define NV_FIFO_DMA_METHOD_SUBCHANNEL 15:13 /* RWXUF */ #define NV_FIFO_DMA_TERT_OP 17:16 /* RWXUF */ #define NV_FIFO_DMA_TERT_OP_GRP0_SET_SUB_DEV_MASK 0x00000001 /* RW--V */ #define NV_FIFO_DMA_TERT_OP_GRP0_STORE_SUB_DEV_MASK 0x00000002 /* RW--V */ #define NV_FIFO_DMA_TERT_OP_GRP0_USE_SUB_DEV_MASK 0x00000003 /* RW--V */ #define NV_FIFO_DMA_METHOD_COUNT_OLD 28:18 /* RWXUF */ #define NV_FIFO_DMA_METHOD_COUNT 28:16 /* RWXUF */ #define NV_FIFO_DMA_IMMD_DATA 28:16 /* RWXUF */ #define NV_FIFO_DMA_SEC_OP 31:29 /* RWXUF */ #define NV_FIFO_DMA_SEC_OP_GRP0_USE_TERT 0x00000000 /* RW--V */ #define NV_FIFO_DMA_SEC_OP_INC_METHOD 0x00000001 /* RW--V */ #define NV_FIFO_DMA_SEC_OP_NON_INC_METHOD 0x00000003 /* RW--V */ #define NV_FIFO_DMA_SEC_OP_IMMD_DATA_METHOD 0x00000004 /* RW--V */ #define NV_FIFO_DMA_SEC_OP_ONE_INC 0x00000005 /* RW--V */ #define NV_FIFO_DMA_SEC_OP_RESERVED6 0x00000006 /* RW--V */ #define NV_FIFO_DMA_SEC_OP_END_PB_SEGMENT 0x00000007 /* RW--V */ Incrementing PB Method Header Format An incrementing PB method header specifies that Host generate a sequence of methods. The length of the sequence is defined by the method header. The method data for each method in this sequence is found in a sequence of PB entries immediately following the method header. The dword-address of the first method is specified by the method header, and the dword-address of each subsequent method is equal to the dword-address of the previous method plus one. Or in other words, the byte-address of each subsequent method is equal to the byte-address of the previous method plus four. Example sequence of methods generated from an incrementing method header: addr data0 addr+1 data1 addr+2 data2 addr+3 data3 ... ... The NV_FIFO_DMA_INCR_COUNT field contains the number of methods in the generated sequence. This is the same as the number of method data entries that follow the method header. If the COUNT field is zero, the other fields are ignored, and the PB method effectively becomes a no-op with no method data entries following it. The NV_FIFO_DMA_INCR_SUBCHANNEL field contains the subchannel to use for the methods generated from the method header. See the documentation above for NV_FIFO_DMA_*_SUBCHANNEL. The NV_FIFO_DMA_INCR_ADDRESS field contains the method address for the first method in the generated sequence. The dword-address of the method is incremented by one each time a method is generated. A method address specifies an operation to be performed. Note that because the ADDRESS is a dword-address and not a byte-address, the least two significant bits of the method's byte-address are not stored. The NV_FIFO_DMA_INCR_DATA fields contain the method data for the methods in the generated sequence. The number of method data entries is defined by the COUNT field. A method data entry contains an operand for its respective method. Bit 12 is reserved for the future expansion of either the subchannel or the address fields. #define NV_FIFO_DMA_INCR /* ----G */ #define NV_FIFO_DMA_INCR_OPCODE (0*32+31):(0*32+29) /* RWXUF */ #define NV_FIFO_DMA_INCR_OPCODE_VALUE 0x00000001 /* ----V */ #define NV_FIFO_DMA_INCR_COUNT (0*32+28):(0*32+16) /* RWXUF */ #define NV_FIFO_DMA_INCR_SUBCHANNEL (0*32+15):(0*32+13) /* RWXUF */ #define NV_FIFO_DMA_INCR_ADDRESS (0*32+11):(0*32+0) /* RWXUF */ #define NV_FIFO_DMA_INCR_DATA (1*32+31):(1*32+0) /* RWXUF */ Non-Incrementing PB Method Header Format A non-incrementing PB method header specifies that Host generate a sequence of methods. The length of the sequence is defined by the method header. The method data for each method in this sequence is contained within the PB entries immediately following the method header. Unlike with the incrementing PB method header, the sequence of methods generated all have the same method address. The dword-address of every method in this sequence is specified by the method header. Although the methods all have the same address, the method data entries may be different. Example sequence of methods generated from a non-incrementing method header: addr data0 addr data1 addr data2 addr data3 ... ... The NV_FIFO_DMA_NONINCR_COUNT field contains the number of methods in the generated sequence. This is the same as the number of method data entries that follow the method header. If the COUNT field is zero, the other fields are ignored, and the PB method effectively becomes a no-op with no method data entries following it. The NV_FIFO_DMA_NONINCR_SUBCHANNEL field contains the subchannel to use for the methods generated from the method header. See the documentation above for NV_FIFO_DMA_*_SUBCHANNEL. The NV_FIFO_DMA_NONINCR_ADDRESS field contains the method address for every method in the generated sequence. A method address specifies an operation to be performed. Note that because the ADDRESS field is a dword-address and not a byte-address, the least two significant bits of the method's byte-address are not stored. The NV_FIFO_DMA_NONINCR_DATA fields contain the method data for the methods in the generated sequence. The number of method data entries is defined by the COUNT field. A method data entry contains an operand for its respective method. Bit 12 is reserved for the future expansion of either the subchannel or the address fields. #define NV_FIFO_DMA_NONINCR /* ----G */ #define NV_FIFO_DMA_NONINCR_OPCODE (0*32+31):(0*32+29) /* RWXUF */ #define NV_FIFO_DMA_NONINCR_OPCODE_VALUE 0x00000003 /* ----V */ #define NV_FIFO_DMA_NONINCR_COUNT (0*32+28):(0*32+16) /* RWXUF */ #define NV_FIFO_DMA_NONINCR_SUBCHANNEL (0*32+15):(0*32+13) /* RWXUF */ #define NV_FIFO_DMA_NONINCR_ADDRESS (0*32+11):(0*32+0) /* RWXUF */ #define NV_FIFO_DMA_NONINCR_DATA (1*32+31):(1*32+0) /* RWXUF */ Increment-Once PB Method Header Format An increment-once PB method header specifies that Host generate a sequence of methods. The length of the sequence is defined by the method header. The method data for each method in this sequence is found in a sequence of PB entries immediately following the method header. The dword-address of the first method is specified by the method header. The address of the second and all following methods is equal to the dword-address of the first method plus one. In other words, the byte-address of the second and all following methods is equal to the byte-address of the first method plus four. Example sequence of methods generated from an increment-once method header: addr data0 addr+1 data1 addr+1 data2 addr+1 data3 ... ... The NV_FIFO_DMA_ONEINCR_COUNT field contains the number of methods in the generated sequence. This is the same as the number of method data entries that follow the method header. If the COUNT field is zero, the other fields are ignored, and the PB method effectively becomes a no-op method with no method data entries following it. The NV_FIFO_DMA_ONEINCR_SUBCHANNEL field contains the subchannel to use for the methods generated from the method header. See the documentation above for NV_FIFO_DMA_*_SUBCHANNEL. The NV_FIFO_DMA_ONEINCR_ADDRESS field contains the method address for the first method in the generated sequence. A method address specifies an operation to be performed. Note that because the ADDRESS is a dword-address and not a byte-address, the least two significant bits of the method's byte-address are not stored. The NV_FIFO_DMA_ONEINCR_DATA fields contain the method data for the methods in the generated sequence. The number of method data entries is defined by the COUNT field. A method data entry contains an operand for its respective method. Bit 12 is reserved for the future expansion of either the subchannel or the address fields. #define NV_FIFO_DMA_ONEINCR /* ----G */ #define NV_FIFO_DMA_ONEINCR_OPCODE (0*32+31):(0*32+29) /* RWXUF */ #define NV_FIFO_DMA_ONEINCR_OPCODE_VALUE 0x00000005 /* ----V */ #define NV_FIFO_DMA_ONEINCR_COUNT (0*32+28):(0*32+16) /* RWXUF */ #define NV_FIFO_DMA_ONEINCR_SUBCHANNEL (0*32+15):(0*32+13) /* RWXUF */ #define NV_FIFO_DMA_ONEINCR_ADDRESS (0*32+11):(0*32+0) /* RWXUF */ #define NV_FIFO_DMA_ONEINCR_DATA (1*32+31):(1*32+0) /* RWXUF */ No-Operation PB Instruction Formats The method header for a no-op PB method may be specified in multiple ways, but the preferred way is to set the PB instruction to NV_FIFO_DMA_NOP. In any case NV_FIFO_DMA_NOP is a universal NOP entry that bypasses any method header format check, and is not considered a method header. #define NV_FIFO_DMA_NOP 0x00000000 /* ----C */ Immediate-Data PB Method Header Format If a method's operand fits within 13 bits, a PB method may be specified in a single PB entry, using the immediate-data PB method header format. Exactly one method is generated from this method header. The NV_FIFO_DMA_IMMD_SUBCHANNEL field contains the subchannel to use for the method generated from the method header. See the documentation above for NV_FIFO_DMA_*_SUBCHANNEL. The NV_FIFO_DMA_IMMD_ADDRESS field contains the method address for the single generated method. A method address specifies an operation to be performed. Note that because the ADDRESS is a dword-address and not a byte-address, the least two significant bits of the method's byte-address are not stored. The single NV_FIFO_DMA_IMMD_DATA field contains the method data for the generated method. This method data contains an operand for the generated method. #define NV_FIFO_DMA_IMMD /* ----G */ #define NV_FIFO_DMA_IMMD_ADDRESS 11:0 /* RWXUF */ #define NV_FIFO_DMA_IMMD_SUBCHANNEL 15:13 /* RWXUF */ #define NV_FIFO_DMA_IMMD_DATA 28:16 /* RWXUF */ #define NV_FIFO_DMA_IMMD_OPCODE 31:29 /* RWXUF */ #define NV_FIFO_DMA_IMMD_OPCODE_VALUE 0x00000004 /* ----V */ Set Sub-Device Mask PB Control Entry Format The SET_SUBDEVICE_MASK (SSDM) PB control entry is used when multiple GPU contexts are using the same pushbuffer (for example, for SLI or for stereo rendering) and there is data in the push buffer that is for only a subset of the GPU contexts. This instruction allows the pushbuffer to tell a specific GPU context to use or ignore methods following the SET_SUBDEVICE_MASK. While the logical-AND of NV_FIFO_DMA_SET_SUBDEVICE_MASK_VALUE and the GPU context's NV_PPBDMA_SUBDEVICE_ID value is zero, methods are ignored. Pushbuffer control entries (like SET_SUBDEVICE_MASK) are not ignored. ******************************************************************************** Warning: When using subdevice masking, one must take care to synchronize properly with any later GP entries marked FETCH_CONDITIONAL. If GP fetching gets too far ahead of PB processing, it is possible for a later conditional PB segment to be discarded prior to reaching an SSDM command that sets SUBDEVICE_STATUS to ACTIVE. This would cause Host to execute garbage data. One way to avoid this would be to set the SYNC_WAIT flag on any FETCH_CONDITIONAL segments following a subdevice reenable. ******************************************************************************** #define NV_FIFO_DMA_SET_SUBDEVICE_MASK /* ----G */ #define NV_FIFO_DMA_SET_SUBDEVICE_MASK_VALUE 15:4 /* RWXUF */ #define NV_FIFO_DMA_SET_SUBDEVICE_MASK_OPCODE 31:16 /* RWXUF */ #define NV_FIFO_DMA_SET_SUBDEVICE_MASK_OPCODE_VALUE 0x00000001 /* ----V */ Store Sub-Device Mask PB Control Entry Format The STORE_SUBDEVICE_MASK PB control entry is used to save a subdevice mask value to be used later by a USE_SUBDEVICE_MASK PB instruction. #define NV_FIFO_DMA_STORE_SUBDEVICE_MASK /* ----G */ #define NV_FIFO_DMA_STORE_SUBDEVICE_MASK_VALUE 15:4 /* RWXUF */ #define NV_FIFO_DMA_STORE_SUBDEVICE_MASK_OPCODE 31:16 /* RWXUF */ #define NV_FIFO_DMA_STORE_SUBDEVICE_MASK_OPCODE_VALUE 0x00000002 /* ----V */ Use Sub-Device Mask PB Control Entry Format The USE_SUBDEVICE_MASK PB control entry is used to apply the subdevice mask value saved by a STORE_SUBDEVICE_MASK PB instruction. The effect of the mask is the same as for a SET_SUBDEVICE_MASK PB instruction. #define NV_FIFO_DMA_USE_SUBDEVICE_MASK /* ----G */ #define NV_FIFO_DMA_USE_SUBDEVICE_MASK_OPCODE 31:16 /* RWXUF */ #define NV_FIFO_DMA_USE_SUBDEVICE_MASK_OPCODE_VALUE 0x00000003 /* ----V */ End-PB-Segment PB Control Entry Format Engines may write PB segments themselves, but they cannot write GP entries. Because they cannot write GP entries, they cannot alter the size of a PB segment. If an engine is writing a PB segment, and if it does not need to fill the entire PB segment it was allocated, instead of filling the remainder of the PB segment with no-op PB instructions, it may write a single End-PB-Segment control entry to indicate that the pushbuffer data contains no further valid data. No further PB entries from that PB segment will be decoded or processed. Host may have already issued requests to fetch the remainder of the PB segment before an End-PB-Segment PB instruction is processed. Host may or may not fetch the remainder of the PB segment. Also note that doing a PB CRC check on this segment via NV_PPBDMA_GP_ENTRY1_OPCODE_PB_CRC will be indeterminate. #define NV_FIFO_DMA_ENDSEG_OPCODE 31:29 /* RWXUF */ #define NV_FIFO_DMA_ENDSEG_OPCODE_VALUE 0x00000007 /* ----V */