Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
fork

Configure Feed

Select the types of activity you want to include in your feed.

at v4.14 2134 lines 66 kB view raw
1/* 2 * Copyright © 2014 Intel Corporation 3 * 4 * Permission is hereby granted, free of charge, to any person obtaining a 5 * copy of this software and associated documentation files (the "Software"), 6 * to deal in the Software without restriction, including without limitation 7 * the rights to use, copy, modify, merge, publish, distribute, sublicense, 8 * and/or sell copies of the Software, and to permit persons to whom the 9 * Software is furnished to do so, subject to the following conditions: 10 * 11 * The above copyright notice and this permission notice (including the next 12 * paragraph) shall be included in all copies or substantial portions of the 13 * Software. 14 * 15 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL 18 * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 20 * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS 21 * IN THE SOFTWARE. 22 * 23 * Authors: 24 * Ben Widawsky <ben@bwidawsk.net> 25 * Michel Thierry <michel.thierry@intel.com> 26 * Thomas Daniel <thomas.daniel@intel.com> 27 * Oscar Mateo <oscar.mateo@intel.com> 28 * 29 */ 30 31/** 32 * DOC: Logical Rings, Logical Ring Contexts and Execlists 33 * 34 * Motivation: 35 * GEN8 brings an expansion of the HW contexts: "Logical Ring Contexts". 36 * These expanded contexts enable a number of new abilities, especially 37 * "Execlists" (also implemented in this file). 38 * 39 * One of the main differences with the legacy HW contexts is that logical 40 * ring contexts incorporate many more things to the context's state, like 41 * PDPs or ringbuffer control registers: 42 * 43 * The reason why PDPs are included in the context is straightforward: as 44 * PPGTTs (per-process GTTs) are actually per-context, having the PDPs 45 * contained there mean you don't need to do a ppgtt->switch_mm yourself, 46 * instead, the GPU will do it for you on the context switch. 47 * 48 * But, what about the ringbuffer control registers (head, tail, etc..)? 49 * shouldn't we just need a set of those per engine command streamer? This is 50 * where the name "Logical Rings" starts to make sense: by virtualizing the 51 * rings, the engine cs shifts to a new "ring buffer" with every context 52 * switch. When you want to submit a workload to the GPU you: A) choose your 53 * context, B) find its appropriate virtualized ring, C) write commands to it 54 * and then, finally, D) tell the GPU to switch to that context. 55 * 56 * Instead of the legacy MI_SET_CONTEXT, the way you tell the GPU to switch 57 * to a contexts is via a context execution list, ergo "Execlists". 58 * 59 * LRC implementation: 60 * Regarding the creation of contexts, we have: 61 * 62 * - One global default context. 63 * - One local default context for each opened fd. 64 * - One local extra context for each context create ioctl call. 65 * 66 * Now that ringbuffers belong per-context (and not per-engine, like before) 67 * and that contexts are uniquely tied to a given engine (and not reusable, 68 * like before) we need: 69 * 70 * - One ringbuffer per-engine inside each context. 71 * - One backing object per-engine inside each context. 72 * 73 * The global default context starts its life with these new objects fully 74 * allocated and populated. The local default context for each opened fd is 75 * more complex, because we don't know at creation time which engine is going 76 * to use them. To handle this, we have implemented a deferred creation of LR 77 * contexts: 78 * 79 * The local context starts its life as a hollow or blank holder, that only 80 * gets populated for a given engine once we receive an execbuffer. If later 81 * on we receive another execbuffer ioctl for the same context but a different 82 * engine, we allocate/populate a new ringbuffer and context backing object and 83 * so on. 84 * 85 * Finally, regarding local contexts created using the ioctl call: as they are 86 * only allowed with the render ring, we can allocate & populate them right 87 * away (no need to defer anything, at least for now). 88 * 89 * Execlists implementation: 90 * Execlists are the new method by which, on gen8+ hardware, workloads are 91 * submitted for execution (as opposed to the legacy, ringbuffer-based, method). 92 * This method works as follows: 93 * 94 * When a request is committed, its commands (the BB start and any leading or 95 * trailing commands, like the seqno breadcrumbs) are placed in the ringbuffer 96 * for the appropriate context. The tail pointer in the hardware context is not 97 * updated at this time, but instead, kept by the driver in the ringbuffer 98 * structure. A structure representing this request is added to a request queue 99 * for the appropriate engine: this structure contains a copy of the context's 100 * tail after the request was written to the ring buffer and a pointer to the 101 * context itself. 102 * 103 * If the engine's request queue was empty before the request was added, the 104 * queue is processed immediately. Otherwise the queue will be processed during 105 * a context switch interrupt. In any case, elements on the queue will get sent 106 * (in pairs) to the GPU's ExecLists Submit Port (ELSP, for short) with a 107 * globally unique 20-bits submission ID. 108 * 109 * When execution of a request completes, the GPU updates the context status 110 * buffer with a context complete event and generates a context switch interrupt. 111 * During the interrupt handling, the driver examines the events in the buffer: 112 * for each context complete event, if the announced ID matches that on the head 113 * of the request queue, then that request is retired and removed from the queue. 114 * 115 * After processing, if any requests were retired and the queue is not empty 116 * then a new execution list can be submitted. The two requests at the front of 117 * the queue are next to be submitted but since a context may not occur twice in 118 * an execution list, if subsequent requests have the same ID as the first then 119 * the two requests must be combined. This is done simply by discarding requests 120 * at the head of the queue until either only one requests is left (in which case 121 * we use a NULL second context) or the first two requests have unique IDs. 122 * 123 * By always executing the first two requests in the queue the driver ensures 124 * that the GPU is kept as busy as possible. In the case where a single context 125 * completes but a second context is still executing, the request for this second 126 * context will be at the head of the queue when we remove the first one. This 127 * request will then be resubmitted along with a new request for a different context, 128 * which will cause the hardware to continue executing the second request and queue 129 * the new request (the GPU detects the condition of a context getting preempted 130 * with the same context and optimizes the context switch flow by not doing 131 * preemption, but just sampling the new tail pointer). 132 * 133 */ 134#include <linux/interrupt.h> 135 136#include <drm/drmP.h> 137#include <drm/i915_drm.h> 138#include "i915_drv.h" 139#include "intel_mocs.h" 140 141#define RING_EXECLIST_QFULL (1 << 0x2) 142#define RING_EXECLIST1_VALID (1 << 0x3) 143#define RING_EXECLIST0_VALID (1 << 0x4) 144#define RING_EXECLIST_ACTIVE_STATUS (3 << 0xE) 145#define RING_EXECLIST1_ACTIVE (1 << 0x11) 146#define RING_EXECLIST0_ACTIVE (1 << 0x12) 147 148#define GEN8_CTX_STATUS_IDLE_ACTIVE (1 << 0) 149#define GEN8_CTX_STATUS_PREEMPTED (1 << 1) 150#define GEN8_CTX_STATUS_ELEMENT_SWITCH (1 << 2) 151#define GEN8_CTX_STATUS_ACTIVE_IDLE (1 << 3) 152#define GEN8_CTX_STATUS_COMPLETE (1 << 4) 153#define GEN8_CTX_STATUS_LITE_RESTORE (1 << 15) 154 155#define GEN8_CTX_STATUS_COMPLETED_MASK \ 156 (GEN8_CTX_STATUS_ACTIVE_IDLE | \ 157 GEN8_CTX_STATUS_PREEMPTED | \ 158 GEN8_CTX_STATUS_ELEMENT_SWITCH) 159 160#define CTX_LRI_HEADER_0 0x01 161#define CTX_CONTEXT_CONTROL 0x02 162#define CTX_RING_HEAD 0x04 163#define CTX_RING_TAIL 0x06 164#define CTX_RING_BUFFER_START 0x08 165#define CTX_RING_BUFFER_CONTROL 0x0a 166#define CTX_BB_HEAD_U 0x0c 167#define CTX_BB_HEAD_L 0x0e 168#define CTX_BB_STATE 0x10 169#define CTX_SECOND_BB_HEAD_U 0x12 170#define CTX_SECOND_BB_HEAD_L 0x14 171#define CTX_SECOND_BB_STATE 0x16 172#define CTX_BB_PER_CTX_PTR 0x18 173#define CTX_RCS_INDIRECT_CTX 0x1a 174#define CTX_RCS_INDIRECT_CTX_OFFSET 0x1c 175#define CTX_LRI_HEADER_1 0x21 176#define CTX_CTX_TIMESTAMP 0x22 177#define CTX_PDP3_UDW 0x24 178#define CTX_PDP3_LDW 0x26 179#define CTX_PDP2_UDW 0x28 180#define CTX_PDP2_LDW 0x2a 181#define CTX_PDP1_UDW 0x2c 182#define CTX_PDP1_LDW 0x2e 183#define CTX_PDP0_UDW 0x30 184#define CTX_PDP0_LDW 0x32 185#define CTX_LRI_HEADER_2 0x41 186#define CTX_R_PWR_CLK_STATE 0x42 187#define CTX_GPGPU_CSR_BASE_ADDRESS 0x44 188 189#define CTX_REG(reg_state, pos, reg, val) do { \ 190 (reg_state)[(pos)+0] = i915_mmio_reg_offset(reg); \ 191 (reg_state)[(pos)+1] = (val); \ 192} while (0) 193 194#define ASSIGN_CTX_PDP(ppgtt, reg_state, n) do { \ 195 const u64 _addr = i915_page_dir_dma_addr((ppgtt), (n)); \ 196 reg_state[CTX_PDP ## n ## _UDW+1] = upper_32_bits(_addr); \ 197 reg_state[CTX_PDP ## n ## _LDW+1] = lower_32_bits(_addr); \ 198} while (0) 199 200#define ASSIGN_CTX_PML4(ppgtt, reg_state) do { \ 201 reg_state[CTX_PDP0_UDW + 1] = upper_32_bits(px_dma(&ppgtt->pml4)); \ 202 reg_state[CTX_PDP0_LDW + 1] = lower_32_bits(px_dma(&ppgtt->pml4)); \ 203} while (0) 204 205#define GEN8_CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT 0x17 206#define GEN9_CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT 0x26 207#define GEN10_CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT 0x19 208 209/* Typical size of the average request (2 pipecontrols and a MI_BB) */ 210#define EXECLISTS_REQUEST_SIZE 64 /* bytes */ 211 212#define WA_TAIL_DWORDS 2 213 214static int execlists_context_deferred_alloc(struct i915_gem_context *ctx, 215 struct intel_engine_cs *engine); 216static void execlists_init_reg_state(u32 *reg_state, 217 struct i915_gem_context *ctx, 218 struct intel_engine_cs *engine, 219 struct intel_ring *ring); 220 221/** 222 * intel_sanitize_enable_execlists() - sanitize i915.enable_execlists 223 * @dev_priv: i915 device private 224 * @enable_execlists: value of i915.enable_execlists module parameter. 225 * 226 * Only certain platforms support Execlists (the prerequisites being 227 * support for Logical Ring Contexts and Aliasing PPGTT or better). 228 * 229 * Return: 1 if Execlists is supported and has to be enabled. 230 */ 231int intel_sanitize_enable_execlists(struct drm_i915_private *dev_priv, int enable_execlists) 232{ 233 /* On platforms with execlist available, vGPU will only 234 * support execlist mode, no ring buffer mode. 235 */ 236 if (HAS_LOGICAL_RING_CONTEXTS(dev_priv) && intel_vgpu_active(dev_priv)) 237 return 1; 238 239 if (INTEL_GEN(dev_priv) >= 9) 240 return 1; 241 242 if (enable_execlists == 0) 243 return 0; 244 245 if (HAS_LOGICAL_RING_CONTEXTS(dev_priv) && 246 USES_PPGTT(dev_priv) && 247 i915.use_mmio_flip >= 0) 248 return 1; 249 250 return 0; 251} 252 253/** 254 * intel_lr_context_descriptor_update() - calculate & cache the descriptor 255 * descriptor for a pinned context 256 * @ctx: Context to work on 257 * @engine: Engine the descriptor will be used with 258 * 259 * The context descriptor encodes various attributes of a context, 260 * including its GTT address and some flags. Because it's fairly 261 * expensive to calculate, we'll just do it once and cache the result, 262 * which remains valid until the context is unpinned. 263 * 264 * This is what a descriptor looks like, from LSB to MSB:: 265 * 266 * bits 0-11: flags, GEN8_CTX_* (cached in ctx->desc_template) 267 * bits 12-31: LRCA, GTT address of (the HWSP of) this context 268 * bits 32-52: ctx ID, a globally unique tag 269 * bits 53-54: mbz, reserved for use by hardware 270 * bits 55-63: group ID, currently unused and set to 0 271 */ 272static void 273intel_lr_context_descriptor_update(struct i915_gem_context *ctx, 274 struct intel_engine_cs *engine) 275{ 276 struct intel_context *ce = &ctx->engine[engine->id]; 277 u64 desc; 278 279 BUILD_BUG_ON(MAX_CONTEXT_HW_ID > (1<<GEN8_CTX_ID_WIDTH)); 280 281 desc = ctx->desc_template; /* bits 0-11 */ 282 desc |= i915_ggtt_offset(ce->state) + LRC_PPHWSP_PN * PAGE_SIZE; 283 /* bits 12-31 */ 284 desc |= (u64)ctx->hw_id << GEN8_CTX_ID_SHIFT; /* bits 32-52 */ 285 286 ce->lrc_desc = desc; 287} 288 289uint64_t intel_lr_context_descriptor(struct i915_gem_context *ctx, 290 struct intel_engine_cs *engine) 291{ 292 return ctx->engine[engine->id].lrc_desc; 293} 294 295static inline void 296execlists_context_status_change(struct drm_i915_gem_request *rq, 297 unsigned long status) 298{ 299 /* 300 * Only used when GVT-g is enabled now. When GVT-g is disabled, 301 * The compiler should eliminate this function as dead-code. 302 */ 303 if (!IS_ENABLED(CONFIG_DRM_I915_GVT)) 304 return; 305 306 atomic_notifier_call_chain(&rq->engine->context_status_notifier, 307 status, rq); 308} 309 310static void 311execlists_update_context_pdps(struct i915_hw_ppgtt *ppgtt, u32 *reg_state) 312{ 313 ASSIGN_CTX_PDP(ppgtt, reg_state, 3); 314 ASSIGN_CTX_PDP(ppgtt, reg_state, 2); 315 ASSIGN_CTX_PDP(ppgtt, reg_state, 1); 316 ASSIGN_CTX_PDP(ppgtt, reg_state, 0); 317} 318 319static u64 execlists_update_context(struct drm_i915_gem_request *rq) 320{ 321 struct intel_context *ce = &rq->ctx->engine[rq->engine->id]; 322 struct i915_hw_ppgtt *ppgtt = 323 rq->ctx->ppgtt ?: rq->i915->mm.aliasing_ppgtt; 324 u32 *reg_state = ce->lrc_reg_state; 325 326 reg_state[CTX_RING_TAIL+1] = intel_ring_set_tail(rq->ring, rq->tail); 327 328 /* True 32b PPGTT with dynamic page allocation: update PDP 329 * registers and point the unallocated PDPs to scratch page. 330 * PML4 is allocated during ppgtt init, so this is not needed 331 * in 48-bit mode. 332 */ 333 if (ppgtt && !i915_vm_is_48bit(&ppgtt->base)) 334 execlists_update_context_pdps(ppgtt, reg_state); 335 336 return ce->lrc_desc; 337} 338 339static void execlists_submit_ports(struct intel_engine_cs *engine) 340{ 341 struct execlist_port *port = engine->execlist_port; 342 u32 __iomem *elsp = 343 engine->i915->regs + i915_mmio_reg_offset(RING_ELSP(engine)); 344 unsigned int n; 345 346 for (n = ARRAY_SIZE(engine->execlist_port); n--; ) { 347 struct drm_i915_gem_request *rq; 348 unsigned int count; 349 u64 desc; 350 351 rq = port_unpack(&port[n], &count); 352 if (rq) { 353 GEM_BUG_ON(count > !n); 354 if (!count++) 355 execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN); 356 port_set(&port[n], port_pack(rq, count)); 357 desc = execlists_update_context(rq); 358 GEM_DEBUG_EXEC(port[n].context_id = upper_32_bits(desc)); 359 } else { 360 GEM_BUG_ON(!n); 361 desc = 0; 362 } 363 364 writel(upper_32_bits(desc), elsp); 365 writel(lower_32_bits(desc), elsp); 366 } 367} 368 369static bool ctx_single_port_submission(const struct i915_gem_context *ctx) 370{ 371 return (IS_ENABLED(CONFIG_DRM_I915_GVT) && 372 i915_gem_context_force_single_submission(ctx)); 373} 374 375static bool can_merge_ctx(const struct i915_gem_context *prev, 376 const struct i915_gem_context *next) 377{ 378 if (prev != next) 379 return false; 380 381 if (ctx_single_port_submission(prev)) 382 return false; 383 384 return true; 385} 386 387static void port_assign(struct execlist_port *port, 388 struct drm_i915_gem_request *rq) 389{ 390 GEM_BUG_ON(rq == port_request(port)); 391 392 if (port_isset(port)) 393 i915_gem_request_put(port_request(port)); 394 395 port_set(port, port_pack(i915_gem_request_get(rq), port_count(port))); 396} 397 398static void execlists_dequeue(struct intel_engine_cs *engine) 399{ 400 struct drm_i915_gem_request *last; 401 struct execlist_port *port = engine->execlist_port; 402 struct rb_node *rb; 403 bool submit = false; 404 405 last = port_request(port); 406 if (last) 407 /* WaIdleLiteRestore:bdw,skl 408 * Apply the wa NOOPs to prevent ring:HEAD == req:TAIL 409 * as we resubmit the request. See gen8_emit_breadcrumb() 410 * for where we prepare the padding after the end of the 411 * request. 412 */ 413 last->tail = last->wa_tail; 414 415 GEM_BUG_ON(port_isset(&port[1])); 416 417 /* Hardware submission is through 2 ports. Conceptually each port 418 * has a (RING_START, RING_HEAD, RING_TAIL) tuple. RING_START is 419 * static for a context, and unique to each, so we only execute 420 * requests belonging to a single context from each ring. RING_HEAD 421 * is maintained by the CS in the context image, it marks the place 422 * where it got up to last time, and through RING_TAIL we tell the CS 423 * where we want to execute up to this time. 424 * 425 * In this list the requests are in order of execution. Consecutive 426 * requests from the same context are adjacent in the ringbuffer. We 427 * can combine these requests into a single RING_TAIL update: 428 * 429 * RING_HEAD...req1...req2 430 * ^- RING_TAIL 431 * since to execute req2 the CS must first execute req1. 432 * 433 * Our goal then is to point each port to the end of a consecutive 434 * sequence of requests as being the most optimal (fewest wake ups 435 * and context switches) submission. 436 */ 437 438 spin_lock_irq(&engine->timeline->lock); 439 rb = engine->execlist_first; 440 GEM_BUG_ON(rb_first(&engine->execlist_queue) != rb); 441 while (rb) { 442 struct i915_priolist *p = rb_entry(rb, typeof(*p), node); 443 struct drm_i915_gem_request *rq, *rn; 444 445 list_for_each_entry_safe(rq, rn, &p->requests, priotree.link) { 446 /* 447 * Can we combine this request with the current port? 448 * It has to be the same context/ringbuffer and not 449 * have any exceptions (e.g. GVT saying never to 450 * combine contexts). 451 * 452 * If we can combine the requests, we can execute both 453 * by updating the RING_TAIL to point to the end of the 454 * second request, and so we never need to tell the 455 * hardware about the first. 456 */ 457 if (last && !can_merge_ctx(rq->ctx, last->ctx)) { 458 /* 459 * If we are on the second port and cannot 460 * combine this request with the last, then we 461 * are done. 462 */ 463 if (port != engine->execlist_port) { 464 __list_del_many(&p->requests, 465 &rq->priotree.link); 466 goto done; 467 } 468 469 /* 470 * If GVT overrides us we only ever submit 471 * port[0], leaving port[1] empty. Note that we 472 * also have to be careful that we don't queue 473 * the same context (even though a different 474 * request) to the second port. 475 */ 476 if (ctx_single_port_submission(last->ctx) || 477 ctx_single_port_submission(rq->ctx)) { 478 __list_del_many(&p->requests, 479 &rq->priotree.link); 480 goto done; 481 } 482 483 GEM_BUG_ON(last->ctx == rq->ctx); 484 485 if (submit) 486 port_assign(port, last); 487 port++; 488 } 489 490 INIT_LIST_HEAD(&rq->priotree.link); 491 rq->priotree.priority = INT_MAX; 492 493 __i915_gem_request_submit(rq); 494 trace_i915_gem_request_in(rq, port_index(port, engine)); 495 last = rq; 496 submit = true; 497 } 498 499 rb = rb_next(rb); 500 rb_erase(&p->node, &engine->execlist_queue); 501 INIT_LIST_HEAD(&p->requests); 502 if (p->priority != I915_PRIORITY_NORMAL) 503 kmem_cache_free(engine->i915->priorities, p); 504 } 505done: 506 engine->execlist_first = rb; 507 if (submit) 508 port_assign(port, last); 509 spin_unlock_irq(&engine->timeline->lock); 510 511 if (submit) 512 execlists_submit_ports(engine); 513} 514 515static bool execlists_elsp_ready(const struct intel_engine_cs *engine) 516{ 517 const struct execlist_port *port = engine->execlist_port; 518 519 return port_count(&port[0]) + port_count(&port[1]) < 2; 520} 521 522/* 523 * Check the unread Context Status Buffers and manage the submission of new 524 * contexts to the ELSP accordingly. 525 */ 526static void intel_lrc_irq_handler(unsigned long data) 527{ 528 struct intel_engine_cs *engine = (struct intel_engine_cs *)data; 529 struct execlist_port *port = engine->execlist_port; 530 struct drm_i915_private *dev_priv = engine->i915; 531 532 /* We can skip acquiring intel_runtime_pm_get() here as it was taken 533 * on our behalf by the request (see i915_gem_mark_busy()) and it will 534 * not be relinquished until the device is idle (see 535 * i915_gem_idle_work_handler()). As a precaution, we make sure 536 * that all ELSP are drained i.e. we have processed the CSB, 537 * before allowing ourselves to idle and calling intel_runtime_pm_put(). 538 */ 539 GEM_BUG_ON(!dev_priv->gt.awake); 540 541 intel_uncore_forcewake_get(dev_priv, engine->fw_domains); 542 543 /* Prefer doing test_and_clear_bit() as a two stage operation to avoid 544 * imposing the cost of a locked atomic transaction when submitting a 545 * new request (outside of the context-switch interrupt). 546 */ 547 while (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted)) { 548 u32 __iomem *csb_mmio = 549 dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)); 550 u32 __iomem *buf = 551 dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_BUF_LO(engine, 0)); 552 unsigned int head, tail; 553 554 /* The write will be ordered by the uncached read (itself 555 * a memory barrier), so we do not need another in the form 556 * of a locked instruction. The race between the interrupt 557 * handler and the split test/clear is harmless as we order 558 * our clear before the CSB read. If the interrupt arrived 559 * first between the test and the clear, we read the updated 560 * CSB and clear the bit. If the interrupt arrives as we read 561 * the CSB or later (i.e. after we had cleared the bit) the bit 562 * is set and we do a new loop. 563 */ 564 __clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted); 565 head = readl(csb_mmio); 566 tail = GEN8_CSB_WRITE_PTR(head); 567 head = GEN8_CSB_READ_PTR(head); 568 while (head != tail) { 569 struct drm_i915_gem_request *rq; 570 unsigned int status; 571 unsigned int count; 572 573 if (++head == GEN8_CSB_ENTRIES) 574 head = 0; 575 576 /* We are flying near dragons again. 577 * 578 * We hold a reference to the request in execlist_port[] 579 * but no more than that. We are operating in softirq 580 * context and so cannot hold any mutex or sleep. That 581 * prevents us stopping the requests we are processing 582 * in port[] from being retired simultaneously (the 583 * breadcrumb will be complete before we see the 584 * context-switch). As we only hold the reference to the 585 * request, any pointer chasing underneath the request 586 * is subject to a potential use-after-free. Thus we 587 * store all of the bookkeeping within port[] as 588 * required, and avoid using unguarded pointers beneath 589 * request itself. The same applies to the atomic 590 * status notifier. 591 */ 592 593 status = readl(buf + 2 * head); 594 if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK)) 595 continue; 596 597 /* Check the context/desc id for this event matches */ 598 GEM_DEBUG_BUG_ON(readl(buf + 2 * head + 1) != 599 port->context_id); 600 601 rq = port_unpack(port, &count); 602 GEM_BUG_ON(count == 0); 603 if (--count == 0) { 604 GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED); 605 GEM_BUG_ON(!i915_gem_request_completed(rq)); 606 execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_OUT); 607 608 trace_i915_gem_request_out(rq); 609 i915_gem_request_put(rq); 610 611 port[0] = port[1]; 612 memset(&port[1], 0, sizeof(port[1])); 613 } else { 614 port_set(port, port_pack(rq, count)); 615 } 616 617 /* After the final element, the hw should be idle */ 618 GEM_BUG_ON(port_count(port) == 0 && 619 !(status & GEN8_CTX_STATUS_ACTIVE_IDLE)); 620 } 621 622 writel(_MASKED_FIELD(GEN8_CSB_READ_PTR_MASK, head << 8), 623 csb_mmio); 624 } 625 626 if (execlists_elsp_ready(engine)) 627 execlists_dequeue(engine); 628 629 intel_uncore_forcewake_put(dev_priv, engine->fw_domains); 630} 631 632static bool 633insert_request(struct intel_engine_cs *engine, 634 struct i915_priotree *pt, 635 int prio) 636{ 637 struct i915_priolist *p; 638 struct rb_node **parent, *rb; 639 bool first = true; 640 641 if (unlikely(engine->no_priolist)) 642 prio = I915_PRIORITY_NORMAL; 643 644find_priolist: 645 /* most positive priority is scheduled first, equal priorities fifo */ 646 rb = NULL; 647 parent = &engine->execlist_queue.rb_node; 648 while (*parent) { 649 rb = *parent; 650 p = rb_entry(rb, typeof(*p), node); 651 if (prio > p->priority) { 652 parent = &rb->rb_left; 653 } else if (prio < p->priority) { 654 parent = &rb->rb_right; 655 first = false; 656 } else { 657 list_add_tail(&pt->link, &p->requests); 658 return false; 659 } 660 } 661 662 if (prio == I915_PRIORITY_NORMAL) { 663 p = &engine->default_priolist; 664 } else { 665 p = kmem_cache_alloc(engine->i915->priorities, GFP_ATOMIC); 666 /* Convert an allocation failure to a priority bump */ 667 if (unlikely(!p)) { 668 prio = I915_PRIORITY_NORMAL; /* recurses just once */ 669 670 /* To maintain ordering with all rendering, after an 671 * allocation failure we have to disable all scheduling. 672 * Requests will then be executed in fifo, and schedule 673 * will ensure that dependencies are emitted in fifo. 674 * There will be still some reordering with existing 675 * requests, so if userspace lied about their 676 * dependencies that reordering may be visible. 677 */ 678 engine->no_priolist = true; 679 goto find_priolist; 680 } 681 } 682 683 p->priority = prio; 684 rb_link_node(&p->node, rb, parent); 685 rb_insert_color(&p->node, &engine->execlist_queue); 686 687 INIT_LIST_HEAD(&p->requests); 688 list_add_tail(&pt->link, &p->requests); 689 690 if (first) 691 engine->execlist_first = &p->node; 692 693 return first; 694} 695 696static void execlists_submit_request(struct drm_i915_gem_request *request) 697{ 698 struct intel_engine_cs *engine = request->engine; 699 unsigned long flags; 700 701 /* Will be called from irq-context when using foreign fences. */ 702 spin_lock_irqsave(&engine->timeline->lock, flags); 703 704 if (insert_request(engine, 705 &request->priotree, 706 request->priotree.priority)) { 707 if (execlists_elsp_ready(engine)) 708 tasklet_hi_schedule(&engine->irq_tasklet); 709 } 710 711 GEM_BUG_ON(!engine->execlist_first); 712 GEM_BUG_ON(list_empty(&request->priotree.link)); 713 714 spin_unlock_irqrestore(&engine->timeline->lock, flags); 715} 716 717static struct intel_engine_cs * 718pt_lock_engine(struct i915_priotree *pt, struct intel_engine_cs *locked) 719{ 720 struct intel_engine_cs *engine = 721 container_of(pt, struct drm_i915_gem_request, priotree)->engine; 722 723 GEM_BUG_ON(!locked); 724 725 if (engine != locked) { 726 spin_unlock(&locked->timeline->lock); 727 spin_lock(&engine->timeline->lock); 728 } 729 730 return engine; 731} 732 733static void execlists_schedule(struct drm_i915_gem_request *request, int prio) 734{ 735 struct intel_engine_cs *engine; 736 struct i915_dependency *dep, *p; 737 struct i915_dependency stack; 738 LIST_HEAD(dfs); 739 740 if (prio <= READ_ONCE(request->priotree.priority)) 741 return; 742 743 /* Need BKL in order to use the temporary link inside i915_dependency */ 744 lockdep_assert_held(&request->i915->drm.struct_mutex); 745 746 stack.signaler = &request->priotree; 747 list_add(&stack.dfs_link, &dfs); 748 749 /* Recursively bump all dependent priorities to match the new request. 750 * 751 * A naive approach would be to use recursion: 752 * static void update_priorities(struct i915_priotree *pt, prio) { 753 * list_for_each_entry(dep, &pt->signalers_list, signal_link) 754 * update_priorities(dep->signal, prio) 755 * insert_request(pt); 756 * } 757 * but that may have unlimited recursion depth and so runs a very 758 * real risk of overunning the kernel stack. Instead, we build 759 * a flat list of all dependencies starting with the current request. 760 * As we walk the list of dependencies, we add all of its dependencies 761 * to the end of the list (this may include an already visited 762 * request) and continue to walk onwards onto the new dependencies. The 763 * end result is a topological list of requests in reverse order, the 764 * last element in the list is the request we must execute first. 765 */ 766 list_for_each_entry_safe(dep, p, &dfs, dfs_link) { 767 struct i915_priotree *pt = dep->signaler; 768 769 /* Within an engine, there can be no cycle, but we may 770 * refer to the same dependency chain multiple times 771 * (redundant dependencies are not eliminated) and across 772 * engines. 773 */ 774 list_for_each_entry(p, &pt->signalers_list, signal_link) { 775 GEM_BUG_ON(p->signaler->priority < pt->priority); 776 if (prio > READ_ONCE(p->signaler->priority)) 777 list_move_tail(&p->dfs_link, &dfs); 778 } 779 780 list_safe_reset_next(dep, p, dfs_link); 781 } 782 783 /* If we didn't need to bump any existing priorities, and we haven't 784 * yet submitted this request (i.e. there is no potential race with 785 * execlists_submit_request()), we can set our own priority and skip 786 * acquiring the engine locks. 787 */ 788 if (request->priotree.priority == INT_MIN) { 789 GEM_BUG_ON(!list_empty(&request->priotree.link)); 790 request->priotree.priority = prio; 791 if (stack.dfs_link.next == stack.dfs_link.prev) 792 return; 793 __list_del_entry(&stack.dfs_link); 794 } 795 796 engine = request->engine; 797 spin_lock_irq(&engine->timeline->lock); 798 799 /* Fifo and depth-first replacement ensure our deps execute before us */ 800 list_for_each_entry_safe_reverse(dep, p, &dfs, dfs_link) { 801 struct i915_priotree *pt = dep->signaler; 802 803 INIT_LIST_HEAD(&dep->dfs_link); 804 805 engine = pt_lock_engine(pt, engine); 806 807 if (prio <= pt->priority) 808 continue; 809 810 pt->priority = prio; 811 if (!list_empty(&pt->link)) { 812 __list_del_entry(&pt->link); 813 insert_request(engine, pt, prio); 814 } 815 } 816 817 spin_unlock_irq(&engine->timeline->lock); 818 819 /* XXX Do we need to preempt to make room for us and our deps? */ 820} 821 822static struct intel_ring * 823execlists_context_pin(struct intel_engine_cs *engine, 824 struct i915_gem_context *ctx) 825{ 826 struct intel_context *ce = &ctx->engine[engine->id]; 827 unsigned int flags; 828 void *vaddr; 829 int ret; 830 831 lockdep_assert_held(&ctx->i915->drm.struct_mutex); 832 833 if (likely(ce->pin_count++)) 834 goto out; 835 GEM_BUG_ON(!ce->pin_count); /* no overflow please! */ 836 837 if (!ce->state) { 838 ret = execlists_context_deferred_alloc(ctx, engine); 839 if (ret) 840 goto err; 841 } 842 GEM_BUG_ON(!ce->state); 843 844 flags = PIN_GLOBAL | PIN_HIGH; 845 if (ctx->ggtt_offset_bias) 846 flags |= PIN_OFFSET_BIAS | ctx->ggtt_offset_bias; 847 848 ret = i915_vma_pin(ce->state, 0, GEN8_LR_CONTEXT_ALIGN, flags); 849 if (ret) 850 goto err; 851 852 vaddr = i915_gem_object_pin_map(ce->state->obj, I915_MAP_WB); 853 if (IS_ERR(vaddr)) { 854 ret = PTR_ERR(vaddr); 855 goto unpin_vma; 856 } 857 858 ret = intel_ring_pin(ce->ring, ctx->i915, ctx->ggtt_offset_bias); 859 if (ret) 860 goto unpin_map; 861 862 intel_lr_context_descriptor_update(ctx, engine); 863 864 ce->lrc_reg_state = vaddr + LRC_STATE_PN * PAGE_SIZE; 865 ce->lrc_reg_state[CTX_RING_BUFFER_START+1] = 866 i915_ggtt_offset(ce->ring->vma); 867 868 ce->state->obj->mm.dirty = true; 869 870 i915_gem_context_get(ctx); 871out: 872 return ce->ring; 873 874unpin_map: 875 i915_gem_object_unpin_map(ce->state->obj); 876unpin_vma: 877 __i915_vma_unpin(ce->state); 878err: 879 ce->pin_count = 0; 880 return ERR_PTR(ret); 881} 882 883static void execlists_context_unpin(struct intel_engine_cs *engine, 884 struct i915_gem_context *ctx) 885{ 886 struct intel_context *ce = &ctx->engine[engine->id]; 887 888 lockdep_assert_held(&ctx->i915->drm.struct_mutex); 889 GEM_BUG_ON(ce->pin_count == 0); 890 891 if (--ce->pin_count) 892 return; 893 894 intel_ring_unpin(ce->ring); 895 896 i915_gem_object_unpin_map(ce->state->obj); 897 i915_vma_unpin(ce->state); 898 899 i915_gem_context_put(ctx); 900} 901 902static int execlists_request_alloc(struct drm_i915_gem_request *request) 903{ 904 struct intel_engine_cs *engine = request->engine; 905 struct intel_context *ce = &request->ctx->engine[engine->id]; 906 u32 *cs; 907 int ret; 908 909 GEM_BUG_ON(!ce->pin_count); 910 911 /* Flush enough space to reduce the likelihood of waiting after 912 * we start building the request - in which case we will just 913 * have to repeat work. 914 */ 915 request->reserved_space += EXECLISTS_REQUEST_SIZE; 916 917 if (i915.enable_guc_submission) { 918 /* 919 * Check that the GuC has space for the request before 920 * going any further, as the i915_add_request() call 921 * later on mustn't fail ... 922 */ 923 ret = i915_guc_wq_reserve(request); 924 if (ret) 925 goto err; 926 } 927 928 cs = intel_ring_begin(request, 0); 929 if (IS_ERR(cs)) { 930 ret = PTR_ERR(cs); 931 goto err_unreserve; 932 } 933 934 if (!ce->initialised) { 935 ret = engine->init_context(request); 936 if (ret) 937 goto err_unreserve; 938 939 ce->initialised = true; 940 } 941 942 /* Note that after this point, we have committed to using 943 * this request as it is being used to both track the 944 * state of engine initialisation and liveness of the 945 * golden renderstate above. Think twice before you try 946 * to cancel/unwind this request now. 947 */ 948 949 request->reserved_space -= EXECLISTS_REQUEST_SIZE; 950 return 0; 951 952err_unreserve: 953 if (i915.enable_guc_submission) 954 i915_guc_wq_unreserve(request); 955err: 956 return ret; 957} 958 959/* 960 * In this WA we need to set GEN8_L3SQCREG4[21:21] and reset it after 961 * PIPE_CONTROL instruction. This is required for the flush to happen correctly 962 * but there is a slight complication as this is applied in WA batch where the 963 * values are only initialized once so we cannot take register value at the 964 * beginning and reuse it further; hence we save its value to memory, upload a 965 * constant value with bit21 set and then we restore it back with the saved value. 966 * To simplify the WA, a constant value is formed by using the default value 967 * of this register. This shouldn't be a problem because we are only modifying 968 * it for a short period and this batch in non-premptible. We can ofcourse 969 * use additional instructions that read the actual value of the register 970 * at that time and set our bit of interest but it makes the WA complicated. 971 * 972 * This WA is also required for Gen9 so extracting as a function avoids 973 * code duplication. 974 */ 975static u32 * 976gen8_emit_flush_coherentl3_wa(struct intel_engine_cs *engine, u32 *batch) 977{ 978 *batch++ = MI_STORE_REGISTER_MEM_GEN8 | MI_SRM_LRM_GLOBAL_GTT; 979 *batch++ = i915_mmio_reg_offset(GEN8_L3SQCREG4); 980 *batch++ = i915_ggtt_offset(engine->scratch) + 256; 981 *batch++ = 0; 982 983 *batch++ = MI_LOAD_REGISTER_IMM(1); 984 *batch++ = i915_mmio_reg_offset(GEN8_L3SQCREG4); 985 *batch++ = 0x40400000 | GEN8_LQSC_FLUSH_COHERENT_LINES; 986 987 batch = gen8_emit_pipe_control(batch, 988 PIPE_CONTROL_CS_STALL | 989 PIPE_CONTROL_DC_FLUSH_ENABLE, 990 0); 991 992 *batch++ = MI_LOAD_REGISTER_MEM_GEN8 | MI_SRM_LRM_GLOBAL_GTT; 993 *batch++ = i915_mmio_reg_offset(GEN8_L3SQCREG4); 994 *batch++ = i915_ggtt_offset(engine->scratch) + 256; 995 *batch++ = 0; 996 997 return batch; 998} 999 1000/* 1001 * Typically we only have one indirect_ctx and per_ctx batch buffer which are 1002 * initialized at the beginning and shared across all contexts but this field 1003 * helps us to have multiple batches at different offsets and select them based 1004 * on a criteria. At the moment this batch always start at the beginning of the page 1005 * and at this point we don't have multiple wa_ctx batch buffers. 1006 * 1007 * The number of WA applied are not known at the beginning; we use this field 1008 * to return the no of DWORDS written. 1009 * 1010 * It is to be noted that this batch does not contain MI_BATCH_BUFFER_END 1011 * so it adds NOOPs as padding to make it cacheline aligned. 1012 * MI_BATCH_BUFFER_END will be added to perctx batch and both of them together 1013 * makes a complete batch buffer. 1014 */ 1015static u32 *gen8_init_indirectctx_bb(struct intel_engine_cs *engine, u32 *batch) 1016{ 1017 /* WaDisableCtxRestoreArbitration:bdw,chv */ 1018 *batch++ = MI_ARB_ON_OFF | MI_ARB_DISABLE; 1019 1020 /* WaFlushCoherentL3CacheLinesAtContextSwitch:bdw */ 1021 if (IS_BROADWELL(engine->i915)) 1022 batch = gen8_emit_flush_coherentl3_wa(engine, batch); 1023 1024 /* WaClearSlmSpaceAtContextSwitch:bdw,chv */ 1025 /* Actual scratch location is at 128 bytes offset */ 1026 batch = gen8_emit_pipe_control(batch, 1027 PIPE_CONTROL_FLUSH_L3 | 1028 PIPE_CONTROL_GLOBAL_GTT_IVB | 1029 PIPE_CONTROL_CS_STALL | 1030 PIPE_CONTROL_QW_WRITE, 1031 i915_ggtt_offset(engine->scratch) + 1032 2 * CACHELINE_BYTES); 1033 1034 /* Pad to end of cacheline */ 1035 while ((unsigned long)batch % CACHELINE_BYTES) 1036 *batch++ = MI_NOOP; 1037 1038 /* 1039 * MI_BATCH_BUFFER_END is not required in Indirect ctx BB because 1040 * execution depends on the length specified in terms of cache lines 1041 * in the register CTX_RCS_INDIRECT_CTX 1042 */ 1043 1044 return batch; 1045} 1046 1047/* 1048 * This batch is started immediately after indirect_ctx batch. Since we ensure 1049 * that indirect_ctx ends on a cacheline this batch is aligned automatically. 1050 * 1051 * The number of DWORDS written are returned using this field. 1052 * 1053 * This batch is terminated with MI_BATCH_BUFFER_END and so we need not add padding 1054 * to align it with cacheline as padding after MI_BATCH_BUFFER_END is redundant. 1055 */ 1056static u32 *gen8_init_perctx_bb(struct intel_engine_cs *engine, u32 *batch) 1057{ 1058 /* WaDisableCtxRestoreArbitration:bdw,chv */ 1059 *batch++ = MI_ARB_ON_OFF | MI_ARB_ENABLE; 1060 *batch++ = MI_BATCH_BUFFER_END; 1061 1062 return batch; 1063} 1064 1065static u32 *gen9_init_indirectctx_bb(struct intel_engine_cs *engine, u32 *batch) 1066{ 1067 /* WaFlushCoherentL3CacheLinesAtContextSwitch:skl,bxt,glk */ 1068 batch = gen8_emit_flush_coherentl3_wa(engine, batch); 1069 1070 /* WaDisableGatherAtSetShaderCommonSlice:skl,bxt,kbl,glk */ 1071 *batch++ = MI_LOAD_REGISTER_IMM(1); 1072 *batch++ = i915_mmio_reg_offset(COMMON_SLICE_CHICKEN2); 1073 *batch++ = _MASKED_BIT_DISABLE( 1074 GEN9_DISABLE_GATHER_AT_SET_SHADER_COMMON_SLICE); 1075 *batch++ = MI_NOOP; 1076 1077 /* WaClearSlmSpaceAtContextSwitch:kbl */ 1078 /* Actual scratch location is at 128 bytes offset */ 1079 if (IS_KBL_REVID(engine->i915, 0, KBL_REVID_A0)) { 1080 batch = gen8_emit_pipe_control(batch, 1081 PIPE_CONTROL_FLUSH_L3 | 1082 PIPE_CONTROL_GLOBAL_GTT_IVB | 1083 PIPE_CONTROL_CS_STALL | 1084 PIPE_CONTROL_QW_WRITE, 1085 i915_ggtt_offset(engine->scratch) 1086 + 2 * CACHELINE_BYTES); 1087 } 1088 1089 /* WaMediaPoolStateCmdInWABB:bxt,glk */ 1090 if (HAS_POOLED_EU(engine->i915)) { 1091 /* 1092 * EU pool configuration is setup along with golden context 1093 * during context initialization. This value depends on 1094 * device type (2x6 or 3x6) and needs to be updated based 1095 * on which subslice is disabled especially for 2x6 1096 * devices, however it is safe to load default 1097 * configuration of 3x6 device instead of masking off 1098 * corresponding bits because HW ignores bits of a disabled 1099 * subslice and drops down to appropriate config. Please 1100 * see render_state_setup() in i915_gem_render_state.c for 1101 * possible configurations, to avoid duplication they are 1102 * not shown here again. 1103 */ 1104 *batch++ = GEN9_MEDIA_POOL_STATE; 1105 *batch++ = GEN9_MEDIA_POOL_ENABLE; 1106 *batch++ = 0x00777000; 1107 *batch++ = 0; 1108 *batch++ = 0; 1109 *batch++ = 0; 1110 } 1111 1112 /* Pad to end of cacheline */ 1113 while ((unsigned long)batch % CACHELINE_BYTES) 1114 *batch++ = MI_NOOP; 1115 1116 return batch; 1117} 1118 1119static u32 *gen9_init_perctx_bb(struct intel_engine_cs *engine, u32 *batch) 1120{ 1121 *batch++ = MI_BATCH_BUFFER_END; 1122 1123 return batch; 1124} 1125 1126#define CTX_WA_BB_OBJ_SIZE (PAGE_SIZE) 1127 1128static int lrc_setup_wa_ctx(struct intel_engine_cs *engine) 1129{ 1130 struct drm_i915_gem_object *obj; 1131 struct i915_vma *vma; 1132 int err; 1133 1134 obj = i915_gem_object_create(engine->i915, CTX_WA_BB_OBJ_SIZE); 1135 if (IS_ERR(obj)) 1136 return PTR_ERR(obj); 1137 1138 vma = i915_vma_instance(obj, &engine->i915->ggtt.base, NULL); 1139 if (IS_ERR(vma)) { 1140 err = PTR_ERR(vma); 1141 goto err; 1142 } 1143 1144 err = i915_vma_pin(vma, 0, PAGE_SIZE, PIN_GLOBAL | PIN_HIGH); 1145 if (err) 1146 goto err; 1147 1148 engine->wa_ctx.vma = vma; 1149 return 0; 1150 1151err: 1152 i915_gem_object_put(obj); 1153 return err; 1154} 1155 1156static void lrc_destroy_wa_ctx(struct intel_engine_cs *engine) 1157{ 1158 i915_vma_unpin_and_release(&engine->wa_ctx.vma); 1159} 1160 1161typedef u32 *(*wa_bb_func_t)(struct intel_engine_cs *engine, u32 *batch); 1162 1163static int intel_init_workaround_bb(struct intel_engine_cs *engine) 1164{ 1165 struct i915_ctx_workarounds *wa_ctx = &engine->wa_ctx; 1166 struct i915_wa_ctx_bb *wa_bb[2] = { &wa_ctx->indirect_ctx, 1167 &wa_ctx->per_ctx }; 1168 wa_bb_func_t wa_bb_fn[2]; 1169 struct page *page; 1170 void *batch, *batch_ptr; 1171 unsigned int i; 1172 int ret; 1173 1174 if (WARN_ON(engine->id != RCS || !engine->scratch)) 1175 return -EINVAL; 1176 1177 switch (INTEL_GEN(engine->i915)) { 1178 case 9: 1179 wa_bb_fn[0] = gen9_init_indirectctx_bb; 1180 wa_bb_fn[1] = gen9_init_perctx_bb; 1181 break; 1182 case 8: 1183 wa_bb_fn[0] = gen8_init_indirectctx_bb; 1184 wa_bb_fn[1] = gen8_init_perctx_bb; 1185 break; 1186 default: 1187 MISSING_CASE(INTEL_GEN(engine->i915)); 1188 return 0; 1189 } 1190 1191 ret = lrc_setup_wa_ctx(engine); 1192 if (ret) { 1193 DRM_DEBUG_DRIVER("Failed to setup context WA page: %d\n", ret); 1194 return ret; 1195 } 1196 1197 page = i915_gem_object_get_dirty_page(wa_ctx->vma->obj, 0); 1198 batch = batch_ptr = kmap_atomic(page); 1199 1200 /* 1201 * Emit the two workaround batch buffers, recording the offset from the 1202 * start of the workaround batch buffer object for each and their 1203 * respective sizes. 1204 */ 1205 for (i = 0; i < ARRAY_SIZE(wa_bb_fn); i++) { 1206 wa_bb[i]->offset = batch_ptr - batch; 1207 if (WARN_ON(!IS_ALIGNED(wa_bb[i]->offset, CACHELINE_BYTES))) { 1208 ret = -EINVAL; 1209 break; 1210 } 1211 batch_ptr = wa_bb_fn[i](engine, batch_ptr); 1212 wa_bb[i]->size = batch_ptr - (batch + wa_bb[i]->offset); 1213 } 1214 1215 BUG_ON(batch_ptr - batch > CTX_WA_BB_OBJ_SIZE); 1216 1217 kunmap_atomic(batch); 1218 if (ret) 1219 lrc_destroy_wa_ctx(engine); 1220 1221 return ret; 1222} 1223 1224static u8 gtiir[] = { 1225 [RCS] = 0, 1226 [BCS] = 0, 1227 [VCS] = 1, 1228 [VCS2] = 1, 1229 [VECS] = 3, 1230}; 1231 1232static int gen8_init_common_ring(struct intel_engine_cs *engine) 1233{ 1234 struct drm_i915_private *dev_priv = engine->i915; 1235 struct execlist_port *port = engine->execlist_port; 1236 unsigned int n; 1237 bool submit; 1238 int ret; 1239 1240 ret = intel_mocs_init_engine(engine); 1241 if (ret) 1242 return ret; 1243 1244 intel_engine_reset_breadcrumbs(engine); 1245 intel_engine_init_hangcheck(engine); 1246 1247 I915_WRITE(RING_HWSTAM(engine->mmio_base), 0xffffffff); 1248 I915_WRITE(RING_MODE_GEN7(engine), 1249 _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE)); 1250 I915_WRITE(RING_HWS_PGA(engine->mmio_base), 1251 engine->status_page.ggtt_offset); 1252 POSTING_READ(RING_HWS_PGA(engine->mmio_base)); 1253 1254 DRM_DEBUG_DRIVER("Execlists enabled for %s\n", engine->name); 1255 1256 GEM_BUG_ON(engine->id >= ARRAY_SIZE(gtiir)); 1257 1258 /* 1259 * Clear any pending interrupt state. 1260 * 1261 * We do it twice out of paranoia that some of the IIR are double 1262 * buffered, and if we only reset it once there may still be 1263 * an interrupt pending. 1264 */ 1265 I915_WRITE(GEN8_GT_IIR(gtiir[engine->id]), 1266 GT_CONTEXT_SWITCH_INTERRUPT << engine->irq_shift); 1267 I915_WRITE(GEN8_GT_IIR(gtiir[engine->id]), 1268 GT_CONTEXT_SWITCH_INTERRUPT << engine->irq_shift); 1269 clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted); 1270 1271 /* After a GPU reset, we may have requests to replay */ 1272 submit = false; 1273 for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++) { 1274 if (!port_isset(&port[n])) 1275 break; 1276 1277 DRM_DEBUG_DRIVER("Restarting %s:%d from 0x%x\n", 1278 engine->name, n, 1279 port_request(&port[n])->global_seqno); 1280 1281 /* Discard the current inflight count */ 1282 port_set(&port[n], port_request(&port[n])); 1283 submit = true; 1284 } 1285 1286 if (submit && !i915.enable_guc_submission) 1287 execlists_submit_ports(engine); 1288 1289 return 0; 1290} 1291 1292static int gen8_init_render_ring(struct intel_engine_cs *engine) 1293{ 1294 struct drm_i915_private *dev_priv = engine->i915; 1295 int ret; 1296 1297 ret = gen8_init_common_ring(engine); 1298 if (ret) 1299 return ret; 1300 1301 /* We need to disable the AsyncFlip performance optimisations in order 1302 * to use MI_WAIT_FOR_EVENT within the CS. It should already be 1303 * programmed to '1' on all products. 1304 * 1305 * WaDisableAsyncFlipPerfMode:snb,ivb,hsw,vlv,bdw,chv 1306 */ 1307 I915_WRITE(MI_MODE, _MASKED_BIT_ENABLE(ASYNC_FLIP_PERF_DISABLE)); 1308 1309 I915_WRITE(INSTPM, _MASKED_BIT_ENABLE(INSTPM_FORCE_ORDERING)); 1310 1311 return init_workarounds_ring(engine); 1312} 1313 1314static int gen9_init_render_ring(struct intel_engine_cs *engine) 1315{ 1316 int ret; 1317 1318 ret = gen8_init_common_ring(engine); 1319 if (ret) 1320 return ret; 1321 1322 return init_workarounds_ring(engine); 1323} 1324 1325static void reset_common_ring(struct intel_engine_cs *engine, 1326 struct drm_i915_gem_request *request) 1327{ 1328 struct execlist_port *port = engine->execlist_port; 1329 struct intel_context *ce; 1330 unsigned int n; 1331 1332 /* 1333 * Catch up with any missed context-switch interrupts. 1334 * 1335 * Ideally we would just read the remaining CSB entries now that we 1336 * know the gpu is idle. However, the CSB registers are sometimes^W 1337 * often trashed across a GPU reset! Instead we have to rely on 1338 * guessing the missed context-switch events by looking at what 1339 * requests were completed. 1340 */ 1341 if (!request) { 1342 for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++) 1343 i915_gem_request_put(port_request(&port[n])); 1344 memset(engine->execlist_port, 0, sizeof(engine->execlist_port)); 1345 return; 1346 } 1347 1348 if (request->ctx != port_request(port)->ctx) { 1349 i915_gem_request_put(port_request(port)); 1350 port[0] = port[1]; 1351 memset(&port[1], 0, sizeof(port[1])); 1352 } 1353 1354 GEM_BUG_ON(request->ctx != port_request(port)->ctx); 1355 1356 /* If the request was innocent, we leave the request in the ELSP 1357 * and will try to replay it on restarting. The context image may 1358 * have been corrupted by the reset, in which case we may have 1359 * to service a new GPU hang, but more likely we can continue on 1360 * without impact. 1361 * 1362 * If the request was guilty, we presume the context is corrupt 1363 * and have to at least restore the RING register in the context 1364 * image back to the expected values to skip over the guilty request. 1365 */ 1366 if (request->fence.error != -EIO) 1367 return; 1368 1369 /* We want a simple context + ring to execute the breadcrumb update. 1370 * We cannot rely on the context being intact across the GPU hang, 1371 * so clear it and rebuild just what we need for the breadcrumb. 1372 * All pending requests for this context will be zapped, and any 1373 * future request will be after userspace has had the opportunity 1374 * to recreate its own state. 1375 */ 1376 ce = &request->ctx->engine[engine->id]; 1377 execlists_init_reg_state(ce->lrc_reg_state, 1378 request->ctx, engine, ce->ring); 1379 1380 /* Move the RING_HEAD onto the breadcrumb, past the hanging batch */ 1381 ce->lrc_reg_state[CTX_RING_BUFFER_START+1] = 1382 i915_ggtt_offset(ce->ring->vma); 1383 ce->lrc_reg_state[CTX_RING_HEAD+1] = request->postfix; 1384 1385 request->ring->head = request->postfix; 1386 intel_ring_update_space(request->ring); 1387 1388 /* Reset WaIdleLiteRestore:bdw,skl as well */ 1389 request->tail = 1390 intel_ring_wrap(request->ring, 1391 request->wa_tail - WA_TAIL_DWORDS*sizeof(u32)); 1392 assert_ring_tail_valid(request->ring, request->tail); 1393} 1394 1395static int intel_logical_ring_emit_pdps(struct drm_i915_gem_request *req) 1396{ 1397 struct i915_hw_ppgtt *ppgtt = req->ctx->ppgtt; 1398 struct intel_engine_cs *engine = req->engine; 1399 const int num_lri_cmds = GEN8_3LVL_PDPES * 2; 1400 u32 *cs; 1401 int i; 1402 1403 cs = intel_ring_begin(req, num_lri_cmds * 2 + 2); 1404 if (IS_ERR(cs)) 1405 return PTR_ERR(cs); 1406 1407 *cs++ = MI_LOAD_REGISTER_IMM(num_lri_cmds); 1408 for (i = GEN8_3LVL_PDPES - 1; i >= 0; i--) { 1409 const dma_addr_t pd_daddr = i915_page_dir_dma_addr(ppgtt, i); 1410 1411 *cs++ = i915_mmio_reg_offset(GEN8_RING_PDP_UDW(engine, i)); 1412 *cs++ = upper_32_bits(pd_daddr); 1413 *cs++ = i915_mmio_reg_offset(GEN8_RING_PDP_LDW(engine, i)); 1414 *cs++ = lower_32_bits(pd_daddr); 1415 } 1416 1417 *cs++ = MI_NOOP; 1418 intel_ring_advance(req, cs); 1419 1420 return 0; 1421} 1422 1423static int gen8_emit_bb_start(struct drm_i915_gem_request *req, 1424 u64 offset, u32 len, 1425 const unsigned int flags) 1426{ 1427 u32 *cs; 1428 int ret; 1429 1430 /* Don't rely in hw updating PDPs, specially in lite-restore. 1431 * Ideally, we should set Force PD Restore in ctx descriptor, 1432 * but we can't. Force Restore would be a second option, but 1433 * it is unsafe in case of lite-restore (because the ctx is 1434 * not idle). PML4 is allocated during ppgtt init so this is 1435 * not needed in 48-bit.*/ 1436 if (req->ctx->ppgtt && 1437 (intel_engine_flag(req->engine) & req->ctx->ppgtt->pd_dirty_rings) && 1438 !i915_vm_is_48bit(&req->ctx->ppgtt->base) && 1439 !intel_vgpu_active(req->i915)) { 1440 ret = intel_logical_ring_emit_pdps(req); 1441 if (ret) 1442 return ret; 1443 1444 req->ctx->ppgtt->pd_dirty_rings &= ~intel_engine_flag(req->engine); 1445 } 1446 1447 cs = intel_ring_begin(req, 4); 1448 if (IS_ERR(cs)) 1449 return PTR_ERR(cs); 1450 1451 /* FIXME(BDW): Address space and security selectors. */ 1452 *cs++ = MI_BATCH_BUFFER_START_GEN8 | 1453 (flags & I915_DISPATCH_SECURE ? 0 : BIT(8)) | 1454 (flags & I915_DISPATCH_RS ? MI_BATCH_RESOURCE_STREAMER : 0); 1455 *cs++ = lower_32_bits(offset); 1456 *cs++ = upper_32_bits(offset); 1457 *cs++ = MI_NOOP; 1458 intel_ring_advance(req, cs); 1459 1460 return 0; 1461} 1462 1463static void gen8_logical_ring_enable_irq(struct intel_engine_cs *engine) 1464{ 1465 struct drm_i915_private *dev_priv = engine->i915; 1466 I915_WRITE_IMR(engine, 1467 ~(engine->irq_enable_mask | engine->irq_keep_mask)); 1468 POSTING_READ_FW(RING_IMR(engine->mmio_base)); 1469} 1470 1471static void gen8_logical_ring_disable_irq(struct intel_engine_cs *engine) 1472{ 1473 struct drm_i915_private *dev_priv = engine->i915; 1474 I915_WRITE_IMR(engine, ~engine->irq_keep_mask); 1475} 1476 1477static int gen8_emit_flush(struct drm_i915_gem_request *request, u32 mode) 1478{ 1479 u32 cmd, *cs; 1480 1481 cs = intel_ring_begin(request, 4); 1482 if (IS_ERR(cs)) 1483 return PTR_ERR(cs); 1484 1485 cmd = MI_FLUSH_DW + 1; 1486 1487 /* We always require a command barrier so that subsequent 1488 * commands, such as breadcrumb interrupts, are strictly ordered 1489 * wrt the contents of the write cache being flushed to memory 1490 * (and thus being coherent from the CPU). 1491 */ 1492 cmd |= MI_FLUSH_DW_STORE_INDEX | MI_FLUSH_DW_OP_STOREDW; 1493 1494 if (mode & EMIT_INVALIDATE) { 1495 cmd |= MI_INVALIDATE_TLB; 1496 if (request->engine->id == VCS) 1497 cmd |= MI_INVALIDATE_BSD; 1498 } 1499 1500 *cs++ = cmd; 1501 *cs++ = I915_GEM_HWS_SCRATCH_ADDR | MI_FLUSH_DW_USE_GTT; 1502 *cs++ = 0; /* upper addr */ 1503 *cs++ = 0; /* value */ 1504 intel_ring_advance(request, cs); 1505 1506 return 0; 1507} 1508 1509static int gen8_emit_flush_render(struct drm_i915_gem_request *request, 1510 u32 mode) 1511{ 1512 struct intel_engine_cs *engine = request->engine; 1513 u32 scratch_addr = 1514 i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES; 1515 bool vf_flush_wa = false, dc_flush_wa = false; 1516 u32 *cs, flags = 0; 1517 int len; 1518 1519 flags |= PIPE_CONTROL_CS_STALL; 1520 1521 if (mode & EMIT_FLUSH) { 1522 flags |= PIPE_CONTROL_RENDER_TARGET_CACHE_FLUSH; 1523 flags |= PIPE_CONTROL_DEPTH_CACHE_FLUSH; 1524 flags |= PIPE_CONTROL_DC_FLUSH_ENABLE; 1525 flags |= PIPE_CONTROL_FLUSH_ENABLE; 1526 } 1527 1528 if (mode & EMIT_INVALIDATE) { 1529 flags |= PIPE_CONTROL_TLB_INVALIDATE; 1530 flags |= PIPE_CONTROL_INSTRUCTION_CACHE_INVALIDATE; 1531 flags |= PIPE_CONTROL_TEXTURE_CACHE_INVALIDATE; 1532 flags |= PIPE_CONTROL_VF_CACHE_INVALIDATE; 1533 flags |= PIPE_CONTROL_CONST_CACHE_INVALIDATE; 1534 flags |= PIPE_CONTROL_STATE_CACHE_INVALIDATE; 1535 flags |= PIPE_CONTROL_QW_WRITE; 1536 flags |= PIPE_CONTROL_GLOBAL_GTT_IVB; 1537 1538 /* 1539 * On GEN9: before VF_CACHE_INVALIDATE we need to emit a NULL 1540 * pipe control. 1541 */ 1542 if (IS_GEN9(request->i915)) 1543 vf_flush_wa = true; 1544 1545 /* WaForGAMHang:kbl */ 1546 if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0)) 1547 dc_flush_wa = true; 1548 } 1549 1550 len = 6; 1551 1552 if (vf_flush_wa) 1553 len += 6; 1554 1555 if (dc_flush_wa) 1556 len += 12; 1557 1558 cs = intel_ring_begin(request, len); 1559 if (IS_ERR(cs)) 1560 return PTR_ERR(cs); 1561 1562 if (vf_flush_wa) 1563 cs = gen8_emit_pipe_control(cs, 0, 0); 1564 1565 if (dc_flush_wa) 1566 cs = gen8_emit_pipe_control(cs, PIPE_CONTROL_DC_FLUSH_ENABLE, 1567 0); 1568 1569 cs = gen8_emit_pipe_control(cs, flags, scratch_addr); 1570 1571 if (dc_flush_wa) 1572 cs = gen8_emit_pipe_control(cs, PIPE_CONTROL_CS_STALL, 0); 1573 1574 intel_ring_advance(request, cs); 1575 1576 return 0; 1577} 1578 1579/* 1580 * Reserve space for 2 NOOPs at the end of each request to be 1581 * used as a workaround for not being allowed to do lite 1582 * restore with HEAD==TAIL (WaIdleLiteRestore). 1583 */ 1584static void gen8_emit_wa_tail(struct drm_i915_gem_request *request, u32 *cs) 1585{ 1586 *cs++ = MI_NOOP; 1587 *cs++ = MI_NOOP; 1588 request->wa_tail = intel_ring_offset(request, cs); 1589} 1590 1591static void gen8_emit_breadcrumb(struct drm_i915_gem_request *request, u32 *cs) 1592{ 1593 /* w/a: bit 5 needs to be zero for MI_FLUSH_DW address. */ 1594 BUILD_BUG_ON(I915_GEM_HWS_INDEX_ADDR & (1 << 5)); 1595 1596 *cs++ = (MI_FLUSH_DW + 1) | MI_FLUSH_DW_OP_STOREDW; 1597 *cs++ = intel_hws_seqno_address(request->engine) | MI_FLUSH_DW_USE_GTT; 1598 *cs++ = 0; 1599 *cs++ = request->global_seqno; 1600 *cs++ = MI_USER_INTERRUPT; 1601 *cs++ = MI_NOOP; 1602 request->tail = intel_ring_offset(request, cs); 1603 assert_ring_tail_valid(request->ring, request->tail); 1604 1605 gen8_emit_wa_tail(request, cs); 1606} 1607 1608static const int gen8_emit_breadcrumb_sz = 6 + WA_TAIL_DWORDS; 1609 1610static void gen8_emit_breadcrumb_render(struct drm_i915_gem_request *request, 1611 u32 *cs) 1612{ 1613 /* We're using qword write, seqno should be aligned to 8 bytes. */ 1614 BUILD_BUG_ON(I915_GEM_HWS_INDEX & 1); 1615 1616 /* w/a for post sync ops following a GPGPU operation we 1617 * need a prior CS_STALL, which is emitted by the flush 1618 * following the batch. 1619 */ 1620 *cs++ = GFX_OP_PIPE_CONTROL(6); 1621 *cs++ = PIPE_CONTROL_GLOBAL_GTT_IVB | PIPE_CONTROL_CS_STALL | 1622 PIPE_CONTROL_QW_WRITE; 1623 *cs++ = intel_hws_seqno_address(request->engine); 1624 *cs++ = 0; 1625 *cs++ = request->global_seqno; 1626 /* We're thrashing one dword of HWS. */ 1627 *cs++ = 0; 1628 *cs++ = MI_USER_INTERRUPT; 1629 *cs++ = MI_NOOP; 1630 request->tail = intel_ring_offset(request, cs); 1631 assert_ring_tail_valid(request->ring, request->tail); 1632 1633 gen8_emit_wa_tail(request, cs); 1634} 1635 1636static const int gen8_emit_breadcrumb_render_sz = 8 + WA_TAIL_DWORDS; 1637 1638static int gen8_init_rcs_context(struct drm_i915_gem_request *req) 1639{ 1640 int ret; 1641 1642 ret = intel_ring_workarounds_emit(req); 1643 if (ret) 1644 return ret; 1645 1646 ret = intel_rcs_context_init_mocs(req); 1647 /* 1648 * Failing to program the MOCS is non-fatal.The system will not 1649 * run at peak performance. So generate an error and carry on. 1650 */ 1651 if (ret) 1652 DRM_ERROR("MOCS failed to program: expect performance issues.\n"); 1653 1654 return i915_gem_render_state_emit(req); 1655} 1656 1657/** 1658 * intel_logical_ring_cleanup() - deallocate the Engine Command Streamer 1659 * @engine: Engine Command Streamer. 1660 */ 1661void intel_logical_ring_cleanup(struct intel_engine_cs *engine) 1662{ 1663 struct drm_i915_private *dev_priv; 1664 1665 /* 1666 * Tasklet cannot be active at this point due intel_mark_active/idle 1667 * so this is just for documentation. 1668 */ 1669 if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &engine->irq_tasklet.state))) 1670 tasklet_kill(&engine->irq_tasklet); 1671 1672 dev_priv = engine->i915; 1673 1674 if (engine->buffer) { 1675 WARN_ON((I915_READ_MODE(engine) & MODE_IDLE) == 0); 1676 } 1677 1678 if (engine->cleanup) 1679 engine->cleanup(engine); 1680 1681 if (engine->status_page.vma) { 1682 i915_gem_object_unpin_map(engine->status_page.vma->obj); 1683 engine->status_page.vma = NULL; 1684 } 1685 1686 intel_engine_cleanup_common(engine); 1687 1688 lrc_destroy_wa_ctx(engine); 1689 engine->i915 = NULL; 1690 dev_priv->engine[engine->id] = NULL; 1691 kfree(engine); 1692} 1693 1694static void execlists_set_default_submission(struct intel_engine_cs *engine) 1695{ 1696 engine->submit_request = execlists_submit_request; 1697 engine->schedule = execlists_schedule; 1698 engine->irq_tasklet.func = intel_lrc_irq_handler; 1699} 1700 1701static void 1702logical_ring_default_vfuncs(struct intel_engine_cs *engine) 1703{ 1704 /* Default vfuncs which can be overriden by each engine. */ 1705 engine->init_hw = gen8_init_common_ring; 1706 engine->reset_hw = reset_common_ring; 1707 1708 engine->context_pin = execlists_context_pin; 1709 engine->context_unpin = execlists_context_unpin; 1710 1711 engine->request_alloc = execlists_request_alloc; 1712 1713 engine->emit_flush = gen8_emit_flush; 1714 engine->emit_breadcrumb = gen8_emit_breadcrumb; 1715 engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_sz; 1716 1717 engine->set_default_submission = execlists_set_default_submission; 1718 1719 engine->irq_enable = gen8_logical_ring_enable_irq; 1720 engine->irq_disable = gen8_logical_ring_disable_irq; 1721 engine->emit_bb_start = gen8_emit_bb_start; 1722} 1723 1724static inline void 1725logical_ring_default_irqs(struct intel_engine_cs *engine) 1726{ 1727 unsigned shift = engine->irq_shift; 1728 engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift; 1729 engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift; 1730} 1731 1732static int 1733lrc_setup_hws(struct intel_engine_cs *engine, struct i915_vma *vma) 1734{ 1735 const int hws_offset = LRC_PPHWSP_PN * PAGE_SIZE; 1736 void *hws; 1737 1738 /* The HWSP is part of the default context object in LRC mode. */ 1739 hws = i915_gem_object_pin_map(vma->obj, I915_MAP_WB); 1740 if (IS_ERR(hws)) 1741 return PTR_ERR(hws); 1742 1743 engine->status_page.page_addr = hws + hws_offset; 1744 engine->status_page.ggtt_offset = i915_ggtt_offset(vma) + hws_offset; 1745 engine->status_page.vma = vma; 1746 1747 return 0; 1748} 1749 1750static void 1751logical_ring_setup(struct intel_engine_cs *engine) 1752{ 1753 struct drm_i915_private *dev_priv = engine->i915; 1754 enum forcewake_domains fw_domains; 1755 1756 intel_engine_setup_common(engine); 1757 1758 /* Intentionally left blank. */ 1759 engine->buffer = NULL; 1760 1761 fw_domains = intel_uncore_forcewake_for_reg(dev_priv, 1762 RING_ELSP(engine), 1763 FW_REG_WRITE); 1764 1765 fw_domains |= intel_uncore_forcewake_for_reg(dev_priv, 1766 RING_CONTEXT_STATUS_PTR(engine), 1767 FW_REG_READ | FW_REG_WRITE); 1768 1769 fw_domains |= intel_uncore_forcewake_for_reg(dev_priv, 1770 RING_CONTEXT_STATUS_BUF_BASE(engine), 1771 FW_REG_READ); 1772 1773 engine->fw_domains = fw_domains; 1774 1775 tasklet_init(&engine->irq_tasklet, 1776 intel_lrc_irq_handler, (unsigned long)engine); 1777 1778 logical_ring_default_vfuncs(engine); 1779 logical_ring_default_irqs(engine); 1780} 1781 1782static int 1783logical_ring_init(struct intel_engine_cs *engine) 1784{ 1785 struct i915_gem_context *dctx = engine->i915->kernel_context; 1786 int ret; 1787 1788 ret = intel_engine_init_common(engine); 1789 if (ret) 1790 goto error; 1791 1792 /* And setup the hardware status page. */ 1793 ret = lrc_setup_hws(engine, dctx->engine[engine->id].state); 1794 if (ret) { 1795 DRM_ERROR("Failed to set up hws %s: %d\n", engine->name, ret); 1796 goto error; 1797 } 1798 1799 return 0; 1800 1801error: 1802 intel_logical_ring_cleanup(engine); 1803 return ret; 1804} 1805 1806int logical_render_ring_init(struct intel_engine_cs *engine) 1807{ 1808 struct drm_i915_private *dev_priv = engine->i915; 1809 int ret; 1810 1811 logical_ring_setup(engine); 1812 1813 if (HAS_L3_DPF(dev_priv)) 1814 engine->irq_keep_mask |= GT_RENDER_L3_PARITY_ERROR_INTERRUPT; 1815 1816 /* Override some for render ring. */ 1817 if (INTEL_GEN(dev_priv) >= 9) 1818 engine->init_hw = gen9_init_render_ring; 1819 else 1820 engine->init_hw = gen8_init_render_ring; 1821 engine->init_context = gen8_init_rcs_context; 1822 engine->emit_flush = gen8_emit_flush_render; 1823 engine->emit_breadcrumb = gen8_emit_breadcrumb_render; 1824 engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_render_sz; 1825 1826 ret = intel_engine_create_scratch(engine, PAGE_SIZE); 1827 if (ret) 1828 return ret; 1829 1830 ret = intel_init_workaround_bb(engine); 1831 if (ret) { 1832 /* 1833 * We continue even if we fail to initialize WA batch 1834 * because we only expect rare glitches but nothing 1835 * critical to prevent us from using GPU 1836 */ 1837 DRM_ERROR("WA batch buffer initialization failed: %d\n", 1838 ret); 1839 } 1840 1841 return logical_ring_init(engine); 1842} 1843 1844int logical_xcs_ring_init(struct intel_engine_cs *engine) 1845{ 1846 logical_ring_setup(engine); 1847 1848 return logical_ring_init(engine); 1849} 1850 1851static u32 1852make_rpcs(struct drm_i915_private *dev_priv) 1853{ 1854 u32 rpcs = 0; 1855 1856 /* 1857 * No explicit RPCS request is needed to ensure full 1858 * slice/subslice/EU enablement prior to Gen9. 1859 */ 1860 if (INTEL_GEN(dev_priv) < 9) 1861 return 0; 1862 1863 /* 1864 * Starting in Gen9, render power gating can leave 1865 * slice/subslice/EU in a partially enabled state. We 1866 * must make an explicit request through RPCS for full 1867 * enablement. 1868 */ 1869 if (INTEL_INFO(dev_priv)->sseu.has_slice_pg) { 1870 rpcs |= GEN8_RPCS_S_CNT_ENABLE; 1871 rpcs |= hweight8(INTEL_INFO(dev_priv)->sseu.slice_mask) << 1872 GEN8_RPCS_S_CNT_SHIFT; 1873 rpcs |= GEN8_RPCS_ENABLE; 1874 } 1875 1876 if (INTEL_INFO(dev_priv)->sseu.has_subslice_pg) { 1877 rpcs |= GEN8_RPCS_SS_CNT_ENABLE; 1878 rpcs |= hweight8(INTEL_INFO(dev_priv)->sseu.subslice_mask) << 1879 GEN8_RPCS_SS_CNT_SHIFT; 1880 rpcs |= GEN8_RPCS_ENABLE; 1881 } 1882 1883 if (INTEL_INFO(dev_priv)->sseu.has_eu_pg) { 1884 rpcs |= INTEL_INFO(dev_priv)->sseu.eu_per_subslice << 1885 GEN8_RPCS_EU_MIN_SHIFT; 1886 rpcs |= INTEL_INFO(dev_priv)->sseu.eu_per_subslice << 1887 GEN8_RPCS_EU_MAX_SHIFT; 1888 rpcs |= GEN8_RPCS_ENABLE; 1889 } 1890 1891 return rpcs; 1892} 1893 1894static u32 intel_lr_indirect_ctx_offset(struct intel_engine_cs *engine) 1895{ 1896 u32 indirect_ctx_offset; 1897 1898 switch (INTEL_GEN(engine->i915)) { 1899 default: 1900 MISSING_CASE(INTEL_GEN(engine->i915)); 1901 /* fall through */ 1902 case 10: 1903 indirect_ctx_offset = 1904 GEN10_CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT; 1905 break; 1906 case 9: 1907 indirect_ctx_offset = 1908 GEN9_CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT; 1909 break; 1910 case 8: 1911 indirect_ctx_offset = 1912 GEN8_CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT; 1913 break; 1914 } 1915 1916 return indirect_ctx_offset; 1917} 1918 1919static void execlists_init_reg_state(u32 *regs, 1920 struct i915_gem_context *ctx, 1921 struct intel_engine_cs *engine, 1922 struct intel_ring *ring) 1923{ 1924 struct drm_i915_private *dev_priv = engine->i915; 1925 struct i915_hw_ppgtt *ppgtt = ctx->ppgtt ?: dev_priv->mm.aliasing_ppgtt; 1926 u32 base = engine->mmio_base; 1927 bool rcs = engine->id == RCS; 1928 1929 /* A context is actually a big batch buffer with several 1930 * MI_LOAD_REGISTER_IMM commands followed by (reg, value) pairs. The 1931 * values we are setting here are only for the first context restore: 1932 * on a subsequent save, the GPU will recreate this batchbuffer with new 1933 * values (including all the missing MI_LOAD_REGISTER_IMM commands that 1934 * we are not initializing here). 1935 */ 1936 regs[CTX_LRI_HEADER_0] = MI_LOAD_REGISTER_IMM(rcs ? 14 : 11) | 1937 MI_LRI_FORCE_POSTED; 1938 1939 CTX_REG(regs, CTX_CONTEXT_CONTROL, RING_CONTEXT_CONTROL(engine), 1940 _MASKED_BIT_ENABLE(CTX_CTRL_INHIBIT_SYN_CTX_SWITCH | 1941 CTX_CTRL_ENGINE_CTX_RESTORE_INHIBIT | 1942 (HAS_RESOURCE_STREAMER(dev_priv) ? 1943 CTX_CTRL_RS_CTX_ENABLE : 0))); 1944 CTX_REG(regs, CTX_RING_HEAD, RING_HEAD(base), 0); 1945 CTX_REG(regs, CTX_RING_TAIL, RING_TAIL(base), 0); 1946 CTX_REG(regs, CTX_RING_BUFFER_START, RING_START(base), 0); 1947 CTX_REG(regs, CTX_RING_BUFFER_CONTROL, RING_CTL(base), 1948 RING_CTL_SIZE(ring->size) | RING_VALID); 1949 CTX_REG(regs, CTX_BB_HEAD_U, RING_BBADDR_UDW(base), 0); 1950 CTX_REG(regs, CTX_BB_HEAD_L, RING_BBADDR(base), 0); 1951 CTX_REG(regs, CTX_BB_STATE, RING_BBSTATE(base), RING_BB_PPGTT); 1952 CTX_REG(regs, CTX_SECOND_BB_HEAD_U, RING_SBBADDR_UDW(base), 0); 1953 CTX_REG(regs, CTX_SECOND_BB_HEAD_L, RING_SBBADDR(base), 0); 1954 CTX_REG(regs, CTX_SECOND_BB_STATE, RING_SBBSTATE(base), 0); 1955 if (rcs) { 1956 CTX_REG(regs, CTX_BB_PER_CTX_PTR, RING_BB_PER_CTX_PTR(base), 0); 1957 CTX_REG(regs, CTX_RCS_INDIRECT_CTX, RING_INDIRECT_CTX(base), 0); 1958 CTX_REG(regs, CTX_RCS_INDIRECT_CTX_OFFSET, 1959 RING_INDIRECT_CTX_OFFSET(base), 0); 1960 1961 if (engine->wa_ctx.vma) { 1962 struct i915_ctx_workarounds *wa_ctx = &engine->wa_ctx; 1963 u32 ggtt_offset = i915_ggtt_offset(wa_ctx->vma); 1964 1965 regs[CTX_RCS_INDIRECT_CTX + 1] = 1966 (ggtt_offset + wa_ctx->indirect_ctx.offset) | 1967 (wa_ctx->indirect_ctx.size / CACHELINE_BYTES); 1968 1969 regs[CTX_RCS_INDIRECT_CTX_OFFSET + 1] = 1970 intel_lr_indirect_ctx_offset(engine) << 6; 1971 1972 regs[CTX_BB_PER_CTX_PTR + 1] = 1973 (ggtt_offset + wa_ctx->per_ctx.offset) | 0x01; 1974 } 1975 } 1976 1977 regs[CTX_LRI_HEADER_1] = MI_LOAD_REGISTER_IMM(9) | MI_LRI_FORCE_POSTED; 1978 1979 CTX_REG(regs, CTX_CTX_TIMESTAMP, RING_CTX_TIMESTAMP(base), 0); 1980 /* PDP values well be assigned later if needed */ 1981 CTX_REG(regs, CTX_PDP3_UDW, GEN8_RING_PDP_UDW(engine, 3), 0); 1982 CTX_REG(regs, CTX_PDP3_LDW, GEN8_RING_PDP_LDW(engine, 3), 0); 1983 CTX_REG(regs, CTX_PDP2_UDW, GEN8_RING_PDP_UDW(engine, 2), 0); 1984 CTX_REG(regs, CTX_PDP2_LDW, GEN8_RING_PDP_LDW(engine, 2), 0); 1985 CTX_REG(regs, CTX_PDP1_UDW, GEN8_RING_PDP_UDW(engine, 1), 0); 1986 CTX_REG(regs, CTX_PDP1_LDW, GEN8_RING_PDP_LDW(engine, 1), 0); 1987 CTX_REG(regs, CTX_PDP0_UDW, GEN8_RING_PDP_UDW(engine, 0), 0); 1988 CTX_REG(regs, CTX_PDP0_LDW, GEN8_RING_PDP_LDW(engine, 0), 0); 1989 1990 if (ppgtt && i915_vm_is_48bit(&ppgtt->base)) { 1991 /* 64b PPGTT (48bit canonical) 1992 * PDP0_DESCRIPTOR contains the base address to PML4 and 1993 * other PDP Descriptors are ignored. 1994 */ 1995 ASSIGN_CTX_PML4(ppgtt, regs); 1996 } 1997 1998 if (rcs) { 1999 regs[CTX_LRI_HEADER_2] = MI_LOAD_REGISTER_IMM(1); 2000 CTX_REG(regs, CTX_R_PWR_CLK_STATE, GEN8_R_PWR_CLK_STATE, 2001 make_rpcs(dev_priv)); 2002 2003 i915_oa_init_reg_state(engine, ctx, regs); 2004 } 2005} 2006 2007static int 2008populate_lr_context(struct i915_gem_context *ctx, 2009 struct drm_i915_gem_object *ctx_obj, 2010 struct intel_engine_cs *engine, 2011 struct intel_ring *ring) 2012{ 2013 void *vaddr; 2014 int ret; 2015 2016 ret = i915_gem_object_set_to_cpu_domain(ctx_obj, true); 2017 if (ret) { 2018 DRM_DEBUG_DRIVER("Could not set to CPU domain\n"); 2019 return ret; 2020 } 2021 2022 vaddr = i915_gem_object_pin_map(ctx_obj, I915_MAP_WB); 2023 if (IS_ERR(vaddr)) { 2024 ret = PTR_ERR(vaddr); 2025 DRM_DEBUG_DRIVER("Could not map object pages! (%d)\n", ret); 2026 return ret; 2027 } 2028 ctx_obj->mm.dirty = true; 2029 2030 /* The second page of the context object contains some fields which must 2031 * be set up prior to the first execution. */ 2032 2033 execlists_init_reg_state(vaddr + LRC_STATE_PN * PAGE_SIZE, 2034 ctx, engine, ring); 2035 2036 i915_gem_object_unpin_map(ctx_obj); 2037 2038 return 0; 2039} 2040 2041static int execlists_context_deferred_alloc(struct i915_gem_context *ctx, 2042 struct intel_engine_cs *engine) 2043{ 2044 struct drm_i915_gem_object *ctx_obj; 2045 struct intel_context *ce = &ctx->engine[engine->id]; 2046 struct i915_vma *vma; 2047 uint32_t context_size; 2048 struct intel_ring *ring; 2049 int ret; 2050 2051 WARN_ON(ce->state); 2052 2053 context_size = round_up(engine->context_size, I915_GTT_PAGE_SIZE); 2054 2055 /* One extra page as the sharing data between driver and GuC */ 2056 context_size += PAGE_SIZE * LRC_PPHWSP_PN; 2057 2058 ctx_obj = i915_gem_object_create(ctx->i915, context_size); 2059 if (IS_ERR(ctx_obj)) { 2060 DRM_DEBUG_DRIVER("Alloc LRC backing obj failed.\n"); 2061 return PTR_ERR(ctx_obj); 2062 } 2063 2064 vma = i915_vma_instance(ctx_obj, &ctx->i915->ggtt.base, NULL); 2065 if (IS_ERR(vma)) { 2066 ret = PTR_ERR(vma); 2067 goto error_deref_obj; 2068 } 2069 2070 ring = intel_engine_create_ring(engine, ctx->ring_size); 2071 if (IS_ERR(ring)) { 2072 ret = PTR_ERR(ring); 2073 goto error_deref_obj; 2074 } 2075 2076 ret = populate_lr_context(ctx, ctx_obj, engine, ring); 2077 if (ret) { 2078 DRM_DEBUG_DRIVER("Failed to populate LRC: %d\n", ret); 2079 goto error_ring_free; 2080 } 2081 2082 ce->ring = ring; 2083 ce->state = vma; 2084 ce->initialised |= engine->init_context == NULL; 2085 2086 return 0; 2087 2088error_ring_free: 2089 intel_ring_free(ring); 2090error_deref_obj: 2091 i915_gem_object_put(ctx_obj); 2092 return ret; 2093} 2094 2095void intel_lr_context_resume(struct drm_i915_private *dev_priv) 2096{ 2097 struct intel_engine_cs *engine; 2098 struct i915_gem_context *ctx; 2099 enum intel_engine_id id; 2100 2101 /* Because we emit WA_TAIL_DWORDS there may be a disparity 2102 * between our bookkeeping in ce->ring->head and ce->ring->tail and 2103 * that stored in context. As we only write new commands from 2104 * ce->ring->tail onwards, everything before that is junk. If the GPU 2105 * starts reading from its RING_HEAD from the context, it may try to 2106 * execute that junk and die. 2107 * 2108 * So to avoid that we reset the context images upon resume. For 2109 * simplicity, we just zero everything out. 2110 */ 2111 list_for_each_entry(ctx, &dev_priv->contexts.list, link) { 2112 for_each_engine(engine, dev_priv, id) { 2113 struct intel_context *ce = &ctx->engine[engine->id]; 2114 u32 *reg; 2115 2116 if (!ce->state) 2117 continue; 2118 2119 reg = i915_gem_object_pin_map(ce->state->obj, 2120 I915_MAP_WB); 2121 if (WARN_ON(IS_ERR(reg))) 2122 continue; 2123 2124 reg += LRC_STATE_PN * PAGE_SIZE / sizeof(*reg); 2125 reg[CTX_RING_HEAD+1] = 0; 2126 reg[CTX_RING_TAIL+1] = 0; 2127 2128 ce->state->obj->mm.dirty = true; 2129 i915_gem_object_unpin_map(ce->state->obj); 2130 2131 intel_ring_reset(ce->ring, 0); 2132 } 2133 } 2134}