Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs

+816

Documentation/filesystems/xfs-delayed-logging-design.txt

··· 1 + XFS Delayed Logging Design 2 + -------------------------- 3 + 4 + Introduction to Re-logging in XFS 5 + --------------------------------- 6 + 7 + XFS logging is a combination of logical and physical logging. Some objects, 8 + such as inodes and dquots, are logged in logical format where the details 9 + logged are made up of the changes to in-core structures rather than on-disk 10 + structures. Other objects - typically buffers - have their physical changes 11 + logged. The reason for these differences is to reduce the amount of log space 12 + required for objects that are frequently logged. Some parts of inodes are more 13 + frequently logged than others, and inodes are typically more frequently logged 14 + than any other object (except maybe the superblock buffer) so keeping the 15 + amount of metadata logged low is of prime importance. 16 + 17 + The reason that this is such a concern is that XFS allows multiple separate 18 + modifications to a single object to be carried in the log at any given time. 19 + This allows the log to avoid needing to flush each change to disk before 20 + recording a new change to the object. XFS does this via a method called 21 + "re-logging". Conceptually, this is quite simple - all it requires is that any 22 + new change to the object is recorded with a *new copy* of all the existing 23 + changes in the new transaction that is written to the log. 24 + 25 + That is, if we have a sequence of changes A through to F, and the object was 26 + written to disk after change D, we would see in the log the following series 27 + of transactions, their contents and the log sequence number (LSN) of the 28 + transaction: 29 + 30 + Transaction Contents LSN 31 + A A X 32 + B A+B X+n 33 + C A+B+C X+n+m 34 + D A+B+C+D X+n+m+o 35 + <object written to disk> 36 + E E Y (> X+n+m+o) 37 + F E+F Yٍ+p 38 + 39 + In other words, each time an object is relogged, the new transaction contains 40 + the aggregation of all the previous changes currently held only in the log. 41 + 42 + This relogging technique also allows objects to be moved forward in the log so 43 + that an object being relogged does not prevent the tail of the log from ever 44 + moving forward. This can be seen in the table above by the changing 45 + (increasing) LSN of each subsquent transaction - the LSN is effectively a 46 + direct encoding of the location in the log of the transaction. 47 + 48 + This relogging is also used to implement long-running, multiple-commit 49 + transactions. These transaction are known as rolling transactions, and require 50 + a special log reservation known as a permanent transaction reservation. A 51 + typical example of a rolling transaction is the removal of extents from an 52 + inode which can only be done at a rate of two extents per transaction because 53 + of reservation size limitations. Hence a rolling extent removal transaction 54 + keeps relogging the inode and btree buffers as they get modified in each 55 + removal operation. This keeps them moving forward in the log as the operation 56 + progresses, ensuring that current operation never gets blocked by itself if the 57 + log wraps around. 58 + 59 + Hence it can be seen that the relogging operation is fundamental to the correct 60 + working of the XFS journalling subsystem. From the above description, most 61 + people should be able to see why the XFS metadata operations writes so much to 62 + the log - repeated operations to the same objects write the same changes to 63 + the log over and over again. Worse is the fact that objects tend to get 64 + dirtier as they get relogged, so each subsequent transaction is writing more 65 + metadata into the log. 66 + 67 + Another feature of the XFS transaction subsystem is that most transactions are 68 + asynchronous. That is, they don't commit to disk until either a log buffer is 69 + filled (a log buffer can hold multiple transactions) or a synchronous operation 70 + forces the log buffers holding the transactions to disk. This means that XFS is 71 + doing aggregation of transactions in memory - batching them, if you like - to 72 + minimise the impact of the log IO on transaction throughput. 73 + 74 + The limitation on asynchronous transaction throughput is the number and size of 75 + log buffers made available by the log manager. By default there are 8 log 76 + buffers available and the size of each is 32kB - the size can be increased up 77 + to 256kB by use of a mount option. 78 + 79 + Effectively, this gives us the maximum bound of outstanding metadata changes 80 + that can be made to the filesystem at any point in time - if all the log 81 + buffers are full and under IO, then no more transactions can be committed until 82 + the current batch completes. It is now common for a single current CPU core to 83 + be to able to issue enough transactions to keep the log buffers full and under 84 + IO permanently. Hence the XFS journalling subsystem can be considered to be IO 85 + bound. 86 + 87 + Delayed Logging: Concepts 88 + ------------------------- 89 + 90 + The key thing to note about the asynchronous logging combined with the 91 + relogging technique XFS uses is that we can be relogging changed objects 92 + multiple times before they are committed to disk in the log buffers. If we 93 + return to the previous relogging example, it is entirely possible that 94 + transactions A through D are committed to disk in the same log buffer. 95 + 96 + That is, a single log buffer may contain multiple copies of the same object, 97 + but only one of those copies needs to be there - the last one "D", as it 98 + contains all the changes from the previous changes. In other words, we have one 99 + necessary copy in the log buffer, and three stale copies that are simply 100 + wasting space. When we are doing repeated operations on the same set of 101 + objects, these "stale objects" can be over 90% of the space used in the log 102 + buffers. It is clear that reducing the number of stale objects written to the 103 + log would greatly reduce the amount of metadata we write to the log, and this 104 + is the fundamental goal of delayed logging. 105 + 106 + From a conceptual point of view, XFS is already doing relogging in memory (where 107 + memory == log buffer), only it is doing it extremely inefficiently. It is using 108 + logical to physical formatting to do the relogging because there is no 109 + infrastructure to keep track of logical changes in memory prior to physically 110 + formatting the changes in a transaction to the log buffer. Hence we cannot avoid 111 + accumulating stale objects in the log buffers. 112 + 113 + Delayed logging is the name we've given to keeping and tracking transactional 114 + changes to objects in memory outside the log buffer infrastructure. Because of 115 + the relogging concept fundamental to the XFS journalling subsystem, this is 116 + actually relatively easy to do - all the changes to logged items are already 117 + tracked in the current infrastructure. The big problem is how to accumulate 118 + them and get them to the log in a consistent, recoverable manner. 119 + Describing the problems and how they have been solved is the focus of this 120 + document. 121 + 122 + One of the key changes that delayed logging makes to the operation of the 123 + journalling subsystem is that it disassociates the amount of outstanding 124 + metadata changes from the size and number of log buffers available. In other 125 + words, instead of there only being a maximum of 2MB of transaction changes not 126 + written to the log at any point in time, there may be a much greater amount 127 + being accumulated in memory. Hence the potential for loss of metadata on a 128 + crash is much greater than for the existing logging mechanism. 129 + 130 + It should be noted that this does not change the guarantee that log recovery 131 + will result in a consistent filesystem. What it does mean is that as far as the 132 + recovered filesystem is concerned, there may be many thousands of transactions 133 + that simply did not occur as a result of the crash. This makes it even more 134 + important that applications that care about their data use fsync() where they 135 + need to ensure application level data integrity is maintained. 136 + 137 + It should be noted that delayed logging is not an innovative new concept that 138 + warrants rigorous proofs to determine whether it is correct or not. The method 139 + of accumulating changes in memory for some period before writing them to the 140 + log is used effectively in many filesystems including ext3 and ext4. Hence 141 + no time is spent in this document trying to convince the reader that the 142 + concept is sound. Instead it is simply considered a "solved problem" and as 143 + such implementing it in XFS is purely an exercise in software engineering. 144 + 145 + The fundamental requirements for delayed logging in XFS are simple: 146 + 147 + 1. Reduce the amount of metadata written to the log by at least 148 + an order of magnitude. 149 + 2. Supply sufficient statistics to validate Requirement #1. 150 + 3. Supply sufficient new tracing infrastructure to be able to debug 151 + problems with the new code. 152 + 4. No on-disk format change (metadata or log format). 153 + 5. Enable and disable with a mount option. 154 + 6. No performance regressions for synchronous transaction workloads. 155 + 156 + Delayed Logging: Design 157 + ----------------------- 158 + 159 + Storing Changes 160 + 161 + The problem with accumulating changes at a logical level (i.e. just using the 162 + existing log item dirty region tracking) is that when it comes to writing the 163 + changes to the log buffers, we need to ensure that the object we are formatting 164 + is not changing while we do this. This requires locking the object to prevent 165 + concurrent modification. Hence flushing the logical changes to the log would 166 + require us to lock every object, format them, and then unlock them again. 167 + 168 + This introduces lots of scope for deadlocks with transactions that are already 169 + running. For example, a transaction has object A locked and modified, but needs 170 + the delayed logging tracking lock to commit the transaction. However, the 171 + flushing thread has the delayed logging tracking lock already held, and is 172 + trying to get the lock on object A to flush it to the log buffer. This appears 173 + to be an unsolvable deadlock condition, and it was solving this problem that 174 + was the barrier to implementing delayed logging for so long. 175 + 176 + The solution is relatively simple - it just took a long time to recognise it. 177 + Put simply, the current logging code formats the changes to each item into an 178 + vector array that points to the changed regions in the item. The log write code 179 + simply copies the memory these vectors point to into the log buffer during 180 + transaction commit while the item is locked in the transaction. Instead of 181 + using the log buffer as the destination of the formatting code, we can use an 182 + allocated memory buffer big enough to fit the formatted vector. 183 + 184 + If we then copy the vector into the memory buffer and rewrite the vector to 185 + point to the memory buffer rather than the object itself, we now have a copy of 186 + the changes in a format that is compatible with the log buffer writing code. 187 + that does not require us to lock the item to access. This formatting and 188 + rewriting can all be done while the object is locked during transaction commit, 189 + resulting in a vector that is transactionally consistent and can be accessed 190 + without needing to lock the owning item. 191 + 192 + Hence we avoid the need to lock items when we need to flush outstanding 193 + asynchronous transactions to the log. The differences between the existing 194 + formatting method and the delayed logging formatting can be seen in the 195 + diagram below. 196 + 197 + Current format log vector: 198 + 199 + Object +---------------------------------------------+ 200 + Vector 1 +----+ 201 + Vector 2 +----+ 202 + Vector 3 +----------+ 203 + 204 + After formatting: 205 + 206 + Log Buffer +-V1-+-V2-+----V3----+ 207 + 208 + Delayed logging vector: 209 + 210 + Object +---------------------------------------------+ 211 + Vector 1 +----+ 212 + Vector 2 +----+ 213 + Vector 3 +----------+ 214 + 215 + After formatting: 216 + 217 + Memory Buffer +-V1-+-V2-+----V3----+ 218 + Vector 1 +----+ 219 + Vector 2 +----+ 220 + Vector 3 +----------+ 221 + 222 + The memory buffer and associated vector need to be passed as a single object, 223 + but still need to be associated with the parent object so if the object is 224 + relogged we can replace the current memory buffer with a new memory buffer that 225 + contains the latest changes. 226 + 227 + The reason for keeping the vector around after we've formatted the memory 228 + buffer is to support splitting vectors across log buffer boundaries correctly. 229 + If we don't keep the vector around, we do not know where the region boundaries 230 + are in the item, so we'd need a new encapsulation method for regions in the log 231 + buffer writing (i.e. double encapsulation). This would be an on-disk format 232 + change and as such is not desirable. It also means we'd have to write the log 233 + region headers in the formatting stage, which is problematic as there is per 234 + region state that needs to be placed into the headers during the log write. 235 + 236 + Hence we need to keep the vector, but by attaching the memory buffer to it and 237 + rewriting the vector addresses to point at the memory buffer we end up with a 238 + self-describing object that can be passed to the log buffer write code to be 239 + handled in exactly the same manner as the existing log vectors are handled. 240 + Hence we avoid needing a new on-disk format to handle items that have been 241 + relogged in memory. 242 + 243 + 244 + Tracking Changes 245 + 246 + Now that we can record transactional changes in memory in a form that allows 247 + them to be used without limitations, we need to be able to track and accumulate 248 + them so that they can be written to the log at some later point in time. The 249 + log item is the natural place to store this vector and buffer, and also makes sense 250 + to be the object that is used to track committed objects as it will always 251 + exist once the object has been included in a transaction. 252 + 253 + The log item is already used to track the log items that have been written to 254 + the log but not yet written to disk. Such log items are considered "active" 255 + and as such are stored in the Active Item List (AIL) which is a LSN-ordered 256 + double linked list. Items are inserted into this list during log buffer IO 257 + completion, after which they are unpinned and can be written to disk. An object 258 + that is in the AIL can be relogged, which causes the object to be pinned again 259 + and then moved forward in the AIL when the log buffer IO completes for that 260 + transaction. 261 + 262 + Essentially, this shows that an item that is in the AIL can still be modified 263 + and relogged, so any tracking must be separate to the AIL infrastructure. As 264 + such, we cannot reuse the AIL list pointers for tracking committed items, nor 265 + can we store state in any field that is protected by the AIL lock. Hence the 266 + committed item tracking needs it's own locks, lists and state fields in the log 267 + item. 268 + 269 + Similar to the AIL, tracking of committed items is done through a new list 270 + called the Committed Item List (CIL). The list tracks log items that have been 271 + committed and have formatted memory buffers attached to them. It tracks objects 272 + in transaction commit order, so when an object is relogged it is removed from 273 + it's place in the list and re-inserted at the tail. This is entirely arbitrary 274 + and done to make it easy for debugging - the last items in the list are the 275 + ones that are most recently modified. Ordering of the CIL is not necessary for 276 + transactional integrity (as discussed in the next section) so the ordering is 277 + done for convenience/sanity of the developers. 278 + 279 + 280 + Delayed Logging: Checkpoints 281 + 282 + When we have a log synchronisation event, commonly known as a "log force", 283 + all the items in the CIL must be written into the log via the log buffers. 284 + We need to write these items in the order that they exist in the CIL, and they 285 + need to be written as an atomic transaction. The need for all the objects to be 286 + written as an atomic transaction comes from the requirements of relogging and 287 + log replay - all the changes in all the objects in a given transaction must 288 + either be completely replayed during log recovery, or not replayed at all. If 289 + a transaction is not replayed because it is not complete in the log, then 290 + no later transactions should be replayed, either. 291 + 292 + To fulfill this requirement, we need to write the entire CIL in a single log 293 + transaction. Fortunately, the XFS log code has no fixed limit on the size of a 294 + transaction, nor does the log replay code. The only fundamental limit is that 295 + the transaction cannot be larger than just under half the size of the log. The 296 + reason for this limit is that to find the head and tail of the log, there must 297 + be at least one complete transaction in the log at any given time. If a 298 + transaction is larger than half the log, then there is the possibility that a 299 + crash during the write of a such a transaction could partially overwrite the 300 + only complete previous transaction in the log. This will result in a recovery 301 + failure and an inconsistent filesystem and hence we must enforce the maximum 302 + size of a checkpoint to be slightly less than a half the log. 303 + 304 + Apart from this size requirement, a checkpoint transaction looks no different 305 + to any other transaction - it contains a transaction header, a series of 306 + formatted log items and a commit record at the tail. From a recovery 307 + perspective, the checkpoint transaction is also no different - just a lot 308 + bigger with a lot more items in it. The worst case effect of this is that we 309 + might need to tune the recovery transaction object hash size. 310 + 311 + Because the checkpoint is just another transaction and all the changes to log 312 + items are stored as log vectors, we can use the existing log buffer writing 313 + code to write the changes into the log. To do this efficiently, we need to 314 + minimise the time we hold the CIL locked while writing the checkpoint 315 + transaction. The current log write code enables us to do this easily with the 316 + way it separates the writing of the transaction contents (the log vectors) from 317 + the transaction commit record, but tracking this requires us to have a 318 + per-checkpoint context that travels through the log write process through to 319 + checkpoint completion. 320 + 321 + Hence a checkpoint has a context that tracks the state of the current 322 + checkpoint from initiation to checkpoint completion. A new context is initiated 323 + at the same time a checkpoint transaction is started. That is, when we remove 324 + all the current items from the CIL during a checkpoint operation, we move all 325 + those changes into the current checkpoint context. We then initialise a new 326 + context and attach that to the CIL for aggregation of new transactions. 327 + 328 + This allows us to unlock the CIL immediately after transfer of all the 329 + committed items and effectively allow new transactions to be issued while we 330 + are formatting the checkpoint into the log. It also allows concurrent 331 + checkpoints to be written into the log buffers in the case of log force heavy 332 + workloads, just like the existing transaction commit code does. This, however, 333 + requires that we strictly order the commit records in the log so that 334 + checkpoint sequence order is maintained during log replay. 335 + 336 + To ensure that we can be writing an item into a checkpoint transaction at 337 + the same time another transaction modifies the item and inserts the log item 338 + into the new CIL, then checkpoint transaction commit code cannot use log items 339 + to store the list of log vectors that need to be written into the transaction. 340 + Hence log vectors need to be able to be chained together to allow them to be 341 + detatched from the log items. That is, when the CIL is flushed the memory 342 + buffer and log vector attached to each log item needs to be attached to the 343 + checkpoint context so that the log item can be released. In diagrammatic form, 344 + the CIL would look like this before the flush: 345 + 346 + CIL Head 347 + | 348 + V 349 + Log Item <-> log vector 1 -> memory buffer 350 + | -> vector array 351 + V 352 + Log Item <-> log vector 2 -> memory buffer 353 + | -> vector array 354 + V 355 + ...... 356 + | 357 + V 358 + Log Item <-> log vector N-1 -> memory buffer 359 + | -> vector array 360 + V 361 + Log Item <-> log vector N -> memory buffer 362 + -> vector array 363 + 364 + And after the flush the CIL head is empty, and the checkpoint context log 365 + vector list would look like: 366 + 367 + Checkpoint Context 368 + | 369 + V 370 + log vector 1 -> memory buffer 371 + | -> vector array 372 + | -> Log Item 373 + V 374 + log vector 2 -> memory buffer 375 + | -> vector array 376 + | -> Log Item 377 + V 378 + ...... 379 + | 380 + V 381 + log vector N-1 -> memory buffer 382 + | -> vector array 383 + | -> Log Item 384 + V 385 + log vector N -> memory buffer 386 + -> vector array 387 + -> Log Item 388 + 389 + Once this transfer is done, the CIL can be unlocked and new transactions can 390 + start, while the checkpoint flush code works over the log vector chain to 391 + commit the checkpoint. 392 + 393 + Once the checkpoint is written into the log buffers, the checkpoint context is 394 + attached to the log buffer that the commit record was written to along with a 395 + completion callback. Log IO completion will call that callback, which can then 396 + run transaction committed processing for the log items (i.e. insert into AIL 397 + and unpin) in the log vector chain and then free the log vector chain and 398 + checkpoint context. 399 + 400 + Discussion Point: I am uncertain as to whether the log item is the most 401 + efficient way to track vectors, even though it seems like the natural way to do 402 + it. The fact that we walk the log items (in the CIL) just to chain the log 403 + vectors and break the link between the log item and the log vector means that 404 + we take a cache line hit for the log item list modification, then another for 405 + the log vector chaining. If we track by the log vectors, then we only need to 406 + break the link between the log item and the log vector, which means we should 407 + dirty only the log item cachelines. Normally I wouldn't be concerned about one 408 + vs two dirty cachelines except for the fact I've seen upwards of 80,000 log 409 + vectors in one checkpoint transaction. I'd guess this is a "measure and 410 + compare" situation that can be done after a working and reviewed implementation 411 + is in the dev tree.... 412 + 413 + Delayed Logging: Checkpoint Sequencing 414 + 415 + One of the key aspects of the XFS transaction subsystem is that it tags 416 + committed transactions with the log sequence number of the transaction commit. 417 + This allows transactions to be issued asynchronously even though there may be 418 + future operations that cannot be completed until that transaction is fully 419 + committed to the log. In the rare case that a dependent operation occurs (e.g. 420 + re-using a freed metadata extent for a data extent), a special, optimised log 421 + force can be issued to force the dependent transaction to disk immediately. 422 + 423 + To do this, transactions need to record the LSN of the commit record of the 424 + transaction. This LSN comes directly from the log buffer the transaction is 425 + written into. While this works just fine for the existing transaction 426 + mechanism, it does not work for delayed logging because transactions are not 427 + written directly into the log buffers. Hence some other method of sequencing 428 + transactions is required. 429 + 430 + As discussed in the checkpoint section, delayed logging uses per-checkpoint 431 + contexts, and as such it is simple to assign a sequence number to each 432 + checkpoint. Because the switching of checkpoint contexts must be done 433 + atomically, it is simple to ensure that each new context has a monotonically 434 + increasing sequence number assigned to it without the need for an external 435 + atomic counter - we can just take the current context sequence number and add 436 + one to it for the new context. 437 + 438 + Then, instead of assigning a log buffer LSN to the transaction commit LSN 439 + during the commit, we can assign the current checkpoint sequence. This allows 440 + operations that track transactions that have not yet completed know what 441 + checkpoint sequence needs to be committed before they can continue. As a 442 + result, the code that forces the log to a specific LSN now needs to ensure that 443 + the log forces to a specific checkpoint. 444 + 445 + To ensure that we can do this, we need to track all the checkpoint contexts 446 + that are currently committing to the log. When we flush a checkpoint, the 447 + context gets added to a "committing" list which can be searched. When a 448 + checkpoint commit completes, it is removed from the committing list. Because 449 + the checkpoint context records the LSN of the commit record for the checkpoint, 450 + we can also wait on the log buffer that contains the commit record, thereby 451 + using the existing log force mechanisms to execute synchronous forces. 452 + 453 + It should be noted that the synchronous forces may need to be extended with 454 + mitigation algorithms similar to the current log buffer code to allow 455 + aggregation of multiple synchronous transactions if there are already 456 + synchronous transactions being flushed. Investigation of the performance of the 457 + current design is needed before making any decisions here. 458 + 459 + The main concern with log forces is to ensure that all the previous checkpoints 460 + are also committed to disk before the one we need to wait for. Therefore we 461 + need to check that all the prior contexts in the committing list are also 462 + complete before waiting on the one we need to complete. We do this 463 + synchronisation in the log force code so that we don't need to wait anywhere 464 + else for such serialisation - it only matters when we do a log force. 465 + 466 + The only remaining complexity is that a log force now also has to handle the 467 + case where the forcing sequence number is the same as the current context. That 468 + is, we need to flush the CIL and potentially wait for it to complete. This is a 469 + simple addition to the existing log forcing code to check the sequence numbers 470 + and push if required. Indeed, placing the current sequence checkpoint flush in 471 + the log force code enables the current mechanism for issuing synchronous 472 + transactions to remain untouched (i.e. commit an asynchronous transaction, then 473 + force the log at the LSN of that transaction) and so the higher level code 474 + behaves the same regardless of whether delayed logging is being used or not. 475 + 476 + Delayed Logging: Checkpoint Log Space Accounting 477 + 478 + The big issue for a checkpoint transaction is the log space reservation for the 479 + transaction. We don't know how big a checkpoint transaction is going to be 480 + ahead of time, nor how many log buffers it will take to write out, nor the 481 + number of split log vector regions are going to be used. We can track the 482 + amount of log space required as we add items to the commit item list, but we 483 + still need to reserve the space in the log for the checkpoint. 484 + 485 + A typical transaction reserves enough space in the log for the worst case space 486 + usage of the transaction. The reservation accounts for log record headers, 487 + transaction and region headers, headers for split regions, buffer tail padding, 488 + etc. as well as the actual space for all the changed metadata in the 489 + transaction. While some of this is fixed overhead, much of it is dependent on 490 + the size of the transaction and the number of regions being logged (the number 491 + of log vectors in the transaction). 492 + 493 + An example of the differences would be logging directory changes versus logging 494 + inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then 495 + there are lots of transactions that only contain an inode core and an inode log 496 + format structure. That is, two vectors totaling roughly 150 bytes. If we modify 497 + 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each 498 + vector is 12 bytes, so the total to be logged is approximately 1.75MB. In 499 + comparison, if we are logging full directory buffers, they are typically 4KB 500 + each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a 501 + buffer format structure for each buffer - roughly 800 vectors or 1.51MB total 502 + space. From this, it should be obvious that a static log space reservation is 503 + not particularly flexible and is difficult to select the "optimal value" for 504 + all workloads. 505 + 506 + Further, if we are going to use a static reservation, which bit of the entire 507 + reservation does it cover? We account for space used by the transaction 508 + reservation by tracking the space currently used by the object in the CIL and 509 + then calculating the increase or decrease in space used as the object is 510 + relogged. This allows for a checkpoint reservation to only have to account for 511 + log buffer metadata used such as log header records. 512 + 513 + However, even using a static reservation for just the log metadata is 514 + problematic. Typically log record headers use at least 16KB of log space per 515 + 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be 516 + large enough to handle arbitrary sized checkpoint transactions. This 517 + reservation needs to be made before the checkpoint is started, and we need to 518 + be able to reserve the space without sleeping. For a 8MB checkpoint, we need a 519 + reservation of around 150KB, which is a non-trivial amount of space. 520 + 521 + A static reservation needs to manipulate the log grant counters - we can take a 522 + permanent reservation on the space, but we still need to make sure we refresh 523 + the write reservation (the actual space available to the transaction) after 524 + every checkpoint transaction completion. Unfortunately, if this space is not 525 + available when required, then the regrant code will sleep waiting for it. 526 + 527 + The problem with this is that it can lead to deadlocks as we may need to commit 528 + checkpoints to be able to free up log space (refer back to the description of 529 + rolling transactions for an example of this). Hence we *must* always have 530 + space available in the log if we are to use static reservations, and that is 531 + very difficult and complex to arrange. It is possible to do, but there is a 532 + simpler way. 533 + 534 + The simpler way of doing this is tracking the entire log space used by the 535 + items in the CIL and using this to dynamically calculate the amount of log 536 + space required by the log metadata. If this log metadata space changes as a 537 + result of a transaction commit inserting a new memory buffer into the CIL, then 538 + the difference in space required is removed from the transaction that causes 539 + the change. Transactions at this level will *always* have enough space 540 + available in their reservation for this as they have already reserved the 541 + maximal amount of log metadata space they require, and such a delta reservation 542 + will always be less than or equal to the maximal amount in the reservation. 543 + 544 + Hence we can grow the checkpoint transaction reservation dynamically as items 545 + are added to the CIL and avoid the need for reserving and regranting log space 546 + up front. This avoids deadlocks and removes a blocking point from the 547 + checkpoint flush code. 548 + 549 + As mentioned early, transactions can't grow to more than half the size of the 550 + log. Hence as part of the reservation growing, we need to also check the size 551 + of the reservation against the maximum allowed transaction size. If we reach 552 + the maximum threshold, we need to push the CIL to the log. This is effectively 553 + a "background flush" and is done on demand. This is identical to 554 + a CIL push triggered by a log force, only that there is no waiting for the 555 + checkpoint commit to complete. This background push is checked and executed by 556 + transaction commit code. 557 + 558 + If the transaction subsystem goes idle while we still have items in the CIL, 559 + they will be flushed by the periodic log force issued by the xfssyncd. This log 560 + force will push the CIL to disk, and if the transaction subsystem stays idle, 561 + allow the idle log to be covered (effectively marked clean) in exactly the same 562 + manner that is done for the existing logging method. A discussion point is 563 + whether this log force needs to be done more frequently than the current rate 564 + which is once every 30s. 565 + 566 + 567 + Delayed Logging: Log Item Pinning 568 + 569 + Currently log items are pinned during transaction commit while the items are 570 + still locked. This happens just after the items are formatted, though it could 571 + be done any time before the items are unlocked. The result of this mechanism is 572 + that items get pinned once for every transaction that is committed to the log 573 + buffers. Hence items that are relogged in the log buffers will have a pin count 574 + for every outstanding transaction they were dirtied in. When each of these 575 + transactions is completed, they will unpin the item once. As a result, the item 576 + only becomes unpinned when all the transactions complete and there are no 577 + pending transactions. Thus the pinning and unpinning of a log item is symmetric 578 + as there is a 1:1 relationship with transaction commit and log item completion. 579 + 580 + For delayed logging, however, we have an assymetric transaction commit to 581 + completion relationship. Every time an object is relogged in the CIL it goes 582 + through the commit process without a corresponding completion being registered. 583 + That is, we now have a many-to-one relationship between transaction commit and 584 + log item completion. The result of this is that pinning and unpinning of the 585 + log items becomes unbalanced if we retain the "pin on transaction commit, unpin 586 + on transaction completion" model. 587 + 588 + To keep pin/unpin symmetry, the algorithm needs to change to a "pin on 589 + insertion into the CIL, unpin on checkpoint completion". In other words, the 590 + pinning and unpinning becomes symmetric around a checkpoint context. We have to 591 + pin the object the first time it is inserted into the CIL - if it is already in 592 + the CIL during a transaction commit, then we do not pin it again. Because there 593 + can be multiple outstanding checkpoint contexts, we can still see elevated pin 594 + counts, but as each checkpoint completes the pin count will retain the correct 595 + value according to it's context. 596 + 597 + Just to make matters more slightly more complex, this checkpoint level context 598 + for the pin count means that the pinning of an item must take place under the 599 + CIL commit/flush lock. If we pin the object outside this lock, we cannot 600 + guarantee which context the pin count is associated with. This is because of 601 + the fact pinning the item is dependent on whether the item is present in the 602 + current CIL or not. If we don't pin the CIL first before we check and pin the 603 + object, we have a race with CIL being flushed between the check and the pin 604 + (or not pinning, as the case may be). Hence we must hold the CIL flush/commit 605 + lock to guarantee that we pin the items correctly. 606 + 607 + Delayed Logging: Concurrent Scalability 608 + 609 + A fundamental requirement for the CIL is that accesses through transaction 610 + commits must scale to many concurrent commits. The current transaction commit 611 + code does not break down even when there are transactions coming from 2048 612 + processors at once. The current transaction code does not go any faster than if 613 + there was only one CPU using it, but it does not slow down either. 614 + 615 + As a result, the delayed logging transaction commit code needs to be designed 616 + for concurrency from the ground up. It is obvious that there are serialisation 617 + points in the design - the three important ones are: 618 + 619 + 1. Locking out new transaction commits while flushing the CIL 620 + 2. Adding items to the CIL and updating item space accounting 621 + 3. Checkpoint commit ordering 622 + 623 + Looking at the transaction commit and CIL flushing interactions, it is clear 624 + that we have a many-to-one interaction here. That is, the only restriction on 625 + the number of concurrent transactions that can be trying to commit at once is 626 + the amount of space available in the log for their reservations. The practical 627 + limit here is in the order of several hundred concurrent transactions for a 628 + 128MB log, which means that it is generally one per CPU in a machine. 629 + 630 + The amount of time a transaction commit needs to hold out a flush is a 631 + relatively long period of time - the pinning of log items needs to be done 632 + while we are holding out a CIL flush, so at the moment that means it is held 633 + across the formatting of the objects into memory buffers (i.e. while memcpy()s 634 + are in progress). Ultimately a two pass algorithm where the formatting is done 635 + separately to the pinning of objects could be used to reduce the hold time of 636 + the transaction commit side. 637 + 638 + Because of the number of potential transaction commit side holders, the lock 639 + really needs to be a sleeping lock - if the CIL flush takes the lock, we do not 640 + want every other CPU in the machine spinning on the CIL lock. Given that 641 + flushing the CIL could involve walking a list of tens of thousands of log 642 + items, it will get held for a significant time and so spin contention is a 643 + significant concern. Preventing lots of CPUs spinning doing nothing is the 644 + main reason for choosing a sleeping lock even though nothing in either the 645 + transaction commit or CIL flush side sleeps with the lock held. 646 + 647 + It should also be noted that CIL flushing is also a relatively rare operation 648 + compared to transaction commit for asynchronous transaction workloads - only 649 + time will tell if using a read-write semaphore for exclusion will limit 650 + transaction commit concurrency due to cache line bouncing of the lock on the 651 + read side. 652 + 653 + The second serialisation point is on the transaction commit side where items 654 + are inserted into the CIL. Because transactions can enter this code 655 + concurrently, the CIL needs to be protected separately from the above 656 + commit/flush exclusion. It also needs to be an exclusive lock but it is only 657 + held for a very short time and so a spin lock is appropriate here. It is 658 + possible that this lock will become a contention point, but given the short 659 + hold time once per transaction I think that contention is unlikely. 660 + 661 + The final serialisation point is the checkpoint commit record ordering code 662 + that is run as part of the checkpoint commit and log force sequencing. The code 663 + path that triggers a CIL flush (i.e. whatever triggers the log force) will enter 664 + an ordering loop after writing all the log vectors into the log buffers but 665 + before writing the commit record. This loop walks the list of committing 666 + checkpoints and needs to block waiting for checkpoints to complete their commit 667 + record write. As a result it needs a lock and a wait variable. Log force 668 + sequencing also requires the same lock, list walk, and blocking mechanism to 669 + ensure completion of checkpoints. 670 + 671 + These two sequencing operations can use the mechanism even though the 672 + events they are waiting for are different. The checkpoint commit record 673 + sequencing needs to wait until checkpoint contexts contain a commit LSN 674 + (obtained through completion of a commit record write) while log force 675 + sequencing needs to wait until previous checkpoint contexts are removed from 676 + the committing list (i.e. they've completed). A simple wait variable and 677 + broadcast wakeups (thundering herds) has been used to implement these two 678 + serialisation queues. They use the same lock as the CIL, too. If we see too 679 + much contention on the CIL lock, or too many context switches as a result of 680 + the broadcast wakeups these operations can be put under a new spinlock and 681 + given separate wait lists to reduce lock contention and the number of processes 682 + woken by the wrong event. 683 + 684 + 685 + Lifecycle Changes 686 + 687 + The existing log item life cycle is as follows: 688 + 689 + 1. Transaction allocate 690 + 2. Transaction reserve 691 + 3. Lock item 692 + 4. Join item to transaction 693 + If not already attached, 694 + Allocate log item 695 + Attach log item to owner item 696 + Attach log item to transaction 697 + 5. Modify item 698 + Record modifications in log item 699 + 6. Transaction commit 700 + Pin item in memory 701 + Format item into log buffer 702 + Write commit LSN into transaction 703 + Unlock item 704 + Attach transaction to log buffer 705 + 706 + <log buffer IO dispatched> 707 + <log buffer IO completes> 708 + 709 + 7. Transaction completion 710 + Mark log item committed 711 + Insert log item into AIL 712 + Write commit LSN into log item 713 + Unpin log item 714 + 8. AIL traversal 715 + Lock item 716 + Mark log item clean 717 + Flush item to disk 718 + 719 + <item IO completion> 720 + 721 + 9. Log item removed from AIL 722 + Moves log tail 723 + Item unlocked 724 + 725 + Essentially, steps 1-6 operate independently from step 7, which is also 726 + independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 727 + at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur 728 + at the same time. If the log item is in the AIL or between steps 6 and 7 729 + and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 730 + are entered and completed is the object considered clean. 731 + 732 + With delayed logging, there are new steps inserted into the life cycle: 733 + 734 + 1. Transaction allocate 735 + 2. Transaction reserve 736 + 3. Lock item 737 + 4. Join item to transaction 738 + If not already attached, 739 + Allocate log item 740 + Attach log item to owner item 741 + Attach log item to transaction 742 + 5. Modify item 743 + Record modifications in log item 744 + 6. Transaction commit 745 + Pin item in memory if not pinned in CIL 746 + Format item into log vector + buffer 747 + Attach log vector and buffer to log item 748 + Insert log item into CIL 749 + Write CIL context sequence into transaction 750 + Unlock item 751 + 752 + <next log force> 753 + 754 + 7. CIL push 755 + lock CIL flush 756 + Chain log vectors and buffers together 757 + Remove items from CIL 758 + unlock CIL flush 759 + write log vectors into log 760 + sequence commit records 761 + attach checkpoint context to log buffer 762 + 763 + <log buffer IO dispatched> 764 + <log buffer IO completes> 765 + 766 + 8. Checkpoint completion 767 + Mark log item committed 768 + Insert item into AIL 769 + Write commit LSN into log item 770 + Unpin log item 771 + 9. AIL traversal 772 + Lock item 773 + Mark log item clean 774 + Flush item to disk 775 + <item IO completion> 776 + 10. Log item removed from AIL 777 + Moves log tail 778 + Item unlocked 779 + 780 + From this, it can be seen that the only life cycle differences between the two 781 + logging methods are in the middle of the life cycle - they still have the same 782 + beginning and end and execution constraints. The only differences are in the 783 + commiting of the log items to the log itself and the completion processing. 784 + Hence delayed logging should not introduce any constraints on log item 785 + behaviour, allocation or freeing that don't already exist. 786 + 787 + As a result of this zero-impact "insertion" of delayed logging infrastructure 788 + and the design of the internal structures to avoid on disk format changes, we 789 + can basically switch between delayed logging and the existing mechanism with a 790 + mount option. Fundamentally, there is no reason why the log manager would not 791 + be able to swap methods automatically and transparently depending on load 792 + characteristics, but this should not be necessary if delayed logging works as 793 + designed. 794 + 795 + Roadmap: 796 + 797 + 2.6.35 Inclusion in mainline as an experimental mount option 798 + => approximately 2-3 months to merge window 799 + => needs to be in xfs-dev tree in 4-6 weeks 800 + => code is nearing readiness for review 801 + 802 + 2.6.37 Remove experimental tag from mount option 803 + => should be roughly 6 months after initial merge 804 + => enough time to: 805 + => gain confidence and fix problems reported by early 806 + adopters (a.k.a. guinea pigs) 807 + => address worst performance regressions and undesired 808 + behaviours 809 + => start tuning/optimising code for parallelism 810 + => start tuning/optimising algorithms consuming 811 + excessive CPU time 812 + 813 + 2.6.39 Switch default mount option to use delayed logging 814 + => should be roughly 12 months after initial merge 815 + => enough time to shake out remaining problems before next round of 816 + enterprise distro kernel rebases

+1

fs/xfs/Makefile

··· 77 77 xfs_itable.o \ 78 78 xfs_dfrag.o \ 79 79 xfs_log.o \ 80 + xfs_log_cil.o \ 80 81 xfs_log_recover.o \ 81 82 xfs_mount.o \ 82 83 xfs_mru_cache.o \

+9

fs/xfs/linux-2.6/xfs_buf.c

··· 37 37 38 38 #include "xfs_sb.h" 39 39 #include "xfs_inum.h" 40 + #include "xfs_log.h" 40 41 #include "xfs_ag.h" 41 42 #include "xfs_dmapi.h" 42 43 #include "xfs_mount.h" ··· 851 850 * Note that this in no way locks the underlying pages, so it is only 852 851 * useful for synchronizing concurrent use of buffer objects, not for 853 852 * synchronizing independent access to the underlying pages. 853 + * 854 + * If we come across a stale, pinned, locked buffer, we know that we 855 + * are being asked to lock a buffer that has been reallocated. Because 856 + * it is pinned, we know that the log has not been pushed to disk and 857 + * hence it will still be locked. Rather than sleeping until someone 858 + * else pushes the log, push it ourselves before trying to get the lock. 854 859 */ 855 860 void 856 861 xfs_buf_lock( ··· 864 857 { 865 858 trace_xfs_buf_lock(bp, _RET_IP_); 866 859 860 + if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE)) 861 + xfs_log_force(bp->b_mount, 0); 867 862 if (atomic_read(&bp->b_io_remaining)) 868 863 blk_run_address_space(bp->b_target->bt_mapping); 869 864 down(&bp->b_sema);

+1

fs/xfs/linux-2.6/xfs_quotaops.c

··· 19 19 #include "xfs_dmapi.h" 20 20 #include "xfs_sb.h" 21 21 #include "xfs_inum.h" 22 + #include "xfs_log.h" 22 23 #include "xfs_ag.h" 23 24 #include "xfs_mount.h" 24 25 #include "xfs_quota.h"

+11 -1

fs/xfs/linux-2.6/xfs_super.c

··· 119 119 #define MNTOPT_DMAPI "dmapi" /* DMI enabled (DMAPI / XDSM) */ 120 120 #define MNTOPT_XDSM "xdsm" /* DMI enabled (DMAPI / XDSM) */ 121 121 #define MNTOPT_DMI "dmi" /* DMI enabled (DMAPI / XDSM) */ 122 + #define MNTOPT_DELAYLOG "delaylog" /* Delayed loging enabled */ 123 + #define MNTOPT_NODELAYLOG "nodelaylog" /* Delayed loging disabled */ 122 124 123 125 /* 124 126 * Table driven mount option parser. ··· 376 374 mp->m_flags |= XFS_MOUNT_DMAPI; 377 375 } else if (!strcmp(this_char, MNTOPT_DMI)) { 378 376 mp->m_flags |= XFS_MOUNT_DMAPI; 377 + } else if (!strcmp(this_char, MNTOPT_DELAYLOG)) { 378 + mp->m_flags |= XFS_MOUNT_DELAYLOG; 379 + cmn_err(CE_WARN, 380 + "Enabling EXPERIMENTAL delayed logging feature " 381 + "- use at your own risk.\n"); 382 + } else if (!strcmp(this_char, MNTOPT_NODELAYLOG)) { 383 + mp->m_flags &= ~XFS_MOUNT_DELAYLOG; 379 384 } else if (!strcmp(this_char, "ihashsize")) { 380 385 cmn_err(CE_WARN, 381 386 "XFS: ihashsize no longer used, option is deprecated."); ··· 544 535 { XFS_MOUNT_FILESTREAMS, "," MNTOPT_FILESTREAM }, 545 536 { XFS_MOUNT_DMAPI, "," MNTOPT_DMAPI }, 546 537 { XFS_MOUNT_GRPID, "," MNTOPT_GRPID }, 538 + { XFS_MOUNT_DELAYLOG, "," MNTOPT_DELAYLOG }, 547 539 { 0, NULL } 548 540 }; 549 541 static struct proc_xfs_info xfs_info_unset[] = { ··· 1765 1755 * but it is much faster. 1766 1756 */ 1767 1757 xfs_buf_item_zone = kmem_zone_init((sizeof(xfs_buf_log_item_t) + 1768 - (((XFS_MAX_BLOCKSIZE / XFS_BLI_CHUNK) / 1758 + (((XFS_MAX_BLOCKSIZE / XFS_BLF_CHUNK) / 1769 1759 NBWORD) * sizeof(int))), "xfs_buf_item"); 1770 1760 if (!xfs_buf_item_zone) 1771 1761 goto out_destroy_trans_zone;

+70 -41

fs/xfs/linux-2.6/xfs_trace.h

··· 1059 1059 1060 1060 ); 1061 1061 1062 + #define XFS_BUSY_SYNC \ 1063 + { 0, "async" }, \ 1064 + { 1, "sync" } 1065 + 1062 1066 TRACE_EVENT(xfs_alloc_busy, 1063 - TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno, 1064 - xfs_extlen_t len, int slot), 1065 - TP_ARGS(mp, agno, agbno, len, slot), 1067 + TP_PROTO(struct xfs_trans *trans, xfs_agnumber_t agno, 1068 + xfs_agblock_t agbno, xfs_extlen_t len, int sync), 1069 + TP_ARGS(trans, agno, agbno, len, sync), 1070 + TP_STRUCT__entry( 1071 + __field(dev_t, dev) 1072 + __field(struct xfs_trans *, tp) 1073 + __field(int, tid) 1074 + __field(xfs_agnumber_t, agno) 1075 + __field(xfs_agblock_t, agbno) 1076 + __field(xfs_extlen_t, len) 1077 + __field(int, sync) 1078 + ), 1079 + TP_fast_assign( 1080 + __entry->dev = trans->t_mountp->m_super->s_dev; 1081 + __entry->tp = trans; 1082 + __entry->tid = trans->t_ticket->t_tid; 1083 + __entry->agno = agno; 1084 + __entry->agbno = agbno; 1085 + __entry->len = len; 1086 + __entry->sync = sync; 1087 + ), 1088 + TP_printk("dev %d:%d trans 0x%p tid 0x%x agno %u agbno %u len %u %s", 1089 + MAJOR(__entry->dev), MINOR(__entry->dev), 1090 + __entry->tp, 1091 + __entry->tid, 1092 + __entry->agno, 1093 + __entry->agbno, 1094 + __entry->len, 1095 + __print_symbolic(__entry->sync, XFS_BUSY_SYNC)) 1096 + 1097 + ); 1098 + 1099 + TRACE_EVENT(xfs_alloc_unbusy, 1100 + TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, 1101 + xfs_agblock_t agbno, xfs_extlen_t len), 1102 + TP_ARGS(mp, agno, agbno, len), 1066 1103 TP_STRUCT__entry( 1067 1104 __field(dev_t, dev) 1068 1105 __field(xfs_agnumber_t, agno) 1069 1106 __field(xfs_agblock_t, agbno) 1070 1107 __field(xfs_extlen_t, len) 1071 - __field(int, slot) 1072 1108 ), 1073 1109 TP_fast_assign( 1074 1110 __entry->dev = mp->m_super->s_dev; 1075 1111 __entry->agno = agno; 1076 1112 __entry->agbno = agbno; 1077 1113 __entry->len = len; 1078 - __entry->slot = slot; 1079 1114 ), 1080 - TP_printk("dev %d:%d agno %u agbno %u len %u slot %d", 1115 + TP_printk("dev %d:%d agno %u agbno %u len %u", 1081 1116 MAJOR(__entry->dev), MINOR(__entry->dev), 1082 1117 __entry->agno, 1083 1118 __entry->agbno, 1084 - __entry->len, 1085 - __entry->slot) 1086 - 1119 + __entry->len) 1087 1120 ); 1088 1121 1089 1122 #define XFS_BUSY_STATES \ 1090 - { 0, "found" }, \ 1091 - { 1, "missing" } 1123 + { 0, "missing" }, \ 1124 + { 1, "found" } 1092 1125 1093 - TRACE_EVENT(xfs_alloc_unbusy, 1126 + TRACE_EVENT(xfs_alloc_busysearch, 1094 1127 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, 1095 - int slot, int found), 1096 - TP_ARGS(mp, agno, slot, found), 1128 + xfs_agblock_t agbno, xfs_extlen_t len, int found), 1129 + TP_ARGS(mp, agno, agbno, len, found), 1097 1130 TP_STRUCT__entry( 1098 1131 __field(dev_t, dev) 1099 1132 __field(xfs_agnumber_t, agno) 1100 - __field(int, slot) 1133 + __field(xfs_agblock_t, agbno) 1134 + __field(xfs_extlen_t, len) 1101 1135 __field(int, found) 1102 1136 ), 1103 1137 TP_fast_assign( 1104 1138 __entry->dev = mp->m_super->s_dev; 1105 1139 __entry->agno = agno; 1106 - __entry->slot = slot; 1107 - __entry->found = found; 1108 - ), 1109 - TP_printk("dev %d:%d agno %u slot %d %s", 1110 - MAJOR(__entry->dev), MINOR(__entry->dev), 1111 - __entry->agno, 1112 - __entry->slot, 1113 - __print_symbolic(__entry->found, XFS_BUSY_STATES)) 1114 - ); 1115 - 1116 - TRACE_EVENT(xfs_alloc_busysearch, 1117 - TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno, 1118 - xfs_extlen_t len, xfs_lsn_t lsn), 1119 - TP_ARGS(mp, agno, agbno, len, lsn), 1120 - TP_STRUCT__entry( 1121 - __field(dev_t, dev) 1122 - __field(xfs_agnumber_t, agno) 1123 - __field(xfs_agblock_t, agbno) 1124 - __field(xfs_extlen_t, len) 1125 - __field(xfs_lsn_t, lsn) 1126 - ), 1127 - TP_fast_assign( 1128 - __entry->dev = mp->m_super->s_dev; 1129 - __entry->agno = agno; 1130 1140 __entry->agbno = agbno; 1131 1141 __entry->len = len; 1132 - __entry->lsn = lsn; 1142 + __entry->found = found; 1133 1143 ), 1134 - TP_printk("dev %d:%d agno %u agbno %u len %u force lsn 0x%llx", 1144 + TP_printk("dev %d:%d agno %u agbno %u len %u %s", 1135 1145 MAJOR(__entry->dev), MINOR(__entry->dev), 1136 1146 __entry->agno, 1137 1147 __entry->agbno, 1138 1148 __entry->len, 1149 + __print_symbolic(__entry->found, XFS_BUSY_STATES)) 1150 + ); 1151 + 1152 + TRACE_EVENT(xfs_trans_commit_lsn, 1153 + TP_PROTO(struct xfs_trans *trans), 1154 + TP_ARGS(trans), 1155 + TP_STRUCT__entry( 1156 + __field(dev_t, dev) 1157 + __field(struct xfs_trans *, tp) 1158 + __field(xfs_lsn_t, lsn) 1159 + ), 1160 + TP_fast_assign( 1161 + __entry->dev = trans->t_mountp->m_super->s_dev; 1162 + __entry->tp = trans; 1163 + __entry->lsn = trans->t_commit_lsn; 1164 + ), 1165 + TP_printk("dev %d:%d trans 0x%p commit_lsn 0x%llx", 1166 + MAJOR(__entry->dev), MINOR(__entry->dev), 1167 + __entry->tp, 1139 1168 __entry->lsn) 1140 1169 ); 1141 1170

+3 -3

fs/xfs/quota/xfs_dquot.c

··· 344 344 for (i = 0; i < q->qi_dqperchunk; i++, d++, curid++) 345 345 xfs_qm_dqinit_core(curid, type, d); 346 346 xfs_trans_dquot_buf(tp, bp, 347 - (type & XFS_DQ_USER ? XFS_BLI_UDQUOT_BUF : 348 - ((type & XFS_DQ_PROJ) ? XFS_BLI_PDQUOT_BUF : 349 - XFS_BLI_GDQUOT_BUF))); 347 + (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF : 348 + ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF : 349 + XFS_BLF_GDQUOT_BUF))); 350 350 xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1); 351 351 } 352 352

+15 -9

fs/xfs/xfs_ag.h

··· 175 175 } xfs_agfl_t; 176 176 177 177 /* 178 - * Busy block/extent entry. Used in perag to mark blocks that have been freed 179 - * but whose transactions aren't committed to disk yet. 178 + * Busy block/extent entry. Indexed by a rbtree in perag to mark blocks that 179 + * have been freed but whose transactions aren't committed to disk yet. 180 + * 181 + * Note that we use the transaction ID to record the transaction, not the 182 + * transaction structure itself. See xfs_alloc_busy_insert() for details. 180 183 */ 181 - typedef struct xfs_perag_busy { 182 - xfs_agblock_t busy_start; 183 - xfs_extlen_t busy_length; 184 - struct xfs_trans *busy_tp; /* transaction that did the free */ 185 - } xfs_perag_busy_t; 184 + struct xfs_busy_extent { 185 + struct rb_node rb_node; /* ag by-bno indexed search tree */ 186 + struct list_head list; /* transaction busy extent list */ 187 + xfs_agnumber_t agno; 188 + xfs_agblock_t bno; 189 + xfs_extlen_t length; 190 + xlog_tid_t tid; /* transaction that created this */ 191 + }; 186 192 187 193 /* 188 194 * Per-ag incore structure, copies of information in agf and agi, ··· 222 216 xfs_agino_t pagl_leftrec; 223 217 xfs_agino_t pagl_rightrec; 224 218 #ifdef __KERNEL__ 225 - spinlock_t pagb_lock; /* lock for pagb_list */ 219 + spinlock_t pagb_lock; /* lock for pagb_tree */ 220 + struct rb_root pagb_tree; /* ordered tree of busy extents */ 226 221 227 222 atomic_t pagf_fstrms; /* # of filestreams active in this AG */ 228 223 ··· 233 226 int pag_ici_reclaimable; /* reclaimable inodes */ 234 227 #endif 235 228 int pagb_count; /* pagb slots in use */ 236 - xfs_perag_busy_t pagb_list[XFS_PAGB_NUM_SLOTS]; /* unstable blocks */ 237 229 } xfs_perag_t; 238 230 239 231 /*

+263 -116

fs/xfs/xfs_alloc.c

··· 46 46 #define XFSA_FIXUP_BNO_OK 1 47 47 #define XFSA_FIXUP_CNT_OK 2 48 48 49 - STATIC void 50 - xfs_alloc_search_busy(xfs_trans_t *tp, 51 - xfs_agnumber_t agno, 52 - xfs_agblock_t bno, 53 - xfs_extlen_t len); 49 + static int 50 + xfs_alloc_busy_search(struct xfs_mount *mp, xfs_agnumber_t agno, 51 + xfs_agblock_t bno, xfs_extlen_t len); 54 52 55 53 /* 56 54 * Prototypes for per-ag allocation routines ··· 538 540 be32_to_cpu(agf->agf_length)); 539 541 xfs_alloc_log_agf(args->tp, args->agbp, 540 542 XFS_AGF_FREEBLKS); 541 - /* search the busylist for these blocks */ 542 - xfs_alloc_search_busy(args->tp, args->agno, 543 - args->agbno, args->len); 543 + /* 544 + * Search the busylist for these blocks and mark the 545 + * transaction as synchronous if blocks are found. This 546 + * avoids the need to block due to a synchronous log 547 + * force to ensure correct ordering as the synchronous 548 + * transaction will guarantee that for us. 549 + */ 550 + if (xfs_alloc_busy_search(args->mp, args->agno, 551 + args->agbno, args->len)) 552 + xfs_trans_set_sync(args->tp); 544 553 } 545 554 if (!args->isfl) 546 555 xfs_trans_mod_sb(args->tp, ··· 1698 1693 * when the iclog commits to disk. If a busy block is allocated, 1699 1694 * the iclog is pushed up to the LSN that freed the block. 1700 1695 */ 1701 - xfs_alloc_mark_busy(tp, agno, bno, len); 1696 + xfs_alloc_busy_insert(tp, agno, bno, len); 1702 1697 return 0; 1703 1698 1704 1699 error0: ··· 1994 1989 *bnop = bno; 1995 1990 1996 1991 /* 1997 - * As blocks are freed, they are added to the per-ag busy list 1998 - * and remain there until the freeing transaction is committed to 1999 - * disk. Now that we have allocated blocks, this list must be 2000 - * searched to see if a block is being reused. If one is, then 2001 - * the freeing transaction must be pushed to disk NOW by forcing 2002 - * to disk all iclogs up that transaction's LSN. 1992 + * As blocks are freed, they are added to the per-ag busy list and 1993 + * remain there until the freeing transaction is committed to disk. 1994 + * Now that we have allocated blocks, this list must be searched to see 1995 + * if a block is being reused. If one is, then the freeing transaction 1996 + * must be pushed to disk before this transaction. 1997 + * 1998 + * We do this by setting the current transaction to a sync transaction 1999 + * which guarantees that the freeing transaction is on disk before this 2000 + * transaction. This is done instead of a synchronous log force here so 2001 + * that we don't sit and wait with the AGF locked in the transaction 2002 + * during the log force. 2003 2003 */ 2004 - xfs_alloc_search_busy(tp, be32_to_cpu(agf->agf_seqno), bno, 1); 2004 + if (xfs_alloc_busy_search(mp, be32_to_cpu(agf->agf_seqno), bno, 1)) 2005 + xfs_trans_set_sync(tp); 2005 2006 return 0; 2006 2007 } 2007 2008 ··· 2212 2201 be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]); 2213 2202 spin_lock_init(&pag->pagb_lock); 2214 2203 pag->pagb_count = 0; 2215 - memset(pag->pagb_list, 0, sizeof(pag->pagb_list)); 2204 + pag->pagb_tree = RB_ROOT; 2216 2205 pag->pagf_init = 1; 2217 2206 } 2218 2207 #ifdef DEBUG ··· 2490 2479 * list is reused, the transaction that freed it must be forced to disk 2491 2480 * before continuing to use the block. 2492 2481 * 2493 - * xfs_alloc_mark_busy - add to the per-ag busy list 2494 - * xfs_alloc_clear_busy - remove an item from the per-ag busy list 2482 + * xfs_alloc_busy_insert - add to the per-ag busy list 2483 + * xfs_alloc_busy_clear - remove an item from the per-ag busy list 2484 + * xfs_alloc_busy_search - search for a busy extent 2485 + */ 2486 + 2487 + /* 2488 + * Insert a new extent into the busy tree. 2489 + * 2490 + * The busy extent tree is indexed by the start block of the busy extent. 2491 + * there can be multiple overlapping ranges in the busy extent tree but only 2492 + * ever one entry at a given start block. The reason for this is that 2493 + * multi-block extents can be freed, then smaller chunks of that extent 2494 + * allocated and freed again before the first transaction commit is on disk. 2495 + * If the exact same start block is freed a second time, we have to wait for 2496 + * that busy extent to pass out of the tree before the new extent is inserted. 2497 + * There are two main cases we have to handle here. 2498 + * 2499 + * The first case is a transaction that triggers a "free - allocate - free" 2500 + * cycle. This can occur during btree manipulations as a btree block is freed 2501 + * to the freelist, then allocated from the free list, then freed again. In 2502 + * this case, the second extxpnet free is what triggers the duplicate and as 2503 + * such the transaction IDs should match. Because the extent was allocated in 2504 + * this transaction, the transaction must be marked as synchronous. This is 2505 + * true for all cases where the free/alloc/free occurs in the one transaction, 2506 + * hence the addition of the ASSERT(tp->t_flags & XFS_TRANS_SYNC) to this case. 2507 + * This serves to catch violations of the second case quite effectively. 2508 + * 2509 + * The second case is where the free/alloc/free occur in different 2510 + * transactions. In this case, the thread freeing the extent the second time 2511 + * can't mark the extent busy immediately because it is already tracked in a 2512 + * transaction that may be committing. When the log commit for the existing 2513 + * busy extent completes, the busy extent will be removed from the tree. If we 2514 + * allow the second busy insert to continue using that busy extent structure, 2515 + * it can be freed before this transaction is safely in the log. Hence our 2516 + * only option in this case is to force the log to remove the existing busy 2517 + * extent from the list before we insert the new one with the current 2518 + * transaction ID. 2519 + * 2520 + * The problem we are trying to avoid in the free-alloc-free in separate 2521 + * transactions is most easily described with a timeline: 2522 + * 2523 + * Thread 1 Thread 2 Thread 3 xfslogd 2524 + * xact alloc 2525 + * free X 2526 + * mark busy 2527 + * commit xact 2528 + * free xact 2529 + * xact alloc 2530 + * alloc X 2531 + * busy search 2532 + * mark xact sync 2533 + * commit xact 2534 + * free xact 2535 + * force log 2536 + * checkpoint starts 2537 + * .... 2538 + * xact alloc 2539 + * free X 2540 + * mark busy 2541 + * finds match 2542 + * *** KABOOM! *** 2543 + * .... 2544 + * log IO completes 2545 + * unbusy X 2546 + * checkpoint completes 2547 + * 2548 + * By issuing a log force in thread 3 @ "KABOOM", the thread will block until 2549 + * the checkpoint completes, and the busy extent it matched will have been 2550 + * removed from the tree when it is woken. Hence it can then continue safely. 2551 + * 2552 + * However, to ensure this matching process is robust, we need to use the 2553 + * transaction ID for identifying transaction, as delayed logging results in 2554 + * the busy extent and transaction lifecycles being different. i.e. the busy 2555 + * extent is active for a lot longer than the transaction. Hence the 2556 + * transaction structure can be freed and reallocated, then mark the same 2557 + * extent busy again in the new transaction. In this case the new transaction 2558 + * will have a different tid but can have the same address, and hence we need 2559 + * to check against the tid. 2560 + * 2561 + * Future: for delayed logging, we could avoid the log force if the extent was 2562 + * first freed in the current checkpoint sequence. This, however, requires the 2563 + * ability to pin the current checkpoint in memory until this transaction 2564 + * commits to ensure that both the original free and the current one combine 2565 + * logically into the one checkpoint. If the checkpoint sequences are 2566 + * different, however, we still need to wait on a log force. 2495 2567 */ 2496 2568 void 2497 - xfs_alloc_mark_busy(xfs_trans_t *tp, 2498 - xfs_agnumber_t agno, 2499 - xfs_agblock_t bno, 2500 - xfs_extlen_t len) 2569 + xfs_alloc_busy_insert( 2570 + struct xfs_trans *tp, 2571 + xfs_agnumber_t agno, 2572 + xfs_agblock_t bno, 2573 + xfs_extlen_t len) 2501 2574 { 2502 - xfs_perag_busy_t *bsy; 2575 + struct xfs_busy_extent *new; 2576 + struct xfs_busy_extent *busyp; 2503 2577 struct xfs_perag *pag; 2504 - int n; 2578 + struct rb_node **rbp; 2579 + struct rb_node *parent; 2580 + int match; 2505 2581 2506 - pag = xfs_perag_get(tp->t_mountp, agno); 2582 + 2583 + new = kmem_zalloc(sizeof(struct xfs_busy_extent), KM_MAYFAIL); 2584 + if (!new) { 2585 + /* 2586 + * No Memory! Since it is now not possible to track the free 2587 + * block, make this a synchronous transaction to insure that 2588 + * the block is not reused before this transaction commits. 2589 + */ 2590 + trace_xfs_alloc_busy(tp, agno, bno, len, 1); 2591 + xfs_trans_set_sync(tp); 2592 + return; 2593 + } 2594 + 2595 + new->agno = agno; 2596 + new->bno = bno; 2597 + new->length = len; 2598 + new->tid = xfs_log_get_trans_ident(tp); 2599 + 2600 + INIT_LIST_HEAD(&new->list); 2601 + 2602 + /* trace before insert to be able to see failed inserts */ 2603 + trace_xfs_alloc_busy(tp, agno, bno, len, 0); 2604 + 2605 + pag = xfs_perag_get(tp->t_mountp, new->agno); 2606 + restart: 2507 2607 spin_lock(&pag->pagb_lock); 2608 + rbp = &pag->pagb_tree.rb_node; 2609 + parent = NULL; 2610 + busyp = NULL; 2611 + match = 0; 2612 + while (*rbp && match >= 0) { 2613 + parent = *rbp; 2614 + busyp = rb_entry(parent, struct xfs_busy_extent, rb_node); 2508 2615 2509 - /* search pagb_list for an open slot */ 2510 - for (bsy = pag->pagb_list, n = 0; 2511 - n < XFS_PAGB_NUM_SLOTS; 2512 - bsy++, n++) { 2513 - if (bsy->busy_tp == NULL) { 2616 + if (new->bno < busyp->bno) { 2617 + /* may overlap, but exact start block is lower */ 2618 + rbp = &(*rbp)->rb_left; 2619 + if (new->bno + new->length > busyp->bno) 2620 + match = busyp->tid == new->tid ? 1 : -1; 2621 + } else if (new->bno > busyp->bno) { 2622 + /* may overlap, but exact start block is higher */ 2623 + rbp = &(*rbp)->rb_right; 2624 + if (bno < busyp->bno + busyp->length) 2625 + match = busyp->tid == new->tid ? 1 : -1; 2626 + } else { 2627 + match = busyp->tid == new->tid ? 1 : -1; 2514 2628 break; 2515 2629 } 2516 2630 } 2517 - 2518 - trace_xfs_alloc_busy(tp->t_mountp, agno, bno, len, n); 2519 - 2520 - if (n < XFS_PAGB_NUM_SLOTS) { 2521 - bsy = &pag->pagb_list[n]; 2522 - pag->pagb_count++; 2523 - bsy->busy_start = bno; 2524 - bsy->busy_length = len; 2525 - bsy->busy_tp = tp; 2526 - xfs_trans_add_busy(tp, agno, n); 2527 - } else { 2528 - /* 2529 - * The busy list is full! Since it is now not possible to 2530 - * track the free block, make this a synchronous transaction 2531 - * to insure that the block is not reused before this 2532 - * transaction commits. 2533 - */ 2534 - xfs_trans_set_sync(tp); 2631 + if (match < 0) { 2632 + /* overlap marked busy in different transaction */ 2633 + spin_unlock(&pag->pagb_lock); 2634 + xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); 2635 + goto restart; 2535 2636 } 2637 + if (match > 0) { 2638 + /* 2639 + * overlap marked busy in same transaction. Update if exact 2640 + * start block match, otherwise combine the busy extents into 2641 + * a single range. 2642 + */ 2643 + if (busyp->bno == new->bno) { 2644 + busyp->length = max(busyp->length, new->length); 2645 + spin_unlock(&pag->pagb_lock); 2646 + ASSERT(tp->t_flags & XFS_TRANS_SYNC); 2647 + xfs_perag_put(pag); 2648 + kmem_free(new); 2649 + return; 2650 + } 2651 + rb_erase(&busyp->rb_node, &pag->pagb_tree); 2652 + new->length = max(busyp->bno + busyp->length, 2653 + new->bno + new->length) - 2654 + min(busyp->bno, new->bno); 2655 + new->bno = min(busyp->bno, new->bno); 2656 + } else 2657 + busyp = NULL; 2536 2658 2659 + rb_link_node(&new->rb_node, parent, rbp); 2660 + rb_insert_color(&new->rb_node, &pag->pagb_tree); 2661 + 2662 + list_add(&new->list, &tp->t_busy); 2537 2663 spin_unlock(&pag->pagb_lock); 2538 2664 xfs_perag_put(pag); 2665 + kmem_free(busyp); 2666 + } 2667 + 2668 + /* 2669 + * Search for a busy extent within the range of the extent we are about to 2670 + * allocate. You need to be holding the busy extent tree lock when calling 2671 + * xfs_alloc_busy_search(). This function returns 0 for no overlapping busy 2672 + * extent, -1 for an overlapping but not exact busy extent, and 1 for an exact 2673 + * match. This is done so that a non-zero return indicates an overlap that 2674 + * will require a synchronous transaction, but it can still be 2675 + * used to distinguish between a partial or exact match. 2676 + */ 2677 + static int 2678 + xfs_alloc_busy_search( 2679 + struct xfs_mount *mp, 2680 + xfs_agnumber_t agno, 2681 + xfs_agblock_t bno, 2682 + xfs_extlen_t len) 2683 + { 2684 + struct xfs_perag *pag; 2685 + struct rb_node *rbp; 2686 + struct xfs_busy_extent *busyp; 2687 + int match = 0; 2688 + 2689 + pag = xfs_perag_get(mp, agno); 2690 + spin_lock(&pag->pagb_lock); 2691 + 2692 + rbp = pag->pagb_tree.rb_node; 2693 + 2694 + /* find closest start bno overlap */ 2695 + while (rbp) { 2696 + busyp = rb_entry(rbp, struct xfs_busy_extent, rb_node); 2697 + if (bno < busyp->bno) { 2698 + /* may overlap, but exact start block is lower */ 2699 + if (bno + len > busyp->bno) 2700 + match = -1; 2701 + rbp = rbp->rb_left; 2702 + } else if (bno > busyp->bno) { 2703 + /* may overlap, but exact start block is higher */ 2704 + if (bno < busyp->bno + busyp->length) 2705 + match = -1; 2706 + rbp = rbp->rb_right; 2707 + } else { 2708 + /* bno matches busyp, length determines exact match */ 2709 + match = (busyp->length == len) ? 1 : -1; 2710 + break; 2711 + } 2712 + } 2713 + spin_unlock(&pag->pagb_lock); 2714 + trace_xfs_alloc_busysearch(mp, agno, bno, len, !!match); 2715 + xfs_perag_put(pag); 2716 + return match; 2539 2717 } 2540 2718 2541 2719 void 2542 - xfs_alloc_clear_busy(xfs_trans_t *tp, 2543 - xfs_agnumber_t agno, 2544 - int idx) 2720 + xfs_alloc_busy_clear( 2721 + struct xfs_mount *mp, 2722 + struct xfs_busy_extent *busyp) 2545 2723 { 2546 2724 struct xfs_perag *pag; 2547 - xfs_perag_busy_t *list; 2548 2725 2549 - ASSERT(idx < XFS_PAGB_NUM_SLOTS); 2550 - pag = xfs_perag_get(tp->t_mountp, agno); 2726 + trace_xfs_alloc_unbusy(mp, busyp->agno, busyp->bno, 2727 + busyp->length); 2728 + 2729 + ASSERT(xfs_alloc_busy_search(mp, busyp->agno, busyp->bno, 2730 + busyp->length) == 1); 2731 + 2732 + list_del_init(&busyp->list); 2733 + 2734 + pag = xfs_perag_get(mp, busyp->agno); 2551 2735 spin_lock(&pag->pagb_lock); 2552 - list = pag->pagb_list; 2553 - 2554 - trace_xfs_alloc_unbusy(tp->t_mountp, agno, idx, list[idx].busy_tp == tp); 2555 - 2556 - if (list[idx].busy_tp == tp) { 2557 - list[idx].busy_tp = NULL; 2558 - pag->pagb_count--; 2559 - } 2560 - 2736 + rb_erase(&busyp->rb_node, &pag->pagb_tree); 2561 2737 spin_unlock(&pag->pagb_lock); 2562 2738 xfs_perag_put(pag); 2563 - } 2564 2739 2565 - 2566 - /* 2567 - * If we find the extent in the busy list, force the log out to get the 2568 - * extent out of the busy list so the caller can use it straight away. 2569 - */ 2570 - STATIC void 2571 - xfs_alloc_search_busy(xfs_trans_t *tp, 2572 - xfs_agnumber_t agno, 2573 - xfs_agblock_t bno, 2574 - xfs_extlen_t len) 2575 - { 2576 - struct xfs_perag *pag; 2577 - xfs_perag_busy_t *bsy; 2578 - xfs_agblock_t uend, bend; 2579 - xfs_lsn_t lsn = 0; 2580 - int cnt; 2581 - 2582 - pag = xfs_perag_get(tp->t_mountp, agno); 2583 - spin_lock(&pag->pagb_lock); 2584 - cnt = pag->pagb_count; 2585 - 2586 - /* 2587 - * search pagb_list for this slot, skipping open slots. We have to 2588 - * search the entire array as there may be multiple overlaps and 2589 - * we have to get the most recent LSN for the log force to push out 2590 - * all the transactions that span the range. 2591 - */ 2592 - uend = bno + len - 1; 2593 - for (cnt = 0; cnt < pag->pagb_count; cnt++) { 2594 - bsy = &pag->pagb_list[cnt]; 2595 - if (!bsy->busy_tp) 2596 - continue; 2597 - 2598 - bend = bsy->busy_start + bsy->busy_length - 1; 2599 - if (bno > bend || uend < bsy->busy_start) 2600 - continue; 2601 - 2602 - /* (start1,length1) within (start2, length2) */ 2603 - if (XFS_LSN_CMP(bsy->busy_tp->t_commit_lsn, lsn) > 0) 2604 - lsn = bsy->busy_tp->t_commit_lsn; 2605 - } 2606 - spin_unlock(&pag->pagb_lock); 2607 - xfs_perag_put(pag); 2608 - trace_xfs_alloc_busysearch(tp->t_mountp, agno, bno, len, lsn); 2609 - 2610 - /* 2611 - * If a block was found, force the log through the LSN of the 2612 - * transaction that freed the block 2613 - */ 2614 - if (lsn) 2615 - xfs_log_force_lsn(tp->t_mountp, lsn, XFS_LOG_SYNC); 2740 + kmem_free(busyp); 2616 2741 }

+3 -4

fs/xfs/xfs_alloc.h

··· 22 22 struct xfs_mount; 23 23 struct xfs_perag; 24 24 struct xfs_trans; 25 + struct xfs_busy_extent; 25 26 26 27 /* 27 28 * Freespace allocation types. Argument to xfs_alloc_[v]extent. ··· 120 119 #ifdef __KERNEL__ 121 120 122 121 void 123 - xfs_alloc_mark_busy(xfs_trans_t *tp, 122 + xfs_alloc_busy_insert(xfs_trans_t *tp, 124 123 xfs_agnumber_t agno, 125 124 xfs_agblock_t bno, 126 125 xfs_extlen_t len); 127 126 128 127 void 129 - xfs_alloc_clear_busy(xfs_trans_t *tp, 130 - xfs_agnumber_t ag, 131 - int idx); 128 + xfs_alloc_busy_clear(struct xfs_mount *mp, struct xfs_busy_extent *busyp); 132 129 133 130 #endif /* __KERNEL__ */ 134 131

+1 -1

fs/xfs/xfs_alloc_btree.c

··· 134 134 * disk. If a busy block is allocated, the iclog is pushed up to the 135 135 * LSN that freed the block. 136 136 */ 137 - xfs_alloc_mark_busy(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1); 137 + xfs_alloc_busy_insert(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1); 138 138 xfs_trans_agbtree_delta(cur->bc_tp, -1); 139 139 return 0; 140 140 }

+87 -83

fs/xfs/xfs_buf_item.c

··· 64 64 nbytes = last - first + 1; 65 65 bfset(bip->bli_logged, first, nbytes); 66 66 for (x = 0; x < nbytes; x++) { 67 - chunk_num = byte >> XFS_BLI_SHIFT; 67 + chunk_num = byte >> XFS_BLF_SHIFT; 68 68 word_num = chunk_num >> BIT_TO_WORD_SHIFT; 69 69 bit_num = chunk_num & (NBWORD - 1); 70 70 wordp = &(bip->bli_format.blf_data_map[word_num]); ··· 166 166 * cancel flag in it. 167 167 */ 168 168 trace_xfs_buf_item_size_stale(bip); 169 - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); 169 + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); 170 170 return 1; 171 171 } 172 172 ··· 197 197 } else if (next_bit != last_bit + 1) { 198 198 last_bit = next_bit; 199 199 nvecs++; 200 - } else if (xfs_buf_offset(bp, next_bit * XFS_BLI_CHUNK) != 201 - (xfs_buf_offset(bp, last_bit * XFS_BLI_CHUNK) + 202 - XFS_BLI_CHUNK)) { 200 + } else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) != 201 + (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) + 202 + XFS_BLF_CHUNK)) { 203 203 last_bit = next_bit; 204 204 nvecs++; 205 205 } else { ··· 254 254 vecp++; 255 255 nvecs = 1; 256 256 257 + /* 258 + * If it is an inode buffer, transfer the in-memory state to the 259 + * format flags and clear the in-memory state. We do not transfer 260 + * this state if the inode buffer allocation has not yet been committed 261 + * to the log as setting the XFS_BLI_INODE_BUF flag will prevent 262 + * correct replay of the inode allocation. 263 + */ 264 + if (bip->bli_flags & XFS_BLI_INODE_BUF) { 265 + if (!((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && 266 + xfs_log_item_in_current_chkpt(&bip->bli_item))) 267 + bip->bli_format.blf_flags |= XFS_BLF_INODE_BUF; 268 + bip->bli_flags &= ~XFS_BLI_INODE_BUF; 269 + } 270 + 257 271 if (bip->bli_flags & XFS_BLI_STALE) { 258 272 /* 259 273 * The buffer is stale, so all we need to log ··· 275 261 * cancel flag in it. 276 262 */ 277 263 trace_xfs_buf_item_format_stale(bip); 278 - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); 264 + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); 279 265 bip->bli_format.blf_size = nvecs; 280 266 return; 281 267 } ··· 308 294 * keep counting and scanning. 309 295 */ 310 296 if (next_bit == -1) { 311 - buffer_offset = first_bit * XFS_BLI_CHUNK; 297 + buffer_offset = first_bit * XFS_BLF_CHUNK; 312 298 vecp->i_addr = xfs_buf_offset(bp, buffer_offset); 313 - vecp->i_len = nbits * XFS_BLI_CHUNK; 299 + vecp->i_len = nbits * XFS_BLF_CHUNK; 314 300 vecp->i_type = XLOG_REG_TYPE_BCHUNK; 315 301 nvecs++; 316 302 break; 317 303 } else if (next_bit != last_bit + 1) { 318 - buffer_offset = first_bit * XFS_BLI_CHUNK; 304 + buffer_offset = first_bit * XFS_BLF_CHUNK; 319 305 vecp->i_addr = xfs_buf_offset(bp, buffer_offset); 320 - vecp->i_len = nbits * XFS_BLI_CHUNK; 306 + vecp->i_len = nbits * XFS_BLF_CHUNK; 321 307 vecp->i_type = XLOG_REG_TYPE_BCHUNK; 322 308 nvecs++; 323 309 vecp++; 324 310 first_bit = next_bit; 325 311 last_bit = next_bit; 326 312 nbits = 1; 327 - } else if (xfs_buf_offset(bp, next_bit << XFS_BLI_SHIFT) != 328 - (xfs_buf_offset(bp, last_bit << XFS_BLI_SHIFT) + 329 - XFS_BLI_CHUNK)) { 330 - buffer_offset = first_bit * XFS_BLI_CHUNK; 313 + } else if (xfs_buf_offset(bp, next_bit << XFS_BLF_SHIFT) != 314 + (xfs_buf_offset(bp, last_bit << XFS_BLF_SHIFT) + 315 + XFS_BLF_CHUNK)) { 316 + buffer_offset = first_bit * XFS_BLF_CHUNK; 331 317 vecp->i_addr = xfs_buf_offset(bp, buffer_offset); 332 - vecp->i_len = nbits * XFS_BLI_CHUNK; 318 + vecp->i_len = nbits * XFS_BLF_CHUNK; 333 319 vecp->i_type = XLOG_REG_TYPE_BCHUNK; 334 320 /* You would think we need to bump the nvecs here too, but we do not 335 321 * this number is used by recovery, and it gets confused by the boundary ··· 355 341 } 356 342 357 343 /* 358 - * This is called to pin the buffer associated with the buf log 359 - * item in memory so it cannot be written out. Simply call bpin() 360 - * on the buffer to do this. 344 + * This is called to pin the buffer associated with the buf log item in memory 345 + * so it cannot be written out. Simply call bpin() on the buffer to do this. 346 + * 347 + * We also always take a reference to the buffer log item here so that the bli 348 + * is held while the item is pinned in memory. This means that we can 349 + * unconditionally drop the reference count a transaction holds when the 350 + * transaction is completed. 361 351 */ 352 + 362 353 STATIC void 363 354 xfs_buf_item_pin( 364 355 xfs_buf_log_item_t *bip) ··· 375 356 ASSERT(atomic_read(&bip->bli_refcount) > 0); 376 357 ASSERT((bip->bli_flags & XFS_BLI_LOGGED) || 377 358 (bip->bli_flags & XFS_BLI_STALE)); 359 + atomic_inc(&bip->bli_refcount); 378 360 trace_xfs_buf_item_pin(bip); 379 361 xfs_bpin(bp); 380 362 } ··· 413 393 ASSERT(XFS_BUF_VALUSEMA(bp) <= 0); 414 394 ASSERT(!(XFS_BUF_ISDELAYWRITE(bp))); 415 395 ASSERT(XFS_BUF_ISSTALE(bp)); 416 - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); 396 + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); 417 397 trace_xfs_buf_item_unpin_stale(bip); 418 398 419 399 /* ··· 509 489 } 510 490 511 491 /* 512 - * Release the buffer associated with the buf log item. 513 - * If there is no dirty logged data associated with the 514 - * buffer recorded in the buf log item, then free the 515 - * buf log item and remove the reference to it in the 516 - * buffer. 492 + * Release the buffer associated with the buf log item. If there is no dirty 493 + * logged data associated with the buffer recorded in the buf log item, then 494 + * free the buf log item and remove the reference to it in the buffer. 517 495 * 518 - * This call ignores the recursion count. It is only called 519 - * when the buffer should REALLY be unlocked, regardless 520 - * of the recursion count. 496 + * This call ignores the recursion count. It is only called when the buffer 497 + * should REALLY be unlocked, regardless of the recursion count. 521 498 * 522 - * If the XFS_BLI_HOLD flag is set in the buf log item, then 523 - * free the log item if necessary but do not unlock the buffer. 524 - * This is for support of xfs_trans_bhold(). Make sure the 525 - * XFS_BLI_HOLD field is cleared if we don't free the item. 499 + * We unconditionally drop the transaction's reference to the log item. If the 500 + * item was logged, then another reference was taken when it was pinned, so we 501 + * can safely drop the transaction reference now. This also allows us to avoid 502 + * potential races with the unpin code freeing the bli by not referencing the 503 + * bli after we've dropped the reference count. 504 + * 505 + * If the XFS_BLI_HOLD flag is set in the buf log item, then free the log item 506 + * if necessary but do not unlock the buffer. This is for support of 507 + * xfs_trans_bhold(). Make sure the XFS_BLI_HOLD field is cleared if we don't 508 + * free the item. 526 509 */ 527 510 STATIC void 528 511 xfs_buf_item_unlock( ··· 537 514 538 515 bp = bip->bli_buf; 539 516 540 - /* 541 - * Clear the buffer's association with this transaction. 542 - */ 517 + /* Clear the buffer's association with this transaction. */ 543 518 XFS_BUF_SET_FSPRIVATE2(bp, NULL); 544 519 545 520 /* 546 - * If this is a transaction abort, don't return early. 547 - * Instead, allow the brelse to happen. 548 - * Normally it would be done for stale (cancelled) buffers 549 - * at unpin time, but we'll never go through the pin/unpin 550 - * cycle if we abort inside commit. 521 + * If this is a transaction abort, don't return early. Instead, allow 522 + * the brelse to happen. Normally it would be done for stale 523 + * (cancelled) buffers at unpin time, but we'll never go through the 524 + * pin/unpin cycle if we abort inside commit. 551 525 */ 552 526 aborted = (bip->bli_item.li_flags & XFS_LI_ABORTED) != 0; 553 - 554 - /* 555 - * If the buf item is marked stale, then don't do anything. 556 - * We'll unlock the buffer and free the buf item when the 557 - * buffer is unpinned for the last time. 558 - */ 559 - if (bip->bli_flags & XFS_BLI_STALE) { 560 - bip->bli_flags &= ~XFS_BLI_LOGGED; 561 - trace_xfs_buf_item_unlock_stale(bip); 562 - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); 563 - if (!aborted) 564 - return; 565 - } 566 - 567 - /* 568 - * Drop the transaction's reference to the log item if 569 - * it was not logged as part of the transaction. Otherwise 570 - * we'll drop the reference in xfs_buf_item_unpin() when 571 - * the transaction is really through with the buffer. 572 - */ 573 - if (!(bip->bli_flags & XFS_BLI_LOGGED)) { 574 - atomic_dec(&bip->bli_refcount); 575 - } else { 576 - /* 577 - * Clear the logged flag since this is per 578 - * transaction state. 579 - */ 580 - bip->bli_flags &= ~XFS_BLI_LOGGED; 581 - } 582 527 583 528 /* 584 529 * Before possibly freeing the buf item, determine if we should 585 530 * release the buffer at the end of this routine. 586 531 */ 587 532 hold = bip->bli_flags & XFS_BLI_HOLD; 533 + 534 + /* Clear the per transaction state. */ 535 + bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD); 536 + 537 + /* 538 + * If the buf item is marked stale, then don't do anything. We'll 539 + * unlock the buffer and free the buf item when the buffer is unpinned 540 + * for the last time. 541 + */ 542 + if (bip->bli_flags & XFS_BLI_STALE) { 543 + trace_xfs_buf_item_unlock_stale(bip); 544 + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); 545 + if (!aborted) { 546 + atomic_dec(&bip->bli_refcount); 547 + return; 548 + } 549 + } 550 + 588 551 trace_xfs_buf_item_unlock(bip); 589 552 590 553 /* 591 - * If the buf item isn't tracking any data, free it. 592 - * Otherwise, if XFS_BLI_HOLD is set clear it. 554 + * If the buf item isn't tracking any data, free it, otherwise drop the 555 + * reference we hold to it. 593 556 */ 594 557 if (xfs_bitmap_empty(bip->bli_format.blf_data_map, 595 - bip->bli_format.blf_map_size)) { 558 + bip->bli_format.blf_map_size)) 596 559 xfs_buf_item_relse(bp); 597 - } else if (hold) { 598 - bip->bli_flags &= ~XFS_BLI_HOLD; 599 - } 560 + else 561 + atomic_dec(&bip->bli_refcount); 600 562 601 - /* 602 - * Release the buffer if XFS_BLI_HOLD was not set. 603 - */ 604 - if (!hold) { 563 + if (!hold) 605 564 xfs_buf_relse(bp); 606 - } 607 565 } 608 566 609 567 /* ··· 721 717 } 722 718 723 719 /* 724 - * chunks is the number of XFS_BLI_CHUNK size pieces 720 + * chunks is the number of XFS_BLF_CHUNK size pieces 725 721 * the buffer can be divided into. Make sure not to 726 722 * truncate any pieces. map_size is the size of the 727 723 * bitmap needed to describe the chunks of the buffer. 728 724 */ 729 - chunks = (int)((XFS_BUF_COUNT(bp) + (XFS_BLI_CHUNK - 1)) >> XFS_BLI_SHIFT); 725 + chunks = (int)((XFS_BUF_COUNT(bp) + (XFS_BLF_CHUNK - 1)) >> XFS_BLF_SHIFT); 730 726 map_size = (int)((chunks + NBWORD) >> BIT_TO_WORD_SHIFT); 731 727 732 728 bip = (xfs_buf_log_item_t*)kmem_zone_zalloc(xfs_buf_item_zone, ··· 794 790 /* 795 791 * Convert byte offsets to bit numbers. 796 792 */ 797 - first_bit = first >> XFS_BLI_SHIFT; 798 - last_bit = last >> XFS_BLI_SHIFT; 793 + first_bit = first >> XFS_BLF_SHIFT; 794 + last_bit = last >> XFS_BLF_SHIFT; 799 795 800 796 /* 801 797 * Calculate the total number of bits to be set.

+10 -8

fs/xfs/xfs_buf_item.h

··· 41 41 * This flag indicates that the buffer contains on disk inodes 42 42 * and requires special recovery handling. 43 43 */ 44 - #define XFS_BLI_INODE_BUF 0x1 44 + #define XFS_BLF_INODE_BUF 0x1 45 45 /* 46 46 * This flag indicates that the buffer should not be replayed 47 47 * during recovery because its blocks are being freed. 48 48 */ 49 - #define XFS_BLI_CANCEL 0x2 49 + #define XFS_BLF_CANCEL 0x2 50 50 /* 51 51 * This flag indicates that the buffer contains on disk 52 52 * user or group dquots and may require special recovery handling. 53 53 */ 54 - #define XFS_BLI_UDQUOT_BUF 0x4 55 - #define XFS_BLI_PDQUOT_BUF 0x8 56 - #define XFS_BLI_GDQUOT_BUF 0x10 54 + #define XFS_BLF_UDQUOT_BUF 0x4 55 + #define XFS_BLF_PDQUOT_BUF 0x8 56 + #define XFS_BLF_GDQUOT_BUF 0x10 57 57 58 - #define XFS_BLI_CHUNK 128 59 - #define XFS_BLI_SHIFT 7 58 + #define XFS_BLF_CHUNK 128 59 + #define XFS_BLF_SHIFT 7 60 60 #define BIT_TO_WORD_SHIFT 5 61 61 #define NBWORD (NBBY * sizeof(unsigned int)) 62 62 ··· 69 69 #define XFS_BLI_LOGGED 0x08 70 70 #define XFS_BLI_INODE_ALLOC_BUF 0x10 71 71 #define XFS_BLI_STALE_INODE 0x20 72 + #define XFS_BLI_INODE_BUF 0x40 72 73 73 74 #define XFS_BLI_FLAGS \ 74 75 { XFS_BLI_HOLD, "HOLD" }, \ ··· 77 76 { XFS_BLI_STALE, "STALE" }, \ 78 77 { XFS_BLI_LOGGED, "LOGGED" }, \ 79 78 { XFS_BLI_INODE_ALLOC_BUF, "INODE_ALLOC" }, \ 80 - { XFS_BLI_STALE_INODE, "STALE_INODE" } 79 + { XFS_BLI_STALE_INODE, "STALE_INODE" }, \ 80 + { XFS_BLI_INODE_BUF, "INODE_BUF" } 81 81 82 82 83 83 #ifdef __KERNEL__

+1 -1

fs/xfs/xfs_error.c

··· 170 170 va_list ap; 171 171 172 172 #ifdef DEBUG 173 - xfs_panic_mask |= XFS_PTAG_SHUTDOWN_CORRUPT; 173 + xfs_panic_mask |= (XFS_PTAG_SHUTDOWN_CORRUPT | XFS_PTAG_LOGRES); 174 174 #endif 175 175 176 176 if (xfs_panic_mask && (xfs_panic_mask & panic_tag)

+89 -33

fs/xfs/xfs_log.c

··· 54 54 STATIC int xlog_space_left(xlog_t *log, int cycle, int bytes); 55 55 STATIC int xlog_sync(xlog_t *log, xlog_in_core_t *iclog); 56 56 STATIC void xlog_dealloc_log(xlog_t *log); 57 - STATIC int xlog_write(struct log *log, struct xfs_log_vec *log_vector, 58 - struct xlog_ticket *tic, xfs_lsn_t *start_lsn, 59 - xlog_in_core_t **commit_iclog, uint flags); 60 57 61 58 /* local state machine functions */ 62 59 STATIC void xlog_state_done_syncing(xlog_in_core_t *iclog, int); ··· 82 85 xlog_ticket_t *ticket); 83 86 STATIC void xlog_ungrant_log_space(xlog_t *log, 84 87 xlog_ticket_t *ticket); 85 - 86 - 87 - /* local ticket functions */ 88 - STATIC xlog_ticket_t *xlog_ticket_alloc(xlog_t *log, 89 - int unit_bytes, 90 - int count, 91 - char clientid, 92 - uint flags); 93 88 94 89 #if defined(DEBUG) 95 90 STATIC void xlog_verify_dest_ptr(xlog_t *log, char *ptr); ··· 349 360 ASSERT(flags & XFS_LOG_PERM_RESERV); 350 361 internal_ticket = *ticket; 351 362 363 + /* 364 + * this is a new transaction on the ticket, so we need to 365 + * change the transaction ID so that the next transaction has a 366 + * different TID in the log. Just add one to the existing tid 367 + * so that we can see chains of rolling transactions in the log 368 + * easily. 369 + */ 370 + internal_ticket->t_tid++; 371 + 352 372 trace_xfs_log_reserve(log, internal_ticket); 353 373 354 374 xlog_grant_push_ail(mp, internal_ticket->t_unit_res); ··· 365 367 } else { 366 368 /* may sleep if need to allocate more tickets */ 367 369 internal_ticket = xlog_ticket_alloc(log, unit_bytes, cnt, 368 - client, flags); 370 + client, flags, 371 + KM_SLEEP|KM_MAYFAIL); 369 372 if (!internal_ticket) 370 373 return XFS_ERROR(ENOMEM); 371 374 internal_ticket->t_trans_type = t_type; ··· 450 451 451 452 /* Normal transactions can now occur */ 452 453 mp->m_log->l_flags &= ~XLOG_ACTIVE_RECOVERY; 454 + 455 + /* 456 + * Now the log has been fully initialised and we know were our 457 + * space grant counters are, we can initialise the permanent ticket 458 + * needed for delayed logging to work. 459 + */ 460 + xlog_cil_init_post_recovery(mp->m_log); 453 461 454 462 return 0; 455 463 ··· 664 658 item->li_ailp = mp->m_ail; 665 659 item->li_type = type; 666 660 item->li_ops = ops; 661 + item->li_lv = NULL; 662 + 663 + INIT_LIST_HEAD(&item->li_ail); 664 + INIT_LIST_HEAD(&item->li_cil); 667 665 } 668 666 669 667 /* ··· 1178 1168 *iclogp = log->l_iclog; /* complete ring */ 1179 1169 log->l_iclog->ic_prev = prev_iclog; /* re-write 1st prev ptr */ 1180 1170 1171 + error = xlog_cil_init(log); 1172 + if (error) 1173 + goto out_free_iclog; 1181 1174 return log; 1182 1175 1183 1176 out_free_iclog: ··· 1507 1494 xlog_in_core_t *iclog, *next_iclog; 1508 1495 int i; 1509 1496 1497 + xlog_cil_destroy(log); 1498 + 1510 1499 iclog = log->l_iclog; 1511 1500 for (i=0; i<log->l_iclog_bufs; i++) { 1512 1501 sv_destroy(&iclog->ic_force_wait); ··· 1551 1536 * print out info relating to regions written which consume 1552 1537 * the reservation 1553 1538 */ 1554 - STATIC void 1555 - xlog_print_tic_res(xfs_mount_t *mp, xlog_ticket_t *ticket) 1539 + void 1540 + xlog_print_tic_res( 1541 + struct xfs_mount *mp, 1542 + struct xlog_ticket *ticket) 1556 1543 { 1557 1544 uint i; 1558 1545 uint ophdr_spc = ticket->t_res_num_ophdrs * (uint)sizeof(xlog_op_header_t); ··· 1654 1637 "bad-rtype" : res_type_str[r_type-1]), 1655 1638 ticket->t_res_arr[i].r_len); 1656 1639 } 1640 + 1641 + xfs_cmn_err(XFS_PTAG_LOGRES, CE_ALERT, mp, 1642 + "xfs_log_write: reservation ran out. Need to up reservation"); 1643 + xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); 1657 1644 } 1658 1645 1659 1646 /* ··· 1886 1865 * we don't update ic_offset until the end when we know exactly how many 1887 1866 * bytes have been written out. 1888 1867 */ 1889 - STATIC int 1868 + int 1890 1869 xlog_write( 1891 1870 struct log *log, 1892 1871 struct xfs_log_vec *log_vector, ··· 1910 1889 *start_lsn = 0; 1911 1890 1912 1891 len = xlog_write_calc_vec_length(ticket, log_vector); 1913 - if (ticket->t_curr_res < len) { 1892 + if (log->l_cilp) { 1893 + /* 1894 + * Region headers and bytes are already accounted for. 1895 + * We only need to take into account start records and 1896 + * split regions in this function. 1897 + */ 1898 + if (ticket->t_flags & XLOG_TIC_INITED) 1899 + ticket->t_curr_res -= sizeof(xlog_op_header_t); 1900 + 1901 + /* 1902 + * Commit record headers need to be accounted for. These 1903 + * come in as separate writes so are easy to detect. 1904 + */ 1905 + if (flags & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) 1906 + ticket->t_curr_res -= sizeof(xlog_op_header_t); 1907 + } else 1908 + ticket->t_curr_res -= len; 1909 + 1910 + if (ticket->t_curr_res < 0) 1914 1911 xlog_print_tic_res(log->l_mp, ticket); 1915 - #ifdef DEBUG 1916 - xlog_panic( 1917 - "xfs_log_write: reservation ran out. Need to up reservation"); 1918 - #else 1919 - /* Customer configurable panic */ 1920 - xfs_cmn_err(XFS_PTAG_LOGRES, CE_ALERT, log->l_mp, 1921 - "xfs_log_write: reservation ran out. Need to up reservation"); 1922 - 1923 - /* If we did not panic, shutdown the filesystem */ 1924 - xfs_force_shutdown(log->l_mp, SHUTDOWN_CORRUPT_INCORE); 1925 - #endif 1926 - } 1927 - 1928 - ticket->t_curr_res -= len; 1929 1912 1930 1913 index = 0; 1931 1914 lv = log_vector; ··· 3025 3000 3026 3001 XFS_STATS_INC(xs_log_force); 3027 3002 3003 + xlog_cil_push(log, 1); 3004 + 3028 3005 spin_lock(&log->l_icloglock); 3029 3006 3030 3007 iclog = log->l_iclog; ··· 3175 3148 ASSERT(lsn != 0); 3176 3149 3177 3150 XFS_STATS_INC(xs_log_force); 3151 + 3152 + if (log->l_cilp) { 3153 + lsn = xlog_cil_push_lsn(log, lsn); 3154 + if (lsn == NULLCOMMITLSN) 3155 + return 0; 3156 + } 3178 3157 3179 3158 try_again: 3180 3159 spin_lock(&log->l_icloglock); ··· 3346 3313 return ticket; 3347 3314 } 3348 3315 3316 + xlog_tid_t 3317 + xfs_log_get_trans_ident( 3318 + struct xfs_trans *tp) 3319 + { 3320 + return tp->t_ticket->t_tid; 3321 + } 3322 + 3349 3323 /* 3350 3324 * Allocate and initialise a new log ticket. 3351 3325 */ 3352 - STATIC xlog_ticket_t * 3326 + xlog_ticket_t * 3353 3327 xlog_ticket_alloc( 3354 3328 struct log *log, 3355 3329 int unit_bytes, 3356 3330 int cnt, 3357 3331 char client, 3358 - uint xflags) 3332 + uint xflags, 3333 + int alloc_flags) 3359 3334 { 3360 3335 struct xlog_ticket *tic; 3361 3336 uint num_headers; 3362 3337 int iclog_space; 3363 3338 3364 - tic = kmem_zone_zalloc(xfs_log_ticket_zone, KM_SLEEP|KM_MAYFAIL); 3339 + tic = kmem_zone_zalloc(xfs_log_ticket_zone, alloc_flags); 3365 3340 if (!tic) 3366 3341 return NULL; 3367 3342 ··· 3688 3647 * c. nothing new gets queued up after (a) and (b) are done. 3689 3648 * d. if !logerror, flush the iclogs to disk, then seal them off 3690 3649 * for business. 3650 + * 3651 + * Note: for delayed logging the !logerror case needs to flush the regions 3652 + * held in memory out to the iclogs before flushing them to disk. This needs 3653 + * to be done before the log is marked as shutdown, otherwise the flush to the 3654 + * iclogs will fail. 3691 3655 */ 3692 3656 int 3693 3657 xfs_log_force_umount( ··· 3726 3680 return 1; 3727 3681 } 3728 3682 retval = 0; 3683 + 3684 + /* 3685 + * Flush the in memory commit item list before marking the log as 3686 + * being shut down. We need to do it in this order to ensure all the 3687 + * completed transactions are flushed to disk with the xfs_log_force() 3688 + * call below. 3689 + */ 3690 + if (!logerror && (mp->m_flags & XFS_MOUNT_DELAYLOG)) 3691 + xlog_cil_push(log, 1); 3692 + 3729 3693 /* 3730 3694 * We must hold both the GRANT lock and the LOG lock, 3731 3695 * before we mark the filesystem SHUTDOWN and wake

+12 -2

fs/xfs/xfs_log.h

··· 19 19 #define __XFS_LOG_H__ 20 20 21 21 /* get lsn fields */ 22 - 23 22 #define CYCLE_LSN(lsn) ((uint)((lsn)>>32)) 24 23 #define BLOCK_LSN(lsn) ((uint)(lsn)) 25 24 ··· 113 114 struct xfs_log_vec *lv_next; /* next lv in build list */ 114 115 int lv_niovecs; /* number of iovecs in lv */ 115 116 struct xfs_log_iovec *lv_iovecp; /* iovec array */ 117 + struct xfs_log_item *lv_item; /* owner */ 118 + char *lv_buf; /* formatted buffer */ 119 + int lv_buf_len; /* size of formatted buffer */ 116 120 }; 117 121 118 122 /* ··· 136 134 struct xlog_ticket; 137 135 struct xfs_log_item; 138 136 struct xfs_item_ops; 137 + struct xfs_trans; 139 138 140 139 void xfs_log_item_init(struct xfs_mount *mp, 141 140 struct xfs_log_item *item, ··· 190 187 191 188 void xlog_iodone(struct xfs_buf *); 192 189 193 - struct xlog_ticket * xfs_log_ticket_get(struct xlog_ticket *ticket); 190 + struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket); 194 191 void xfs_log_ticket_put(struct xlog_ticket *ticket); 192 + 193 + xlog_tid_t xfs_log_get_trans_ident(struct xfs_trans *tp); 194 + 195 + int xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp, 196 + struct xfs_log_vec *log_vector, 197 + xfs_lsn_t *commit_lsn, int flags); 198 + bool xfs_log_item_in_current_chkpt(struct xfs_log_item *lip); 195 199 196 200 #endif 197 201

+725

fs/xfs/xfs_log_cil.c

··· 1 + /* 2 + * Copyright (c) 2010 Red Hat, Inc. All Rights Reserved. 3 + * 4 + * This program is free software; you can redistribute it and/or 5 + * modify it under the terms of the GNU General Public License as 6 + * published by the Free Software Foundation. 7 + * 8 + * This program is distributed in the hope that it would be useful, 9 + * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 + * GNU General Public License for more details. 12 + * 13 + * You should have received a copy of the GNU General Public License 14 + * along with this program; if not, write the Free Software Foundation, 15 + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA 16 + */ 17 + 18 + #include "xfs.h" 19 + #include "xfs_fs.h" 20 + #include "xfs_types.h" 21 + #include "xfs_bit.h" 22 + #include "xfs_log.h" 23 + #include "xfs_inum.h" 24 + #include "xfs_trans.h" 25 + #include "xfs_trans_priv.h" 26 + #include "xfs_log_priv.h" 27 + #include "xfs_sb.h" 28 + #include "xfs_ag.h" 29 + #include "xfs_dir2.h" 30 + #include "xfs_dmapi.h" 31 + #include "xfs_mount.h" 32 + #include "xfs_error.h" 33 + #include "xfs_alloc.h" 34 + 35 + /* 36 + * Perform initial CIL structure initialisation. If the CIL is not 37 + * enabled in this filesystem, ensure the log->l_cilp is null so 38 + * we can check this conditional to determine if we are doing delayed 39 + * logging or not. 40 + */ 41 + int 42 + xlog_cil_init( 43 + struct log *log) 44 + { 45 + struct xfs_cil *cil; 46 + struct xfs_cil_ctx *ctx; 47 + 48 + log->l_cilp = NULL; 49 + if (!(log->l_mp->m_flags & XFS_MOUNT_DELAYLOG)) 50 + return 0; 51 + 52 + cil = kmem_zalloc(sizeof(*cil), KM_SLEEP|KM_MAYFAIL); 53 + if (!cil) 54 + return ENOMEM; 55 + 56 + ctx = kmem_zalloc(sizeof(*ctx), KM_SLEEP|KM_MAYFAIL); 57 + if (!ctx) { 58 + kmem_free(cil); 59 + return ENOMEM; 60 + } 61 + 62 + INIT_LIST_HEAD(&cil->xc_cil); 63 + INIT_LIST_HEAD(&cil->xc_committing); 64 + spin_lock_init(&cil->xc_cil_lock); 65 + init_rwsem(&cil->xc_ctx_lock); 66 + sv_init(&cil->xc_commit_wait, SV_DEFAULT, "cilwait"); 67 + 68 + INIT_LIST_HEAD(&ctx->committing); 69 + INIT_LIST_HEAD(&ctx->busy_extents); 70 + ctx->sequence = 1; 71 + ctx->cil = cil; 72 + cil->xc_ctx = ctx; 73 + 74 + cil->xc_log = log; 75 + log->l_cilp = cil; 76 + return 0; 77 + } 78 + 79 + void 80 + xlog_cil_destroy( 81 + struct log *log) 82 + { 83 + if (!log->l_cilp) 84 + return; 85 + 86 + if (log->l_cilp->xc_ctx) { 87 + if (log->l_cilp->xc_ctx->ticket) 88 + xfs_log_ticket_put(log->l_cilp->xc_ctx->ticket); 89 + kmem_free(log->l_cilp->xc_ctx); 90 + } 91 + 92 + ASSERT(list_empty(&log->l_cilp->xc_cil)); 93 + kmem_free(log->l_cilp); 94 + } 95 + 96 + /* 97 + * Allocate a new ticket. Failing to get a new ticket makes it really hard to 98 + * recover, so we don't allow failure here. Also, we allocate in a context that 99 + * we don't want to be issuing transactions from, so we need to tell the 100 + * allocation code this as well. 101 + * 102 + * We don't reserve any space for the ticket - we are going to steal whatever 103 + * space we require from transactions as they commit. To ensure we reserve all 104 + * the space required, we need to set the current reservation of the ticket to 105 + * zero so that we know to steal the initial transaction overhead from the 106 + * first transaction commit. 107 + */ 108 + static struct xlog_ticket * 109 + xlog_cil_ticket_alloc( 110 + struct log *log) 111 + { 112 + struct xlog_ticket *tic; 113 + 114 + tic = xlog_ticket_alloc(log, 0, 1, XFS_TRANSACTION, 0, 115 + KM_SLEEP|KM_NOFS); 116 + tic->t_trans_type = XFS_TRANS_CHECKPOINT; 117 + 118 + /* 119 + * set the current reservation to zero so we know to steal the basic 120 + * transaction overhead reservation from the first transaction commit. 121 + */ 122 + tic->t_curr_res = 0; 123 + return tic; 124 + } 125 + 126 + /* 127 + * After the first stage of log recovery is done, we know where the head and 128 + * tail of the log are. We need this log initialisation done before we can 129 + * initialise the first CIL checkpoint context. 130 + * 131 + * Here we allocate a log ticket to track space usage during a CIL push. This 132 + * ticket is passed to xlog_write() directly so that we don't slowly leak log 133 + * space by failing to account for space used by log headers and additional 134 + * region headers for split regions. 135 + */ 136 + void 137 + xlog_cil_init_post_recovery( 138 + struct log *log) 139 + { 140 + if (!log->l_cilp) 141 + return; 142 + 143 + log->l_cilp->xc_ctx->ticket = xlog_cil_ticket_alloc(log); 144 + log->l_cilp->xc_ctx->sequence = 1; 145 + log->l_cilp->xc_ctx->commit_lsn = xlog_assign_lsn(log->l_curr_cycle, 146 + log->l_curr_block); 147 + } 148 + 149 + /* 150 + * Insert the log item into the CIL and calculate the difference in space 151 + * consumed by the item. Add the space to the checkpoint ticket and calculate 152 + * if the change requires additional log metadata. If it does, take that space 153 + * as well. Remove the amount of space we addded to the checkpoint ticket from 154 + * the current transaction ticket so that the accounting works out correctly. 155 + * 156 + * If this is the first time the item is being placed into the CIL in this 157 + * context, pin it so it can't be written to disk until the CIL is flushed to 158 + * the iclog and the iclog written to disk. 159 + */ 160 + static void 161 + xlog_cil_insert( 162 + struct log *log, 163 + struct xlog_ticket *ticket, 164 + struct xfs_log_item *item, 165 + struct xfs_log_vec *lv) 166 + { 167 + struct xfs_cil *cil = log->l_cilp; 168 + struct xfs_log_vec *old = lv->lv_item->li_lv; 169 + struct xfs_cil_ctx *ctx = cil->xc_ctx; 170 + int len; 171 + int diff_iovecs; 172 + int iclog_space; 173 + 174 + if (old) { 175 + /* existing lv on log item, space used is a delta */ 176 + ASSERT(!list_empty(&item->li_cil)); 177 + ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs); 178 + 179 + len = lv->lv_buf_len - old->lv_buf_len; 180 + diff_iovecs = lv->lv_niovecs - old->lv_niovecs; 181 + kmem_free(old->lv_buf); 182 + kmem_free(old); 183 + } else { 184 + /* new lv, must pin the log item */ 185 + ASSERT(!lv->lv_item->li_lv); 186 + ASSERT(list_empty(&item->li_cil)); 187 + 188 + len = lv->lv_buf_len; 189 + diff_iovecs = lv->lv_niovecs; 190 + IOP_PIN(lv->lv_item); 191 + 192 + } 193 + len += diff_iovecs * sizeof(xlog_op_header_t); 194 + 195 + /* attach new log vector to log item */ 196 + lv->lv_item->li_lv = lv; 197 + 198 + spin_lock(&cil->xc_cil_lock); 199 + list_move_tail(&item->li_cil, &cil->xc_cil); 200 + ctx->nvecs += diff_iovecs; 201 + 202 + /* 203 + * If this is the first time the item is being committed to the CIL, 204 + * store the sequence number on the log item so we can tell 205 + * in future commits whether this is the first checkpoint the item is 206 + * being committed into. 207 + */ 208 + if (!item->li_seq) 209 + item->li_seq = ctx->sequence; 210 + 211 + /* 212 + * Now transfer enough transaction reservation to the context ticket 213 + * for the checkpoint. The context ticket is special - the unit 214 + * reservation has to grow as well as the current reservation as we 215 + * steal from tickets so we can correctly determine the space used 216 + * during the transaction commit. 217 + */ 218 + if (ctx->ticket->t_curr_res == 0) { 219 + /* first commit in checkpoint, steal the header reservation */ 220 + ASSERT(ticket->t_curr_res >= ctx->ticket->t_unit_res + len); 221 + ctx->ticket->t_curr_res = ctx->ticket->t_unit_res; 222 + ticket->t_curr_res -= ctx->ticket->t_unit_res; 223 + } 224 + 225 + /* do we need space for more log record headers? */ 226 + iclog_space = log->l_iclog_size - log->l_iclog_hsize; 227 + if (len > 0 && (ctx->space_used / iclog_space != 228 + (ctx->space_used + len) / iclog_space)) { 229 + int hdrs; 230 + 231 + hdrs = (len + iclog_space - 1) / iclog_space; 232 + /* need to take into account split region headers, too */ 233 + hdrs *= log->l_iclog_hsize + sizeof(struct xlog_op_header); 234 + ctx->ticket->t_unit_res += hdrs; 235 + ctx->ticket->t_curr_res += hdrs; 236 + ticket->t_curr_res -= hdrs; 237 + ASSERT(ticket->t_curr_res >= len); 238 + } 239 + ticket->t_curr_res -= len; 240 + ctx->space_used += len; 241 + 242 + spin_unlock(&cil->xc_cil_lock); 243 + } 244 + 245 + /* 246 + * Format log item into a flat buffers 247 + * 248 + * For delayed logging, we need to hold a formatted buffer containing all the 249 + * changes on the log item. This enables us to relog the item in memory and 250 + * write it out asynchronously without needing to relock the object that was 251 + * modified at the time it gets written into the iclog. 252 + * 253 + * This function builds a vector for the changes in each log item in the 254 + * transaction. It then works out the length of the buffer needed for each log 255 + * item, allocates them and formats the vector for the item into the buffer. 256 + * The buffer is then attached to the log item are then inserted into the 257 + * Committed Item List for tracking until the next checkpoint is written out. 258 + * 259 + * We don't set up region headers during this process; we simply copy the 260 + * regions into the flat buffer. We can do this because we still have to do a 261 + * formatting step to write the regions into the iclog buffer. Writing the 262 + * ophdrs during the iclog write means that we can support splitting large 263 + * regions across iclog boundares without needing a change in the format of the 264 + * item/region encapsulation. 265 + * 266 + * Hence what we need to do now is change the rewrite the vector array to point 267 + * to the copied region inside the buffer we just allocated. This allows us to 268 + * format the regions into the iclog as though they are being formatted 269 + * directly out of the objects themselves. 270 + */ 271 + static void 272 + xlog_cil_format_items( 273 + struct log *log, 274 + struct xfs_log_vec *log_vector, 275 + struct xlog_ticket *ticket, 276 + xfs_lsn_t *start_lsn) 277 + { 278 + struct xfs_log_vec *lv; 279 + 280 + if (start_lsn) 281 + *start_lsn = log->l_cilp->xc_ctx->sequence; 282 + 283 + ASSERT(log_vector); 284 + for (lv = log_vector; lv; lv = lv->lv_next) { 285 + void *ptr; 286 + int index; 287 + int len = 0; 288 + 289 + /* build the vector array and calculate it's length */ 290 + IOP_FORMAT(lv->lv_item, lv->lv_iovecp); 291 + for (index = 0; index < lv->lv_niovecs; index++) 292 + len += lv->lv_iovecp[index].i_len; 293 + 294 + lv->lv_buf_len = len; 295 + lv->lv_buf = kmem_zalloc(lv->lv_buf_len, KM_SLEEP|KM_NOFS); 296 + ptr = lv->lv_buf; 297 + 298 + for (index = 0; index < lv->lv_niovecs; index++) { 299 + struct xfs_log_iovec *vec = &lv->lv_iovecp[index]; 300 + 301 + memcpy(ptr, vec->i_addr, vec->i_len); 302 + vec->i_addr = ptr; 303 + ptr += vec->i_len; 304 + } 305 + ASSERT(ptr == lv->lv_buf + lv->lv_buf_len); 306 + 307 + xlog_cil_insert(log, ticket, lv->lv_item, lv); 308 + } 309 + } 310 + 311 + static void 312 + xlog_cil_free_logvec( 313 + struct xfs_log_vec *log_vector) 314 + { 315 + struct xfs_log_vec *lv; 316 + 317 + for (lv = log_vector; lv; ) { 318 + struct xfs_log_vec *next = lv->lv_next; 319 + kmem_free(lv->lv_buf); 320 + kmem_free(lv); 321 + lv = next; 322 + } 323 + } 324 + 325 + /* 326 + * Commit a transaction with the given vector to the Committed Item List. 327 + * 328 + * To do this, we need to format the item, pin it in memory if required and 329 + * account for the space used by the transaction. Once we have done that we 330 + * need to release the unused reservation for the transaction, attach the 331 + * transaction to the checkpoint context so we carry the busy extents through 332 + * to checkpoint completion, and then unlock all the items in the transaction. 333 + * 334 + * For more specific information about the order of operations in 335 + * xfs_log_commit_cil() please refer to the comments in 336 + * xfs_trans_commit_iclog(). 337 + * 338 + * Called with the context lock already held in read mode to lock out 339 + * background commit, returns without it held once background commits are 340 + * allowed again. 341 + */ 342 + int 343 + xfs_log_commit_cil( 344 + struct xfs_mount *mp, 345 + struct xfs_trans *tp, 346 + struct xfs_log_vec *log_vector, 347 + xfs_lsn_t *commit_lsn, 348 + int flags) 349 + { 350 + struct log *log = mp->m_log; 351 + int log_flags = 0; 352 + int push = 0; 353 + 354 + if (flags & XFS_TRANS_RELEASE_LOG_RES) 355 + log_flags = XFS_LOG_REL_PERM_RESERV; 356 + 357 + if (XLOG_FORCED_SHUTDOWN(log)) { 358 + xlog_cil_free_logvec(log_vector); 359 + return XFS_ERROR(EIO); 360 + } 361 + 362 + /* lock out background commit */ 363 + down_read(&log->l_cilp->xc_ctx_lock); 364 + xlog_cil_format_items(log, log_vector, tp->t_ticket, commit_lsn); 365 + 366 + /* check we didn't blow the reservation */ 367 + if (tp->t_ticket->t_curr_res < 0) 368 + xlog_print_tic_res(log->l_mp, tp->t_ticket); 369 + 370 + /* attach the transaction to the CIL if it has any busy extents */ 371 + if (!list_empty(&tp->t_busy)) { 372 + spin_lock(&log->l_cilp->xc_cil_lock); 373 + list_splice_init(&tp->t_busy, 374 + &log->l_cilp->xc_ctx->busy_extents); 375 + spin_unlock(&log->l_cilp->xc_cil_lock); 376 + } 377 + 378 + tp->t_commit_lsn = *commit_lsn; 379 + xfs_log_done(mp, tp->t_ticket, NULL, log_flags); 380 + xfs_trans_unreserve_and_mod_sb(tp); 381 + 382 + /* check for background commit before unlock */ 383 + if (log->l_cilp->xc_ctx->space_used > XLOG_CIL_SPACE_LIMIT(log)) 384 + push = 1; 385 + up_read(&log->l_cilp->xc_ctx_lock); 386 + 387 + /* 388 + * We need to push CIL every so often so we don't cache more than we 389 + * can fit in the log. The limit really is that a checkpoint can't be 390 + * more than half the log (the current checkpoint is not allowed to 391 + * overwrite the previous checkpoint), but commit latency and memory 392 + * usage limit this to a smaller size in most cases. 393 + */ 394 + if (push) 395 + xlog_cil_push(log, 0); 396 + return 0; 397 + } 398 + 399 + /* 400 + * Mark all items committed and clear busy extents. We free the log vector 401 + * chains in a separate pass so that we unpin the log items as quickly as 402 + * possible. 403 + */ 404 + static void 405 + xlog_cil_committed( 406 + void *args, 407 + int abort) 408 + { 409 + struct xfs_cil_ctx *ctx = args; 410 + struct xfs_log_vec *lv; 411 + int abortflag = abort ? XFS_LI_ABORTED : 0; 412 + struct xfs_busy_extent *busyp, *n; 413 + 414 + /* unpin all the log items */ 415 + for (lv = ctx->lv_chain; lv; lv = lv->lv_next ) { 416 + xfs_trans_item_committed(lv->lv_item, ctx->start_lsn, 417 + abortflag); 418 + } 419 + 420 + list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list) 421 + xfs_alloc_busy_clear(ctx->cil->xc_log->l_mp, busyp); 422 + 423 + spin_lock(&ctx->cil->xc_cil_lock); 424 + list_del(&ctx->committing); 425 + spin_unlock(&ctx->cil->xc_cil_lock); 426 + 427 + xlog_cil_free_logvec(ctx->lv_chain); 428 + kmem_free(ctx); 429 + } 430 + 431 + /* 432 + * Push the Committed Item List to the log. If the push_now flag is not set, 433 + * then it is a background flush and so we can chose to ignore it. 434 + */ 435 + int 436 + xlog_cil_push( 437 + struct log *log, 438 + int push_now) 439 + { 440 + struct xfs_cil *cil = log->l_cilp; 441 + struct xfs_log_vec *lv; 442 + struct xfs_cil_ctx *ctx; 443 + struct xfs_cil_ctx *new_ctx; 444 + struct xlog_in_core *commit_iclog; 445 + struct xlog_ticket *tic; 446 + int num_lv; 447 + int num_iovecs; 448 + int len; 449 + int error = 0; 450 + struct xfs_trans_header thdr; 451 + struct xfs_log_iovec lhdr; 452 + struct xfs_log_vec lvhdr = { NULL }; 453 + xfs_lsn_t commit_lsn; 454 + 455 + if (!cil) 456 + return 0; 457 + 458 + new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_SLEEP|KM_NOFS); 459 + new_ctx->ticket = xlog_cil_ticket_alloc(log); 460 + 461 + /* lock out transaction commit, but don't block on background push */ 462 + if (!down_write_trylock(&cil->xc_ctx_lock)) { 463 + if (!push_now) 464 + goto out_free_ticket; 465 + down_write(&cil->xc_ctx_lock); 466 + } 467 + ctx = cil->xc_ctx; 468 + 469 + /* check if we've anything to push */ 470 + if (list_empty(&cil->xc_cil)) 471 + goto out_skip; 472 + 473 + /* check for spurious background flush */ 474 + if (!push_now && cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) 475 + goto out_skip; 476 + 477 + /* 478 + * pull all the log vectors off the items in the CIL, and 479 + * remove the items from the CIL. We don't need the CIL lock 480 + * here because it's only needed on the transaction commit 481 + * side which is currently locked out by the flush lock. 482 + */ 483 + lv = NULL; 484 + num_lv = 0; 485 + num_iovecs = 0; 486 + len = 0; 487 + while (!list_empty(&cil->xc_cil)) { 488 + struct xfs_log_item *item; 489 + int i; 490 + 491 + item = list_first_entry(&cil->xc_cil, 492 + struct xfs_log_item, li_cil); 493 + list_del_init(&item->li_cil); 494 + if (!ctx->lv_chain) 495 + ctx->lv_chain = item->li_lv; 496 + else 497 + lv->lv_next = item->li_lv; 498 + lv = item->li_lv; 499 + item->li_lv = NULL; 500 + 501 + num_lv++; 502 + num_iovecs += lv->lv_niovecs; 503 + for (i = 0; i < lv->lv_niovecs; i++) 504 + len += lv->lv_iovecp[i].i_len; 505 + } 506 + 507 + /* 508 + * initialise the new context and attach it to the CIL. Then attach 509 + * the current context to the CIL committing lsit so it can be found 510 + * during log forces to extract the commit lsn of the sequence that 511 + * needs to be forced. 512 + */ 513 + INIT_LIST_HEAD(&new_ctx->committing); 514 + INIT_LIST_HEAD(&new_ctx->busy_extents); 515 + new_ctx->sequence = ctx->sequence + 1; 516 + new_ctx->cil = cil; 517 + cil->xc_ctx = new_ctx; 518 + 519 + /* 520 + * The switch is now done, so we can drop the context lock and move out 521 + * of a shared context. We can't just go straight to the commit record, 522 + * though - we need to synchronise with previous and future commits so 523 + * that the commit records are correctly ordered in the log to ensure 524 + * that we process items during log IO completion in the correct order. 525 + * 526 + * For example, if we get an EFI in one checkpoint and the EFD in the 527 + * next (e.g. due to log forces), we do not want the checkpoint with 528 + * the EFD to be committed before the checkpoint with the EFI. Hence 529 + * we must strictly order the commit records of the checkpoints so 530 + * that: a) the checkpoint callbacks are attached to the iclogs in the 531 + * correct order; and b) the checkpoints are replayed in correct order 532 + * in log recovery. 533 + * 534 + * Hence we need to add this context to the committing context list so 535 + * that higher sequences will wait for us to write out a commit record 536 + * before they do. 537 + */ 538 + spin_lock(&cil->xc_cil_lock); 539 + list_add(&ctx->committing, &cil->xc_committing); 540 + spin_unlock(&cil->xc_cil_lock); 541 + up_write(&cil->xc_ctx_lock); 542 + 543 + /* 544 + * Build a checkpoint transaction header and write it to the log to 545 + * begin the transaction. We need to account for the space used by the 546 + * transaction header here as it is not accounted for in xlog_write(). 547 + * 548 + * The LSN we need to pass to the log items on transaction commit is 549 + * the LSN reported by the first log vector write. If we use the commit 550 + * record lsn then we can move the tail beyond the grant write head. 551 + */ 552 + tic = ctx->ticket; 553 + thdr.th_magic = XFS_TRANS_HEADER_MAGIC; 554 + thdr.th_type = XFS_TRANS_CHECKPOINT; 555 + thdr.th_tid = tic->t_tid; 556 + thdr.th_num_items = num_iovecs; 557 + lhdr.i_addr = (xfs_caddr_t)&thdr; 558 + lhdr.i_len = sizeof(xfs_trans_header_t); 559 + lhdr.i_type = XLOG_REG_TYPE_TRANSHDR; 560 + tic->t_curr_res -= lhdr.i_len + sizeof(xlog_op_header_t); 561 + 562 + lvhdr.lv_niovecs = 1; 563 + lvhdr.lv_iovecp = &lhdr; 564 + lvhdr.lv_next = ctx->lv_chain; 565 + 566 + error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0); 567 + if (error) 568 + goto out_abort; 569 + 570 + /* 571 + * now that we've written the checkpoint into the log, strictly 572 + * order the commit records so replay will get them in the right order. 573 + */ 574 + restart: 575 + spin_lock(&cil->xc_cil_lock); 576 + list_for_each_entry(new_ctx, &cil->xc_committing, committing) { 577 + /* 578 + * Higher sequences will wait for this one so skip them. 579 + * Don't wait for own own sequence, either. 580 + */ 581 + if (new_ctx->sequence >= ctx->sequence) 582 + continue; 583 + if (!new_ctx->commit_lsn) { 584 + /* 585 + * It is still being pushed! Wait for the push to 586 + * complete, then start again from the beginning. 587 + */ 588 + sv_wait(&cil->xc_commit_wait, 0, &cil->xc_cil_lock, 0); 589 + goto restart; 590 + } 591 + } 592 + spin_unlock(&cil->xc_cil_lock); 593 + 594 + commit_lsn = xfs_log_done(log->l_mp, tic, &commit_iclog, 0); 595 + if (error || commit_lsn == -1) 596 + goto out_abort; 597 + 598 + /* attach all the transactions w/ busy extents to iclog */ 599 + ctx->log_cb.cb_func = xlog_cil_committed; 600 + ctx->log_cb.cb_arg = ctx; 601 + error = xfs_log_notify(log->l_mp, commit_iclog, &ctx->log_cb); 602 + if (error) 603 + goto out_abort; 604 + 605 + /* 606 + * now the checkpoint commit is complete and we've attached the 607 + * callbacks to the iclog we can assign the commit LSN to the context 608 + * and wake up anyone who is waiting for the commit to complete. 609 + */ 610 + spin_lock(&cil->xc_cil_lock); 611 + ctx->commit_lsn = commit_lsn; 612 + sv_broadcast(&cil->xc_commit_wait); 613 + spin_unlock(&cil->xc_cil_lock); 614 + 615 + /* release the hounds! */ 616 + return xfs_log_release_iclog(log->l_mp, commit_iclog); 617 + 618 + out_skip: 619 + up_write(&cil->xc_ctx_lock); 620 + out_free_ticket: 621 + xfs_log_ticket_put(new_ctx->ticket); 622 + kmem_free(new_ctx); 623 + return 0; 624 + 625 + out_abort: 626 + xlog_cil_committed(ctx, XFS_LI_ABORTED); 627 + return XFS_ERROR(EIO); 628 + } 629 + 630 + /* 631 + * Conditionally push the CIL based on the sequence passed in. 632 + * 633 + * We only need to push if we haven't already pushed the sequence 634 + * number given. Hence the only time we will trigger a push here is 635 + * if the push sequence is the same as the current context. 636 + * 637 + * We return the current commit lsn to allow the callers to determine if a 638 + * iclog flush is necessary following this call. 639 + * 640 + * XXX: Initially, just push the CIL unconditionally and return whatever 641 + * commit lsn is there. It'll be empty, so this is broken for now. 642 + */ 643 + xfs_lsn_t 644 + xlog_cil_push_lsn( 645 + struct log *log, 646 + xfs_lsn_t push_seq) 647 + { 648 + struct xfs_cil *cil = log->l_cilp; 649 + struct xfs_cil_ctx *ctx; 650 + xfs_lsn_t commit_lsn = NULLCOMMITLSN; 651 + 652 + restart: 653 + down_write(&cil->xc_ctx_lock); 654 + ASSERT(push_seq <= cil->xc_ctx->sequence); 655 + 656 + /* check to see if we need to force out the current context */ 657 + if (push_seq == cil->xc_ctx->sequence) { 658 + up_write(&cil->xc_ctx_lock); 659 + xlog_cil_push(log, 1); 660 + goto restart; 661 + } 662 + 663 + /* 664 + * See if we can find a previous sequence still committing. 665 + * We can drop the flush lock as soon as we have the cil lock 666 + * because we are now only comparing contexts protected by 667 + * the cil lock. 668 + * 669 + * We need to wait for all previous sequence commits to complete 670 + * before allowing the force of push_seq to go ahead. Hence block 671 + * on commits for those as well. 672 + */ 673 + spin_lock(&cil->xc_cil_lock); 674 + up_write(&cil->xc_ctx_lock); 675 + list_for_each_entry(ctx, &cil->xc_committing, committing) { 676 + if (ctx->sequence > push_seq) 677 + continue; 678 + if (!ctx->commit_lsn) { 679 + /* 680 + * It is still being pushed! Wait for the push to 681 + * complete, then start again from the beginning. 682 + */ 683 + sv_wait(&cil->xc_commit_wait, 0, &cil->xc_cil_lock, 0); 684 + goto restart; 685 + } 686 + if (ctx->sequence != push_seq) 687 + continue; 688 + /* found it! */ 689 + commit_lsn = ctx->commit_lsn; 690 + } 691 + spin_unlock(&cil->xc_cil_lock); 692 + return commit_lsn; 693 + } 694 + 695 + /* 696 + * Check if the current log item was first committed in this sequence. 697 + * We can't rely on just the log item being in the CIL, we have to check 698 + * the recorded commit sequence number. 699 + * 700 + * Note: for this to be used in a non-racy manner, it has to be called with 701 + * CIL flushing locked out. As a result, it should only be used during the 702 + * transaction commit process when deciding what to format into the item. 703 + */ 704 + bool 705 + xfs_log_item_in_current_chkpt( 706 + struct xfs_log_item *lip) 707 + { 708 + struct xfs_cil_ctx *ctx; 709 + 710 + if (!(lip->li_mountp->m_flags & XFS_MOUNT_DELAYLOG)) 711 + return false; 712 + if (list_empty(&lip->li_cil)) 713 + return false; 714 + 715 + ctx = lip->li_mountp->m_log->l_cilp->xc_ctx; 716 + 717 + /* 718 + * li_seq is written on the first commit of a log item to record the 719 + * first checkpoint it is written to. Hence if it is different to the 720 + * current sequence, we're in a new checkpoint. 721 + */ 722 + if (XFS_LSN_CMP(lip->li_seq, ctx->sequence) != 0) 723 + return false; 724 + return true; 725 + }

+114 -4

fs/xfs/xfs_log_priv.h

··· 152 152 #define XLOG_RECOVERY_NEEDED 0x4 /* log was recovered */ 153 153 #define XLOG_IO_ERROR 0x8 /* log hit an I/O error, and being 154 154 shutdown */ 155 - typedef __uint32_t xlog_tid_t; 156 - 157 155 158 156 #ifdef __KERNEL__ 159 157 /* ··· 377 379 } xlog_in_core_t; 378 380 379 381 /* 382 + * The CIL context is used to aggregate per-transaction details as well be 383 + * passed to the iclog for checkpoint post-commit processing. After being 384 + * passed to the iclog, another context needs to be allocated for tracking the 385 + * next set of transactions to be aggregated into a checkpoint. 386 + */ 387 + struct xfs_cil; 388 + 389 + struct xfs_cil_ctx { 390 + struct xfs_cil *cil; 391 + xfs_lsn_t sequence; /* chkpt sequence # */ 392 + xfs_lsn_t start_lsn; /* first LSN of chkpt commit */ 393 + xfs_lsn_t commit_lsn; /* chkpt commit record lsn */ 394 + struct xlog_ticket *ticket; /* chkpt ticket */ 395 + int nvecs; /* number of regions */ 396 + int space_used; /* aggregate size of regions */ 397 + struct list_head busy_extents; /* busy extents in chkpt */ 398 + struct xfs_log_vec *lv_chain; /* logvecs being pushed */ 399 + xfs_log_callback_t log_cb; /* completion callback hook. */ 400 + struct list_head committing; /* ctx committing list */ 401 + }; 402 + 403 + /* 404 + * Committed Item List structure 405 + * 406 + * This structure is used to track log items that have been committed but not 407 + * yet written into the log. It is used only when the delayed logging mount 408 + * option is enabled. 409 + * 410 + * This structure tracks the list of committing checkpoint contexts so 411 + * we can avoid the problem of having to hold out new transactions during a 412 + * flush until we have a the commit record LSN of the checkpoint. We can 413 + * traverse the list of committing contexts in xlog_cil_push_lsn() to find a 414 + * sequence match and extract the commit LSN directly from there. If the 415 + * checkpoint is still in the process of committing, we can block waiting for 416 + * the commit LSN to be determined as well. This should make synchronous 417 + * operations almost as efficient as the old logging methods. 418 + */ 419 + struct xfs_cil { 420 + struct log *xc_log; 421 + struct list_head xc_cil; 422 + spinlock_t xc_cil_lock; 423 + struct xfs_cil_ctx *xc_ctx; 424 + struct rw_semaphore xc_ctx_lock; 425 + struct list_head xc_committing; 426 + sv_t xc_commit_wait; 427 + }; 428 + 429 + /* 430 + * The amount of log space we should the CIL to aggregate is difficult to size. 431 + * Whatever we chose we have to make we can get a reservation for the log space 432 + * effectively, that it is large enough to capture sufficient relogging to 433 + * reduce log buffer IO significantly, but it is not too large for the log or 434 + * induces too much latency when writing out through the iclogs. We track both 435 + * space consumed and the number of vectors in the checkpoint context, so we 436 + * need to decide which to use for limiting. 437 + * 438 + * Every log buffer we write out during a push needs a header reserved, which 439 + * is at least one sector and more for v2 logs. Hence we need a reservation of 440 + * at least 512 bytes per 32k of log space just for the LR headers. That means 441 + * 16KB of reservation per megabyte of delayed logging space we will consume, 442 + * plus various headers. The number of headers will vary based on the num of 443 + * io vectors, so limiting on a specific number of vectors is going to result 444 + * in transactions of varying size. IOWs, it is more consistent to track and 445 + * limit space consumed in the log rather than by the number of objects being 446 + * logged in order to prevent checkpoint ticket overruns. 447 + * 448 + * Further, use of static reservations through the log grant mechanism is 449 + * problematic. It introduces a lot of complexity (e.g. reserve grant vs write 450 + * grant) and a significant deadlock potential because regranting write space 451 + * can block on log pushes. Hence if we have to regrant log space during a log 452 + * push, we can deadlock. 453 + * 454 + * However, we can avoid this by use of a dynamic "reservation stealing" 455 + * technique during transaction commit whereby unused reservation space in the 456 + * transaction ticket is transferred to the CIL ctx commit ticket to cover the 457 + * space needed by the checkpoint transaction. This means that we never need to 458 + * specifically reserve space for the CIL checkpoint transaction, nor do we 459 + * need to regrant space once the checkpoint completes. This also means the 460 + * checkpoint transaction ticket is specific to the checkpoint context, rather 461 + * than the CIL itself. 462 + * 463 + * With dynamic reservations, we can basically make up arbitrary limits for the 464 + * checkpoint size so long as they don't violate any other size rules. Hence 465 + * the initial maximum size for the checkpoint transaction will be set to a 466 + * quarter of the log or 8MB, which ever is smaller. 8MB is an arbitrary limit 467 + * right now based on the latency of writing out a large amount of data through 468 + * the circular iclog buffers. 469 + */ 470 + 471 + #define XLOG_CIL_SPACE_LIMIT(log) \ 472 + (min((log->l_logsize >> 2), (8 * 1024 * 1024))) 473 + 474 + /* 380 475 * The reservation head lsn is not made up of a cycle number and block number. 381 476 * Instead, it uses a cycle number and byte number. Logs don't expect to 382 477 * overflow 31 bits worth of byte offset, so using a byte number will mean ··· 479 388 /* The following fields don't need locking */ 480 389 struct xfs_mount *l_mp; /* mount point */ 481 390 struct xfs_ail *l_ailp; /* AIL log is working with */ 391 + struct xfs_cil *l_cilp; /* CIL log is working with */ 482 392 struct xfs_buf *l_xbuf; /* extra buffer for log 483 393 * wrapping */ 484 394 struct xfs_buftarg *l_targ; /* buftarg of log */ ··· 530 438 531 439 #define XLOG_FORCED_SHUTDOWN(log) ((log)->l_flags & XLOG_IO_ERROR) 532 440 533 - 534 441 /* common routines */ 535 442 extern xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp); 536 443 extern int xlog_recover(xlog_t *log); 537 444 extern int xlog_recover_finish(xlog_t *log); 538 445 extern void xlog_pack_data(xlog_t *log, xlog_in_core_t *iclog, int); 539 446 540 - extern kmem_zone_t *xfs_log_ticket_zone; 447 + extern kmem_zone_t *xfs_log_ticket_zone; 448 + struct xlog_ticket *xlog_ticket_alloc(struct log *log, int unit_bytes, 449 + int count, char client, uint xflags, 450 + int alloc_flags); 451 + 541 452 542 453 static inline void 543 454 xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes) ··· 549 454 *len -= bytes; 550 455 *off += bytes; 551 456 } 457 + 458 + void xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket); 459 + int xlog_write(struct log *log, struct xfs_log_vec *log_vector, 460 + struct xlog_ticket *tic, xfs_lsn_t *start_lsn, 461 + xlog_in_core_t **commit_iclog, uint flags); 462 + 463 + /* 464 + * Committed Item List interfaces 465 + */ 466 + int xlog_cil_init(struct log *log); 467 + void xlog_cil_init_post_recovery(struct log *log); 468 + void xlog_cil_destroy(struct log *log); 469 + 470 + int xlog_cil_push(struct log *log, int push_now); 471 + xfs_lsn_t xlog_cil_push_lsn(struct log *log, xfs_lsn_t push_sequence); 552 472 553 473 /* 554 474 * Unmount record type is used as a pseudo transaction type for the ticket.

+23 -23

fs/xfs/xfs_log_recover.c

··· 1576 1576 1577 1577 switch (ITEM_TYPE(item)) { 1578 1578 case XFS_LI_BUF: 1579 - if (!(buf_f->blf_flags & XFS_BLI_CANCEL)) { 1579 + if (!(buf_f->blf_flags & XFS_BLF_CANCEL)) { 1580 1580 trace_xfs_log_recover_item_reorder_head(log, 1581 1581 trans, item, pass); 1582 1582 list_move(&item->ri_list, &trans->r_itemq); ··· 1638 1638 /* 1639 1639 * If this isn't a cancel buffer item, then just return. 1640 1640 */ 1641 - if (!(flags & XFS_BLI_CANCEL)) { 1641 + if (!(flags & XFS_BLF_CANCEL)) { 1642 1642 trace_xfs_log_recover_buf_not_cancel(log, buf_f); 1643 1643 return; 1644 1644 } ··· 1696 1696 * Check to see whether the buffer being recovered has a corresponding 1697 1697 * entry in the buffer cancel record table. If it does then return 1 1698 1698 * so that it will be cancelled, otherwise return 0. If the buffer is 1699 - * actually a buffer cancel item (XFS_BLI_CANCEL is set), then decrement 1699 + * actually a buffer cancel item (XFS_BLF_CANCEL is set), then decrement 1700 1700 * the refcount on the entry in the table and remove it from the table 1701 1701 * if this is the last reference. 1702 1702 * ··· 1721 1721 * There is nothing in the table built in pass one, 1722 1722 * so this buffer must not be cancelled. 1723 1723 */ 1724 - ASSERT(!(flags & XFS_BLI_CANCEL)); 1724 + ASSERT(!(flags & XFS_BLF_CANCEL)); 1725 1725 return 0; 1726 1726 } 1727 1727 ··· 1733 1733 * There is no corresponding entry in the table built 1734 1734 * in pass one, so this buffer has not been cancelled. 1735 1735 */ 1736 - ASSERT(!(flags & XFS_BLI_CANCEL)); 1736 + ASSERT(!(flags & XFS_BLF_CANCEL)); 1737 1737 return 0; 1738 1738 } 1739 1739 ··· 1752 1752 * one in the table and remove it if this is the 1753 1753 * last reference. 1754 1754 */ 1755 - if (flags & XFS_BLI_CANCEL) { 1755 + if (flags & XFS_BLF_CANCEL) { 1756 1756 bcp->bc_refcount--; 1757 1757 if (bcp->bc_refcount == 0) { 1758 1758 if (prevp == NULL) { ··· 1772 1772 * We didn't find a corresponding entry in the table, so 1773 1773 * return 0 so that the buffer is NOT cancelled. 1774 1774 */ 1775 - ASSERT(!(flags & XFS_BLI_CANCEL)); 1775 + ASSERT(!(flags & XFS_BLF_CANCEL)); 1776 1776 return 0; 1777 1777 } 1778 1778 ··· 1874 1874 nbits = xfs_contig_bits(data_map, map_size, 1875 1875 bit); 1876 1876 ASSERT(nbits > 0); 1877 - reg_buf_offset = bit << XFS_BLI_SHIFT; 1878 - reg_buf_bytes = nbits << XFS_BLI_SHIFT; 1877 + reg_buf_offset = bit << XFS_BLF_SHIFT; 1878 + reg_buf_bytes = nbits << XFS_BLF_SHIFT; 1879 1879 item_index++; 1880 1880 } 1881 1881 ··· 1889 1889 } 1890 1890 1891 1891 ASSERT(item->ri_buf[item_index].i_addr != NULL); 1892 - ASSERT((item->ri_buf[item_index].i_len % XFS_BLI_CHUNK) == 0); 1892 + ASSERT((item->ri_buf[item_index].i_len % XFS_BLF_CHUNK) == 0); 1893 1893 ASSERT((reg_buf_offset + reg_buf_bytes) <= XFS_BUF_COUNT(bp)); 1894 1894 1895 1895 /* ··· 1955 1955 nbits = xfs_contig_bits(data_map, map_size, bit); 1956 1956 ASSERT(nbits > 0); 1957 1957 ASSERT(item->ri_buf[i].i_addr != NULL); 1958 - ASSERT(item->ri_buf[i].i_len % XFS_BLI_CHUNK == 0); 1958 + ASSERT(item->ri_buf[i].i_len % XFS_BLF_CHUNK == 0); 1959 1959 ASSERT(XFS_BUF_COUNT(bp) >= 1960 - ((uint)bit << XFS_BLI_SHIFT)+(nbits<<XFS_BLI_SHIFT)); 1960 + ((uint)bit << XFS_BLF_SHIFT)+(nbits<<XFS_BLF_SHIFT)); 1961 1961 1962 1962 /* 1963 1963 * Do a sanity check if this is a dquot buffer. Just checking ··· 1966 1966 */ 1967 1967 error = 0; 1968 1968 if (buf_f->blf_flags & 1969 - (XFS_BLI_UDQUOT_BUF|XFS_BLI_PDQUOT_BUF|XFS_BLI_GDQUOT_BUF)) { 1969 + (XFS_BLF_UDQUOT_BUF|XFS_BLF_PDQUOT_BUF|XFS_BLF_GDQUOT_BUF)) { 1970 1970 if (item->ri_buf[i].i_addr == NULL) { 1971 1971 cmn_err(CE_ALERT, 1972 1972 "XFS: NULL dquot in %s.", __func__); ··· 1987 1987 } 1988 1988 1989 1989 memcpy(xfs_buf_offset(bp, 1990 - (uint)bit << XFS_BLI_SHIFT), /* dest */ 1990 + (uint)bit << XFS_BLF_SHIFT), /* dest */ 1991 1991 item->ri_buf[i].i_addr, /* source */ 1992 - nbits<<XFS_BLI_SHIFT); /* length */ 1992 + nbits<<XFS_BLF_SHIFT); /* length */ 1993 1993 next: 1994 1994 i++; 1995 1995 bit += nbits; ··· 2148 2148 } 2149 2149 2150 2150 type = 0; 2151 - if (buf_f->blf_flags & XFS_BLI_UDQUOT_BUF) 2151 + if (buf_f->blf_flags & XFS_BLF_UDQUOT_BUF) 2152 2152 type |= XFS_DQ_USER; 2153 - if (buf_f->blf_flags & XFS_BLI_PDQUOT_BUF) 2153 + if (buf_f->blf_flags & XFS_BLF_PDQUOT_BUF) 2154 2154 type |= XFS_DQ_PROJ; 2155 - if (buf_f->blf_flags & XFS_BLI_GDQUOT_BUF) 2155 + if (buf_f->blf_flags & XFS_BLF_GDQUOT_BUF) 2156 2156 type |= XFS_DQ_GROUP; 2157 2157 /* 2158 2158 * This type of quotas was turned off, so ignore this buffer ··· 2173 2173 * here which overlaps that may be stale. 2174 2174 * 2175 2175 * When meta-data buffers are freed at run time we log a buffer item 2176 - * with the XFS_BLI_CANCEL bit set to indicate that previous copies 2176 + * with the XFS_BLF_CANCEL bit set to indicate that previous copies 2177 2177 * of the buffer in the log should not be replayed at recovery time. 2178 2178 * This is so that if the blocks covered by the buffer are reused for 2179 2179 * file data before we crash we don't end up replaying old, freed ··· 2207 2207 if (pass == XLOG_RECOVER_PASS1) { 2208 2208 /* 2209 2209 * In this pass we're only looking for buf items 2210 - * with the XFS_BLI_CANCEL bit set. 2210 + * with the XFS_BLF_CANCEL bit set. 2211 2211 */ 2212 2212 xlog_recover_do_buffer_pass1(log, buf_f); 2213 2213 return 0; ··· 2244 2244 2245 2245 mp = log->l_mp; 2246 2246 buf_flags = XBF_LOCK; 2247 - if (!(flags & XFS_BLI_INODE_BUF)) 2247 + if (!(flags & XFS_BLF_INODE_BUF)) 2248 2248 buf_flags |= XBF_MAPPED; 2249 2249 2250 2250 bp = xfs_buf_read(mp->m_ddev_targp, blkno, len, buf_flags); ··· 2257 2257 } 2258 2258 2259 2259 error = 0; 2260 - if (flags & XFS_BLI_INODE_BUF) { 2260 + if (flags & XFS_BLF_INODE_BUF) { 2261 2261 error = xlog_recover_do_inode_buffer(mp, item, bp, buf_f); 2262 2262 } else if (flags & 2263 - (XFS_BLI_UDQUOT_BUF|XFS_BLI_PDQUOT_BUF|XFS_BLI_GDQUOT_BUF)) { 2263 + (XFS_BLF_UDQUOT_BUF|XFS_BLF_PDQUOT_BUF|XFS_BLF_GDQUOT_BUF)) { 2264 2264 xlog_recover_do_dquot_buffer(mp, log, item, bp, buf_f); 2265 2265 } else { 2266 2266 xlog_recover_do_reg_buffer(mp, item, bp, buf_f);

+1 -1

fs/xfs/xfs_log_recover.h

··· 28 28 #define XLOG_RHASH(tid) \ 29 29 ((((__uint32_t)tid)>>XLOG_RHASH_SHIFT) & (XLOG_RHASH_SIZE-1)) 30 30 31 - #define XLOG_MAX_REGIONS_IN_ITEM (XFS_MAX_BLOCKSIZE / XFS_BLI_CHUNK / 2 + 1) 31 + #define XLOG_MAX_REGIONS_IN_ITEM (XFS_MAX_BLOCKSIZE / XFS_BLF_CHUNK / 2 + 1) 32 32 33 33 34 34 /*

+1

fs/xfs/xfs_mount.h

··· 268 268 #define XFS_MOUNT_WSYNC (1ULL << 0) /* for nfs - all metadata ops 269 269 must be synchronous except 270 270 for space allocations */ 271 + #define XFS_MOUNT_DELAYLOG (1ULL << 1) /* delayed logging is enabled */ 271 272 #define XFS_MOUNT_DMAPI (1ULL << 2) /* dmapi is enabled */ 272 273 #define XFS_MOUNT_WAS_CLEAN (1ULL << 3) 273 274 #define XFS_MOUNT_FS_SHUTDOWN (1ULL << 4) /* atomic stop of all filesystem

+108 -36

fs/xfs/xfs_trans.c

··· 44 44 #include "xfs_trans_priv.h" 45 45 #include "xfs_trans_space.h" 46 46 #include "xfs_inode_item.h" 47 + #include "xfs_trace.h" 47 48 48 49 kmem_zone_t *xfs_trans_zone; 49 50 ··· 244 243 tp->t_type = type; 245 244 tp->t_mountp = mp; 246 245 tp->t_items_free = XFS_LIC_NUM_SLOTS; 247 - tp->t_busy_free = XFS_LBC_NUM_SLOTS; 248 246 xfs_lic_init(&(tp->t_items)); 249 - XFS_LBC_INIT(&(tp->t_busy)); 247 + INIT_LIST_HEAD(&tp->t_busy); 250 248 return tp; 251 249 } 252 250 ··· 255 255 */ 256 256 STATIC void 257 257 xfs_trans_free( 258 - xfs_trans_t *tp) 258 + struct xfs_trans *tp) 259 259 { 260 + struct xfs_busy_extent *busyp, *n; 261 + 262 + list_for_each_entry_safe(busyp, n, &tp->t_busy, list) 263 + xfs_alloc_busy_clear(tp->t_mountp, busyp); 264 + 260 265 atomic_dec(&tp->t_mountp->m_active_trans); 261 266 xfs_trans_free_dqinfo(tp); 262 267 kmem_zone_free(xfs_trans_zone, tp); ··· 290 285 ntp->t_type = tp->t_type; 291 286 ntp->t_mountp = tp->t_mountp; 292 287 ntp->t_items_free = XFS_LIC_NUM_SLOTS; 293 - ntp->t_busy_free = XFS_LBC_NUM_SLOTS; 294 288 xfs_lic_init(&(ntp->t_items)); 295 - XFS_LBC_INIT(&(ntp->t_busy)); 289 + INIT_LIST_HEAD(&ntp->t_busy); 296 290 297 291 ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); 298 292 ASSERT(tp->t_ticket != NULL); ··· 426 422 427 423 return error; 428 424 } 429 - 430 425 431 426 /* 432 427 * Record the indicated change to the given field for application ··· 655 652 * XFS_TRANS_SB_DIRTY will not be set when the transaction is updated but we 656 653 * still need to update the incore superblock with the changes. 657 654 */ 658 - STATIC void 655 + void 659 656 xfs_trans_unreserve_and_mod_sb( 660 657 xfs_trans_t *tp) 661 658 { ··· 883 880 * they could be immediately flushed and we'd have to race with the flusher 884 881 * trying to pull the item from the AIL as we add it. 885 882 */ 886 - static void 883 + void 887 884 xfs_trans_item_committed( 888 885 struct xfs_log_item *lip, 889 886 xfs_lsn_t commit_lsn, ··· 933 930 IOP_UNPIN(lip); 934 931 } 935 932 936 - /* Clear all the per-AG busy list items listed in this transaction */ 937 - static void 938 - xfs_trans_clear_busy_extents( 939 - struct xfs_trans *tp) 940 - { 941 - xfs_log_busy_chunk_t *lbcp; 942 - xfs_log_busy_slot_t *lbsp; 943 - int i; 944 - 945 - for (lbcp = &tp->t_busy; lbcp != NULL; lbcp = lbcp->lbc_next) { 946 - i = 0; 947 - for (lbsp = lbcp->lbc_busy; i < lbcp->lbc_unused; i++, lbsp++) { 948 - if (XFS_LBC_ISFREE(lbcp, i)) 949 - continue; 950 - xfs_alloc_clear_busy(tp, lbsp->lbc_ag, lbsp->lbc_idx); 951 - } 952 - } 953 - xfs_trans_free_busy(tp); 954 - } 955 - 956 933 /* 957 934 * This is typically called by the LM when a transaction has been fully 958 935 * committed to disk. It needs to unpin the items which have ··· 967 984 kmem_free(licp); 968 985 } 969 986 970 - xfs_trans_clear_busy_extents(tp); 971 987 xfs_trans_free(tp); 972 988 } 973 989 ··· 994 1012 xfs_trans_unreserve_and_mod_sb(tp); 995 1013 xfs_trans_unreserve_and_mod_dquots(tp); 996 1014 997 - xfs_trans_free_items(tp, flags); 998 - xfs_trans_free_busy(tp); 1015 + xfs_trans_free_items(tp, NULLCOMMITLSN, flags); 999 1016 xfs_trans_free(tp); 1000 1017 } 1001 1018 ··· 1056 1075 *commit_lsn = xfs_log_done(mp, tp->t_ticket, &commit_iclog, log_flags); 1057 1076 1058 1077 tp->t_commit_lsn = *commit_lsn; 1078 + trace_xfs_trans_commit_lsn(tp); 1079 + 1059 1080 if (nvec > XFS_TRANS_LOGVEC_COUNT) 1060 1081 kmem_free(log_vector); 1061 1082 ··· 1144 1161 return xfs_log_release_iclog(mp, commit_iclog); 1145 1162 } 1146 1163 1164 + /* 1165 + * Walk the log items and allocate log vector structures for 1166 + * each item large enough to fit all the vectors they require. 1167 + * Note that this format differs from the old log vector format in 1168 + * that there is no transaction header in these log vectors. 1169 + */ 1170 + STATIC struct xfs_log_vec * 1171 + xfs_trans_alloc_log_vecs( 1172 + xfs_trans_t *tp) 1173 + { 1174 + xfs_log_item_desc_t *lidp; 1175 + struct xfs_log_vec *lv = NULL; 1176 + struct xfs_log_vec *ret_lv = NULL; 1177 + 1178 + lidp = xfs_trans_first_item(tp); 1179 + 1180 + /* Bail out if we didn't find a log item. */ 1181 + if (!lidp) { 1182 + ASSERT(0); 1183 + return NULL; 1184 + } 1185 + 1186 + while (lidp != NULL) { 1187 + struct xfs_log_vec *new_lv; 1188 + 1189 + /* Skip items which aren't dirty in this transaction. */ 1190 + if (!(lidp->lid_flags & XFS_LID_DIRTY)) { 1191 + lidp = xfs_trans_next_item(tp, lidp); 1192 + continue; 1193 + } 1194 + 1195 + /* Skip items that do not have any vectors for writing */ 1196 + lidp->lid_size = IOP_SIZE(lidp->lid_item); 1197 + if (!lidp->lid_size) { 1198 + lidp = xfs_trans_next_item(tp, lidp); 1199 + continue; 1200 + } 1201 + 1202 + new_lv = kmem_zalloc(sizeof(*new_lv) + 1203 + lidp->lid_size * sizeof(struct xfs_log_iovec), 1204 + KM_SLEEP); 1205 + 1206 + /* The allocated iovec region lies beyond the log vector. */ 1207 + new_lv->lv_iovecp = (struct xfs_log_iovec *)&new_lv[1]; 1208 + new_lv->lv_niovecs = lidp->lid_size; 1209 + new_lv->lv_item = lidp->lid_item; 1210 + if (!ret_lv) 1211 + ret_lv = new_lv; 1212 + else 1213 + lv->lv_next = new_lv; 1214 + lv = new_lv; 1215 + lidp = xfs_trans_next_item(tp, lidp); 1216 + } 1217 + 1218 + return ret_lv; 1219 + } 1220 + 1221 + static int 1222 + xfs_trans_commit_cil( 1223 + struct xfs_mount *mp, 1224 + struct xfs_trans *tp, 1225 + xfs_lsn_t *commit_lsn, 1226 + int flags) 1227 + { 1228 + struct xfs_log_vec *log_vector; 1229 + int error; 1230 + 1231 + /* 1232 + * Get each log item to allocate a vector structure for 1233 + * the log item to to pass to the log write code. The 1234 + * CIL commit code will format the vector and save it away. 1235 + */ 1236 + log_vector = xfs_trans_alloc_log_vecs(tp); 1237 + if (!log_vector) 1238 + return ENOMEM; 1239 + 1240 + error = xfs_log_commit_cil(mp, tp, log_vector, commit_lsn, flags); 1241 + if (error) 1242 + return error; 1243 + 1244 + current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); 1245 + 1246 + /* xfs_trans_free_items() unlocks them first */ 1247 + xfs_trans_free_items(tp, *commit_lsn, 0); 1248 + xfs_trans_free(tp); 1249 + return 0; 1250 + } 1147 1251 1148 1252 /* 1149 1253 * xfs_trans_commit ··· 1291 1221 xfs_trans_apply_sb_deltas(tp); 1292 1222 xfs_trans_apply_dquot_deltas(tp); 1293 1223 1294 - error = xfs_trans_commit_iclog(mp, tp, &commit_lsn, flags); 1224 + if (mp->m_flags & XFS_MOUNT_DELAYLOG) 1225 + error = xfs_trans_commit_cil(mp, tp, &commit_lsn, flags); 1226 + else 1227 + error = xfs_trans_commit_iclog(mp, tp, &commit_lsn, flags); 1228 + 1295 1229 if (error == ENOMEM) { 1296 1230 xfs_force_shutdown(mp, SHUTDOWN_LOG_IO_ERROR); 1297 1231 error = XFS_ERROR(EIO); ··· 1333 1259 error = XFS_ERROR(EIO); 1334 1260 } 1335 1261 current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); 1336 - xfs_trans_free_items(tp, error ? XFS_TRANS_ABORT : 0); 1337 - xfs_trans_free_busy(tp); 1262 + xfs_trans_free_items(tp, NULLCOMMITLSN, error ? XFS_TRANS_ABORT : 0); 1338 1263 xfs_trans_free(tp); 1339 1264 1340 1265 XFS_STATS_INC(xs_trans_empty); ··· 1411 1338 /* mark this thread as no longer being in a transaction */ 1412 1339 current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); 1413 1340 1414 - xfs_trans_free_items(tp, flags); 1415 - xfs_trans_free_busy(tp); 1341 + xfs_trans_free_items(tp, NULLCOMMITLSN, flags); 1416 1342 xfs_trans_free(tp); 1417 1343 } 1418 1344

+10 -34

fs/xfs/xfs_trans.h

··· 106 106 #define XFS_TRANS_GROWFSRT_FREE 39 107 107 #define XFS_TRANS_SWAPEXT 40 108 108 #define XFS_TRANS_SB_COUNT 41 109 - #define XFS_TRANS_TYPE_MAX 41 109 + #define XFS_TRANS_CHECKPOINT 42 110 + #define XFS_TRANS_TYPE_MAX 42 110 111 /* new transaction types need to be reflected in xfs_logprint(8) */ 111 112 112 113 #define XFS_TRANS_TYPES \ ··· 149 148 { XFS_TRANS_GROWFSRT_FREE, "GROWFSRT_FREE" }, \ 150 149 { XFS_TRANS_SWAPEXT, "SWAPEXT" }, \ 151 150 { XFS_TRANS_SB_COUNT, "SB_COUNT" }, \ 151 + { XFS_TRANS_CHECKPOINT, "CHECKPOINT" }, \ 152 152 { XFS_TRANS_DUMMY1, "DUMMY1" }, \ 153 153 { XFS_TRANS_DUMMY2, "DUMMY2" }, \ 154 154 { XLOG_UNMOUNT_REC_TYPE, "UNMOUNT" } ··· 815 813 struct xfs_mount; 816 814 struct xfs_trans; 817 815 struct xfs_dquot_acct; 816 + struct xfs_busy_extent; 818 817 819 818 typedef struct xfs_log_item { 820 819 struct list_head li_ail; /* AIL pointers */ ··· 831 828 /* buffer item iodone */ 832 829 /* callback func */ 833 830 struct xfs_item_ops *li_ops; /* function list */ 831 + 832 + /* delayed logging */ 833 + struct list_head li_cil; /* CIL pointers */ 834 + struct xfs_log_vec *li_lv; /* active log vector */ 835 + xfs_lsn_t li_seq; /* CIL commit seq */ 834 836 } xfs_log_item_t; 835 837 836 838 #define XFS_LI_IN_AIL 0x1 ··· 878 870 #define XFS_ITEM_PINNED 1 879 871 #define XFS_ITEM_LOCKED 2 880 872 #define XFS_ITEM_PUSHBUF 3 881 - 882 - /* 883 - * This structure is used to maintain a list of block ranges that have been 884 - * freed in the transaction. The ranges are listed in the perag[] busy list 885 - * between when they're freed and the transaction is committed to disk. 886 - */ 887 - 888 - typedef struct xfs_log_busy_slot { 889 - xfs_agnumber_t lbc_ag; 890 - ushort lbc_idx; /* index in perag.busy[] */ 891 - } xfs_log_busy_slot_t; 892 - 893 - #define XFS_LBC_NUM_SLOTS 31 894 - typedef struct xfs_log_busy_chunk { 895 - struct xfs_log_busy_chunk *lbc_next; 896 - uint lbc_free; /* free slots bitmask */ 897 - ushort lbc_unused; /* first unused */ 898 - xfs_log_busy_slot_t lbc_busy[XFS_LBC_NUM_SLOTS]; 899 - } xfs_log_busy_chunk_t; 900 - 901 - #define XFS_LBC_MAX_SLOT (XFS_LBC_NUM_SLOTS - 1) 902 - #define XFS_LBC_FREEMASK ((1U << XFS_LBC_NUM_SLOTS) - 1) 903 - 904 - #define XFS_LBC_INIT(cp) ((cp)->lbc_free = XFS_LBC_FREEMASK) 905 - #define XFS_LBC_CLAIM(cp, slot) ((cp)->lbc_free &= ~(1 << (slot))) 906 - #define XFS_LBC_SLOT(cp, slot) (&((cp)->lbc_busy[(slot)])) 907 - #define XFS_LBC_VACANCY(cp) (((cp)->lbc_free) & XFS_LBC_FREEMASK) 908 - #define XFS_LBC_ISFREE(cp, slot) ((cp)->lbc_free & (1 << (slot))) 909 873 910 874 /* 911 875 * This is the type of function which can be given to xfs_trans_callback() ··· 930 950 unsigned int t_items_free; /* log item descs free */ 931 951 xfs_log_item_chunk_t t_items; /* first log item desc chunk */ 932 952 xfs_trans_header_t t_header; /* header for in-log trans */ 933 - unsigned int t_busy_free; /* busy descs free */ 934 - xfs_log_busy_chunk_t t_busy; /* busy/async free blocks */ 953 + struct list_head t_busy; /* list of busy extents */ 935 954 unsigned long t_pflags; /* saved process flags state */ 936 955 } xfs_trans_t; 937 956 ··· 1004 1025 void xfs_trans_cancel(xfs_trans_t *, int); 1005 1026 int xfs_trans_ail_init(struct xfs_mount *); 1006 1027 void xfs_trans_ail_destroy(struct xfs_mount *); 1007 - xfs_log_busy_slot_t *xfs_trans_add_busy(xfs_trans_t *tp, 1008 - xfs_agnumber_t ag, 1009 - xfs_extlen_t idx); 1010 1028 1011 1029 extern kmem_zone_t *xfs_trans_zone; 1012 1030

+23 -23

fs/xfs/xfs_trans_buf.c

··· 114 114 xfs_buf_item_init(bp, tp->t_mountp); 115 115 bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); 116 116 ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); 117 - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_CANCEL)); 117 + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_CANCEL)); 118 118 ASSERT(!(bip->bli_flags & XFS_BLI_LOGGED)); 119 119 if (reset_recur) 120 120 bip->bli_recur = 0; ··· 511 511 bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); 512 512 ASSERT(bip->bli_item.li_type == XFS_LI_BUF); 513 513 ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); 514 - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_CANCEL)); 514 + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_CANCEL)); 515 515 ASSERT(atomic_read(&bip->bli_refcount) > 0); 516 516 517 517 /* ··· 619 619 620 620 bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); 621 621 ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); 622 - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_CANCEL)); 622 + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_CANCEL)); 623 623 ASSERT(atomic_read(&bip->bli_refcount) > 0); 624 624 bip->bli_flags |= XFS_BLI_HOLD; 625 625 trace_xfs_trans_bhold(bip); ··· 641 641 642 642 bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); 643 643 ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); 644 - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_CANCEL)); 644 + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_CANCEL)); 645 645 ASSERT(atomic_read(&bip->bli_refcount) > 0); 646 646 ASSERT(bip->bli_flags & XFS_BLI_HOLD); 647 647 bip->bli_flags &= ~XFS_BLI_HOLD; ··· 704 704 bip->bli_flags &= ~XFS_BLI_STALE; 705 705 ASSERT(XFS_BUF_ISSTALE(bp)); 706 706 XFS_BUF_UNSTALE(bp); 707 - bip->bli_format.blf_flags &= ~XFS_BLI_CANCEL; 707 + bip->bli_format.blf_flags &= ~XFS_BLF_CANCEL; 708 708 } 709 709 710 710 lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)bip); ··· 762 762 ASSERT(!(XFS_BUF_ISDELAYWRITE(bp))); 763 763 ASSERT(XFS_BUF_ISSTALE(bp)); 764 764 ASSERT(!(bip->bli_flags & (XFS_BLI_LOGGED | XFS_BLI_DIRTY))); 765 - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_INODE_BUF)); 766 - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); 765 + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_INODE_BUF)); 766 + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); 767 767 ASSERT(lidp->lid_flags & XFS_LID_DIRTY); 768 768 ASSERT(tp->t_flags & XFS_TRANS_DIRTY); 769 769 return; ··· 774 774 * in the buf log item. The STALE flag will be used in 775 775 * xfs_buf_item_unpin() to determine if it should clean up 776 776 * when the last reference to the buf item is given up. 777 - * We set the XFS_BLI_CANCEL flag in the buf log format structure 777 + * We set the XFS_BLF_CANCEL flag in the buf log format structure 778 778 * and log the buf item. This will be used at recovery time 779 779 * to determine that copies of the buffer in the log before 780 780 * this should not be replayed. ··· 792 792 XFS_BUF_UNDELAYWRITE(bp); 793 793 XFS_BUF_STALE(bp); 794 794 bip->bli_flags |= XFS_BLI_STALE; 795 - bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_DIRTY); 796 - bip->bli_format.blf_flags &= ~XFS_BLI_INODE_BUF; 797 - bip->bli_format.blf_flags |= XFS_BLI_CANCEL; 795 + bip->bli_flags &= ~(XFS_BLI_INODE_BUF | XFS_BLI_LOGGED | XFS_BLI_DIRTY); 796 + bip->bli_format.blf_flags &= ~XFS_BLF_INODE_BUF; 797 + bip->bli_format.blf_flags |= XFS_BLF_CANCEL; 798 798 memset((char *)(bip->bli_format.blf_data_map), 0, 799 799 (bip->bli_format.blf_map_size * sizeof(uint))); 800 800 lidp->lid_flags |= XFS_LID_DIRTY; ··· 802 802 } 803 803 804 804 /* 805 - * This call is used to indicate that the buffer contains on-disk 806 - * inodes which must be handled specially during recovery. They 807 - * require special handling because only the di_next_unlinked from 808 - * the inodes in the buffer should be recovered. The rest of the 809 - * data in the buffer is logged via the inodes themselves. 805 + * This call is used to indicate that the buffer contains on-disk inodes which 806 + * must be handled specially during recovery. They require special handling 807 + * because only the di_next_unlinked from the inodes in the buffer should be 808 + * recovered. The rest of the data in the buffer is logged via the inodes 809 + * themselves. 810 810 * 811 - * All we do is set the XFS_BLI_INODE_BUF flag in the buffer's log 812 - * format structure so that we'll know what to do at recovery time. 811 + * All we do is set the XFS_BLI_INODE_BUF flag in the items flags so it can be 812 + * transferred to the buffer's log format structure so that we'll know what to 813 + * do at recovery time. 813 814 */ 814 - /* ARGSUSED */ 815 815 void 816 816 xfs_trans_inode_buf( 817 817 xfs_trans_t *tp, ··· 826 826 bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); 827 827 ASSERT(atomic_read(&bip->bli_refcount) > 0); 828 828 829 - bip->bli_format.blf_flags |= XFS_BLI_INODE_BUF; 829 + bip->bli_flags |= XFS_BLI_INODE_BUF; 830 830 } 831 831 832 832 /* ··· 908 908 ASSERT(XFS_BUF_ISBUSY(bp)); 909 909 ASSERT(XFS_BUF_FSPRIVATE2(bp, xfs_trans_t *) == tp); 910 910 ASSERT(XFS_BUF_FSPRIVATE(bp, void *) != NULL); 911 - ASSERT(type == XFS_BLI_UDQUOT_BUF || 912 - type == XFS_BLI_PDQUOT_BUF || 913 - type == XFS_BLI_GDQUOT_BUF); 911 + ASSERT(type == XFS_BLF_UDQUOT_BUF || 912 + type == XFS_BLF_PDQUOT_BUF || 913 + type == XFS_BLF_GDQUOT_BUF); 914 914 915 915 bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); 916 916 ASSERT(atomic_read(&bip->bli_refcount) > 0);

+3 -111

fs/xfs/xfs_trans_item.c

··· 299 299 void 300 300 xfs_trans_free_items( 301 301 xfs_trans_t *tp, 302 + xfs_lsn_t commit_lsn, 302 303 int flags) 303 304 { 304 305 xfs_log_item_chunk_t *licp; ··· 312 311 * Special case the embedded chunk so we don't free it below. 313 312 */ 314 313 if (!xfs_lic_are_all_free(licp)) { 315 - (void) xfs_trans_unlock_chunk(licp, 1, abort, NULLCOMMITLSN); 314 + (void) xfs_trans_unlock_chunk(licp, 1, abort, commit_lsn); 316 315 xfs_lic_all_free(licp); 317 316 licp->lic_unused = 0; 318 317 } ··· 323 322 */ 324 323 while (licp != NULL) { 325 324 ASSERT(!xfs_lic_are_all_free(licp)); 326 - (void) xfs_trans_unlock_chunk(licp, 1, abort, NULLCOMMITLSN); 325 + (void) xfs_trans_unlock_chunk(licp, 1, abort, commit_lsn); 327 326 next_licp = licp->lic_next; 328 327 kmem_free(licp); 329 328 licp = next_licp; ··· 438 437 } 439 438 440 439 return freed; 441 - } 442 - 443 - 444 - /* 445 - * This is called to add the given busy item to the transaction's 446 - * list of busy items. It must find a free busy item descriptor 447 - * or allocate a new one and add the item to that descriptor. 448 - * The function returns a pointer to busy descriptor used to point 449 - * to the new busy entry. The log busy entry will now point to its new 450 - * descriptor with its ???? field. 451 - */ 452 - xfs_log_busy_slot_t * 453 - xfs_trans_add_busy(xfs_trans_t *tp, xfs_agnumber_t ag, xfs_extlen_t idx) 454 - { 455 - xfs_log_busy_chunk_t *lbcp; 456 - xfs_log_busy_slot_t *lbsp; 457 - int i=0; 458 - 459 - /* 460 - * If there are no free descriptors, allocate a new chunk 461 - * of them and put it at the front of the chunk list. 462 - */ 463 - if (tp->t_busy_free == 0) { 464 - lbcp = (xfs_log_busy_chunk_t*) 465 - kmem_alloc(sizeof(xfs_log_busy_chunk_t), KM_SLEEP); 466 - ASSERT(lbcp != NULL); 467 - /* 468 - * Initialize the chunk, and then 469 - * claim the first slot in the newly allocated chunk. 470 - */ 471 - XFS_LBC_INIT(lbcp); 472 - XFS_LBC_CLAIM(lbcp, 0); 473 - lbcp->lbc_unused = 1; 474 - lbsp = XFS_LBC_SLOT(lbcp, 0); 475 - 476 - /* 477 - * Link in the new chunk and update the free count. 478 - */ 479 - lbcp->lbc_next = tp->t_busy.lbc_next; 480 - tp->t_busy.lbc_next = lbcp; 481 - tp->t_busy_free = XFS_LIC_NUM_SLOTS - 1; 482 - 483 - /* 484 - * Initialize the descriptor and the generic portion 485 - * of the log item. 486 - * 487 - * Point the new slot at this item and return it. 488 - * Also point the log item at its currently active 489 - * descriptor and set the item's mount pointer. 490 - */ 491 - lbsp->lbc_ag = ag; 492 - lbsp->lbc_idx = idx; 493 - return lbsp; 494 - } 495 - 496 - /* 497 - * Find the free descriptor. It is somewhere in the chunklist 498 - * of descriptors. 499 - */ 500 - lbcp = &tp->t_busy; 501 - while (lbcp != NULL) { 502 - if (XFS_LBC_VACANCY(lbcp)) { 503 - if (lbcp->lbc_unused <= XFS_LBC_MAX_SLOT) { 504 - i = lbcp->lbc_unused; 505 - break; 506 - } else { 507 - /* out-of-order vacancy */ 508 - cmn_err(CE_DEBUG, "OOO vacancy lbcp 0x%p\n", lbcp); 509 - ASSERT(0); 510 - } 511 - } 512 - lbcp = lbcp->lbc_next; 513 - } 514 - ASSERT(lbcp != NULL); 515 - /* 516 - * If we find a free descriptor, claim it, 517 - * initialize it, and return it. 518 - */ 519 - XFS_LBC_CLAIM(lbcp, i); 520 - if (lbcp->lbc_unused <= i) { 521 - lbcp->lbc_unused = i + 1; 522 - } 523 - lbsp = XFS_LBC_SLOT(lbcp, i); 524 - tp->t_busy_free--; 525 - lbsp->lbc_ag = ag; 526 - lbsp->lbc_idx = idx; 527 - return lbsp; 528 - } 529 - 530 - 531 - /* 532 - * xfs_trans_free_busy 533 - * Free all of the busy lists from a transaction 534 - */ 535 - void 536 - xfs_trans_free_busy(xfs_trans_t *tp) 537 - { 538 - xfs_log_busy_chunk_t *lbcp; 539 - xfs_log_busy_chunk_t *lbcq; 540 - 541 - lbcp = tp->t_busy.lbc_next; 542 - while (lbcp != NULL) { 543 - lbcq = lbcp->lbc_next; 544 - kmem_free(lbcp); 545 - lbcp = lbcq; 546 - } 547 - 548 - XFS_LBC_INIT(&tp->t_busy); 549 - tp->t_busy.lbc_unused = 0; 550 440 }

+8 -7

fs/xfs/xfs_trans_priv.h

··· 35 35 struct xfs_log_item_desc *xfs_trans_first_item(struct xfs_trans *); 36 36 struct xfs_log_item_desc *xfs_trans_next_item(struct xfs_trans *, 37 37 struct xfs_log_item_desc *); 38 - void xfs_trans_free_items(struct xfs_trans *, int); 39 - void xfs_trans_unlock_items(struct xfs_trans *, 40 - xfs_lsn_t); 41 - void xfs_trans_free_busy(xfs_trans_t *tp); 42 - xfs_log_busy_slot_t *xfs_trans_add_busy(xfs_trans_t *tp, 43 - xfs_agnumber_t ag, 44 - xfs_extlen_t idx); 38 + 39 + void xfs_trans_unlock_items(struct xfs_trans *tp, xfs_lsn_t commit_lsn); 40 + void xfs_trans_free_items(struct xfs_trans *tp, xfs_lsn_t commit_lsn, 41 + int flags); 42 + 43 + void xfs_trans_item_committed(struct xfs_log_item *lip, 44 + xfs_lsn_t commit_lsn, int aborted); 45 + void xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp); 45 46 46 47 /* 47 48 * AIL traversal cursor.

+2

fs/xfs/xfs_types.h

··· 75 75 76 76 typedef __uint16_t xfs_prid_t; /* prid_t truncated to 16bits in XFS */ 77 77 78 + typedef __uint32_t xlog_tid_t; /* transaction ID type */ 79 + 78 80 /* 79 81 * These types are 64 bits on disk but are either 32 or 64 bits in memory. 80 82 * Disk based types: