Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

orangefs: get rid of knob code...

Christoph Hellwig sent in a reversion of "orangefs: remember count
when reading." because:

->read_iter calls can race with each other and one or
more ->flush calls. Remove the the scheme to store the read
count in the file private data as is is completely racy and
can cause use after free or double free conditions

Christoph's reversion caused Orangefs not to work or to compile. I
added a patch that fixed that, but intel's kbuild test robot pointed
out that sending Christoph's patch followed by my patch upstream, it
would break bisection because of the failure to compile. So I have
combined the reversion plus my patch... here's the commit message
that was in my patch:

Logically, optimal Orangefs "pages" are 4 megabytes. Reading
large Orangefs files 4096 bytes at a time is like trying to
kick a dead whale down the beach. Before Christoph's "Revert
orangefs: remember count when reading." I tried to give users
a knob whereby they could, for example, use "count" in
read(2) or bs with dd(1) to get whatever they considered an
appropriate amount of bytes at a time from Orangefs and fill
as many page cache pages as they could at once.

Without the racy code that Christoph reverted Orangefs won't
even compile, much less work. So this replaces the logic that
used the private file data that Christoph reverted with
a static number of bytes to read from Orangefs.

I ran tests like the following to determine what a
reasonable static number of bytes might be:

dd if=/pvfsmnt/asdf of=/dev/null count=128 bs=4194304
dd if=/pvfsmnt/asdf of=/dev/null count=256 bs=2097152
dd if=/pvfsmnt/asdf of=/dev/null count=512 bs=1048576
.
.
.
dd if=/pvfsmnt/asdf of=/dev/null count=4194304 bs=128

Reads seem faster using the static number, so my "knob code"
wasn't just racy, it wasn't even a good idea...

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Reported-by: kbuild test robot <lkp@intel.com>

+7 -62
+1 -25
fs/orangefs/file.c
··· 346 346 struct iov_iter *iter) 347 347 { 348 348 int ret; 349 - struct orangefs_read_options *ro; 350 - 351 349 orangefs_stats.reads++; 352 - 353 - /* 354 - * Remember how they set "count" in read(2) or pread(2) or whatever - 355 - * users can use count as a knob to control orangefs io size and later 356 - * we can try to help them fill as many pages as possible in readpage. 357 - */ 358 - if (!iocb->ki_filp->private_data) { 359 - iocb->ki_filp->private_data = kmalloc(sizeof *ro, GFP_KERNEL); 360 - if (!iocb->ki_filp->private_data) 361 - return(ENOMEM); 362 - ro = iocb->ki_filp->private_data; 363 - ro->blksiz = iter->count; 364 - } 365 350 366 351 down_read(&file_inode(iocb->ki_filp)->i_rwsem); 367 352 ret = orangefs_revalidate_mapping(file_inode(iocb->ki_filp)); ··· 635 650 return rc; 636 651 } 637 652 638 - static int orangefs_file_open(struct inode * inode, struct file *file) 639 - { 640 - file->private_data = NULL; 641 - return generic_file_open(inode, file); 642 - } 643 - 644 653 static int orangefs_flush(struct file *file, fl_owner_t id) 645 654 { 646 655 /* ··· 647 668 */ 648 669 struct inode *inode = file->f_mapping->host; 649 670 int r; 650 - 651 - kfree(file->private_data); 652 - file->private_data = NULL; 653 671 654 672 if (inode->i_state & I_DIRTY_TIME) { 655 673 spin_lock(&inode->i_lock); ··· 670 694 .lock = orangefs_lock, 671 695 .unlocked_ioctl = orangefs_ioctl, 672 696 .mmap = orangefs_file_mmap, 673 - .open = orangefs_file_open, 697 + .open = generic_file_open, 674 698 .flush = orangefs_flush, 675 699 .release = orangefs_file_release, 676 700 .fsync = orangefs_fsync,
+6 -33
fs/orangefs/inode.c
··· 259 259 pgoff_t index; /* which page */ 260 260 struct page *next_page; 261 261 char *kaddr; 262 - struct orangefs_read_options *ro = file->private_data; 263 262 loff_t read_size; 264 - loff_t roundedup; 265 263 int buffer_index = -1; /* orangefs shared memory slot */ 266 264 int slot_index; /* index into slot */ 267 265 int remaining; 268 266 269 267 /* 270 - * If they set some miniscule size for "count" in read(2) 271 - * (for example) then let's try to read a page, or the whole file 272 - * if it is smaller than a page. Once "count" goes over a page 273 - * then lets round up to the highest page size multiple that is 274 - * less than or equal to "count" and do that much orangefs IO and 275 - * try to fill as many pages as we can from it. 276 - * 277 - * "count" should be represented in ro->blksiz. 278 - * 279 - * inode->i_size = file size. 268 + * Get up to this many bytes from Orangefs at a time and try 269 + * to fill them into the page cache at once. Tests with dd made 270 + * this seem like a reasonable static number, if there was 271 + * interest perhaps this number could be made setable through 272 + * sysfs... 280 273 */ 281 - if (ro) { 282 - if (ro->blksiz < PAGE_SIZE) { 283 - if (inode->i_size < PAGE_SIZE) 284 - read_size = inode->i_size; 285 - else 286 - read_size = PAGE_SIZE; 287 - } else { 288 - roundedup = ((PAGE_SIZE - 1) & ro->blksiz) ? 289 - ((ro->blksiz + PAGE_SIZE) & ~(PAGE_SIZE -1)) : 290 - ro->blksiz; 291 - if (roundedup > inode->i_size) 292 - read_size = inode->i_size; 293 - else 294 - read_size = roundedup; 295 - 296 - } 297 - } else { 298 - read_size = PAGE_SIZE; 299 - } 300 - if (!read_size) 301 - read_size = PAGE_SIZE; 274 + read_size = 524288; 302 275 303 276 if (PageDirty(page)) 304 277 orangefs_launder_page(page);
-4
fs/orangefs/orangefs-kernel.h
··· 239 239 kgid_t gid; 240 240 }; 241 241 242 - struct orangefs_read_options { 243 - ssize_t blksiz; 244 - }; 245 - 246 242 extern struct orangefs_stats orangefs_stats; 247 243 248 244 /*