[colug-432] Troubleshooting Suggestions: File Truncation Issue

William E. T. linux.hacker at gmail.com
Sat Jan 9 23:13:14 EST 2016


At work, I rebooted the servers for one tier of an application after which
we started seeing errors.

Based on a trace file(1), the vendor is sayings its a problem with the OS.
They found where the applications:

1.   Opens a new file
2.  Writes to the file
3.  Closes the file
4.  Creates a link to its real name
5.  Removes the original file
6.  stats the file to find a zero size

We would expect the file to have varying sizes based upon the data that was
written to it.

Since it is not practical for others to run this application, I wrote an
ugly C program(2) to mimc this behavior. (ran via ./truncationbug
$$.pre.tsidx $$.tsidx sample.txt 1000 where sample.txt can be as simple as
having a three letter word in it)  We have validated that this triggers the
bug, although it can take anywhere from 2,000 iterations to millions and I
have only triggered it if I am running multiple instance in parallel (why
I'm using $$ to grab the pid; I can use the same command in multiple shells)

Things we have tried:
1.  Booting an older kernel; with the reboot, they booted a newer kernel
2.  We tried re-kickstarting the servers; we switched distributions to
CentOS 7 this spring and they've ran fine since then until this reboot, so
the hope was to very quickly revert to a known good configuration
3.  Upgrading the firmware on the raid controller -- they were running
3.04, 3.42, and 3.52 to 6.68
4.  Running xfs_repair on the file system

Background Information
We have 30 servers at this tier; they are split evenly between two
different manufactures.  The 15 HP Proliant 380p with Smart Array P420i
raid controllers are experiencing this issue.  They were effectively
running CentOS 7.2; we've reverted (via kickstarted reinstall) to 7.0  The
volume is a raid10 volume over 20 disks.  I haven't been able to reproduce
the issue on the raid1 volume with the OS partitions, but I haven't
extensively tested it either.  I've been able to reproduce the problem
under others users including root.

At this point HP support is saying they see no indications of a hardware
issue and saying they only support hardware.

I really appreciate any thoughts or suggestions.

Thanks,
Bill

1.  http://pastebin.com/mFmZzYEL
2.  https://gist.github.com/w3ttr3y/167b349d2ab67e3aa9d2
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.colug.net/pipermail/colug-432/attachments/20160109/42b0101c/attachment.html 


More information about the colug-432 mailing list