<div dir="ltr">At work, I rebooted the servers for one tier of an application after which we started seeing errors.<div><br></div><div>Based on a trace file(1), the vendor is sayings its a problem with the OS. They found where the applications:</div><div><br></div><div>1. Opens a new file</div><div>2. Writes to the file</div><div>3. Closes the file</div><div>4. Creates a link to its real name</div><div>5. Removes the original file</div><div>6. stats the file to find a zero size</div><div><br></div><div>We would expect the file to have varying sizes based upon the data that was written to it.</div><div><br></div><div>Since it is not practical for others to run this application, I wrote an ugly C program(2) to mimc this behavior. (ran via ./truncationbug $$.pre.tsidx $$.tsidx sample.txt 1000 where sample.txt can be as simple as having a three letter word in it) We have validated that this triggers the bug, although it can take anywhere from 2,000 iterations to millions and I have only triggered it if I am running multiple instance in parallel (why I'm using $$ to grab the pid; I can use the same command in multiple shells)</div><div><br></div><div>Things we have tried:</div><div>1. Booting an older kernel; with the reboot, they booted a newer kernel</div><div>2. We tried re-kickstarting the servers; we switched distributions to CentOS 7 this spring and they've ran fine since then until this reboot, so the hope was to very quickly revert to a known good configuration</div><div>3. Upgrading the firmware on the raid controller -- they were running 3.04, 3.42, and 3.52 to 6.68</div><div>4. Running xfs_repair on the file system</div><div><br></div><div>Background Information</div><div>We have 30 servers at this tier; they are split evenly between two different manufactures. The 15 HP Proliant 380p with Smart Array P420i raid controllers are experiencing this issue. They were effectively running CentOS 7.2; we've reverted (via kickstarted reinstall) to 7.0 The volume is a raid10 volume over 20 disks. I haven't been able to reproduce the issue on the raid1 volume with the OS partitions, but I haven't extensively tested it either. I've been able to reproduce the problem under others users including root.</div><div><br></div><div>At this point HP support is saying they see no indications of a hardware issue and saying they only support hardware.</div><div><br></div><div>I really appreciate any thoughts or suggestions.</div><div><br></div><div>Thanks,</div><div>Bill</div>
<div><br></div><div>1. <a href="http://pastebin.com/mFmZzYEL" target="_blank">http://pastebin.com/mFmZzYEL</a></div><div>2. <a href="https://gist.github.com/w3ttr3y/167b349d2ab67e3aa9d2" target="_blank">https://gist.github.com/w3ttr3y/167b349d2ab67e3aa9d2</a></div></div>