[colug-432] Troubleshooting Suggestions: File Truncation Issue

Sun Jan 10 09:13:22 EST 2016

Top posting.  Yeah, I know I'm a heathen.

My initial thoughts are:

1) hard links typically don't work across filesystem boundaries, and
2) does the program check the exit code from the link command and give 
appropriate error messages?

I think I understand the symptom okay, but don't have good enough 
visibility into the application and underlying system configuration.  It 
may be something simple, or it may be something complex... and there's 
really not enough data for a complex problem diagnosis.

On 2016-01-09 23:13, William E. T. wrote:
> At work, I rebooted the servers for one tier of an application after 
> which we started seeing errors.
>
> Based on a trace file(1), the vendor is sayings its a problem with the 
> OS. They found where the applications:
>
> 1.   Opens a new file
> 2.  Writes to the file
> 3.  Closes the file
> 4.  Creates a link to its real name
> 5.  Removes the original file
> 6.  stats the file to find a zero size
>
> We would expect the file to have varying sizes based upon the data 
> that was written to it.
>
> Since it is not practical for others to run this application, I wrote 
> an ugly C program(2) to mimc this behavior. (ran via ./truncationbug 
> $$.pre.tsidx $$.tsidx sample.txt 1000 where sample.txt can be as 
> simple as having a three letter word in it)  We have validated that 
> this triggers the bug, although it can take anywhere from 2,000 
> iterations to millions and I have only triggered it if I am running 
> multiple instance in parallel (why I'm using $$ to grab the pid; I can 
> use the same command in multiple shells)
>
> Things we have tried:
> 1.  Booting an older kernel; with the reboot, they booted a newer kernel
> 2.  We tried re-kickstarting the servers; we switched distributions to 
> CentOS 7 this spring and they've ran fine since then until this 
> reboot, so the hope was to very quickly revert to a known good 
> configuration
> 3.  Upgrading the firmware on the raid controller -- they were running 
> 3.04, 3.42, and 3.52 to 6.68
> 4.  Running xfs_repair on the file system
>
> Background Information
> We have 30 servers at this tier; they are split evenly between two 
> different manufactures.  The 15 HP Proliant 380p with Smart Array 
> P420i raid controllers are experiencing this issue.  They were 
> effectively running CentOS 7.2; we've reverted (via kickstarted 
> reinstall) to 7.0  The volume is a raid10 volume over 20 disks.  I 
> haven't been able to reproduce the issue on the raid1 volume with the 
> OS partitions, but I haven't extensively tested it either.  I've been 
> able to reproduce the problem under others users including root.
>
> At this point HP support is saying they see no indications of a 
> hardware issue and saying they only support hardware.
>
> I really appreciate any thoughts or suggestions.
>
> Thanks,
> Bill
>
> 1. http://pastebin.com/mFmZzYEL
> 2. https://gist.github.com/w3ttr3y/167b349d2ab67e3aa9d2
>
>
> _______________________________________________
> colug-432 mailing list
> colug-432 at colug.net
> http://lists.colug.net/mailman/listinfo/colug-432

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.colug.net/pipermail/colug-432/attachments/20160110/93d6172d/attachment-0001.html