[Sidefx-houdini-list] Distributed sims failing

Antoine Durr antoinedurr at gmail.com
Tue Jan 31 12:14:46 EST 2017


It sounds like you’re running out of disk space on one of the machines used for the distributed sim.

— Antoine

> On Jan 31, 2017, at 7:11 AM, Gary Jaeger <gary at corestudio.com> wrote:
> 
> Hi Gary,
> 
> If I were to guess it almost looks like a general network disruption.  I'm basing that on the "Write pipe error" messages. Perhaps a disruption occurs while a message is passed from one machine to another so the message is incomplete?
> 
> Anyway, would you be able to post the job output and diagnostics files for one of the failed slice jobs?  And also the .hip file?
> 
> I can give it a whirl here and see if it's an issue with HQueue or distributed sims.
> 
> Cheers,
> Rob
> 
> On 2017-01-29 7:08 PM, Gary Jaeger wrote:
>> A quick follow up. I watched on of the machines on the farm doing a sim - i
>> just did a quick splash tank to make sure it wasn't my scene. Early on the
>> CPUs are pegged and everything moves along. RAM is not even close to being
>> an issue, the process is using up about 2GB. Looks like maybe it's the
>> hython process? Anyway, at some point the CPU usage just drops to nothing.
>> The process is still alive, but doesn't seem to be doing anything.
>> 
>> On Sun, Jan 29, 2017 at 12:30 PM, Gary Jaeger <gary at corestudio.com> wrote:
>> 
>>> Anybody have any insight into this? I have a flip sim that I want to
>>> distribute. I'm pretty sure it's all set up correctly, because the sim
>>> starts and all the slices get part of the way through, but always end up
>>> failing.
>>> 
>>> I've tried both slice and slice along. When I try a slice along, the job
>>> has been getting about 14% through, then just hanging up. No error
>>> messages, etc. It just never progresses. I've also tried slice and chopping
>>> the sim into 4 quadrants. In that case I was seeing things like this:
>>> 
>>> 
>>> ALF_PROGRESS 27%
>>> ALF_PROGRESS 28%
>>> Read error on ack: Error Occurred of 12
>>> Error occurred in message 12 state is 5
>>> ---- Pump enters error status ----
>>> Tracker reports an error, aborting
>>> 
>>> 
>>> ALF_PROGRESS 27%
>>> ALF_PROGRESS 28%
>>> Tracker reports an error, aborting
>>> Write pipe error: Error Occurred offset 0 of 4
>>> Error occurred in message 4 state is 9
>>> EOF in pipe at position 0 of 4
>>> Error occurred in message 4 state is 6
>>> 
>>> --
>>> 
>>> Though the tracker task isn't reporting any errors in hqueue that I can
>>> see.
>>> 
>>> Any ideas?
>>> 
>>> 
>>> 
>>> --
>>> Gary Jaeger // Core Studio
>>> 249 Princeton Avenue
>>> Half Moon Bay, CA 94019
>>> 650.728.7957 <(650)%20728-7957> (direct) • 650.728.7060 <(650)%20728-7060>
>>> (main)
>>> http://corestudio.com
>>> 
>> 
>> 
> 
> _______________________________________________
> Sidefx-houdini-list mailing list
> Sidefx-houdini-list at sidefx.com
> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list




More information about the Sidefx-houdini-list mailing list