[Sidefx-houdini-list] Distributed sims failing

Gary Jaeger gary at corestudio.com
Tue Jan 31 12:46:27 EST 2017


Thanks Antoine-

Just checked and they all have at least 600GB avail on their boot drives. 

Gary Jaeger / 650.728.7957 direct / 415.518.1419 mobile
http://corestudio.com <http://corestudio.com/>
> On Jan 31, 2017, at 9:14 AM, Antoine Durr <antoinedurr at gmail.com> wrote:
> 
> It sounds like you’re running out of disk space on one of the machines used for the distributed sim.
> 
> — Antoine
> 
>> On Jan 31, 2017, at 7:11 AM, Gary Jaeger <gary at corestudio.com> wrote:
>> 
>> Hi Gary,
>> 
>> If I were to guess it almost looks like a general network disruption.  I'm basing that on the "Write pipe error" messages. Perhaps a disruption occurs while a message is passed from one machine to another so the message is incomplete?
>> 
>> Anyway, would you be able to post the job output and diagnostics files for one of the failed slice jobs?  And also the .hip file?
>> 
>> I can give it a whirl here and see if it's an issue with HQueue or distributed sims.
>> 
>> Cheers,
>> Rob
>> 
>> On 2017-01-29 7:08 PM, Gary Jaeger wrote:
>>> A quick follow up. I watched on of the machines on the farm doing a sim - i
>>> just did a quick splash tank to make sure it wasn't my scene. Early on the
>>> CPUs are pegged and everything moves along. RAM is not even close to being
>>> an issue, the process is using up about 2GB. Looks like maybe it's the
>>> hython process? Anyway, at some point the CPU usage just drops to nothing.
>>> The process is still alive, but doesn't seem to be doing anything.
>>> 
>>> On Sun, Jan 29, 2017 at 12:30 PM, Gary Jaeger <gary at corestudio.com> wrote:
>>> 
>>>> Anybody have any insight into this? I have a flip sim that I want to
>>>> distribute. I'm pretty sure it's all set up correctly, because the sim
>>>> starts and all the slices get part of the way through, but always end up
>>>> failing.
>>>> 
>>>> I've tried both slice and slice along. When I try a slice along, the job
>>>> has been getting about 14% through, then just hanging up. No error
>>>> messages, etc. It just never progresses. I've also tried slice and chopping
>>>> the sim into 4 quadrants. In that case I was seeing things like this:
>>>> 
>>>> 
>>>> ALF_PROGRESS 27%
>>>> ALF_PROGRESS 28%
>>>> Read error on ack: Error Occurred of 12
>>>> Error occurred in message 12 state is 5
>>>> ---- Pump enters error status ----
>>>> Tracker reports an error, aborting
>>>> 
>>>> 
>>>> ALF_PROGRESS 27%
>>>> ALF_PROGRESS 28%
>>>> Tracker reports an error, aborting
>>>> Write pipe error: Error Occurred offset 0 of 4
>>>> Error occurred in message 4 state is 9
>>>> EOF in pipe at position 0 of 4
>>>> Error occurred in message 4 state is 6
>>>> 
>>>> --
>>>> 
>>>> Though the tracker task isn't reporting any errors in hqueue that I can
>>>> see.
>>>> 
>>>> Any ideas?
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Gary Jaeger // Core Studio
>>>> 249 Princeton Avenue
>>>> Half Moon Bay, CA 94019
>>>> 650.728.7957 <(650)%20728-7957> (direct) • 650.728.7060 <(650)%20728-7060>
>>>> (main)
>>>> http://corestudio.com
>>>> 
>>> 
>>> 
>> 
>> _______________________________________________
>> Sidefx-houdini-list mailing list
>> Sidefx-houdini-list at sidefx.com
>> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list
> 
> _______________________________________________
> Sidefx-houdini-list mailing list
> Sidefx-houdini-list at sidefx.com
> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list




More information about the Sidefx-houdini-list mailing list