[Sidefx-houdini-list] Distributed sims failing

rvinluan rvinluan at sidefx.com
Fri Feb 3 14:34:26 EST 2017


Hi Gary,

Sorry for the late reply.

I tried out the splash tank .hip file on our farm here and was able to 
simulate through all 120 frames across 5 slices/machines.  So I don't 
think the errors are specific to the scene or simulation setup.

Looking at the errors from the job output it looks like general 
networking issues.  It's as if communication between the client machines 
is disrupted causing a breakdown.  I noticed from the diagnostics files 
that you are using MacOS for your client machines so I wonder if the 
networking issue is specific to Mac.  For example, we have a couple of 
users here on MacBook Pros using El Capitan and Sierra and they complain 
that their Macs disconnect from the wifi for inexplicable reasons at 
least once or twice a day.

When I tested the splash tank scene I used an all Linux farm since I 
didn't have enough dedicated Mac machines.

Maybe try simplifying the setup by reducing the number of 
slices/machines and reducing the number of frames?  Perhaps only one of 
the machines has a networking issue which is causing everything to 
fail.  You can submit the sim job to different machines to see if a 
certain combination succeeds or fails.

Cheers,
Rob

On 2017-01-31 12:46 PM, Gary Jaeger wrote:
> Thanks Antoine-
>
> Just checked and they all have at least 600GB avail on their boot drives.
>
> Gary Jaeger / 650.728.7957 direct / 415.518.1419 mobile
> http://corestudio.com <http://corestudio.com/>
>> On Jan 31, 2017, at 9:14 AM, Antoine Durr <antoinedurr at gmail.com> wrote:
>>
>> It sounds like you’re running out of disk space on one of the machines used for the distributed sim.
>>
>> — Antoine
>>
>>> On Jan 31, 2017, at 7:11 AM, Gary Jaeger <gary at corestudio.com> wrote:
>>>
>>> Hi Gary,
>>>
>>> If I were to guess it almost looks like a general network disruption.  I'm basing that on the "Write pipe error" messages. Perhaps a disruption occurs while a message is passed from one machine to another so the message is incomplete?
>>>
>>> Anyway, would you be able to post the job output and diagnostics files for one of the failed slice jobs?  And also the .hip file?
>>>
>>> I can give it a whirl here and see if it's an issue with HQueue or distributed sims.
>>>
>>> Cheers,
>>> Rob
>>>
>>> On 2017-01-29 7:08 PM, Gary Jaeger wrote:
>>>> A quick follow up. I watched on of the machines on the farm doing a sim - i
>>>> just did a quick splash tank to make sure it wasn't my scene. Early on the
>>>> CPUs are pegged and everything moves along. RAM is not even close to being
>>>> an issue, the process is using up about 2GB. Looks like maybe it's the
>>>> hython process? Anyway, at some point the CPU usage just drops to nothing.
>>>> The process is still alive, but doesn't seem to be doing anything.
>>>>
>>>> On Sun, Jan 29, 2017 at 12:30 PM, Gary Jaeger <gary at corestudio.com> wrote:
>>>>
>>>>> Anybody have any insight into this? I have a flip sim that I want to
>>>>> distribute. I'm pretty sure it's all set up correctly, because the sim
>>>>> starts and all the slices get part of the way through, but always end up
>>>>> failing.
>>>>>
>>>>> I've tried both slice and slice along. When I try a slice along, the job
>>>>> has been getting about 14% through, then just hanging up. No error
>>>>> messages, etc. It just never progresses. I've also tried slice and chopping
>>>>> the sim into 4 quadrants. In that case I was seeing things like this:
>>>>>
>>>>>
>>>>> ALF_PROGRESS 27%
>>>>> ALF_PROGRESS 28%
>>>>> Read error on ack: Error Occurred of 12
>>>>> Error occurred in message 12 state is 5
>>>>> ---- Pump enters error status ----
>>>>> Tracker reports an error, aborting
>>>>>
>>>>>
>>>>> ALF_PROGRESS 27%
>>>>> ALF_PROGRESS 28%
>>>>> Tracker reports an error, aborting
>>>>> Write pipe error: Error Occurred offset 0 of 4
>>>>> Error occurred in message 4 state is 9
>>>>> EOF in pipe at position 0 of 4
>>>>> Error occurred in message 4 state is 6
>>>>>
>>>>> --
>>>>>
>>>>> Though the tracker task isn't reporting any errors in hqueue that I can
>>>>> see.
>>>>>
>>>>> Any ideas?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Gary Jaeger // Core Studio
>>>>> 249 Princeton Avenue
>>>>> Half Moon Bay, CA 94019
>>>>> 650.728.7957 <(650)%20728-7957> (direct) • 650.728.7060 <(650)%20728-7060>
>>>>> (main)
>>>>> http://corestudio.com
>>>>>
>>>>
>>> _______________________________________________
>>> Sidefx-houdini-list mailing list
>>> Sidefx-houdini-list at sidefx.com
>>> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list
>> _______________________________________________
>> Sidefx-houdini-list mailing list
>> Sidefx-houdini-list at sidefx.com
>> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list
> _______________________________________________
> Sidefx-houdini-list mailing list
> Sidefx-houdini-list at sidefx.com
> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list




More information about the Sidefx-houdini-list mailing list