[Sidefx-houdini-list] Distributed sims failing

Gary Jaeger gary at corestudio.com
Sat Feb 4 11:47:49 EST 2017


Thanks so much Rob! Yeah, It must be something specific to the
communication between the machines and simming. Our farm has been very
reliable for all sorts of distributed tasks, including writing ifd files
and rendering mantra. But this sim thing is killing us!

On Fri, Feb 3, 2017 at 11:34 AM, rvinluan <rvinluan at sidefx.com> wrote:

> Hi Gary,
>
> Sorry for the late reply.
>
> I tried out the splash tank .hip file on our farm here and was able to
> simulate through all 120 frames across 5 slices/machines.  So I don't think
> the errors are specific to the scene or simulation setup.
>
> Looking at the errors from the job output it looks like general networking
> issues.  It's as if communication between the client machines is disrupted
> causing a breakdown.  I noticed from the diagnostics files that you are
> using MacOS for your client machines so I wonder if the networking issue is
> specific to Mac.  For example, we have a couple of users here on MacBook
> Pros using El Capitan and Sierra and they complain that their Macs
> disconnect from the wifi for inexplicable reasons at least once or twice a
> day.
>
> When I tested the splash tank scene I used an all Linux farm since I
> didn't have enough dedicated Mac machines.
>
> Maybe try simplifying the setup by reducing the number of slices/machines
> and reducing the number of frames?  Perhaps only one of the machines has a
> networking issue which is causing everything to fail.  You can submit the
> sim job to different machines to see if a certain combination succeeds or
> fails.
>
> Cheers,
> Rob
>
> On 2017-01-31 12:46 PM, Gary Jaeger wrote:
>
>> Thanks Antoine-
>>
>> Just checked and they all have at least 600GB avail on their boot drives.
>>
>> Gary Jaeger / 650.728.7957 direct / 415.518.1419 mobile
>> http://corestudio.com <http://corestudio.com/>
>>
>> On Jan 31, 2017, at 9:14 AM, Antoine Durr <antoinedurr at gmail.com> wrote:
>>>
>>> It sounds like you’re running out of disk space on one of the machines
>>> used for the distributed sim.
>>>
>>> — Antoine
>>>
>>> On Jan 31, 2017, at 7:11 AM, Gary Jaeger <gary at corestudio.com> wrote:
>>>>
>>>> Hi Gary,
>>>>
>>>> If I were to guess it almost looks like a general network disruption.
>>>> I'm basing that on the "Write pipe error" messages. Perhaps a disruption
>>>> occurs while a message is passed from one machine to another so the message
>>>> is incomplete?
>>>>
>>>> Anyway, would you be able to post the job output and diagnostics files
>>>> for one of the failed slice jobs?  And also the .hip file?
>>>>
>>>> I can give it a whirl here and see if it's an issue with HQueue or
>>>> distributed sims.
>>>>
>>>> Cheers,
>>>> Rob
>>>>
>>>> On 2017-01-29 7:08 PM, Gary Jaeger wrote:
>>>>
>>>>> A quick follow up. I watched on of the machines on the farm doing a
>>>>> sim - i
>>>>> just did a quick splash tank to make sure it wasn't my scene. Early on
>>>>> the
>>>>> CPUs are pegged and everything moves along. RAM is not even close to
>>>>> being
>>>>> an issue, the process is using up about 2GB. Looks like maybe it's the
>>>>> hython process? Anyway, at some point the CPU usage just drops to
>>>>> nothing.
>>>>> The process is still alive, but doesn't seem to be doing anything.
>>>>>
>>>>> On Sun, Jan 29, 2017 at 12:30 PM, Gary Jaeger <gary at corestudio.com>
>>>>> wrote:
>>>>>
>>>>> Anybody have any insight into this? I have a flip sim that I want to
>>>>>> distribute. I'm pretty sure it's all set up correctly, because the sim
>>>>>> starts and all the slices get part of the way through, but always end
>>>>>> up
>>>>>> failing.
>>>>>>
>>>>>> I've tried both slice and slice along. When I try a slice along, the
>>>>>> job
>>>>>> has been getting about 14% through, then just hanging up. No error
>>>>>> messages, etc. It just never progresses. I've also tried slice and
>>>>>> chopping
>>>>>> the sim into 4 quadrants. In that case I was seeing things like this:
>>>>>>
>>>>>>
>>>>>> ALF_PROGRESS 27%
>>>>>> ALF_PROGRESS 28%
>>>>>> Read error on ack: Error Occurred of 12
>>>>>> Error occurred in message 12 state is 5
>>>>>> ---- Pump enters error status ----
>>>>>> Tracker reports an error, aborting
>>>>>>
>>>>>>
>>>>>> ALF_PROGRESS 27%
>>>>>> ALF_PROGRESS 28%
>>>>>> Tracker reports an error, aborting
>>>>>> Write pipe error: Error Occurred offset 0 of 4
>>>>>> Error occurred in message 4 state is 9
>>>>>> EOF in pipe at position 0 of 4
>>>>>> Error occurred in message 4 state is 6
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Though the tracker task isn't reporting any errors in hqueue that I
>>>>>> can
>>>>>> see.
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Gary Jaeger // Core Studio
>>>>>> 249 Princeton Avenue
>>>>>> Half Moon Bay, CA 94019
>>>>>> 650.728.7957 <(650)%20728-7957> (direct) • 650.728.7060
>>>>>> <(650)%20728-7060>
>>>>>> (main)
>>>>>> http://corestudio.com
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>> Sidefx-houdini-list mailing list
>>>> Sidefx-houdini-list at sidefx.com
>>>> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list
>>>>
>>> _______________________________________________
>>> Sidefx-houdini-list mailing list
>>> Sidefx-houdini-list at sidefx.com
>>> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list
>>>
>> _______________________________________________
>> Sidefx-houdini-list mailing list
>> Sidefx-houdini-list at sidefx.com
>> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list
>>
>
> _______________________________________________
> Sidefx-houdini-list mailing list
> Sidefx-houdini-list at sidefx.com
> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list
>



-- 
Gary Jaeger // Core Studio
249 Princeton Avenue
Half Moon Bay, CA 94019
650.728.7957 (direct) • 650.728.7060 (main)
http://corestudio.com



More information about the Sidefx-houdini-list mailing list