[Sidefx-houdini-list] Distributed sims failing

Andy Nicholas andy at andynicholas.com
Sun Feb 5 16:01:33 EST 2017


Have a look at Checkpoints on the Cache tab of the ROP.

If you only want to store the last completed simulated frame, then set both the Checkpoint Trail Length and the Checkpoint Interval to 1. If the checkpoint files are large and take a long time to write, then that can add a significant amount of time to the sim. 

The way Houdini deals with clearing them out can be a bit unreliable, so make sure you manually delete them once you’re done with them, otherwise they can screw up new simulations later (or just use $HIPNAME in the checkpoint filename and make sure you save a new scene each time you re-sim).


> On 5 Feb 2017, at 15:29, Gary Jaeger <gary at CORESTUDIO.COM> wrote:
> 
> I’ve actually had a few sims go all the way through, though it’s far from consistent. So it CAN work, but something is definitely wonky. 
> 
> Is it possible to pick up a partially saved sim? For instance if my distributed sim gets 30% through and dies, can is it possible to just pick up where it left off? I don’t see anything obvious in the docs. Thanks. 
> 
> Gary Jaeger / 650.728.7957 direct / 415.518.1419 mobile
> http://corestudio.com <http://corestudio.com/>
>> On Feb 4, 2017, at 8:47 AM, Gary Jaeger <gary at corestudio.com> wrote:
>> 
>> Thanks so much Rob! Yeah, It must be something specific to the communication between the machines and simming. Our farm has been very reliable for all sorts of distributed tasks, including writing ifd files and rendering mantra. But this sim thing is killing us!
>> 
>> On Fri, Feb 3, 2017 at 11:34 AM, rvinluan <rvinluan at sidefx.com <mailto:rvinluan at sidefx.com>> wrote:
>> Hi Gary,
>> 
>> Sorry for the late reply.
>> 
>> I tried out the splash tank .hip file on our farm here and was able to simulate through all 120 frames across 5 slices/machines.  So I don't think the errors are specific to the scene or simulation setup.
>> 
>> Looking at the errors from the job output it looks like general networking issues.  It's as if communication between the client machines is disrupted causing a breakdown.  I noticed from the diagnostics files that you are using MacOS for your client machines so I wonder if the networking issue is specific to Mac.  For example, we have a couple of users here on MacBook Pros using El Capitan and Sierra and they complain that their Macs disconnect from the wifi for inexplicable reasons at least once or twice a day.
>> 
>> When I tested the splash tank scene I used an all Linux farm since I didn't have enough dedicated Mac machines.
>> 
>> Maybe try simplifying the setup by reducing the number of slices/machines and reducing the number of frames?  Perhaps only one of the machines has a networking issue which is causing everything to fail.  You can submit the sim job to different machines to see if a certain combination succeeds or fails.
>> 
>> Cheers,
>> Rob
>> 
>> On 2017-01-31 12:46 PM, Gary Jaeger wrote:
>> Thanks Antoine-
>> 
>> Just checked and they all have at least 600GB avail on their boot drives.
>> 
>> Gary Jaeger / 650.728.7957 <tel:650.728.7957> direct / 415.518.1419 <tel:415.518.1419> mobile
>> http://corestudio.com <http://corestudio.com/> <http://corestudio.com/ <http://corestudio.com/>>
>> 
>> On Jan 31, 2017, at 9:14 AM, Antoine Durr <antoinedurr at gmail.com <mailto:antoinedurr at gmail.com>> wrote:
>> 
>> It sounds like you’re running out of disk space on one of the machines used for the distributed sim.
>> 
>> — Antoine
>> 
>> On Jan 31, 2017, at 7:11 AM, Gary Jaeger <gary at corestudio.com <mailto:gary at corestudio.com>> wrote:
>> 
>> Hi Gary,
>> 
>> If I were to guess it almost looks like a general network disruption.  I'm basing that on the "Write pipe error" messages. Perhaps a disruption occurs while a message is passed from one machine to another so the message is incomplete?
>> 
>> Anyway, would you be able to post the job output and diagnostics files for one of the failed slice jobs?  And also the .hip file?
>> 
>> I can give it a whirl here and see if it's an issue with HQueue or distributed sims.
>> 
>> Cheers,
>> Rob
>> 
>> On 2017-01-29 7:08 PM, Gary Jaeger wrote:
>> A quick follow up. I watched on of the machines on the farm doing a sim - i
>> just did a quick splash tank to make sure it wasn't my scene. Early on the
>> CPUs are pegged and everything moves along. RAM is not even close to being
>> an issue, the process is using up about 2GB. Looks like maybe it's the
>> hython process? Anyway, at some point the CPU usage just drops to nothing.
>> The process is still alive, but doesn't seem to be doing anything.
>> 
>> On Sun, Jan 29, 2017 at 12:30 PM, Gary Jaeger <gary at corestudio.com <mailto:gary at corestudio.com>> wrote:
>> 
>> Anybody have any insight into this? I have a flip sim that I want to
>> distribute. I'm pretty sure it's all set up correctly, because the sim
>> starts and all the slices get part of the way through, but always end up
>> failing.
>> 
>> I've tried both slice and slice along. When I try a slice along, the job
>> has been getting about 14% through, then just hanging up. No error
>> messages, etc. It just never progresses. I've also tried slice and chopping
>> the sim into 4 quadrants. In that case I was seeing things like this:
>> 
>> 
>> ALF_PROGRESS 27%
>> ALF_PROGRESS 28%
>> Read error on ack: Error Occurred of 12
>> Error occurred in message 12 state is 5
>> ---- Pump enters error status ----
>> Tracker reports an error, aborting
>> 
>> 
>> ALF_PROGRESS 27%
>> ALF_PROGRESS 28%
>> Tracker reports an error, aborting
>> Write pipe error: Error Occurred offset 0 of 4
>> Error occurred in message 4 state is 9
>> EOF in pipe at position 0 of 4
>> Error occurred in message 4 state is 6
>> 
>> --
>> 
>> Though the tracker task isn't reporting any errors in hqueue that I can
>> see.
>> 
>> Any ideas?
>> 
>> 
>> 
>> --
>> Gary Jaeger // Core Studio
>> 249 Princeton Avenue
>> Half Moon Bay, CA 94019
>> 650.728.7957 <tel:650.728.7957> <(650)%20728-7957> (direct) • 650.728.7060 <tel:650.728.7060> <(650)%20728-7060>
>> (main)
>> http://corestudio.com <http://corestudio.com/>
>> 
>> 
>> _______________________________________________
>> Sidefx-houdini-list mailing list
>> Sidefx-houdini-list at sidefx.com <mailto:Sidefx-houdini-list at sidefx.com>
>> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list <https://lists.sidefx.com/mailman/listinfo/sidefx-houdini-list>
>> _______________________________________________
>> Sidefx-houdini-list mailing list
>> Sidefx-houdini-list at sidefx.com <mailto:Sidefx-houdini-list at sidefx.com>
>> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list <https://lists.sidefx.com/mailman/listinfo/sidefx-houdini-list>
>> _______________________________________________
>> Sidefx-houdini-list mailing list
>> Sidefx-houdini-list at sidefx.com <mailto:Sidefx-houdini-list at sidefx.com>
>> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list <https://lists.sidefx.com/mailman/listinfo/sidefx-houdini-list>
>> 
>> _______________________________________________
>> Sidefx-houdini-list mailing list
>> Sidefx-houdini-list at sidefx.com <mailto:Sidefx-houdini-list at sidefx.com>
>> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list <https://lists.sidefx.com/mailman/listinfo/sidefx-houdini-list>
>> 
>> 
>> 
>> -- 
>> Gary Jaeger // Core Studio
>> 249 Princeton Avenue
>> Half Moon Bay, CA 94019
>> 650.728.7957 (direct) • 650.728.7060 (main)
>> http://corestudio.com <http://corestudio.com/>
> _______________________________________________
> Sidefx-houdini-list mailing list
> Sidefx-houdini-list at sidefx.com
> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list




More information about the Sidefx-houdini-list mailing list