[Sidefx-houdini-list] render farms, pipelines and file systems

Antoine Durr antoine at floqfx.com
Wed Jan 7 00:42:39 EST 2009


On Jan 6, 2009, at 7:28 PM, Drew Whitehouse wrote:

> Hi all,
>
> I'm interested in hearing about experiences people have had
> implementing render farms where you *don't* have a shared file system

This puts a really interesting spin on the problem.  I think most vfx  
facilities have centralized file
servers that all their workstations and render nodes can access, e.g.  
via NFS automount.  Doing
it via rsync, ftp, rcp, or some other protocol that is not like NFS  
seems particularly challenging.

> between workstations and render farms. I currently have a solution
> than involves shipping resources back and forth. However, it requires
> parsing IFD files and heuristically determining what resources are
> needed for a render and what results needed to be transported back
> post render. I'd like to make this more robust and I'm sure that this
> is something that large facilities have to solve sooner or later.

Well, no, not with centralized file servers.  That's kind of the  
whole point.  That being said, there
are many rendering wrappers out there that repoint the output  
filepath to be local, complete the
render, and then copy the frame back to it's originally intended  
location.  I've had to do that in the
past when network bandwidth was an issue, or automount reliability  
was weak.

> And
> the same problems must crop up whatever software is used (prman,
> MentalRay etc). Following is a bunch of questions that come to mind,
> some houdini specific, others more general.

Yes indeed.  But that is why IFD/RIB/etc. are so useful, because you  
can hack them in transit, e.g.
make a filter that first copies the texture to a local temp dir, and  
repoints the texture call to use that
file that is now local.

>
> Do you generate IFD's out on the farm or on artists workstations, and
> if so how are resources synced across to where they are needed ?
>
> Is it the case that you purchase expensive global file systems and
> maintain a single shared file space ? If so, are they robust and
> performant ?

There is a hitch with "single shared file space."  The challenge is  
to make a single virtual filesystem
that does *not* have the file server's name in it.  I.e., you want:

   /job/show/scene/this/that

not

   /fs23/job/show/scene/this/that

as the latter will fail if data has to be moved around to balance  
disk usage and paths are already
hardcoded into IFD/RIB files.


>
> Obviously their are significant performance issues to consider. Eg you
> are working with Gb's of texture per frame which you don't want to

There's a good solution to that: rat files.  They are mip-mapped, so  
at render time, only the bucket that is
currently being rendered, at the needed resolution, is actually  
pulled across the net.  I'm not sure how often
we have actual GBs of textures that are all needed on a single frame,  
that would not scale down to manageable MBs
with mip-mapping.  Failing that, you have to netcache the textures  
(the filter I described above), but you run the risk
that you're accessing stale data.

> ship across to a render node every time a frame is rendered. There are
> many possible solutions to this problem but I'd love to hear about
> experiences in the trenches ...
>
> At what point do you find that single NFS servers grind to halt ? Is
> scalability and locality a solved problem or something you're always
> having to tweak ?

In a typical network (in the vfx biz), you're always either short on  
cpus, short on bandwidth to get data to/from said cpus,
or short on disk space, or more typically, all three.  The speed of  
your file server (and to a significant extent, your network
topology and setup) will determine how many clients (render cpus) can  
access your data (file servers) at a given time.
You monitor your queue jobs for "efficiency", i.e. how well the cpu  
was utilized.  High 90's are good, but if you start to drop
below 80% or so (YMMV), you're incurring distinct I/O hit.  This  
might be either that your job is waiting for a lot of data, or is in
heavy contention for disk access.  If the latter is the case, you  
have to throttle back the number of simultaneous cpus that
are rendering.  Or, the problem might simply be to stagger the launch  
of the various frames, so that you balance your
network load.  The point being that there are multiple ways that your  
stuff slows down, and you can really only tell by
actually throwing a bunch of jobs at your network and see how much it  
can take.  Then, depending on the $ availability,
you buy faster network equipment, more/faster disk, upgrade to GigE  
or 10GigE, bigger switches, infiniband on the disk
servers, bigger switches, more cpus, bigger switches, and so on :-).   
The art in all this is automating the load balancing,
so that you use the bandwidth you have effectively.

-- Antoine

>
> Are you restricted to certain workstation architectures ?
>
> Does your solution restrict you to a certain applications ?
>
> Any feedback appreciated,
> Drew
>
> -- 
> Drew Whitehouse
> ANU Supercomputer Facility Vizlab
> _______________________________________________
> Sidefx-houdini-list mailing list
> Sidefx-houdini-list at sidefx.com
> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list

-- Antoine

Antoine Durr
Floq FX Inc.
10659 Cranks Rd.
Culver City, CA 90230
310/430-2473





More information about the Sidefx-houdini-list mailing list