[Sidefx-houdini-list] render farms, pipelines and file systems
antoine at floqfx.com
Wed Jan 7 00:42:39 EST 2009
On Jan 6, 2009, at 7:28 PM, Drew Whitehouse wrote:
> Hi all,
> I'm interested in hearing about experiences people have had
> implementing render farms where you *don't* have a shared file system
This puts a really interesting spin on the problem. I think most vfx
facilities have centralized file
servers that all their workstations and render nodes can access, e.g.
via NFS automount. Doing
it via rsync, ftp, rcp, or some other protocol that is not like NFS
seems particularly challenging.
> between workstations and render farms. I currently have a solution
> than involves shipping resources back and forth. However, it requires
> parsing IFD files and heuristically determining what resources are
> needed for a render and what results needed to be transported back
> post render. I'd like to make this more robust and I'm sure that this
> is something that large facilities have to solve sooner or later.
Well, no, not with centralized file servers. That's kind of the
whole point. That being said, there
are many rendering wrappers out there that repoint the output
filepath to be local, complete the
render, and then copy the frame back to it's originally intended
location. I've had to do that in the
past when network bandwidth was an issue, or automount reliability
> the same problems must crop up whatever software is used (prman,
> MentalRay etc). Following is a bunch of questions that come to mind,
> some houdini specific, others more general.
Yes indeed. But that is why IFD/RIB/etc. are so useful, because you
can hack them in transit, e.g.
make a filter that first copies the texture to a local temp dir, and
repoints the texture call to use that
file that is now local.
> Do you generate IFD's out on the farm or on artists workstations, and
> if so how are resources synced across to where they are needed ?
> Is it the case that you purchase expensive global file systems and
> maintain a single shared file space ? If so, are they robust and
> performant ?
There is a hitch with "single shared file space." The challenge is
to make a single virtual filesystem
that does *not* have the file server's name in it. I.e., you want:
as the latter will fail if data has to be moved around to balance
disk usage and paths are already
hardcoded into IFD/RIB files.
> Obviously their are significant performance issues to consider. Eg you
> are working with Gb's of texture per frame which you don't want to
There's a good solution to that: rat files. They are mip-mapped, so
at render time, only the bucket that is
currently being rendered, at the needed resolution, is actually
pulled across the net. I'm not sure how often
we have actual GBs of textures that are all needed on a single frame,
that would not scale down to manageable MBs
with mip-mapping. Failing that, you have to netcache the textures
(the filter I described above), but you run the risk
that you're accessing stale data.
> ship across to a render node every time a frame is rendered. There are
> many possible solutions to this problem but I'd love to hear about
> experiences in the trenches ...
> At what point do you find that single NFS servers grind to halt ? Is
> scalability and locality a solved problem or something you're always
> having to tweak ?
In a typical network (in the vfx biz), you're always either short on
cpus, short on bandwidth to get data to/from said cpus,
or short on disk space, or more typically, all three. The speed of
your file server (and to a significant extent, your network
topology and setup) will determine how many clients (render cpus) can
access your data (file servers) at a given time.
You monitor your queue jobs for "efficiency", i.e. how well the cpu
was utilized. High 90's are good, but if you start to drop
below 80% or so (YMMV), you're incurring distinct I/O hit. This
might be either that your job is waiting for a lot of data, or is in
heavy contention for disk access. If the latter is the case, you
have to throttle back the number of simultaneous cpus that
are rendering. Or, the problem might simply be to stagger the launch
of the various frames, so that you balance your
network load. The point being that there are multiple ways that your
stuff slows down, and you can really only tell by
actually throwing a bunch of jobs at your network and see how much it
can take. Then, depending on the $ availability,
you buy faster network equipment, more/faster disk, upgrade to GigE
or 10GigE, bigger switches, infiniband on the disk
servers, bigger switches, more cpus, bigger switches, and so on :-).
The art in all this is automating the load balancing,
so that you use the bandwidth you have effectively.
> Are you restricted to certain workstation architectures ?
> Does your solution restrict you to a certain applications ?
> Any feedback appreciated,
> Drew Whitehouse
> ANU Supercomputer Facility Vizlab
> Sidefx-houdini-list mailing list
> Sidefx-houdini-list at sidefx.com
Floq FX Inc.
10659 Cranks Rd.
Culver City, CA 90230
More information about the Sidefx-houdini-list