Journey to the Cloud
Issue: Volume 41 Issue 2: (Edition 2 2018)

Journey to the Cloud

FuseFX is an award-winning visual effects studio specializing in visual effects for episodic television, film, commercials, games, and special venues. FuseFX employs around 300 people and has three studio locations: the flagship Los Angeles office, New York City, and Vancouver, BC.

Today, FuseFX's three locations have more than 60 television shows in production simultaneously, in addition to various commercial and feature film projects. The company has provided visual effects for all the major studios on such productions as American Horror Story, Marvel's Agents of S.H.I.E.L.D., and The Tick.

Jason Fotter, co-founder and CTO at FuseFX, is very aware of the challenges that come with building and running a renderfarm. He states, "For me, it's been a 'learn as you go' process. I've been surprised many times throughout the growth of the company. The amount of power and heat that a renderfarm generates and the infrastructure needed to carry is massive."

He continues: "I've found over the years that no matter what size farm you have, you can easily overrun it at any given moment. The more you have, the more you will use. The problem arises when you are up against a delivery and time is not on your side. We need to be able to act quickly at these moments, and that's hard to do with physical infrastructure. Power, cooling, and physical space are all finite resources that put limits on what you can achieve."

An ever-present constraint is that episodic television shows have tight deadlines. "We have two to three weeks to get our work done [with episodic television]. Feature films have six months to a year or more. Commercials define their own schedules," Fotter says. "TV is a churning process. You get your shots in, you get two or three weeks to do them, and Boom! they're out. Next episode, same thing. Next episode, same thing. It's really fast-paced."

Aggressive schedules mean success can bring its own set of problems. Even renting equipment may not be a feasible solution. When considering how long it takes to order, deliver, and rack and stack the nodes; the challenge of finding available rental hardware; finding enough data-center space, power, networking, and cooling, it may seem like there's no answer - unless you start looking at the cloud.

"Before the cloud, I don't know if there was a solution. Maybe really expensive co-location, or some other crazy scenario, but the cloud started to become a reasonable way for us to get some of our more pressing render jobs done," says Fotter.

The First Steps

For the first foray into the cloud, FuseFX teamed up with Bracket Computing. At the time, Bracket Computing was a startup that focused primarily on cloud security, but they helped FuseFX get started. "We had some connections with them, and they asked if we were interested in the cloud. It was the right place at the right time," Fotter says. "I said, 'I would like to see if we can leverage the compute power on the cloud, but I don't have any experience with cloud computing. You guys know the cloud, I know what's needed for a renderfarm, let's see if we can figure something out.' They helped me understand the cloud, and together we built the beginnings of our cloud rendering workflow."

Around the same time, FuseFX opened its remote offices in New York and Vancouver. From their inception, the cloud was built into the workflows there. The immediate problem was how to transfer data to and from these locations. The company wanted to use each office as needed for production work and rendering. To solve the problem, the company designed and implemented its own synchronization software, powered by its proprietary production platform, Nucleus. With it, they can define any asset, specify where it needs to be - including the cloud - intelligently get it there, and send back the results.

Enter QF2

Late last year, Fotter learned that Bracket Computing was no longer going to be an option, and he began to look for alternatives. He clarifies, "I was really focused on price and performance. Who had the features that we were looking for? Who wanted to develop a relationship with us in VFX rendering? I thought our process was really innovative, and I wanted someone who felt the same way."

While he was evaluating his options, Amazon bought Thinkbox, the creators of Deadline, software that manages rendering pipelines. FuseFX was already running Deadline in the cloud, and AWS was looking for just such a customer, so Fotter knew he had found the partner FuseFX was looking for.

One of Fotter's and FuseFX's goals was to expand their virtual renderfarm. With the Bracket solution, he was running a single, high-powered Linux instance on AWS, but the storage architecture couldn't handle more than 200 to 300 virtual machines.

Fotter knew he needed fast clustered storage if he wanted to run more instances. He adds, "We came up with all kinds of ideas. We thought about leveraging S3 and syncing everything to the local machines, but that didn't fit with the way we work. We talked to Avere multiple times, but they're very NFS-centric and we're a Windows shop. Nothing was really hitting the mark for exactly what I was looking for."

FuseFX already had a Qumulo File Fabric (QF2) cluster on premises. QF2 is a modern, highly scalable file storage system. It can scale to billions of files, handles small and large files with equal efficiency, and gives administrators real-time insight and control. Fotter had spoken with Qumulo about his need for a cloud-based solution. When he learned that the company was working on extending QF2 to AWS, Fotter jumped at the chance to try it out. The team experimented with a single instance early on and liked what they saw. When the four-node cluster became available, he was ready to integrate it into his production workflow.

The Test of The Tick

The QF2 cluster was put to the test when the company was working on an episode of The Tick. Fotter describes the situation: "Our process is that people work during the day, submit their jobs, then we render overnight. When they come in the next day, they look at the frames, evaluate where they're at, and either send it off to the next task or they might decide they need to re-render something.

"And again, we only have two to three weeks for a single episode. We often start a project close to the delivery of the first episodes. We don't have a lot of time to waste. If we have a problem, it's always a critical problem. We came in one morning and discovered there had been problems overnight. There must have been 50 jobs queued up that hadn't rendered a single frame. The stress level of the production team was pretty high at that moment. We had been targeting 1,000 machines as a maximum target for capacity. I knew that a moment would come where we would want to burst that high, and it was apparent that now was that time. Each EC2 Spot instance was 32 cores, so that's 32,000 cores at one time!"

Fotter told his render wranglers that if they had a frame to render, to turn on a node for it. "Just get it done," he recalls saying. "We knew that with QF2, we would be able to support that kind of throughput. And we did it. We got the frames rendered in the cloud and got them back down on premise." He says they were actually rendering so fast that the bottleneck was getting the frames back from the cloud cluster.

"We saved ourselves. That's actual proof that the solution works. There's no possible way I could install 1,000 machines in our network here. I don't have the power or cooling to support them," says Fotter. "We were able to make the decision, and in less than one hour be rendering on 1,000 machines. After the jobs finished, we simply terminated the instances. When I think about how easy it was, it still doesn't sound real."

Chris Leslie is the supervising systems engineer at FuseFX. To quantify QF2 performance, he offers the following: "At the peak we saw 40,000 IOPS. The highest throughput was 3.87GB/sec."

The Pipeline

Besides QF2, the FuseFX pipeline uses EC2 Spot Instances for scalable, low-cost computing, Deadline for queue management and managing bids for the spot instances, Thinkbox Marketplace usage-based licensing (UBL) for flexible licensing, and V-Ray for rendering.

Fotter explains how the UBL store works. "If you exhaust your local licenses, you can purchase per-minute or per-hour licenses of Deadline and V-Ray. Once your local license limit is reached, the software sends those requests to the store, monitors the usage, and deducts from that time. It's like a calling card. You buy a calling card with an hour of calling time on it and every call you make deducts from that." Everything is coordinated by the on-premise server, which is connected to the cloud instances with a VPN.

Once it's synchronized to the QF2 cluster in AWS, rendering can occur both locally and in the cloud at the same time. A local machine can, for example, pick up the first frame, and a cloud node can pick up the second frame. Deadline manages the distribution so that the cloud is simply an extension of the on-premise renderfarm.

FuseFX is still working on automation. Leslie explains, "We use a custom AMI that has some internal automation. For that, we use CloudFormation. It gets itself on the network, mounts the Qumulo storage, sets up the Deadline slaves, and a few other things. Right now, we start and terminate the QF2 instances manually."

Fotter adds, "If we have a long-term timeframe where we know we're not going to use QF2, we terminate it and we tell the Qumulo support team. We've learned that we should tell them when we're turning it off because they monitor it so nicely that, otherwise, when we do terminate it, people start calling me to tell me my cloud cluster is down."

Lessons Learned

Fotter has learned quite a bit since FuseFX first began using the cloud. He explains, "Getting the workflow right is the biggest challenge. Rendering is complicated, and visual effects is an inherently inefficient process. The more that you can create efficiencies in the workflow, the better off you're going to be."

Solving the data synchronization issue is the hardest part, contends Fotter, because render jobs require a lot of assets, textures, geometry, simulation caches, and whatever else you need to create the final image. When you're rendering in the cloud, if you're missing one little texture and that job renders incorrectly, you've wasted all that money, he adds.

"We've gone through those pains," Fotter notes. "We've learned the hard way, but being committed to the process and knowing that you can create a solution has always been my focus. So, to boil it down, my advice is to test it. Come up with a plan, test it, be committed to it, and really understand your workflow from start to finish."

Fotter also affirmed the importance of file-based data to his workflow. "It would be nice to be able to use object storage, but we don't have a single product in our environment that uses it. It doesn't make sense. We're a file-based workflow. That's the way the visual effects process works. We have a large amount of files on a file system. We read them. We pull them into our applications," he says. "We work on them. We do our creative work, and we create more files."

Files are the medium of exchange between applications that were not necessarily written by the same company. How do you get something from the animation package into the rendering package? Those are two different disciplines, two different areas of focus, so you must create workflows that integrate across applications, and a file is the way to do that, according to Fotter.

"It follows then, that without a high-performance file system in the cloud, our workflow would be impossible," says Fotter. "QF2 is at the foundation of our AWS storage solution. Without it, we wouldn't be able to expand to the capacity that we have."