Build a SAM2-Powered Roto Assistant for VFX

Learn how to integrate Meta's Segment Anything Model 2 (SAM2) into a production compositing workflow. You'll build an interactive rotoscoping tool that reduces manual segmentation time while handling motion blur and occlusion.

Nov 19, 2024 11 min read Segmentation & Roto SAM2 Nuke Python

Prerequisites

Skills: Python (intermediate), basic compositing concepts, familiarity with Nuke or similar tools

Software: Python 3.9+, PyTorch 2.0+, Nuke 13+ (or After Effects), CUDA-capable GPU (recommended)

Time: 2-3 hours for full implementation

Introduction

Rotoscoping is one of the most time-intensive tasks in VFX production. Artists spend hours manually drawing and refining masks for complex plates—particularly when dealing with hair, motion blur, or occlusion. Meta's Segment Anything Model 2 (SAM2) offers a breakthrough: interactive segmentation that can handle these edge cases while integrating into production pipelines.

In this tutorial, you'll build a production-aware roto assistant that:

Accepts interactive user prompts (clicks, boxes) for segmentation
Handles temporal consistency across frames
Exports masks compatible with Nuke and other compositing tools
Manages GPU memory efficiently for high-resolution plates

Set Up Your Environment

First, let's set up the Python environment with all required dependencies:

# Create a new virtual environment
python -m venv sam2-roto-env
source sam2-roto-env/bin/activate  # On Windows: sam2-roto-env\Scripts\activate

# Install PyTorch (CUDA 11.8 version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# Install SAM2
pip install git+https://github.com/facebookresearch/segment-anything-2.git

# Install additional dependencies
pip install opencv-python numpy pillow tqdm

Download SAM2 Model Checkpoints

SAM2 provides multiple model sizes. For production work, we recommend the sam2_hiera_large checkpoint, which balances quality and performance:

# Download checkpoint
mkdir checkpoints
cd checkpoints
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt
cd ..

Prepare Your VFX Plates

SAM2 works best with properly prepared input sequences. Here's how to prepare your footage:

Convert to Image Sequence

SAM2 expects individual frames. If you're working with a video file, extract frames first:

# Using FFmpeg to extract frames
ffmpeg -i input_plate.mov -qscale:v 2 frames/frame_%04d.png

# Or export directly from Nuke as an image sequence (EXR or PNG)

Downscale for Processing (Optional)

For 4K+ plates, consider downscaling to 2K for initial segmentation, then upscale the mask:

import cv2
import os

input_dir = 'frames_4k'
output_dir = 'frames_2k'
os.makedirs(output_dir, exist_ok=True)

for filename in os.listdir(input_dir):
    img = cv2.imread(os.path.join(input_dir, filename))
    img_2k = cv2.resize(img, (1920, 1080), interpolation=cv2.INTER_AREA)
    cv2.imwrite(os.path.join(output_dir, filename), img_2k)

Wire Up SAM2 for Segmentation

Now let's build the core segmentation engine. We'll create a class that handles SAM2 inference with interactive prompts:

import torch
from sam2.build_sam import build_sam2_video_predictor

class SAM2RotoAssistant:
    def __init__(self, checkpoint_path, model_cfg='sam2_hiera_l.yaml'):
        """Initialize SAM2 video predictor"""
        self.predictor = build_sam2_video_predictor(model_cfg, checkpoint_path)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        print(f"Using device: {self.device}")

    def load_sequence(self, frame_dir):
        """Load video frames for processing"""
        with torch.inference_mode(), torch.autocast(self.device.type, dtype=torch.float16):
            state = self.predictor.init_state(video_path=frame_dir)
        return state

    def add_click_prompt(self, state, frame_idx, point, label):
        """
        Add interactive click prompt
        point: (x, y) coordinates
        label: 1 for foreground, 0 for background
        """
        _, out_obj_ids, out_mask_logits = self.predictor.add_new_points(
            inference_state=state,
            frame_idx=frame_idx,
            obj_id=0,
            points=point,
            labels=label
        )
        return out_mask_logits

    def propagate_masks(self, state):
        """Propagate masks across all frames"""
        masks = {}
        for frame_idx, obj_ids, mask_logits in self.predictor.propagate_in_video(state):
            masks[frame_idx] = (mask_logits[0] > 0.0).cpu().numpy()
        return masks

Note: GPU Memory Management

SAM2 can consume significant GPU memory for high-resolution sequences. If you encounter out-of-memory errors:

Use the sam2_hiera_small checkpoint instead
Process sequences in smaller chunks (e.g., 50-frame batches)
Enable gradient checkpointing in the model config

Integrate With Your Compositing Workflow

Now let's connect this to Nuke. We'll create a Python panel that allows artists to:

Load a plate sequence
Add interactive foreground/background clicks
Preview the segmentation
Export masks as a Nuke-compatible image sequence

Export Masks for Nuke

def export_masks_for_nuke(masks, output_dir, frame_format='frame_%04d.png'):
    """Export binary masks as 8-bit PNG sequence"""
    import os
    os.makedirs(output_dir, exist_ok=True)

    for frame_idx, mask in masks.items():
        # Convert boolean mask to 8-bit (0 or 255)
        mask_8bit = (mask[0] * 255).astype('uint8')

        # Write to disk
        filename = frame_format % (frame_idx + 1)
        cv2.imwrite(os.path.join(output_dir, filename), mask_8bit)

    print(f"Exported {len(masks)} masks to {output_dir}")

Create a Simple Interactive UI

For production use, you'll want a GUI. Here's a minimal example using OpenCV's window system:

import cv2

def interactive_segmentation(assistant, frame_dir):
    """Simple click-based interface for segmentation"""
    state = assistant.load_sequence(frame_dir)

    # Load first frame for display
    first_frame_path = os.path.join(frame_dir, sorted(os.listdir(frame_dir))[0])
    display_frame = cv2.imread(first_frame_path)

    points = []
    labels = []

    def mouse_callback(event, x, y, flags, param):
        if event == cv2.EVENT_LBUTTONDOWN:  # Foreground click
            points.append([x, y])
            labels.append(1)
            cv2.circle(display_frame, (x, y), 5, (0, 255, 0), -1)
        elif event == cv2.EVENT_RBUTTONDOWN:  # Background click
            points.append([x, y])
            labels.append(0)
            cv2.circle(display_frame, (x, y), 5, (0, 0, 255), -1)

    cv2.namedWindow('SAM2 Roto Assistant')
    cv2.setMouseCallback('SAM2 Roto Assistant', mouse_callback)

    while True:
        cv2.imshow('SAM2 Roto Assistant', display_frame)
        key = cv2.waitKey(1)

        if key == ord('q'):  # Quit
            break
        elif key == ord('p'):  # Propagate
            if points:
                mask_logits = assistant.add_click_prompt(state, 0, points, labels)
                masks = assistant.propagate_masks(state)
                export_masks_for_nuke(masks, 'output_masks')
                print("Masks exported!")

    cv2.destroyAllWindows()

Visualize and Iterate

Once you've generated masks, bring them back into Nuke to refine:

Import the mask sequence as a Read node
Use a Premult node to apply the mask to your original plate
Add a RotoPaint node for manual refinements on problem frames
Use edge operations (Dilate/Erode) to fine-tune the mask edge

Tip: Handling Motion Blur

SAM2 can struggle with heavy motion blur. For best results:

Add more interactive prompts on blurred frames
Use Nuke's MotionBlur node to match the original plate blur on the mask
Consider temporal smoothing with a FrameBlend node

Production Considerations

Before deploying this in a production pipeline, consider:

Latency: SAM2 inference on a 100-frame sequence at 2K resolution takes ~2-3 minutes on an A100 GPU. Budget accordingly for interactive sessions.
Quality assurance: Always review masks frame-by-frame. SAM2 is excellent but not perfect—plan for manual cleanup time.
Version control: Save prompt coordinates and model versions so artists can reproduce results.
Batch processing: For large shows, create a job submission system that queues SAM2 inference on render nodes.