Skills: Python (intermediate), basic compositing concepts, familiarity with Nuke or similar tools
Software: Python 3.9+, PyTorch 2.0+, Nuke 13+ (or After Effects), CUDA-capable GPU (recommended)
Time: 2-3 hours for full implementation
Introduction
Rotoscoping is one of the most time-intensive tasks in VFX production. Artists spend hours manually drawing and refining masks for complex plates—particularly when dealing with hair, motion blur, or occlusion. Meta's Segment Anything Model 2 (SAM2) offers a breakthrough: interactive segmentation that can handle these edge cases while integrating into production pipelines.
In this tutorial, you'll build a production-aware roto assistant that:
- Accepts interactive user prompts (clicks, boxes) for segmentation
- Handles temporal consistency across frames
- Exports masks compatible with Nuke and other compositing tools
- Manages GPU memory efficiently for high-resolution plates
Set Up Your Environment
First, let's set up the Python environment with all required dependencies:
# Create a new virtual environment
python -m venv sam2-roto-env
source sam2-roto-env/bin/activate # On Windows: sam2-roto-env\Scripts\activate
# Install PyTorch (CUDA 11.8 version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# Install SAM2
pip install git+https://github.com/facebookresearch/segment-anything-2.git
# Install additional dependencies
pip install opencv-python numpy pillow tqdm
Download SAM2 Model Checkpoints
SAM2 provides multiple model sizes. For production work, we recommend the sam2_hiera_large checkpoint, which balances quality and performance:
# Download checkpoint
mkdir checkpoints
cd checkpoints
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt
cd ..
Prepare Your VFX Plates
SAM2 works best with properly prepared input sequences. Here's how to prepare your footage:
Convert to Image Sequence
SAM2 expects individual frames. If you're working with a video file, extract frames first:
# Using FFmpeg to extract frames
ffmpeg -i input_plate.mov -qscale:v 2 frames/frame_%04d.png
# Or export directly from Nuke as an image sequence (EXR or PNG)
Downscale for Processing (Optional)
For 4K+ plates, consider downscaling to 2K for initial segmentation, then upscale the mask:
import cv2
import os
input_dir = 'frames_4k'
output_dir = 'frames_2k'
os.makedirs(output_dir, exist_ok=True)
for filename in os.listdir(input_dir):
img = cv2.imread(os.path.join(input_dir, filename))
img_2k = cv2.resize(img, (1920, 1080), interpolation=cv2.INTER_AREA)
cv2.imwrite(os.path.join(output_dir, filename), img_2k)
Wire Up SAM2 for Segmentation
Now let's build the core segmentation engine. We'll create a class that handles SAM2 inference with interactive prompts:
import torch
from sam2.build_sam import build_sam2_video_predictor
class SAM2RotoAssistant:
def __init__(self, checkpoint_path, model_cfg='sam2_hiera_l.yaml'):
"""Initialize SAM2 video predictor"""
self.predictor = build_sam2_video_predictor(model_cfg, checkpoint_path)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {self.device}")
def load_sequence(self, frame_dir):
"""Load video frames for processing"""
with torch.inference_mode(), torch.autocast(self.device.type, dtype=torch.float16):
state = self.predictor.init_state(video_path=frame_dir)
return state
def add_click_prompt(self, state, frame_idx, point, label):
"""
Add interactive click prompt
point: (x, y) coordinates
label: 1 for foreground, 0 for background
"""
_, out_obj_ids, out_mask_logits = self.predictor.add_new_points(
inference_state=state,
frame_idx=frame_idx,
obj_id=0,
points=point,
labels=label
)
return out_mask_logits
def propagate_masks(self, state):
"""Propagate masks across all frames"""
masks = {}
for frame_idx, obj_ids, mask_logits in self.predictor.propagate_in_video(state):
masks[frame_idx] = (mask_logits[0] > 0.0).cpu().numpy()
return masks
SAM2 can consume significant GPU memory for high-resolution sequences. If you encounter out-of-memory errors:
- Use the
sam2_hiera_smallcheckpoint instead - Process sequences in smaller chunks (e.g., 50-frame batches)
- Enable gradient checkpointing in the model config
Integrate With Your Compositing Workflow
Now let's connect this to Nuke. We'll create a Python panel that allows artists to:
- Load a plate sequence
- Add interactive foreground/background clicks
- Preview the segmentation
- Export masks as a Nuke-compatible image sequence
Export Masks for Nuke
def export_masks_for_nuke(masks, output_dir, frame_format='frame_%04d.png'):
"""Export binary masks as 8-bit PNG sequence"""
import os
os.makedirs(output_dir, exist_ok=True)
for frame_idx, mask in masks.items():
# Convert boolean mask to 8-bit (0 or 255)
mask_8bit = (mask[0] * 255).astype('uint8')
# Write to disk
filename = frame_format % (frame_idx + 1)
cv2.imwrite(os.path.join(output_dir, filename), mask_8bit)
print(f"Exported {len(masks)} masks to {output_dir}")
Create a Simple Interactive UI
For production use, you'll want a GUI. Here's a minimal example using OpenCV's window system:
import cv2
def interactive_segmentation(assistant, frame_dir):
"""Simple click-based interface for segmentation"""
state = assistant.load_sequence(frame_dir)
# Load first frame for display
first_frame_path = os.path.join(frame_dir, sorted(os.listdir(frame_dir))[0])
display_frame = cv2.imread(first_frame_path)
points = []
labels = []
def mouse_callback(event, x, y, flags, param):
if event == cv2.EVENT_LBUTTONDOWN: # Foreground click
points.append([x, y])
labels.append(1)
cv2.circle(display_frame, (x, y), 5, (0, 255, 0), -1)
elif event == cv2.EVENT_RBUTTONDOWN: # Background click
points.append([x, y])
labels.append(0)
cv2.circle(display_frame, (x, y), 5, (0, 0, 255), -1)
cv2.namedWindow('SAM2 Roto Assistant')
cv2.setMouseCallback('SAM2 Roto Assistant', mouse_callback)
while True:
cv2.imshow('SAM2 Roto Assistant', display_frame)
key = cv2.waitKey(1)
if key == ord('q'): # Quit
break
elif key == ord('p'): # Propagate
if points:
mask_logits = assistant.add_click_prompt(state, 0, points, labels)
masks = assistant.propagate_masks(state)
export_masks_for_nuke(masks, 'output_masks')
print("Masks exported!")
cv2.destroyAllWindows()
Visualize and Iterate
Once you've generated masks, bring them back into Nuke to refine:
- Import the mask sequence as a Read node
- Use a Premult node to apply the mask to your original plate
- Add a RotoPaint node for manual refinements on problem frames
- Use edge operations (Dilate/Erode) to fine-tune the mask edge
SAM2 can struggle with heavy motion blur. For best results:
- Add more interactive prompts on blurred frames
- Use Nuke's MotionBlur node to match the original plate blur on the mask
- Consider temporal smoothing with a FrameBlend node
Production Considerations
Before deploying this in a production pipeline, consider:
- Latency: SAM2 inference on a 100-frame sequence at 2K resolution takes ~2-3 minutes on an A100 GPU. Budget accordingly for interactive sessions.
- Quality assurance: Always review masks frame-by-frame. SAM2 is excellent but not perfect—plan for manual cleanup time.
- Version control: Save prompt coordinates and model versions so artists can reproduce results.
- Batch processing: For large shows, create a job submission system that queues SAM2 inference on render nodes.