2024 Infrastructure Research Phase

VFX Data Corpus Visualization

Interactive exploration of shot metadata, model training datasets, and production lineage. Treating data as strategic infrastructure, not scattered artifacts.

Project Screenshot

Challenge

Production data is fragmented across ShotGrid, editorial systems, render logs, and QC databases. There's no unified view of how shots, assets, and model training data relate—making it difficult to trace model decisions or audit dataset quality.

When an AI model makes a mistake, you need to understand: What training data influenced this decision? Which shots contributed similar examples? How has artist feedback modified the model over time? Without data lineage, these questions are unanswerable.

Approach

Building knowledge graph connecting shots, plates, annotations, model versions, and inference results with full lineage tracking
Interactive 3D visualization showing data relationships and flow through the production pipeline
Filtering by shot complexity, model performance, artist feedback scores, and temporal patterns
Version control for datasets treating training data as a versioned asset like code or models
Query interface for artists to explore which shots influenced model behavior on specific cases

Research Questions

Can we trace model decisions back to training examples? If a segmentation model fails on a reflective surface, which training images taught it about reflections? This explainability builds trust and identifies dataset gaps.

How does data quality correlate with model performance? By visualizing relationships between annotation quality, shot complexity, and inference accuracy, we can prioritize dataset improvements with highest impact.

What's the ROI of human feedback? When artists correct AI outputs, that becomes training data. Tracking this loop shows which corrections improve the model most, guiding where to invest annotation effort.

Current Status

Designing data schema and graph relationships. Prototyping with small dataset from personal projects before scaling to production volumes. Exploring Neo4j for graph database and D3.js for visualization.

The goal is a system that makes production data explorable and valuable—not just archived. Data becomes infrastructure that improves over time, not just historical records.

Neo4j D3.js ShotGrid API MLflow PostgreSQL