True Self-Supervised Novel View Synthesis is Transferable

Thomas W. Mitchel^1,*, Hyunwoo Ryu^1,2,*, Vincent Sitzmann²

^* Equal Contribution

¹Sony Playstation
²MIT CSAIL, Scene Representation Group

TL;DR: We introduce the first geometry-free model to achieve true self-supervised / pose-free Novel View Synthesis (NVS) by learning transferable latent camera pose representations.

We identify that the key criterion for determining whether a pose-free model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry — such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.

Key Insight: Transferability

We identify transferability as the key criterion for true novel view synthesis. A pose representation is transferable if the same latent poses produce the same camera trajectory when applied to different scenes. Prior methods fail this criterion—their poses are entangled with scene content.

To quantify transferability, we introduce True Pose Similarity (TPS), a standardized metric that measures the degree to which latent pose representations transfer between scenes. Given two sequences of frames with predicted poses, TPS leverages an external oracle (such as VGGT) to extract ground-truth camera poses from rendered images and compares them using standard trajectory metrics like Relative Rotation Accuracy (RRA) or Area Under Curve (AUC). Specifically, we render a target scene using poses extracted from a source scene and measure whether the rendered trajectory matches the expected camera motion. High TPS indicates that pose representations successfully transfer geometric information independent of scene content.

Method Overview

XFactor learns transferable latent camera poses through a two-stage architecture: a pose encoder that extracts relative camera pose representations from frame pairs, and a renderer that uses these poses to synthesize novel views. The key innovation is our augmentation scheme that enforces disentanglement between pose and scene content.

True Self-supervised NVS through Latent Pose Transfer

The following videos demonstrate the transferability of XFactor's learned latent pose representations across different scenarios. In all examples, the pose is extracted from one video sequence and used to render a novel view of a different scene. This capability distinguishes XFactor from prior self-supervised methods like RUST and RayZer, which learn pose representations that remain entangled with scene content and fail to transfer meaningfully across different scenes.

Scene-to-Scene Transfer

Object-to-Object Transfer

Cross-Modal Transfer

Experimental Results

We evaluate XFactor on multiple datasets and demonstrate that it significantly outperforms prior methods on both reconstruction quality and transferability metrics.

Figure 2: Quantitative evaluation of transferability. XFactor achieves significantly higher transfer accuracy compared to baseline methods across all datasets.

Learned Latent Poses

We demonstrate that XFactor learns meaningful latent pose representations that correlate with ground-truth camera poses, despite never being trained with explicit 3D supervision or SE(3) constraints.

Figure 3: Probing experiments show that our learned latent poses are highly correlated with ground-truth camera poses, demonstrating that the model learns meaningful geometric representations.

BibTeX

@inproceedings{mitchelryu2025xfactor,
      title={True Self-Supervised Novel View Synthesis is Transferable}, 
      author={Thomas W. Mitchel and Hyunwoo Ryu and Vincent Sitzmann},
      year={2025},
      eprint={2510.13063},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.13063}, 
}