-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Hi, I found a possible inconsistency in the coordinate convention for extrinsics used in AnySplat, which might cause confusion, especially when integrating VGGT.
After checking both VGGT’s data and model outputs, as well as how the losses are computed, I confirmed that VGGT uses world-to-camera (w2c) extrinsics consistently throughout the pipeline.
However, in AnySplat, the situation appears reversed: Input data and the encoder outputs both use camera-to-world (c2w) extrinsics. Despite this, one part of the code assumes w2c convention.
Specifically, in the training setup:
if self.model.encoder.pred_pose: self.loss_pose = HuberLoss(alpha=self.train_cfg.pose_loss_alpha, delta=self.train_cfg.pose_loss_delta)
This loss calls:
GT_pose_enc = extri_intri_to_pose_encoding(context_extrinsics, context_intrinsics, image_size_hw)
According to the function’s intended design, extri_intri_to_pose_encoding expects world-to-camera (w2c) extrinsics. However, context_extrinsics in AnySplat are actually camera-to-world (c2w).
Although this pose_loss is currently not actively used in the final training pipeline, this mismatch introduces confusion when interpreting coordinate conventions and potentially misleads users trying to integrate or debug extrinsics-related features.
Thanks for your great work on this project! A clarification of the extrinsics coordinate flow throughout the codebase would be very helpful for users working on pose alignment and dataset integration.