diff --git a/README.md b/README.md index 668f8cb6..e46c75a3 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,8 @@ [![demo](assets/InternNav.gif "demo")](https://www.youtube.com/watch?v=fD0F1jIax5Y) [![HomePage](https://img.shields.io/badge/HomePage-144B9E?logo=ReactOS&logoColor=white)](https://internrobotics.github.io/internvla-n1.github.io/) -[![Technique Report](https://img.shields.io/badge/Paper-B31B1B?logo=arXiv&logoColor=white)](https://internrobotics.github.io/internvla-n1.github.io/static/pdfs/InternVLA_N1.pdf) +[![Technical Report โ€” InternVLA-N1](https://img.shields.io/badge/Technical_Report-InternVLA--N1-BB2649?logo=adobeacrobatreader&logoColor=white)](https://internrobotics.github.io/internvla-n1.github.io/static/pdfs/InternVLA_N1.pdf) +[![DualVLN Paper โ€” arXiv](https://img.shields.io/badge/arXiv-DualVLN-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2512.08186) [![doc](https://img.shields.io/badge/Document-FFA500?logo=readthedocs&logoColor=white)](https://internrobotics.github.io/user_guide/internnav/index.html) [![GitHub star chart](https://img.shields.io/github/stars/InternRobotics/InternNav?style=square)](https://github.com/InternRobotics/InternNav) [![GitHub Issues](https://img.shields.io/github/issues/InternRobotics/InternNav)](https://github.com/InternRobotics/InternNav/issues) @@ -34,9 +35,9 @@ The toolbox supports the most comprehensive 6 datasets \& benchmarks and 10+ pop The toolbox supports the most advanced high-quality navigation dataset, InternData-N1, which includes 3k+ scenes and 830k VLN data covering diverse embodiments and scenes, and the first dual-system navigation foundation model with leading performance on all the benchmarks and zero-shot generalization capability in the real world, InternVLA-N1. ## ๐Ÿ”ฅ News - | Time | Update | |---------|--------| +| 2025/12 | Training code for InternVLA-N1 is now available. This release provides two model configurations: InternVLA-N1 (Dual System) with NavDP* and InternVLA-N1 (Dual System) DualVLN . For model architecture and training details, please refer to the [DualVLN paper](https://arxiv.org/abs/2512.08186).| | 2025/11 | InternNav v0.2.0 released โ€” added distributed evaluation support for VLN-PE.| | 2025/10 | Add a [inference-only demo](scripts/notebooks/inference_only_demo.ipynb) of InternVLA-N1. | | 2025/10 | InternVLA-N1 [technical report](https://internrobotics.github.io/internvla-n1.github.io/static/pdfs/InternVLA_N1.pdf) is released. Please check our [homepage](https://internrobotics.github.io/internvla-n1.github.io/). | @@ -48,9 +49,10 @@ The toolbox supports the most advanced high-quality navigation dataset, InternDa - [๐Ÿ  Introduction](#-introduction) - [๐Ÿ”ฅ News](#-news) - [๐Ÿ“š Getting Started](#-getting-started) -- [๐Ÿ“ฆ Overview of Benchmark \& Model Zoo](#-overview-of-benchmark-and-model-zoo) +- [๐Ÿ“ฆ Overview of Benchmark \& Model Zoo](#-overview) - [๐Ÿ”ง Customization](#-customization) - [๐Ÿ‘ฅ Contribute](#-contribute) +- [๐Ÿš€ Community Deployment & Best Practices](#-community-deployment--best-practices) - [๐Ÿ”— Citation](#-citation) - [๐Ÿ“„ License](#-license) - [๐Ÿ‘ Acknowledgements](#-acknowledgements) @@ -59,133 +61,147 @@ The toolbox supports the most advanced high-quality navigation dataset, InternDa Please refer to the [documentation](https://internrobotics.github.io/user_guide/internnav/quick_start/index.html) for quick start with InternNav, from installation to training or evaluating supported models. -## ๐Ÿ“ฆ Overview of Benchmark and Model Zoo +## ๐Ÿ“ฆ Overview -### Datasets \& Benchmarks +### ๐Ÿงช Supported Benchmarks - -
- System2 (VLN-CE) - - System1 (VN) + VLN Benchmarks - Whole-system (VLN) + VN Benchmarks
- -
-### Models +### ๐Ÿค— Model Zoo & Downloads
- System2 (VLN-CE) + ๐Ÿง  VLN Single-System - System1 (VN) + ๐ŸŽฏ VN System (System1) - Whole-system (VLN) + ๐Ÿค VLN Multi-System
-### Benchmark Results - -#### VLN-CE Task -| Model | Dataset/Benchmark | NE | OS | SR | SPL | Download | -| ------ | ----------------- | -- | -- | --------- | -- | --------- | -| `InternVLA-N1 (S2)` | R2R | 4.89 | 60.6 | 55.4 | 52.1| [Model](https://huggingface.co/InternRobotics/InternVLA-N1-S2) | -| `InternVLA-N1` | R2R | **4.83** | **63.3** | **58.2** | **54.0** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1) | -| `InternVLA-N1 (S2)` | RxR | 6.67 | 56.5 | 48.6 | 42.6 | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-S2) | -| `InternVLA-N1` | RxR | **5.91** | **60.8** | **53.5** | **46.1** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1) | -| `InternVLA-N1-Preview (S2)` | R2R | 5.09 | 60.9 | 53.7 | 49.7 | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview-S2) | -| `InternVLA-N1-Preview` | R2R | **4.76** | **63.4** | **56.7** | **52.6** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview) | -| `InternVLA-N1-Preview (S2)` | RxR | 6.39 | 60.1 | 50.5 | 43.3 | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview-S2) | -| `InternVLA-N1-Preview` | RxR | **5.65** | **63.2** | **53.5** | **45.7** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview) | - -#### VLN-PE Task -| Model | Dataset/Benchmark | NE | OS | SR | SPL | Download | -| ------ | ----------------- | -- | -- | -- | --- | --- | -| `Seq2Seq` | Flash | 8.27 | 43.0 | 15.7 | 9.7 | [Model](https://huggingface.co/InternRobotics/VLN-PE) | -| `CMA` | Flash | 7.52 | 45.0 | 24.4 | 18.2 | [Model](https://huggingface.co/InternRobotics/VLN-PE) | -| `RDP` | Flash | 6.98 | 42.5 | 24.9 | 17.5 | [Model](https://huggingface.co/InternRobotics/VLN-PE) | -| `InternVLA-N1-Preview` | Flash | **4.21** | **68.0** | **59.8** | **54.0** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview) | -| `InternVLA-N1` | Flash | **4.13** | **67.6** | **60.4** | **54.9** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1) | -| `Seq2Seq` | Physical | 7.88 | 28.1 | 15.1 | 10.7 | [Model](https://huggingface.co/InternRobotics/VLN-PE) | -| `CMA` | Physical | 7.26 | 31.4 | 22.1 | 18.6 | [Model](https://huggingface.co/InternRobotics/VLN-PE) | -| `RDP` | Physical | 6.72 | 36.9 | 25.2 | 17.7 | [Model](https://huggingface.co/InternRobotics/VLN-PE) | -| `InternVLA-N1-Preview` | Physical | **5.31** | **49.0** | **42.6** | **35.8** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview) | -| `InternVLA-N1` | Physical | **4.73** | **56.7** | **50.6** | **43.3** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1) | - -#### Visual Navigation Task - PointGoal Navigation -| Model | Dataset/Benchmark | SR | SPL | Download | -| ------ | ----------------- | -- | -- | --------- | -| `iPlanner` | ClutteredEnv | 84.8 | 83.6 | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) | -| `ViPlanner` | ClutteredEnv | 72.4 | 72.3 | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) | -| `InternVLA-N1 (S1)` | ClutteredEnv | **89.8** | **87.7** | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) | -| `iPlanner` | InternScenes | 48.8 | 46.7 | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) | -| `ViPlanner` | InternScenes | 54.3 | 52.5 | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) | -| `InternVLA-N1 (S1)` | InternScenes | **65.7** | **60.7** | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) | - - - -**NOTE:** -- VLN-CE RxR benchmark and StreamVLN will be supported soon. + + + +### ๐Ÿ“Š Benchmark Results + + +#### VLN-CE Benchmarks + +**๐Ÿ“ R2R Dataset** +| Model | Observation | NE โ†“ | OS โ†‘ | SR โ†‘ | SPL โ†‘ | +|-------|-------------|------|------|------|-------| +| InternVLA-N1-wo-dagger (S2) + [ShortestPathFollower](https://aihabitat.org/docs/habitat-lab/habitat.tasks.nav.shortest_path_follower.ShortestPathFollower.html) | - | 4.89 | 60.6 | 55.4 | 52.1 | +| InternVLA-N1-wo-dagger (Dual System) with NavDP* | RGB-D | 4.83 | 63.3 | 58.2 | 54.0 | +| InternVLA-N1 (S2) + [ShortestPathFollower](https://aihabitat.org/docs/habitat-lab/habitat.tasks.nav.shortest_path_follower.ShortestPathFollower.html) | - | 4.25 | 68.3 | 60.9 | 55.2 | +| InternVLA-N1 (Dual System) with NavDP* | RGB-D | 4.22 | 70.4 | 64.1 | 58.1 | +| InternVLA-N1 (Dual System) DualVLN | RGB | **4.05** | **70.7** | **64.3** | **58.5** | + +**๐Ÿ“ RxR Dataset** +| Model | Observation | NE โ†“ | SR โ†‘ | SPL โ†‘ | nDTW โ†‘ | +|-------|-------------|------|------|------|-------| +| InternVLA-N1 (S2) + [ShortestPathFollower](https://aihabitat.org/docs/habitat-lab/habitat.tasks.nav.shortest_path_follower.ShortestPathFollower.html) | - | 5.71 | 63.5 | 55.0 | 46.8 | +| InternVLA-N1 (Dual System) with NavDP* | RGB-D | 4.70 | 59.7 | 50.6 | 69.7 | +| InternVLA-N1 (Dual System) DualVLN | RGB | **4.58** | **61.4** | **51.8** | **70.0** | + +--- + +#### VLN-PE Benchmarks + +**๐Ÿ“ Flash Controller on R2R Unseen** +| Model | NE โ†“ | OS โ†‘ | SR โ†‘ | SPL โ†‘ | +|-------|------|------|------|-------| +| Seq2Seq | 8.27 | 43.0 | 15.7 | 9.7 | +| CMA | 7.52 | 45.0 | 24.4 | 18.2 | +| RDP | 6.98 | 42.5 | 24.9 | 17.5 | +| InternVLA-N1 (System 2) + iPlanner | 4.91 | 55.53 | 47.07 | 41.09 | +| InternVLA-N1 (System 2) + NavDP | 4.22 | 67.33 | 58.72 | 50.98 | +| InternVLA-N1 (Dual System) DualVLN | **3.90** | **69.93** | **63.62** | **56.49** | + +**๐Ÿ“ Physical Controller on R2R Unseen** +| Model | NE โ†“ | OS โ†‘ | SR โ†‘ | SPL โ†‘ | +|-------|------|------|------|-------| +| Seq2Seq | 7.88 | 28.1 | 15.1 | 10.7 | +| CMA | 7.26 | 31.4 | 22.1 | 18.6 | +| RDP | 6.72 | 36.9 | 25.2 | 17.7 | +| InternVLA-N1 (Dual System) DualVLN | **4.66** | **55.9** | **51.6** | **42.49** | + + +#### Visual Navigation Benchmarks + +**๐Ÿ“ ClutteredEnv Dataset** +| Model | SR โ†‘ | SPL โ†‘ | +|-------|------|-------| +| iPlanner | 84.8 | 83.6 | +| ViPlanner | 72.4 | 72.3 | +| NavDP | **89.8** | **87.7** | + +**๐Ÿ“ InternScenes Dataset** +| Model | SR โ†‘ | SPL โ†‘ | +|-------|------|-------| +| iPlanner | 48.8 | 46.7 | +| ViPlanner | 54.3 | 52.5 | +| NavDP | **65.7** | **60.7** | + +--- ## ๐Ÿ”ง Customization @@ -198,6 +214,23 @@ For example, raising issues, fixing bugs in the framework, and adapting or addin **Note:** We welcome the feedback of the model's zero-shot performance when deploying in your own environment. Please show us your results and offer us your future demands regarding the model's capability. We will select the most valuable ones and collaborate with users together to solve them in the next few months :) +## ๐Ÿš€ Community Deployment & Best Practices + +We are excited to see InternNav being deployed and extended by the community across different robots and real-world scenarios. +Below are selected community-driven deployment guides and solution write-ups, which may serve as practical references for advanced users. + +- **IROS Challenge Nav Track: Champion Solution (2025)** + A complete system-level solution and design analysis for Vision-and-Language Navigation in Physical Environments. + ๐Ÿ”— https://zhuanlan.zhihu.com/p/1969046543286907790 + +- **Go2 Series Deployment Tutorial (ShanghaiTech University)** + Step-by-step edge deployment guide for InternNav-based perception and navigation. + ๐Ÿ”— https://github.com/cmjang/InternNav-deploy + +- **G1 Series Deployment Tutorial (Wuhan University)** + Detailed educational materials on vision-language navigation deployment. + ๐Ÿ”— [*Chapter 5: Vision-Language Navigation (Part II)*](https://mp.weixin.qq.com/s/p3cJzbRvecMajiTh9mXoAw) + ## ๐Ÿ”— Citation If you find our work helpful, please cite: @@ -240,6 +273,15 @@ If you use the specific pretrained models and benchmarks, please kindly cite the year = {2025}, booktitle={arXiv}, } +@misc{wei2025groundslowfastdualsystem, + title={Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation}, + author={Meng Wei and Chenyang Wan and Jiaqi Peng and Xiqian Yu and Yuqiang Yang and Delin Feng and Wenzhe Cai and Chenming Zhu and Tai Wang and Jiangmiao Pang and Xihui Liu}, + year={2025}, + eprint={2512.08186}, + archivePrefix={arXiv}, + primaryClass={cs.RO}, + url={https://arxiv.org/abs/2512.08186}, +} ``` diff --git a/internnav/agent/__init__.py b/internnav/agent/__init__.py index 74aa4bcd..8ccc4f96 100644 --- a/internnav/agent/__init__.py +++ b/internnav/agent/__init__.py @@ -1,13 +1,8 @@ from internnav.agent.base import Agent from internnav.agent.cma_agent import CmaAgent +from internnav.agent.dialog_agent import DialogAgent +from internnav.agent.internvla_n1_agent import InternVLAN1Agent from internnav.agent.rdp_agent import RdpAgent from internnav.agent.seq2seq_agent import Seq2SeqAgent -from internnav.agent.internvla_n1_agent import InternVLAN1Agent -__all__ = [ - 'Agent', - 'CmaAgent', - 'RdpAgent', - 'Seq2SeqAgent', - 'InternVLAN1Agent' -] +__all__ = ['Agent', 'DialogAgent', 'CmaAgent', 'RdpAgent', 'Seq2SeqAgent', 'InternVLAN1Agent'] diff --git a/internnav/agent/base.py b/internnav/agent/base.py index a68626f6..02a566d5 100644 --- a/internnav/agent/base.py +++ b/internnav/agent/base.py @@ -25,6 +25,7 @@ def decorator(agent_class): if agent_type in cls.agents: raise ValueError(f"Agent {agent_type} already registered.") cls.agents[agent_type] = agent_class + return agent_class return decorator diff --git a/internnav/agent/dialog_agent.py b/internnav/agent/dialog_agent.py new file mode 100644 index 00000000..8eec7a92 --- /dev/null +++ b/internnav/agent/dialog_agent.py @@ -0,0 +1,472 @@ +import argparse +import copy +import itertools +import random +import re +import time +from collections import OrderedDict +from typing import Any, Dict + +import numpy as np +import quaternion +import torch +from PIL import Image, ImageDraw + +from internnav.agent import Agent +from internnav.configs.agent import AgentCfg + +try: + from transformers import ( + AutoProcessor, + AutoTokenizer, + Qwen2_5_VLForConditionalGeneration, + ) + from depth_camera_filtering import filter_depth + from habitat.tasks.nav.shortest_path_follower import ShortestPathFollower +except Exception as e: + print(f"Warning: ({e}), Habitat Evaluation is not loaded in this runtime. Ignore this if not using Habitat.") + +DEFAULT_IMAGE_TOKEN = "" + + +def split_and_clean(text): + import re + + parts = re.split(r'()', text) + results = [] + for part in parts: + if part == '': + results.append(part) + else: + clean_part = part.replace('\n', '').strip() + if clean_part: + results.append(clean_part) + return results + + +@Agent.register('dialog') +class DialogAgent(Agent): + """Vision-language navigation agent that can either move or ask an oracle via dialog. The agent builds a multimodal + chat prompt from current/historical RGB observations (and optional look-down frames), runs a Qwen2.5-VL model to + produce either an action sequence, a pixel waypoint, or a dialog query, then converts the model output into + simulator actions and (optionally) a world-space navigation goal. + + Args: + agent_config (AgentCfg): AgentCfg containing model_settings (e.g., task name, sensor config, model path, mode, + resizing, dialog flags, and generation parameters) and runtime info such as local_rank. + """ + + def __init__(self, agent_config: AgentCfg): + self.agent_config = agent_config + self.task_name = self.agent_config.model_settings['task_name'] + + # sensor config + self.sim_sensors_config = self.agent_config.model_settings['sim_sensors_config'] + self._camera_height = self.sim_sensors_config.rgb_sensor.position[1] + self._min_depth = self.sim_sensors_config.depth_sensor.min_depth + self._max_depth = self.sim_sensors_config.depth_sensor.max_depth + self._camera_fov = np.deg2rad(self.sim_sensors_config.depth_sensor.hfov) + self._fx = self._fy = self.sim_sensors_config.depth_sensor.width / (2 * np.tan(self._camera_fov / 2)) + + # model + self.model_args = argparse.Namespace(**self.agent_config.model_settings) + + self.task = self.model_args.task + self.append_look_down = self.model_args.append_look_down + self.resize_h = self.model_args.resize_h + self.resize_w = self.model_args.resize_w + + tokenizer = AutoTokenizer.from_pretrained(self.model_args.model_path, use_fast=True) + processor = AutoProcessor.from_pretrained(self.model_args.model_path) + processor.tokenizer = tokenizer + processor.tokenizer.padding_side = 'left' + + self.device = torch.device('cuda', self.agent_config.model_settings['local_rank']) + if self.model_args.mode == 'dual_system': + raise NotImplementedError("Dual System mode is not supported in DialogAgent.") + elif self.model_args.mode == 'system2': + model = Qwen2_5_VLForConditionalGeneration.from_pretrained( + self.model_args.model_path, + torch_dtype=torch.bfloat16, + attn_implementation="flash_attention_2", + device_map={"": self.device}, + ) + else: + raise ValueError(f"Invalid mode: {self.model_args.mode}") + + model.eval() + + self.model = model + self.processor = processor + self.num_history = self.model_args.num_history + + # prompt + if 'dialog' in self.task_name or self.agent_config.model_settings['dialog_enabled']: + prompt = "You are an autonomous navigation assistant. Your task is to . Where should you go next to stay on track? There is an oracle can help you to complete the task in current environment, you can either choose to move or talk. If choosing to talk, please say something that can help you better to find the target object. If choosing to move, when you want to output a waypoint you need to TILT DOWN (โ†“) by 30 degrees then output the next waypoint\'s coordinates in the image. In case the next waypoint is out of view, utilize the turn actions: TURN LEFT (โ†) or TURN RIGHT (โ†’) by 15 degrees. Please output STOP when you have successfully completed the task." + else: + prompt = "You are an autonomous navigation assistant. Your task is to . Where should you go next to stay on track? When you want to output a waypoint you need to TILT DOWN (โ†“) by 30 degrees then output the next waypoint\'s coordinates in the image. In case the next waypoint is out of view, utilize the turn actions: TURN LEFT (โ†) or TURN RIGHT (โ†’) by 15 degrees. Please output STOP when you have successfully completed the task." + answer = "" + self.conversation = [{"from": "human", "value": prompt}, {"from": "gpt", "value": answer}] + + self.conjunctions = [ + 'you can see ', + 'in front of you is ', + 'there is ', + 'you can spot ', + 'you are toward the ', + 'ahead of you is ', + 'in your sight is ', + ] + + self.actions2idx = OrderedDict( + { + 'STOP': [0], + "โ†‘": [1], + "โ†": [2], + "โ†’": [3], + "โ†“": [5], + } + ) + + def convert_input(self, obs, info): + # update new information after env.step + depth = obs["depth"] + depth = filter_depth(depth.reshape(depth.shape[:2]), blur_type=None) + depth = depth * (self._max_depth - self._min_depth) + self._min_depth + self.depth = depth * 1000 # get depth + + rgb = obs["rgb"] + image = Image.fromarray(rgb).convert('RGB') # raw observation image + self.save_raw_image = image.copy() # get rgb + + x, y = obs["gps"] + camera_yaw = obs["compass"][0] + agent_state = info['agent state'] + height = agent_state.position[1] - self.initial_height # Habitat GPS makes west negative, so flip y + camera_position = np.array([x, -y, self._camera_height + height]) + self.tf_camera_to_episodic = ( + self.xyz_yaw_pitch_to_tf_matrix(camera_position, camera_yaw, np.deg2rad(30)) @ self.get_axis_align_matrix() + ) # get transformation from camera to agent + + # if last action is look down, save the newest look down image for later pixel selection + if self.last_action == 5: + self.look_down_image = image + self.save_raw_image = self.look_down_image.copy() + elif self.last_action != 6: + image = image.resize((self.resize_w, self.resize_h)) + self.rgb_list.append(image) + return obs + + def convert_output(self, env, llm_outputs: str): + if '' in llm_outputs: + self.question = llm_outputs.replace('', '') + return 6 + else: + if bool(re.search(r'\d', llm_outputs)): # output pixel goal + # get pixel goal + self.forward_action = 0 + coord = [int(c) for c in re.findall(r'\d+', llm_outputs)] + print('coords:', coord) + try: + pixel_goal = [int(coord[1]), int(coord[0])] # switch the goal o + except Exception as e: + print(f"Invalid pixel goal: len(coor)!=2: {e}") + return 0 + + # trans pixel goal to global goal + try: + self.goal = self.pixel_to_gps( + pixel_goal, self.depth / 1000, self.intrinsic_matrix, self.tf_camera_to_episodic + ) + except Exception as e: + print(f"Invalid pixel goal: out of image size range: {e}") + return 0 + self.goal = (self.transformation_matrix @ np.array([-self.goal[1], 0, -self.goal[0], 1]))[:3] + if not env._env.sim.pathfinder.is_navigable(np.array(self.goal)): + self.goal = np.array(env._env.sim.pathfinder.snap_point(np.array(self.goal))) + + # paint pixel goal + draw = ImageDraw.Draw(self.save_raw_image, 'RGB') + x, y, r = pixel_goal[0], pixel_goal[1], 2 + draw.ellipse([(x - r, y - r), (x + r, y + r)], fill=(255, 0, 0)) + + # look down --> horizontal + env.step(4) + env.step(4) + + if self.append_look_down and self.look_down_image is not None: + self.prev_look_image = self.look_down_image.resize((self.resize_w, self.resize_h)) + action = self.agent.get_next_action(self.goal) + if action == 0: + self.goal = None + self.messages = [] + print('conduct a random action 2') + self.last_action = 2 + return 2 + print('predicted goal', pixel_goal, self.goal, flush=True) + else: + self.action_seq = self.parse_actions(llm_outputs) + print('actions', self.action_seq, flush=True) + + def inference(self, obs, info): + if self.last_action == 6: + self.dialogs.append({'role': 'navigator', 'message': self.question, 'true_idx': self.step_id}) + self.dialogs.append({'role': 'oracle', 'message': obs['npc_answer'], 'true_idx': self.step_id}) + self.messages.append({'role': 'assistant', 'content': [{'type': 'text', 'text': self.last_llm_outputs}]}) + self.messages.append({'role': 'user', 'content': [{'type': 'text', 'text': obs['npc_answer']}]}) + elif self.last_action == 5: + sources = [{"from": "human", "value": ""}, {"from": "gpt", "value": ""}] + self.input_images += [self.look_down_image] + self.messages.append({'role': 'assistant', 'content': [{'type': 'text', 'text': self.last_llm_outputs}]}) + input_img_id = -1 + else: + sources = copy.deepcopy(self.conversation) + sources[0]["value"] = sources[0]["value"].replace('', info['episode_instruction']) + cur_images = self.rgb_list[-1:] # current observation + if self.step_id == 0: + history_id = [] + else: + history_id = np.unique(np.linspace(0, self.step_id - 1, self.num_history, dtype=np.int32)).tolist() + # add dialod history + dialogs_idx = np.sort(list(set([i['true_idx'] for i in self.dialogs]))).tolist() + history_id = np.sort(np.unique(np.concatenate([history_id, dialogs_idx]).astype(np.int32))).tolist() + placeholder = [''] * (len(history_id) + 1) + for n in dialogs_idx: + pos = history_id.index(n) + output = "" + for dialog in self.dialogs: + if dialog['true_idx'] == n: + output += f"<|{dialog['role']}|>{dialog['message']}" + placeholder[pos + 1] = "<|dialog_start|>" + output + "<|dialog_end|>" + # add image history + placeholder = (DEFAULT_IMAGE_TOKEN + '\n').join(placeholder) + sources[0]["value"] += f' These are your historical observations: {placeholder}.' + if self.append_look_down: + if self.prev_look_image is not None: + sources[0]["value"] += f' Your previous look down image is:{DEFAULT_IMAGE_TOKEN}.' + else: + sources[0]["value"] += ' Your previous look down image is not here.' + history_id = sorted(history_id) + print('history_id', self.step_id, history_id) + # prepare images + if self.append_look_down: + if self.prev_look_image is not None: + self.input_images = [self.rgb_list[i] for i in history_id] + [self.prev_look_image] + cur_images + else: + self.input_images = [self.rgb_list[i] for i in history_id] + cur_images + else: + self.input_images = [self.rgb_list[i] for i in history_id] + cur_images + input_img_id = 0 + + if self.last_action != 6: + # prompt text + prompt = random.choice(self.conjunctions) + DEFAULT_IMAGE_TOKEN + sources[0]["value"] += f" {prompt}." + prompt_instruction = copy.deepcopy(sources[0]["value"]) + + # prompt images + parts = split_and_clean(prompt_instruction) + content = [] + for i in range(len(parts)): + if parts[i] == "": + content.append({"type": "image", "image": self.input_images[input_img_id]}) + input_img_id += 1 + else: + content.append({"type": "text", "text": parts[i]}) + + self.messages.append({'role': 'user', 'content': content}) + # inference + text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) + print('step_id', self.step_id, ' ', text) + inputs = self.processor(text=[text], images=self.input_images, return_tensors="pt").to(self.model.device) + with torch.no_grad(): + output_ids = self.model.generate( + **inputs, max_new_tokens=self.agent_config.model_settings['max_new_tokens'], do_sample=False + ) + llm_outputs = self.processor.tokenizer.decode( + output_ids[0][inputs.input_ids.shape[1] :], skip_special_tokens=True + ) + print('step_id:', self.step_id, 'output text:', llm_outputs) + return llm_outputs + + def step(self, obs: Dict[str, Any], env, info): + print(f'{self.agent_config.model_name} Agent step') + start = time.time() + # convert obs to model input + self.step_id = info['step'] + obs = self.convert_input(obs, info) + if len(self.action_seq) == 0 and self.goal is None: + llm_outputs = self.inference(obs, info) + self.last_llm_outputs = llm_outputs + action = self.convert_output(env, llm_outputs) + with open(info['output_path'], 'a') as f: + f.write(str(self.step_id) + " " + llm_outputs + "\n") + else: + action = None + + if action is None: + if len(self.action_seq) != 0: + action = self.action_seq.pop(0) + elif self.goal is not None: + action = self.agent.get_next_action(self.goal) + action = action.detach().cpu().numpy()[0] if isinstance(action, torch.Tensor) else action + action = action[0] if hasattr(action, "__len__") else action + + self.forward_action += 1 + print('forward_action', self.forward_action, flush=True) + if self.forward_action > 8: + self.goal = None + self.messages = [] + self.forward_action = 0 + end = time.time() + print(f'time: {round(end-start, 4)}s') + # return a meaningless action to do nothing + return 7 + if action == 0: + self.goal = None + self.messages = [] + end = time.time() + print(f'time: {round(end-start, 4)}s') + # return a meaningless action to do nothing + return 7 + else: + action = 0 + + end = time.time() + print(f'time: {round(end-start, 4)}s') + self.last_action = action + return action + + def reset(self, env): + self.intrinsic_matrix = self.get_intrinsic_matrix(self.sim_sensors_config.rgb_sensor) + self.agent = ShortestPathFollower(env._env.sim, 0.25, False) + + # params saving and initialization + agent_state = env._env.sim.get_agent_state() + rotation_matrix = quaternion.as_rotation_matrix(agent_state.rotation) + self.transformation_matrix = np.eye(4) + self.transformation_matrix[:3, :3] = rotation_matrix + self.transformation_matrix[:3, 3] = agent_state.position # get transformation from world to agent + self.initial_height = agent_state.position[1] # get initial height + + self.last_action = None + self.messages = [] + self.rgb_list = [] + self.action_seq = [] + self.goal = None + self.prev_look_image = None + self.look_down_image = None # params for qwen model + + self.dialogs = [] + + # params for saving + self.save_raw_image = None + + def get_intrinsic_matrix(self, sensor_cfg) -> np.ndarray: + width = sensor_cfg.width + height = sensor_cfg.height + fov = sensor_cfg.hfov + fx = (width / 2.0) / np.tan(np.deg2rad(fov / 2.0)) + fy = fx # Assuming square pixels (fx = fy) + cx = (width - 1.0) / 2.0 + cy = (height - 1.0) / 2.0 + + intrinsic_matrix = np.array( + [[fx, 0.0, cx, 0.0], [0.0, fy, cy, 0.0], [0.0, 0.0, 1.0, 0.0], [0.0, 0.0, 0.0, 1.0]] + ) + return intrinsic_matrix + + def get_axis_align_matrix(self): + ma = np.array([[0, 0, 1, 0], [-1, 0, 0, 0], [0, -1, 0, 0], [0, 0, 0, 1]]) + return ma + + def xyz_yaw_to_tf_matrix(self, xyz: np.ndarray, yaw: float) -> np.ndarray: + x, y, z = xyz + transformation_matrix = np.array( + [ + [np.cos(yaw), -np.sin(yaw), 0, x], + [np.sin(yaw), np.cos(yaw), 0, y], + [0, 0, 1, z], + [0, 0, 0, 1], + ] + ) + return transformation_matrix + + def xyz_pitch_to_tf_matrix(self, xyz: np.ndarray, pitch: float) -> np.ndarray: + """Converts a given position and pitch angle to a 4x4 transformation matrix. + + Args: + xyz (np.ndarray): A 3D vector representing the position. + pitch (float): The pitch angle in radians for y axis. + + Returns: + np.ndarray: A 4x4 transformation matrix. + """ + + x, y, z = xyz + transformation_matrix = np.array( + [ + [np.cos(pitch), 0, np.sin(pitch), x], + [0, 1, 0, y], + [-np.sin(pitch), 0, np.cos(pitch), z], + [0, 0, 0, 1], + ] + ) + return transformation_matrix + + def xyz_yaw_pitch_to_tf_matrix(self, xyz: np.ndarray, yaw: float, pitch: float) -> np.ndarray: + """Converts a given position and yaw, pitch angles to a 4x4 transformation matrix. + + Args: + xyz (np.ndarray): A 3D vector representing the position. + yaw (float): The yaw angle in radians. + pitch (float): The pitch angle in radians for y axis. + + Returns: + np.ndarray: A 4x4 transformation matrix. + """ + x, y, z = xyz + rot1 = self.xyz_yaw_to_tf_matrix(xyz, yaw)[:3, :3] + rot2 = self.xyz_pitch_to_tf_matrix(xyz, pitch)[:3, :3] + transformation_matrix = np.eye(4) + transformation_matrix[:3, :3] = rot1 @ rot2 + transformation_matrix[:3, 3] = xyz + return transformation_matrix + + def pixel_to_gps(self, pixel, depth, intrinsic, tf_camera_to_episodic): + """Back-project a 2D image pixel into 3D using the depth map and camera intrinsics. + + Args: + pixel (Tuple[int, int] | List[int] | np.ndarray): pixel coordinate in (v, u) indexing as used here. + depth (np.ndarray): depth image of shape (H, W) in meters, where depth[v, u] returns the metric depth. + intrinsic (np.ndarray): camera intrinsic matrix. + tf_camera_to_episodic (np.ndarray): homogeneous transform of shape (4, 4) mapping camera-frame points to + the episodic frame. + + Returns: + Tuple[float, float]: coordinates in the episodic frame. + """ + v, u = pixel + z = depth[v, u] + print("depth", z) + + x = (u - intrinsic[0, 2]) * z / intrinsic[0, 0] + y = (v - intrinsic[1, 2]) * z / intrinsic[1, 1] + point_camera = np.array([x, y, z, 1.0]) + + # Transform to episodic frame + point_episodic = tf_camera_to_episodic @ point_camera + point_episodic = point_episodic[:3] / point_episodic[3] + + x = point_episodic[0] + y = point_episodic[1] + + return (x, y) # same as habitat gps + + def parse_actions(self, output): + action_patterns = '|'.join(re.escape(action) for action in self.actions2idx) + regex = re.compile(action_patterns) + matches = regex.findall(output) + actions = [self.actions2idx[match] for match in matches] + actions = itertools.chain.from_iterable(actions) + return list(actions) diff --git a/internnav/agent/internvla_n1_agent.py b/internnav/agent/internvla_n1_agent.py index 8e696af7..d2e1d278 100644 --- a/internnav/agent/internvla_n1_agent.py +++ b/internnav/agent/internvla_n1_agent.py @@ -207,7 +207,7 @@ def s2_thread_func(): self.s2_thread.daemon = True self.s2_thread.start() - def should_infer_s2(self, mode="sync"): + def should_infer_s2(self, mode="partial_async"): """Function: Enables the sys2 inference thread depending on the mode. mode: just support 2 modes: "sync" and "partial_async". "sync": Synchronous mode (navdp_version >= 0.0), Sys1 and Sys2 execute in a sequential inference chain. @@ -298,8 +298,6 @@ def step(self, obs): if self.sys1_infer_times > 0: self.dual_forward_step += 1 - # print('Output action:', output, self.dual_forward_step) - else: self.look_down = False # 2. If output is in latent form, execute latent S1 @@ -333,13 +331,9 @@ def step(self, obs): .unsqueeze(-1) .to(self.device) ) # [1, 2, 224, 224, 1] - self.s1_output = self.policy.s1_step_latent( - rgbs, depths, self.s2_output.output_latent, use_async=True - ) + self.s1_output = self.policy.s1_step_latent(rgbs, depths, self.s2_output.output_latent) else: - self.s1_output = self.policy.s1_step_latent( - rgb, depth * 10000.0, self.s2_output.output_latent, use_async=False - ) + self.s1_output = self.policy.s1_step_latent(rgb, depth * 10000.0, self.s2_output.output_latent) else: assert False, f"S2 output should be either action or latent, but got neither! {self.s2_output}" @@ -372,6 +366,7 @@ def step(self, obs): if self.dual_forward_step > self.sys2_max_forward_step: print("!!!!!!!!!!!!") print("ERR: self.dual_forward_step ", self.dual_forward_step, " > ", self.sys2_max_forward_step) + print("Potential reason: sys1 infers empty trajectory list []") print("!!!!!!!!!!!!") print('Output discretized traj:', output['action'], self.dual_forward_step) diff --git a/internnav/agent/internvla_n1_agent_realworld.py b/internnav/agent/internvla_n1_agent_realworld.py index 3165bfae..1cb71c19 100644 --- a/internnav/agent/internvla_n1_agent_realworld.py +++ b/internnav/agent/internvla_n1_agent_realworld.py @@ -222,10 +222,10 @@ def step_s2(self, rgb, depth, pose, instruction, intrinsic, look_down=False): **inputs, max_new_tokens=128, do_sample=False, - use_cache=True, - past_key_values=self.past_key_values, + # use_cache=True, + # past_key_values=self.past_key_values, return_dict_in_generate=True, - raw_input_ids=copy.deepcopy(inputs.input_ids), + # raw_input_ids=copy.deepcopy(inputs.input_ids), ) output_ids = outputs.sequences @@ -253,5 +253,5 @@ def step_s2(self, rgb, depth, pose, instruction, intrinsic, look_down=False): return action_seq, None, None def step_s1(self, latent, rgb, depth): - all_trajs = self.model.generate_traj(latent, rgb, depth, use_async=True) + all_trajs = self.model.generate_traj(latent, rgb, depth) return all_trajs diff --git a/internnav/configs/evaluator/__init__.py b/internnav/configs/evaluator/__init__.py index 27e63a31..7eb1c4f8 100644 --- a/internnav/configs/evaluator/__init__.py +++ b/internnav/configs/evaluator/__init__.py @@ -39,15 +39,17 @@ class MetricCfg(BaseModel): class TaskCfg(BaseModel): task_name: Optional[str] = None - task_settings: Dict[str, Any] - scene: SceneCfg + task_settings: Dict[str, Any] = None + scene: SceneCfg = None robot_name: Optional[str] = None robot: Optional[RobotCfg] = None robot_flash: Optional[bool] = None + flash_collision: Optional[bool] = None robot_usd_path: Optional[str] = None camera_resolution: Optional[List[int]] = None metric: Optional[MetricCfg] = None camera_prim_path: Optional[str] = None + one_step_stand_still: Optional[bool] = None class EvalDatasetCfg(BaseModel): diff --git a/internnav/configs/evaluator/vln_default_config.py b/internnav/configs/evaluator/vln_default_config.py index 71bb6967..e22841fd 100644 --- a/internnav/configs/evaluator/vln_default_config.py +++ b/internnav/configs/evaluator/vln_default_config.py @@ -239,6 +239,9 @@ def get_config(evaluator_cfg: EvalCfg): # add the flash controller in, by flash flag. if evaluator_cfg.task.robot_flash: + vln_move_by_flash_cfg.type = ( + 'VlnMoveByFlashCollisionController' if evaluator_cfg.task.flash_collision else 'VlnMoveByFlashController' + ) robot.controllers.append(ControllerCfg(controller_settings=vln_move_by_flash_cfg.model_dump())) if evaluator_cfg.task.robot_flash or evaluator_cfg.eval_settings.get('vis_output', True): diff --git a/internnav/dataset/internvla_n1_lerobot_dataset.py b/internnav/dataset/internvla_n1_lerobot_dataset.py new file mode 100644 index 00000000..52981731 --- /dev/null +++ b/internnav/dataset/internvla_n1_lerobot_dataset.py @@ -0,0 +1,1385 @@ +import copy +import itertools +import json +import os +import random +import re +import time +from dataclasses import dataclass +from typing import Dict, List, Sequence, Tuple + +import numpy as np +import torch +import transformers +from decord import VideoReader +from PIL import Image +from torch.utils.data import Dataset +from torchcodec.decoders import VideoDecoder +from transformers.image_utils import to_numpy_array + +from .vlln_lerobot_dataset import VLLNDataset +from .rope2d import get_rope_index_2, get_rope_index_25 + +# Define placeholders for dataset paths +CAMBRIAN_737K = { + "annotation_path": "PATH_TO_CAMBRIAN_737K_ANNOTATION", + "data_path": "", +} + +CAMBRIAN_737K_PACK = { + "annotation_path": f"PATH_TO_CAMBRIAN_737K_ANNOTATION_PACKED", # noqa: F541 + "data_path": "", +} + +MP_DOC = { + "annotation_path": "PATH_TO_MP_DOC_ANNOTATION", + "data_path": "PATH_TO_MP_DOC_DATA", +} + +CLEVR_MC = { + "annotation_path": "PATH_TO_CLEVR_MC_ANNOTATION", + "data_path": "PATH_TO_CLEVR_MC_DATA", +} + +VIDEOCHATGPT = { + "annotation_path": "PATH_TO_VIDEOCHATGPT_ANNOTATION", + "data_path": "PATH_TO_VIDEOCHATGPT_DATA", +} + + +R2R_125CM_0_30 = { + "data_path": "traj_data/r2r", + "height": 125, + "pitch_1": 0, + "pitch_2": 30, +} + +R2R_125CM_0_45 = { + "data_path": "traj_data/r2r", + "height": 125, + "pitch_1": 0, + "pitch_2": 45, +} + +R2R_60CM_15_15 = { + "data_path": "traj_data/r2r", + "height": 60, + "pitch_1": 15, + "pitch_2": 15, +} + +R2R_60CM_30_30 = { + "data_path": "traj_data/r2r", + "height": 60, + "pitch_1": 30, + "pitch_2": 30, +} + +RxR_125CM_0_30 = { + "data_path": "traj_data/rxr", + "height": 125, + "pitch_1": 0, + "pitch_2": 30, +} + +RxR_125CM_0_45 = { + "data_path": "traj_data/rxr", + "height": 125, + "pitch_1": 0, + "pitch_2": 45, +} + +RxR_60CM_15_15 = { + "data_path": "traj_data/rxr", + "height": 60, + "pitch_1": 15, + "pitch_2": 15, +} + +RxR_60CM_30_30 = { + "data_path": "traj_data/rxr", + "height": 60, + "pitch_1": 30, + "pitch_2": 30, +} + +SCALEVLN_125CM_0_30 = { + "data_path": "traj_data/scalevln", + "height": 125, + "pitch_1": 0, + "pitch_2": 30, +} + +SCALEVLN_125CM_0_45 = { + "data_path": "traj_data/scalevln", + "height": 125, + "pitch_1": 0, + "pitch_2": 45, +} + +SCALEVLN_60CM_30_30 = { + "data_path": "traj_data/scalevln", + "height": 60, + "pitch_1": 30, + "pitch_2": 30, +} + +data_dict = { + "cambrian_737k": CAMBRIAN_737K, + "cambrian_737k_pack": CAMBRIAN_737K_PACK, + "mp_doc": MP_DOC, + "clevr_mc": CLEVR_MC, + "videochatgpt": VIDEOCHATGPT, + "r2r_125cm_0_30": R2R_125CM_0_30, + "r2r_125cm_0_45": R2R_125CM_0_45, + "r2r_60cm_15_15": R2R_60CM_15_15, + "r2r_60cm_30_30": R2R_60CM_30_30, + "rxr_125cm_0_30": RxR_125CM_0_30, + "rxr_125cm_0_45": RxR_125CM_0_45, + "rxr_60cm_15_15": RxR_60CM_15_15, + "rxr_60cm_30_30": RxR_60CM_30_30, + "scalevln_125cm_0_30": SCALEVLN_125CM_0_30, + "scalevln_125cm_0_45": SCALEVLN_125CM_0_45, + "scalevln_60cm_30_30": SCALEVLN_60CM_30_30, +} + + +def parse_sampling_rate(dataset_name): + match = re.search(r"%(\d+)$", dataset_name) + if match: + return int(match.group(1)) / 100.0 + return 1.0 + + +def read_jsonl(path): + with open(path, "r") as f: + return [json.loads(line) for line in f] + + +def data_list(dataset_names): + config_list = [] + for dataset_name in dataset_names: + sampling_rate = parse_sampling_rate(dataset_name) + dataset_name = re.sub(r"%(\d+)$", "", dataset_name) + if dataset_name in data_dict.keys(): + config = data_dict[dataset_name].copy() + config["sampling_rate"] = sampling_rate + config_list.append(config) + else: + raise ValueError(f"do not find {dataset_name}") + return config_list + + +IGNORE_INDEX = -100 +IMAGE_TOKEN_INDEX = 151655 +VIDEO_TOKEN_INDEX = 151656 +TRAJ_TOKEN_INDEX = 151667 +DEFAULT_IMAGE_TOKEN = "" +DEFAULT_VIDEO_TOKEN = "