Skip to main content
Home/Guides/Computer Vision for Robots 2026
👁️ Perception GuideUpdated June 2026

Computer Vision for Robots 2026

From classical OpenCV to SAM2 and FoundationPose — a practical guide to robot perception stacks with ROS 2 integration code.

YOLO v11SAM2FoundationPoseGrounded-SAM2OpenCVIsaac ROS SLAM

Tool Comparison

ToolCategorySpeedROS 2Score
YOLO v11Best Real-time Detectiondetection2–5ms / frame (RTX 4090)95
Segment Anything Model 2 (SAM2)Best Promptable Segmentationsegmentation40–120ms / frame (RTX 3090)92
FoundationPoseBest 6D Pose Estimationpose20–50ms / frame (RTX 4090)91
Grounded-SAM2 (GDINO + SAM2)Best Open-World Detection+Segmentationfoundation200–400ms / frame (RTX 4090)89
OpenCV 4.xBest Classical CV Foundationfoundation< 1ms for classical algorithms90
Isaac ROS Visual SLAMBest GPU-Accelerated SLAMdepth30Hz (Jetson AGX Orin)88

Tool Reviews

Best Real-time Detection

YOLO v11

detectionCUDA 11.8+ / Jetson (TensorRT)

95/100

Ultralytics YOLO v11 — fastest production-ready object detection with 80+ COCO classes, instance segmentation, pose, and OBB.

Speed

2–5ms / frame (RTX 4090)

Accuracy

54.7 mAP (COCO val2017)

ROS 2

✅ Official

GPU

CUDA 11.8+

Best for

Real-time object detectionRobot grasping target detectionSafety zone monitoringInstance segmentation

✓ Pros

  • Fastest inference in its accuracy class — 2ms on RTX 4090
  • Unified model for detect/segment/pose/OBB in one framework
  • Official Ultralytics ROS 2 wrapper (ultralytics_ros)
  • TensorRT export for Jetson deployment (20+ FPS on Orin NX)
  • Best community + COCO-pretrained weights ecosystem
  • Python API + CLI — easiest fine-tuning pipeline

✗ Cons

  • Small object detection still weaker than two-stage detectors
  • Custom dataset annotation required for domain-specific robots
Best Promptable Segmentation

Segment Anything Model 2 (SAM2)

segmentationCUDA 11.8+ / 8GB+ VRAM

92/100

Meta's SAM2 — promptable image and video segmentation. Point, box, or mask prompt → precise mask, even for novel objects without training.

Speed

40–120ms / frame (RTX 3090)

Accuracy

79.8 J&F (DAVIS)

ROS 2

✅ Official

GPU

CUDA 11.8+

Best for

Zero-shot object segmentationManipulation mask generationNovel object handlingVideo object tracking

✓ Pros

  • Zero-shot — segments ANY object from a point/box prompt without retraining
  • Video tracking: propagate masks across frames for manipulation
  • SAM2 streaming mode for near-real-time (40ms) video segmentation
  • Huge open-source ecosystem: Grounded-SAM2, EfficientSAM, MobileSAM
  • Works on objects robot has never seen — critical for open-world manipulation

✗ Cons

  • 40–120ms latency — not suited for 30Hz real-time control loops
  • Requires CUDA GPU — no Jetson support without quantization
  • Mask quality degrades on transparent/reflective objects
Best 6D Pose Estimation

FoundationPose

poseCUDA 11.8+ / 16GB VRAM

91/100

NVIDIA FoundationPose — unified 6D pose estimation for both model-based and model-free scenarios. No per-object fine-tuning required.

Speed

20–50ms / frame (RTX 4090)

Accuracy

AUC 0.937 (YCB-Video)

ROS 2

✅ Official

GPU

CUDA 11.8+

Best for

Robot grasping poseKnown object trackingNovel object pose (model-free)Pick & place bin picking

✓ Pros

  • Model-free mode: estimate pose for objects only seen in 42 reference images
  • Industry-leading YCB-Video AUC 0.937 — beats BundleSDF, MegaPose
  • NVIDIA Isaac ROS FoundationPose package for ROS 2 deployment
  • Real-time pose tracking at 20Hz on NVIDIA GPU
  • Handles symmetric objects and heavy occlusion

✗ Cons

  • Requires NVIDIA GPU — no AMD/Apple Silicon support
  • 16GB VRAM minimum for full model (L4 or RTX 3090+)
  • Setup complexity: CUDA builds, Isaac ROS containers required
Best Open-World Detection+Segmentation

Grounded-SAM2 (GDINO + SAM2)

foundationCUDA 11.8+ / 12GB+ VRAM

89/100

Grounding DINO + SAM2 pipeline — text prompt → detect any object → segment it. No classes, no training, just text description.

Speed

200–400ms / frame (RTX 4090)

Accuracy

ODINW avg. 57.0 AP

ROS 2

✅ Official

GPU

CUDA 11.8+

Best for

Open-vocabulary manipulationLanguage-guided graspingTask instruction followingHousehold robot applications

✓ Pros

  • Text to segmentation: 'pick up the blue cup' → mask in one pipeline
  • No training required — works on any object describable in language
  • Foundation for LLM-robot integration (robot receives language task → vision locates target)
  • Large open-source ecosystem on GitHub (IDEA Research)

✗ Cons

  • 200–400ms latency — requires async architecture for real-time control
  • GDINO text prompt sensitivity — ambiguous descriptions cause failures
  • Combined model memory footprint 12GB+ VRAM
Best Classical CV Foundation

OpenCV 4.x

foundationCUDA (optional) / CPU default

90/100

OpenCV 4.10 — the universal computer vision library. SIFT, ORB, optical flow, camera calibration, stereo vision, and DNN module for deep model inference.

Speed

< 1ms for classical algorithms

Accuracy

Varies by algorithm

ROS 2

✅ Official

GPU

CUDA (optional)

Best for

Camera calibrationArUco marker detectionOptical flowClassical feature matchingProduction edge deployment

✓ Pros

  • Universal standard — every roboticist knows it, every framework integrates it
  • ArUco marker detection: fastest pose estimation without GPU
  • Camera calibration: checkerboard to camera matrix in 50 lines
  • DNN module: run ONNX/TensorFlow/PyTorch models without framework dependencies
  • CPU-only deployment — runs on any embedded system
  • cv_bridge in ROS 2 for seamless sensor_msgs/Image conversion

✗ Cons

  • Classical algorithms can't match deep learning accuracy on complex tasks
  • DNN module trails dedicated frameworks for GPU performance
  • Verbose C++ API — Python bindings cleaner but less documented
Best GPU-Accelerated SLAM

Isaac ROS Visual SLAM

depthJetson (required) / NVIDIA GPU

88/100

NVIDIA Isaac ROS Visual SLAM — cuVSLAM visual-inertial odometry at 30Hz on Jetson, integrated with Nav2.

Speed

30Hz (Jetson AGX Orin)

Accuracy

< 1% ATE on TUM RGB-D

ROS 2

✅ Official

GPU

Jetson (required)

Best for

Mobile robot localizationIndoor navigation without LiDARStereo camera SLAMNav2 integration

✓ Pros

  • 30Hz at 4cm localization accuracy on Jetson AGX Orin — best edge SLAM
  • Stereo + IMU fusion — robust to dynamic lighting
  • Publishes /visual_slam/tracking/odometry → direct Nav2 integration
  • Free in Isaac ROS package — no license cost
  • Outperforms CPU ORB-SLAM3 by 10× in speed on Jetson

✗ Cons

  • Requires NVIDIA Jetson or GeForce/Quadro GPU — no CPU fallback
  • Isaac ROS container ecosystem (Docker) adds setup complexity
  • Degrades in feature-poor environments (white walls, dark rooms)

ROS 2 Integration Quick Reference

YOLO v11

ultralytics_ros

/yolo/detections → vision_msgs/Detection2DArray

pip install ultralytics && ros2 launch ultralytics_ros yolo.launch.py model:=yolo11n.pt

SAM2

ros2_sam2 (community)

/sam2/masks → sensor_msgs/Image

pip install 'git+https://github.com/facebookresearch/sam2' && ros2 run ros2_sam2 sam2_node

FoundationPose

isaac_ros_foundationpose

/pose_estimation/output → geometry_msgs/PoseArray

Docker: isaac_ros_foundationpose:latest (NVIDIA NGC)

OpenCV

cv_bridge (built-in)

sensor_msgs/Image ↔ cv::Mat

sudo apt install ros-jazzy-cv-bridge ros-jazzy-vision-opencv

Recommended Stacks by Task

Pick & place — known objects

~70ms total

YOLO v11 (detect) + FoundationPose (6D pose) + ROS 2 MoveIt2

Gold standard for industrial bin picking. FoundationPose gives gripper approach angle.

Pick & place — novel objects

~450ms

Grounded-SAM2 (text→mask) + FoundationPose model-free

For household robots receiving language instructions ('pick up the mug').

Navigation without LiDAR

33ms odometry

Isaac ROS Visual SLAM + Nav2 + RealSense D435i

Replaces $2,000 LiDAR with $200 depth camera. Indoor performance comparable.

Safety zone monitoring

< 10ms

YOLO v11 (pose estimation mode) + OpenCV safety zone logic

Human skeleton detection + zone geometry = ISO/TS 15066 collaborative safety.

ArUco marker pose

< 2ms

OpenCV ArUco + cv_bridge + ROS 2 TF publisher

Simplest pose estimation. Use for structured environments, calibration targets.