UnoBench

UnoBench is a benchmark for obstruction-aware robotic grasping in cluttered scenes.

It is designed to evaluate whether a method can perform target-centric obstruction reasoning before grasping, either from natural-language instructions or from object-indexed visual representations.

UnoBench supports both vision-language models and traditional / modular robotic reasoning algorithms through two complementary settings:

Unlike conventional grasping datasets that mainly focus on direct object detection or grasp pose prediction, UnoBench emphasizes which objects obstruct the target object and how such obstruction relationships can be represented, measured, and reasoned about.

Dataset Overview

UnoBench contains synthetic cluttered-scene data with RGB images, Set-of-Mark images, annotations, and obstruction-related metadata.

The benchmark focuses on target-centric obstruction reasoning. Given a target object, the method is expected to identify the objects that block or constrain access to the target before grasping.

The dataset provides:

Sample data

Natural Language Image My image
query object name: "red and orange toy drill"
target objects coordinates : [(x1,y1)] (any points on top of the yellow detergent bottle)
Set-of-Marks Image My image
Query object ID: "Object 2"
target objects Ids : ["Object 3"]
Metadata
      
"index": 1992,
"image_path": "images/image_000578.png",
"som_image_path": "images_som/image_000578.png",
"image_id": 578,
"query_object": {
  "obj_id": 2,
  "object_name": "red and orange toy drill"
},
"target_objects": [
  {
    "obj_id": 3,
    "object_name": "yellow detergent bottle"
  }
],
"occlusion_paths": [
  [
    3,
    2
  ]
],
"difficulty": "Easy",
"k_min": 1,
"num_paths": 1,
"only_som": 0
    
    

Dataset Structure

UnoBench/
├── images.zip
├── images_som.zip
├── annotations.zip
├── test_GT_small.json
├── test_nlp_small.jsonl
├── test_som_small.jsonl
└── meta_data/
    ├── Synthetic_train.json
    ├── image_id_scene_view_id_mapping.json
    ├── name_for_all.json
    └── occ_info/
        ├── obs_information.json
        └── masks.zip

File Description

Main Archives

File Description
images.zip RGB images of synthetic cluttered scenes.
images_som.zip Set-of-Mark images with object IDs / visual prompts.
annotations.zip Instance segmentation masks associated with each image. You can find SoM ID for each object from here.

Test Files

File Description
test_GT_small.json Ground-truth obstruction annotations for the small test split.
test_nlp_small.jsonl Evaluation samples for the NLP setting.
test_som_small.jsonl Evaluation samples for the SoM setting.

Metadata

File Description
meta_data/Synthetic_train.json Metadata for the synthetic training split.
meta_data/image_id_scene_view_id_mapping.json Mapping between image IDs, scene IDs, and view IDs.
meta_data/name_for_all.json Human-annotated descriptions for each object used in the dataset.
meta_data/occ_info/obs_information.json Obstruction / occlusion relationship information for each pair, including obstruction ratio, contact point and obstruction degree.
meta_data/occ_info/masks.zip Instance mask for each obstruction pair.

Download

You can download the full dataset with:

hf download rjiao/UnoBench \
  --repo-type dataset \
  --local-dir ./UnoBench

Or download individual files:

hf download rjiao/UnoBench images.zip \
  --repo-type dataset \
  --local-dir ./UnoBench

hf download rjiao/UnoBench images_som.zip \
  --repo-type dataset \
  --local-dir ./UnoBench

hf download rjiao/UnoBench annotations.zip \
  --repo-type dataset \
  --local-dir ./UnoBench

Extraction

After downloading, unzip the main archives:

cd UnoBench

unzip images.zip
unzip images_som.zip
unzip annotations.zip
unzip meta_data/occ_info/masks.zip -d meta_data/occ_info/

After extraction, the dataset should contain RGB images, Set-of-Mark images, annotation files, and obstruction-related metadata.

Evaluation Settings

UnoBench provides two complementary evaluation settings to support different types of methods.

NLP Setting

In the NLP setting, each sample provides a natural-language instruction and the corresponding RGB image.

The method needs to identify the target object from language and visual input, then reason about which objects obstruct the target before grasping.

This setting is suitable for evaluating:

  • Vision-language models.
  • Language-conditioned robotic grasping methods.
  • End-to-end multimodal reasoning models.
  • Methods that jointly perform language grounding and visual reasoning.

Relevant file:

test_nlp_small.jsonl

SoM Setting

In the SoM setting, each sample provides a Set-of-Mark image where objects are explicitly indexed with IDs.

This setting does not require natural-language input. Instead, the target and candidate objects can be represented through object IDs or structured object-level information.

This makes the SoM setting suitable for evaluating:

  • Vision-language models with object-indexed visual prompts.
  • Traditional perception-reasoning pipelines.
  • Graph-based obstruction reasoning algorithms.
  • Modular robotic grasping systems.
  • Methods that operate on object IDs, masks, or structured scene representations.

Relevant file:

test_som_small.jsonl

Ground Truth

The ground-truth obstruction annotations used for evaluation are provided in:

test_GT_small.json

Metadata Format

The metadata files provide scene-level and object-level information, including image IDs, scene IDs, view IDs, object names, and obstruction relationships.

Typical fields may include:

scene_id
view_id
target_object
occlusion_paths
top_objects
depends_on
num_paths
difficulty

The obstruction information is organized in a target-centric manner. For each target object, the dataset describes the objects that obstruct it and the corresponding obstruction paths.

Intended Use

UnoBench is designed for research on:

  • Obstruction-aware robotic grasping.
  • Vision-language reasoning for robotics.
  • Language-conditioned manipulation in cluttered scenes.
  • Object-level reasoning and scene understanding.
  • Set-of-Mark prompting for robotic perception.
  • Traditional and modular obstruction reasoning pipelines.
  • Graph-based reasoning for robotic manipulation.
  • Evaluation of high-level reasoning before low-level grasp execution.

Notes

UnoBench focuses on high-level obstruction reasoning before grasping. It is not limited to a specific low-level grasp planner or robot controller.

Researchers can use UnoBench to evaluate whether a method can correctly identify obstructing objects before executing a grasp. The benchmark can therefore be used together with different downstream grasping pipelines, including both learning-based and classical robotic systems.

Citation