Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed $\text{ThinkGrasp}$, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. $\text{ThinkGrasp}$ can effectively identify and generate grasp poses for target objects, even when they are heavily obstructed or nearly invisible, by using goal-oriented language to guide the removal of obstructing objects. This approach progressively uncovers the target object and ultimately grasps it with a few steps and a high success rate. In both simulated and real experiments, $\text{ThinkGrasp}$ achieved a high success rate and significantly outperformed state-of-the-art methods in heavily cluttered environments or with diverse unseen objects, demonstrating strong generalization capabilities.
We have developed a plug-and-play system for occlusion handling that efficiently utilizes visual and language information to assist in robotic grasping. To improve reliability, we have implemented a robust error-handling framework using LangSAM and VLPart for segmentation. While GPT-4o provides the target object name, LangSAM and VLPart handle the image segmentation. This division of tasks ensures that any errors from the language model do not affect the segmentation process, leading to higher success rates and safer grasp poses in diverse and cluttered environments.
Our system's modular design enables easy integration into various robotic platforms and grasping systems. It is compatible with 6-DoF two-finger grippers, demonstrating strong generalization capabilities. It quickly adapts to new language goals and novel objects through simple prompts, making it highly versatile and scalable.
Our system uses an iterative pipeline for grasping in cluttered environments. Given an initial RGB-D scene observation ($224{\times}224$ for simulation, $640{\times}480$ for real robot) and a natural language instruction:
First, GPT-4o performs "imagine segmentation", analyzing the scene and instruction to identify potential target objects or parts. GPT-4o suggests grasp locations by proposing specific points within a $3{\times}3$ grid, focusing on the safest and most advantageous parts for grasping.
The system then uses either LangSAM or VLPart for segmentation, based on whether the target is an object or a part. GPT-4o adjusts its selections based on new visual input after each grasp, updating predictions for the target object and preferred grasping location.
To determine the optimal grasping pose, the system generates candidate poses from the cropped point cloud. We used Graspnet-1Billion for simulations and FGC-Graspnet for real-robot tests to ensure consistent results. The candidate poses are evaluated based on proximity to the preferred location and grasp quality scores.
This closed-loop process allows the system to adapt its strategy based on updated observations after each grasp attempt, effectively managing heavy clutter until the task is completed or the maximum number of iterations is reached.
We evaluated $\text{ThinkGrasp}$ in a simulation environment using a UR5 arm, ROBOTIQ-85 gripper, and Intel RealSense L515 camera. Images were resized to 224x224 pixels and segmented by LangSAM for precise object masks. We compared our solution with Vision-Language Grasping (VLG) and OVGrasp, using the same GraspNet backbone for fair comparison. Additionally, we tested the performance of directly using GPT-4o to select grasp points.
Clutter experiments involved tasks like grasping round objects and retrieving items for specific uses. Each test case was run 15 times, measured by Task Success Rate and Motion Number.
In heavy clutter scenarios, our system handled up to 30 unseen objects with up to 50 action attempts per run.
Our system significantly outperformed baselines in overall success rates and efficiency metrics, achieving an average success rate of 0.980, an average step count of 3.39, and an average success step count of 3.32 in clutter cases.
Ablation studies demonstrated the effectiveness of our system components. Different configurations showed the importance of each part in enhancing overall performance.
We extended our system to real-world environments using a UR5 robotic arm, Robotiq 85 gripper, and RealSense D455 camera. Observations were processed using MoveIt and ROS on a workstation with a 12GB 2080Ti GPU. Our model, deployed via Flask on dual 3090 GPUs, provided grasp pose predictions within 10 seconds via the GPT-4o API.
Real-world experiments showed that our system outperformed VL-Grasp, confirming the improvements introduced by our strategic part grasping and heavy clutter handling mechanisms.
Our results indicate a high success rate in identifying and grasping target objects, even in cluttered environments. Failures were primarily due to single image data limitations, low-quality grasp poses from the downstream model, and variations in UR5 robot stability. Improving these factors is crucial for further enhancing system performance.
I would like to extend my heartfelt thanks to Jie Fu, Dian Wang, and Hanhan Zhou for their invaluable support and insightful discussions during the early stages of this project. Their input helped me navigate through complex problems and refine my ideas.
Special appreciation goes to Mingfu Liang for his excellent advice on video production and pipeline design. His contributions greatly enhanced the clarity and effectiveness of our presentation.
I am deeply grateful to my friends who offered constant encouragement and support. Additionally, a warm thank you to my dog, Cookie , and my cat, Lucas, whose companionship and emotional support provided me with much-needed comfort and motivation throughout this journey.
@misc{qian2024thinkgrasp,
title={ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter},
author={Yaoyao Qian and Xupeng Zhu and Ondrej Biza and Shuo Jiang and Linfeng Zhao and Haojie Huang and Yu Qi and Robert Platt},
year={2024},
eprint={2407.11298},
archivePrefix={arXiv},
primaryClass={cs.RO}
}