Naming Objects for Vision-and-Language Manipulation

Jerry Jun Yokono

doi:10.48550/ARXIV.2303.02871

Outline

Naming Objects for Vision-and-Language Manipulation

Jerry Jun Yokono

2023, arXiv (Cornell University)

https://doi.org/10.48550/ARXIV.2303.02871

visibility

…

description

8 pages

link

1 file

Abstract

Robot manipulation tasks by natural language instructions need common understanding of the target object between human and the robot. However, the instructions often have an interpretation ambiguity, because the instruction lacks important information, or does not express the target object correctly to complete the task. To solve this ambiguity problem, we hypothesize that "naming" the target objects in advance will reduce the ambiguity of natural language instructions. We propose a robot system and method that incorporates naming with appearance of the objects in advance, so that in the later manipulation task, instruction can be performed with its unique name to disambiguate the objects easily. To demonstrate the effectiveness of our approach, we build a system that can memorize the target objects, and show that naming the objects facilitates detection of the target objects and improves the success rate of manipulation instructions. With this method, the success rate of object manipulation task increases by 31% in ambiguous instructions.

References (31)

S. Y. Min, D. S. Chaplot, P. K. Ravikumar, Y. Bisk, and R. Salakhutdi- nov, "Film: Following instructions in language with modular methods," in International Conference on Learning Representations, 2021.
D. K. Misra, J. Sung, K. Lee, and A. Saxena, "Tell me dave: Context- sensitive grounding of natural language to manipulation instructions," The International Journal of Robotics Research, vol. 35, no. 1-3, pp. 281-300, 2016.
Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen, L. Fei-Fei, A. Anandkumar, Y. Zhu, and L. Fan, "Vima: General robot manip- ulation with multimodal prompts," arXiv preprint arXiv:2210.03094, 2022.
M. Shridhar, L. Manuelli, and D. Fox, "Cliport: What and where pathways for robotic manipulation," in Conference on Robot Learning. PMLR, 2022, pp. 894-906.
J. Hatori, Y. Kikuchi, S. Kobayashi, K. Takahashi, Y. Tsuboi, Y. Unno, W. Ko, and J. Tan, "Interactively picking real-world objects with un- constrained spoken language instructions," in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 3774-3781.
D. Whitney, E. Rosen, J. MacGlashan, L. L. Wong, and S. Tellex, "Re- ducing errors in object-fetching interactions through social feedback," in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 1006-1013.
Y. Yang, X. Lou, and C. Choi, "Interactive robotic grasping with attribute-guided disambiguation," in 2022 IEEE International Confer- ence on Robotics and Automation (ICRA). IEEE, 2022.
E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, "Bc-z: Zero-shot task generalization with robotic imitation learning," in Conference on Robot Learning. PMLR, 2022, pp. 991-1002.
S. Ishikawa and K. Sugiura, "Target-dependent uniter: A transformer- based multimodal language comprehension model for domestic service robots," IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 8401-8408, 2021.
S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, "Referitgame: Referring to objects in photographs of natural scenes," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 787-798.
A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, "Mdetr -modulated detection for end-to-end multi-modal understand- ing," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 1780-1790.
L. H. Li*, P. Zhang*, H. Zhang*, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang, K.-W. Chang, and J. Gao, "Grounded language-image pre-training," in CVPR, 2022.
T. Iio, M. Shiomi, K. Shinozawa, K. Shimohara, M. Miki, and N. Hagita, "Lexical entrainment in human robot interaction," Inter- national Journal of Social Robotics, vol. 7, no. 2, pp. 253-263, 2015.
T. Nakamura, K. Sugiura, T. Nagai, N. Iwahashi, T. Toda, H. Okada, and T. Omori, "Learning novel objects for extended mobile manipu- lation," Journal of Intelligent & Robotic Systems, vol. 66, no. 1, pp. 187-204, 2012.
"Robocup@home," https://athome.robocup.org/, [Online; accessed 24- February-2023].
K. Zheng, X. Chen, O. C. Jenkins, and X. E. Wang, "Vlmbench: A compositional benchmark for vision-and-language manipulation," in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2022.
O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, "Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks," IEEE Robotics and Automation Letters, 2022.
M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, "Alfred: A benchmark for interpreting grounded instructions for everyday tasks," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 740-10 749.
J. Deng, J. Guo, N. Xue, and S. Zafeiriou, "Arcface: Additive angular margin loss for deep face recognition," in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4685- 4694.
Y. Cui, C. Jiang, L. Wang, and G. Wu, "Mixformer: End-to-end track- ing with iterative mixed attention," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 13 598- 13 608.
E. Todorov, T. Erez, and Y. Tassa, "Mujoco: A physics engine for model-based control," in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026-5033.
J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, "Generation and comprehension of unambiguous object descriptions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 11-20.
"Intel RealSense Depth Camera D435i," https://ark.intel.com/content/ www/us/en/ark/products/190004/intel-realsense-depth-camera-d435i. html, [Online; accessed 24-February-2023].
"Metascan," https://metascan.ai/, [Online; accessed 24-February- 2023].
K. Mamou and F. Ghorbel, "A simple and efficient approach for 3d mesh approximate convex decomposition," in 2009 16th IEEE International Conference on Image Processing (ICIP), 2009, pp. 3501-3504.
T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, "Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning," in Conference on Robot Learning (CoRL), 2019. [Online]. Available: https://arxiv.org/abs/1910.10897
C. Xie, Y. Xiang, A. Mousavian, and D. Fox, "Unseen object in- stance segmentation for robotic environments," IEEE Transactions on Robotics, vol. 37, no. 5, pp. 1343-1359, 2021.
J. Deng, J. Guo, N. Xue, and S. Zafeiriou, "Arcface: Additive angular margin loss for deep face recognition," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690-4699.
K. Lai, L. Bo, X. Ren, and D. Fox, "A large-scale hierarchical multi- view rgb-d object dataset," in 2011 IEEE international conference on robotics and automation. IEEE, 2011, pp. 1817-1824.
L. Huang, X. Zhao, and K. Huang, "Got-10k: A large high-diversity benchmark for generic object tracking in the wild," IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 5, pp. 1562- 1577, 2019.
M. Breyer, J. J. Chung, L. Ott, R. Siegwart, and J. Nieto, "Volumetric grasping network: Real-time 6 dof grasp detection in clutter," in Proceedings of the 2020 Conference on Robot Learning, ser. Proceedings of Machine Learning Research, J. Kober, F. Ramos, and C. Tomlin, Eds., vol. 155. PMLR, 16-18 Nov 2021, pp. 1602-1611. [Online]. Available: https://proceedings.mlr.press/v155/breyer21a.html

Naming Objects for Vision-and-Language Manipulation

Sign up for access to the world's latest research

Abstract

Related papers

References (31)

Related papers

Related topics