Visually Grounded Interaction and Language (ViGIL)

NeurIPS 2018 Workshop, Montreal, Canada

Friday, 7th December, 08:30 AM to 06:30 PM, Room: 512 CDGH

New ViGIL Session:


The dominant paradigm in modern natural language understanding is learning statistical language models from text-only corpora. This approach is founded on a distributional notion of semantics, i.e. that the "meaning" of a word is based only on its relationship to other words. While effective for many applications, methods in this family suffer from limited semantic understanding, as they miss learning from the multimodal and interactive environment in which communication often takes place - the symbols of language thus are not grounded in anything concrete. The symbol grounding problem first highlighted this limitation, that "meaningless symbols (i.e.) words cannot be grounded in anything but other meaningless symbols" [18].

On the other hand, humans acquire language by communicating about and interacting within a rich, perceptual environment. This behavior provides the necessary grounding for symbols, i.e. to concrete objects or concepts (i.e. physical or psychological). Thus, recent work has aimed to bridge vision, interactive learning, and natural language understanding through language learning tasks based on natural images (ReferIt [1], GuessWhat?! [2], Visual Question Answering [3,4,5,6], Visual Dialog [7], Captioning [8]) or through embodied agents performing interactive tasks [13,14,17,22,23,24,26] in physically simulated environments (DeepMind Lab [9], Baidu XWorld [10], OpenAI Universe [11], House3D [20], Matterport3D [21], GIBSON [24], MINOS [35], AI2-THOR [19], StreetLearn [17]), often drawing on the recent successes of deep learning and reinforcement learning. We believe this line of research poses a promising, long-term solution to the grounding problem faced by current, popular language understanding models.

While machine learning research exploring visually-grounded language learning may be in its earlier stages, it may be possible to draw insights from the rich research literature on human language acquisition. In neuroscience, recent progress in fMRI technology has enabled to better understand the interleave between language, vision and other modalities [15,16] suggesting that the brains shares neural representation of concepts across vision and language. Differently, developmental cognitive scientists have also argued that children acquiring various words is closely linked to them learning the underlying concept in the real world [12].

This workshop thus aims to gather people from various backgrounds - machine learning, computer vision, natural language processing, neuroscience, cognitive science, psychology, and philosophy - to share and debate their perspectives on why grounding may (or may not) be important in building machines that truly understand natural language.

Important Dates

Paper Submission Deadline November 1st, 2018 - Midnight Pacific Time
Final Decisions November 8 9th, 2018
Workshop Date December 7th, 2018

Call for Papers

We invite high-quality paper submissions on the following topics:

  • language acquisition or learning through interactions
  • visual captioning, dialog, and question-answering
  • reasoning in language and vision
  • visual synthesis from language
  • transfer learning in language and vision tasks
  • navigation in virtual worlds with natural-language instructions
  • machine translation with visual cues
  • novel tasks that combine language, vision and actions
  • understanding and modeling the relationship between language and vision in humans
  • semantic systems and modeling of natural language and visual stimuli representations in the human brain
  • audio visual scene-aware dialog systems - Visual Question Answering
  • image/video captioning
  • lip reading
  • audio-visual scene understanding - Sound localization
  • audio-visual speech processing
  • audio-visual fusion

Accepted papers will be presented during joint poster sessions, with exceptional submissions selected for spotlight oral presentation. Accepted papers will be made publicly available as non-archival reports, allowing future submissions to archival conferences or journals.

Submissions should be up to 4 pages excluding references, acknowledgements, and supplementary material, and should be NIPS format and anonymous. The review process is double-blind.

We also welcome published papers that are within the scope of the workshop (without re-formatting). This specific papers do not have to be anonymous. They are eligible for poster sessions and will only have a very light review process.

Please submit your paper to the following address: If you have any question, send an email to:

Accepted workshop papers are eligible to the pool of reserved conference tickets (one ticket per accepted papers).


  • 08:30 AM : Opening Remarks
  • 08:40 AM : Invited Speaker 1: Steven Harnad - The symbol grounding problem | Slides
  • 09:20 AM : Invited Speaker 2: Antonio Torralba - Learning to See and Hear
  • 10:00 AM : Audio Visual Semantic Understanding Challenge
  • 10:15 AM : Spotlights (6*2min)
    • A Distributional Semantic Model of Visually Indirect Grounding for Abstract Words- Akira Utsumi [pdf]
    • Visual Reasoning by Progressive Module Network- SeungWook Kim, Makarand Tapaswi, Sanja Fidler [pdf]
    • Generating Diverse Programs with Instruction Conditioned Reinforced Adversarial Learning- Aishwarya Agrawal, Mateusz Malinowski, Felix Hill, Ali Eslami, Oriol Vinyals, Tejas Kulkarni [pdf]
    • Learning to Caption Images by Asking Natural Language Questions- Kevin Shen, Amlan Kar, Sanja Fidler [pdf]
    • Generating Animated Videos of Human Activities from Natural Language Descriptions - Angela S. Lin, Lemeng Wu, Rodolfo Corona, Kevin Tai, Qixing Huang, Raymond J. Mooney [pdf]
    • Multimodal Abstractive Summarization for Open-Domain Videos - Jindrich Libovicky, Shruti Palaskar, Spandana Gella, Florian Metze [pdf]
  • 10:30 AM : Coffee Break (20min)
  • 10:50 AM : Invited Speaker 3: Douwe Kiela - Learning Multimodal Embeddings | Slides
  • 11:30 AM : Invited Speaker 4: Roozbeh Mottaghi - Interactive Scene Understanding
  • 12:10 PM : Poster Session Lunch provided!
  • 01:10 PM : Break
  • 01:40 PM : Invited Speaker 5: Angeliki Lazaridou - Emergence of (linguistic communication) through multi-agent interactions
  • 02:20 PM : Invited Speaker 6: Barbara Landau - Learning simple spatial terms: Core and more
  • 03:00 PM : Coffee Break & Poster Session (50 mins)
  • 03:50 PM : Invited Speaker 7: Joyce Y. Chai - Language Communication with Robots
  • 04:30 PM : Invited Speaker 8: Christopher Manning - Towards real-world visual reasoning | Slides
  • 05:10 PM : Panel Discussion
  • 06:00 PM : Closing Remarks

Invited Speakers

Stevan Harnad is Professor of Cognitive Sciences in the Department of Psychology at Université du Québec à Montréal, and Professor of Web Science in the Department of Electronics and Computer Science, at University of Southampton, UK. His research is on categorisation, communication, cognition and the open research web. [Webpage]

Barbara Landau is Professor of Cognitive Science at Johns Hopkins University. Landau is interested in human knowledge of language and space, and the relationships between these two foundational systems of knowledge. [Webpage]

Joyce Chai is a Professor at Michigan State University. Her lab investigates language use in a variety of context and develops approaches for situated language processing to faciliate situated communication with artificial agents. [Webpage]

Christopher Manning is the inaugral Thomas M. Siebel Professor in Machine Learning in the Departments of Computer Science and Linguistics at Stanford University. His research goal is computers that can intelligently process, understand, and generate human language material. [Webpage]

Antonio Torralba is a Professor at Massachusetts Institute of Technology. He is interested in building systems that can perceive the world like humans do. Although his work focuses on computer vision, he is also interested in other modalities such as audition and touch. [Webpage]

Roozbeh Mottaghi is a Research Scientist at the Allen Institute for Artificial Intelligence. His research interests include computer vision, reasoning, natural language, and action. [Webpage]

Douwe Kiela is research scientist at Facebook AI Research in New York. His research interests are in natural language processing, semantics, grounding, computer vision, deep learning and artificial general intelligence. [Webpage]

Angeliki Lazaridou is a research scientist at DeepMind. Her primary research interests are in the area of natural language processing (NLP), and specifically, in multimodal semantics. [Webpage]


The workshop was broadcasted via BlueJeans. You can find the recordings here: Morning Session | Afternoon Session

Accepted Papers

  • Modelling Visual Properties and Visual Context in Multimodal Semantics - Christopher Davis, Luana Bulat, Anita Vero, Ekaterina Shutova [pdf]
  • Guiding Policies with Language via Meta-Learning - John D. Co-Reyes, Abhishek Gupta, Suvansh Sanjeev, Nick Altieri, John DeNero, Pieter Abbeel, Sergey Levine [pdf]
  • Language coverage and generalization in RNN-based continuous sentence embeddings for interacting agents - Luca Celotti, Simon Brodeur, Jean Rouat [pdf]
  • Scene Graph Parsing by Attention Graph - Martin Andrews, Yew Ken Chia, Sam Witteveen [pdf]
  • Visual Entailment Task for Visually-Grounded Language Learning - Ning Xie, Farley Lai, Derek Doran, Asim Kadav [pdf]
  • On transfer learning using a MAC model variant - Vincent Marois, T.S. Jayram, Vincent Albouy, Tomasz Kornuta, Younes Bouhadjar, Ahmet S. Ozcan [pdf]
  • Learning Capsule Networks with Images and Text - Yufei Feng, Xiaodan Zhu, Yifeng Li, Yuping Ruan, Michael Greenspan [pdf]
  • Multimodal Abstractive Summarization for Open-Domain Videos - Jindrich Libovicky, Shruti Palaskar, Spandana Gella, Florian Metze [pdf]
  • Generating Animated Videos of Human Activities from Natural Language Descriptions - Angela S. Lin, Lemeng Wu, Rodolfo Corona, Kevin Tai, Qixing Huang, Raymond J. Mooney [pdf]
  • Mixture of Regression Experts in fMRI Encoding - Subba Reddy Oota, Adithya Avvaru, Naresh Mawani, Raju S. Bapi [pdf]
  • TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments - Howard Chen, Alane Suhr, Dipendra Misra, Yoav Artzi [pdf]
  • Following Formulaic Map Instructions in a Street Simulation Environment - Volkan Cirik, Yuan Zhang, Jason Baldridge [pdf]
  • Keep Drawing It: Iterative language-based image generation and editing- Alaaeldin El-Nouby, Shikhar Sharma, Hannes Schulz, Devon Hjelm, Layla El Asri, Samira Ebrahimi Kahou, Yoshua Bengio, Graham W. Taylor [pdf]
  • Blindfold Baselines for Embodied QA - Ankesh Anand, Eugene Belilovsky, Kyle Kastner, Hugo Larochelle, Aaron Courville [pdf]
  • Object-oriented Targets for Visual Navigation using Rich Semantic Representations- Jean-Benoit Delbrouck, Stéphane Dupont [pdf]
  • CLEAR: A Dataset for Compositional Language and Elementary Acoustic Reasoning- Jerome Abdelnour, Giampiero Salvi, Jean Rouat [pdf]
  • Incremental Object Model Learning from Multimodal Human-Robot Interactions- [pdf]
  • Learning to Caption Images by Asking Natural Language Questions- Kevin Shen, Amlan Kar, Sanja Fidler [pdf]
  • Audio-Visual Fusion for Sentiment Classification using Cross-Modal Autoencoder- Sri Harsha Dumpala, Imran Sheikh, Rupayan Chakraborty, Sunil Kumar Kopparapu [pdf]
  • Compositional Hard Negatives for Visual Semantic Embeddings via an Adversary- Avishek Joey Bose, Huan Ling, Yanshuai Cao [pdf]
  • SARN: Relational Reasoning through Sequential Attention- Jinwon An, Sungwon Lyu, Sungzoon Cho [pdf]
  • Generating Diverse Programs with Instruction Conditioned Reinforced Adversarial Learning- Aishwarya Agrawal, Mateusz Malinowski, Felix Hill, Ali Eslami, Oriol Vinyals, Tejas Kulkarni [pdf]
  • Large-Scale Answerer in Questioner’s Mind for Visual Dialog Question Generation- Sang-Woo Lee, Tong Gao, Sohee Yang, Jaejun Yoo, Jung-Woo Ha [pdf]
  • Variational learning across domains with triplet information- Rita Kuznetsova, Oleg Bakhteev, Alexandr Ogaltsov [pdf]
  • Product Title Refinement via Multi-Modal Generative Adversarial Learning- Jian-Guo Zhang, Pengcheng Zou, Zhao Li, Yao Wan, Ye Liu, Xiuming Pan, Yu Gong, Philip S. Yu [pdf]
  • How2: A Large-scale Dataset for Multimodal Language Understanding- Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loic Barrault, Lucia Specia, Florian Metze [pdf]
  • Generating Diverse and Accurate Visual Captions by Comparative Adversarial Learning- Dianqi Li, Qiuyuan Huang, Xiaodong He, Lei Zhang, Ming-Ting Sun [pdf]
  • Zero-Shot Image Classification Guided by Natural Language Descriptions of Classes: A Meta-Learning Approach- R. Lily Hu, Caiming Xiong, Richard Socher [pdf]
  • Visual Reasoning by Progressive Module Network- SeungWook Kim, Makarand Tapaswi, Sanja Fidler [pdf]
  • Scene Graphs for Interpretable Video Anomaly Classification- Nicholas F. Y. Chen, Zhiyuan Du, Khin Hua Ng [pdf]
  • Learning Unsupervised Visual Grounding Through Semantic Self-Supervision- Syed Ashar Javed, Shreyas Saxena, Vineet Gandhi [pdf]
  • Towards Audio to Scene Image Synthesis using Generative Adversarial Network- Chia-Hung, Wan, Shun-Po, Chuang, Hung-Yi, Lee [pdf]
  • Attend and Attack: Attention Guided Adversarial Attacks on Visual Question Answering Models- Vasu Sharma, Ankita Kalra, Vaibhav , Simral Chaudhary, Labhesh Patel, LP Morency [pdf]
  • An Interpretable Model for Scene Graph Generation- Ji Zhang, Kevin Shih, Andrew Tao, Bryan Catanzaro, Ahmed Elgammal [pdf]
  • A Corpus for Reasoning About Natural Language Grounded in Photographs- Alane Suhrz, Stephanie Zhouy, Iris Zhangz, Huajun Baiz, Yoav Artziz [pdf]
  • A Distributional Semantic Model of Visually Indirect Grounding for Abstract Words- Akira Utsumi [pdf]
  • Efficient Visual Dialog Policy Learning via Positive Memory Retention- Rui Zhao, Volker Tresp [pdf]
  • Choose Your Neuron: Incorporating Domain Knowledge through Neuron-Importance - Ramprasaath R. Selvaraju, Prithvijit Chattopadhyay, Mohamed Elhoseiny, Tilak Sharma, Dhruv Batra, Devi Parikh, Stefan Lee [pdf]
  • A Bayesian Approach to Phrase Understanding through Cross-Situational Learning - Amir Aly, Tadahiro Taniguchi, Daichi Mochihashi [pdf]
  • Embodied Question Answering in Photorealistic Environments with Point Cloud Perception - Erik Wijmans*, Samyak Datta*, Oleksandr Maksymets*, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, Dhruv Batra [pdf]
  • Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering - Medhini Narasimhan, Svetlana Lazebnik, Alexander Schwing [pdf]
  • DVQA: Understanding Data Visualizations via Question Answering - Kushal Kafle, Brian Price, Scott Cohen, Christopher Kanan [pdf]
  • Dialog-based Interactive Image Retrieval - Hui Wu, Xiaoxiao Guo, Rogerio Feris, Yu Cheng, Steven Rennie, Gerald Tesauro [pdf]
  • TallyQA: Answering Complex Counting Questions - Manoj Acharya, Kushal Kafle, Christopher Kanan [pdf]
  • Efficient Gradient Computation for Structured Output Learning with Rational and Tropical Losses - Corinna Cortes, Vitaly Kuznetsov,Mehryar Mohri, Dmitry Storcheus, Scott Yang [pdf]
  • Systematic Generalization: What Is Required and Can It Be Learned? - Shikhar Murty, Dzmitry Bahdanau, Michael Noukhovitch, Thien Nguyen, Harm De Vries, Aaron Courville [pdf]
  • Incremental Object Model Learning from Multimodal Human-Robot Interactions - Pablo Azagra, Ana Cristina Murillo, Manuel Lopes, Javier Civera [pdf] - [supp]


Florian Strub
University of Lille, Inria | DeepMind
Harm de Vries
University of Montreal
Erik Wijmans
Georgia Tech
Samyak Datta
Georgia Tech
Ethan Perez
New York University
Stefan Lee
Georgia Tech
Peter Anderson
Georgia Tech
Dhruv Batra
Georgia Tech | Facebook AI Research
Aaron Courville
University of Montreal
Olivier Pietquin
Google Brain
Devi Parikh
Georgia Tech | Facebook AI Research



Thanks to for the webpage format.

Other ViGIL Sessions


  1. Sahar Kazemzadeh et al. "ReferItGame: Referring to Objects in Photographs of Natural Scenes." EMNLP, 2014.
  2. Harm de Vries et al. "GuessWhat?! Visual object discovery through multi-modal dialogue." CVPR, 2017.
  3. Stanislaw Antol et al. "Vqa: Visual question answering." ICCV, 2015.
  4. Mateusz Malinowski et al. "Ask Your Neurons: A Neural-based Approach to Answering Questions about Images." ICCV, 2015.
  5. Mateusz Malinowski et al. "A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input." NIPS, 2014.
  6. Geman Donald, et al. "Visual Turing test for computer vision systems." PNAS, 2015.
  7. Abhishek Das et al. "Visual dialog." CVPR, 2017.
  8. Anna Rohrbach et al. "Generating Descriptions with Grounded and Co-Referenced People." CVPR, 2017.
  9. Charles Beattie et al. Deepmind lab. arXiv, 2016.
  10. Haonan Yu et al. "Guided Feature Transformation (GFT): A Neural Language Grounding Module for Embodied Agents." arXiv, 2018.
  11. Openai universe., 2016.
  12. Alison Gopnik et al. "Semantic and cognitive development in 15- to 21-month-old children." Journal of Child Language, 1984.
  13. Abhishek Das et al. "Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning." ICCV, 2017.
  14. Karl Moritz Hermann et al. "Grounded Language Learning in a Simulated 3D World." arXiv, 2017.
  15. Alexander G. Huth et al. "Natural speech reveals the semantic maps that tile human cerebral cortex." Nature, 2016.
  16. Alexander G. Huth, et al. "Decoding the semantic content of natural movies from human brain activity." Frontiers in systems neuroscience, 2016.
  17. Piotr Mirowski et al. "Learning to Navigate in Cities Without a Map." arXiv, 2018.
  18. Stevan Harnad. "The symbol grounding problem." CNLS, 1989.
  19. E Kolve, R Mottaghi, D Gordon, Y Zhu, A Gupta, A Farhadi. "AI2-THOR: An Interactive 3D Environment for Visual AI." arXiv, 2017.
  20. Yi Wu et al. "House3D: A Rich and Realistic 3D Environment." arXiv, 2017.
  21. Angel Chang et al. "Matterport3D: Learning from RGB-D Data in Indoor Environments." arXiv, 2017.
  22. Abhishek Das et al. "Embodied Question Answering." CVPR, 2018.
  23. Peter Anderson et al. "Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments." CVPR, 2018.
  24. Fei Xia et al. "Gibson Env: Real-World Perception for Embodied Agents." CVPR, 2018.
  25. Manolis Savva et al. "MINOS: Multimodal indoor simulator for navigation in complex environments." arXiv, 2017.
  26. Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, Ali Farhadi. "IQA: Visual Question Answering in Interactive Environments." CVPR, 2018.