Visually Grounded Interaction and Language (ViGIL)

NIPS 2018 Workshop, Montreal, Canada

Friday, 7th December, 08:00 AM to 06:30 PM, Room: TBA


The dominant paradigm in modern natural language understanding is learning statistical language models from text-only corpora. This approach is founded on a distributional notion of semantics, i.e. that the "meaning" of a word is based only on its relationship to other words. While effective for many applications, methods in this family suffer from limited semantic understanding, as they miss learning from the multimodal and interactive environment in which communication often takes place - the symbols of language thus are not grounded in anything concrete. The symbol grounding problem first highlighted this limitation, that "meaningless symbols (i.e.) words cannot be grounded in anything but other meaningless symbols" [18].

On the other hand, humans acquire language by communicating about and interacting within a rich, perceptual environment. This behavior provides the necessary grounding for symbols, i.e. to concrete objects or concepts (i.e. physical or psychological). Thus, recent work has aimed to bridge vision, interactive learning, and natural language understanding through language learning tasks based on natural images (ReferIt [1], GuessWhat?! [2], Visual Question Answering [3,4,5,6], Visual Dialog [7], Captioning [8]) or through embodied agents performing interactive tasks [13,14,17,22,23,24,26] in physically simulated environments (DeepMind Lab [9], Baidu XWorld [10], OpenAI Universe [11], House3D [20], Matterport3D [21], GIBSON [24], MINOS [35], AI2-THOR [19], StreetLearn [17]), often drawing on the recent successes of deep learning and reinforcement learning. We believe this line of research poses a promising, long-term solution to the grounding problem faced by current, popular language understanding models.

While machine learning research exploring visually-grounded language learning may be in its earlier stages, it may be possible to draw insights from the rich research literature on human language acquisition. In neuroscience, recent progress in fMRI technology has enabled to better understand the interleave between language, vision and other modalities [15,16] suggesting that the brains shares neural representation of concepts across vision and language. Differently, developmental cognitive scientists have also argued that children acquiring various words is closely linked to them learning the underlying concept in the real world [12].

This workshop thus aims to gather people from various backgrounds - machine learning, computer vision, natural language processing, neuroscience, cognitive science, psychology, and philosophy - to share and debate their perspectives on why grounding may (or may not) be important in building machines that truly understand natural language.

Important Dates

Paper Submission Deadline November 1st, 2018 - Midnight Pacific Time
Final Decisions November 8 9th, 2018
Workshop Date December 7th, 2018

Call for Papers

We invite high-quality paper submissions on the following topics:

  • language acquisition or learning through interactions
  • visual captioning, dialog, and question-answering
  • reasoning in language and vision
  • visual synthesis from language
  • transfer learning in language and vision tasks
  • navigation in virtual worlds with natural-language instructions
  • machine translation with visual cues
  • novel tasks that combine language, vision and actions
  • understanding and modeling the relationship between language and vision in humans
  • semantic systems and modeling of natural language and visual stimuli representations in the human brain
  • audio visual scene-aware dialog systems - Visual Question Answering
  • image/video captioning
  • lip reading
  • audio-visual scene understanding - Sound localization
  • audio-visual speech processing
  • audio-visual fusion

Accepted papers will be presented during joint poster sessions, with exceptional submissions selected for spotlight oral presentation. Accepted papers will be made publicly available as non-archival reports, allowing future submissions to archival conferences or journals.

Submissions should be up to 4 pages excluding references, acknowledgements, and supplementary material, and should be NIPS format and anonymous. The review process is double-blind.

We also welcome published papers that are within the scope of the workshop (without re-formatting). This specific papers do not have to be anonymous. They are eligible for poster sessions and will only have a very light review process.

Please submit your paper to the following address: If you have any question, send an email to:

Accepted workshop papers are eligible to the pool of reserved conference tickets (one ticket per accepted papers).


  • 08:30 AM : Opening Remarks
  • 08:40 AM : Invited Speaker 1: Steven Harnad
  • 09:20 AM : Invited Speaker 2: Antonio Torralba
  • 10:00 AM : Audio Visual Semantic Understanding Challenge
  • 10:15 AM : Spotlights (2*7min)
  • 10:30 AM : Coffee Break (20min)
  • 10:50 AM : Invited Speaker 3: Douwe Kiela
  • 11:30 AM : Invited Speaker 4: Ali Farhabi
  • 12:10 PM : Poster Session
  • 01:10 PM : Break
  • 01:40 PM : Invited Speaker 5: Angeliki Lazaridou
  • 02:20 PM : Invited Speaker 6: Barbara Landau - Learning simple spatial terms: Core and more
  • 03:00 PM : Coffee Break & Poster Session (50 mins)
  • 03:50 PM : Invited Speaker 7: Joyce Y. Chai - Language Communication with Robots
  • 04:30 PM : Invited Speaker 8: Christopher Manning
  • 05:10 PM : Panel Discussion
  • 06:00 PM : Closing Remarks

Invited Speakers

Stevan Harnad is Professor of Cognitive Sciences in the Department of Psychology at Université du Québec à Montréal, and Professor of Web Science in the Department of Electronics and Computer Science, at University of Southampton, UK. His research is on categorisation, communication, cognition and the open research web. [Webpage]

Barbara Landau is Professor of Cognitive Science at Johns Hopkins University. Landau is interested in human knowledge of language and space, and the relationships between these two foundational systems of knowledge. [Webpage]

Joyce Chai is a Professor at Michigan State University. Her lab investigates language use in a variety of context and develops approaches for situated language processing to faciliate situated communication with artificial agents. [Webpage]

Christopher Manning is the inaugral Thomas M. Siebel Professor in Machine Learning in the Departments of Computer Science and Linguistics at Stanford University. His research goal is computers that can intelligently process, understand, and generate human language material. [Webpage]

Antonio Torralba is a Professor at Massachusetts Institute of Technology. He is interested in building systems that can perceive the world like humans do. Although his work focuses on computer vision, he is also interested in other modalities such as audition and touch. [Webpage]

Ali Farhadi is an Associate Professor in the Department of Computer Science and Engineering at the University of Washington. He also leads the project Plato at the Allen Institute for Artificial Intelligence. He is mainly interested in computer vision, machine learning, the intersection of natural language and vision, analysis of the role of semantics in visual understanding, and visual reasoning. [Webpage]

Douwe Kiela is research scientist at Facebook AI Research in New York. His research interests are in natural language processing, semantics, grounding, computer vision, deep learning and artificial general intelligence. [Webpage]

Angeliki Lazaridou is a research scientist at DeepMind. Her primary research interests are in the area of natural language processing (NLP), and specifically, in multimodal semantics. [Webpage]


Florian Strub
University of Lille, Inria | DeepMind
Harm de Vries
University of Montreal
Erik Wijmans
Georgia Tech
Samyak Datta
Georgia Tech
Ethan Perez
New York University
Stefan Lee
Georgia Tech
Peter Anderson
Georgia Tech
Dhruv Batra
Georgia Tech | Facebook AI Research
Aaron Courville
University of Montreal
Olivier Pietquin
Google Brain
Devi Parikh
Georgia Tech | Facebook AI Research



Thanks to for the webpage format.

Previous sessions


  1. Sahar Kazemzadeh et al. "ReferItGame: Referring to Objects in Photographs of Natural Scenes." EMNLP, 2014.
  2. Harm de Vries et al. "GuessWhat?! Visual object discovery through multi-modal dialogue." CVPR, 2017.
  3. Stanislaw Antol et al. "Vqa: Visual question answering." ICCV, 2015.
  4. Mateusz Malinowski et al. "Ask Your Neurons: A Neural-based Approach to Answering Questions about Images." ICCV, 2015.
  5. Mateusz Malinowski et al. "A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input." NIPS, 2014.
  6. Geman Donald, et al. "Visual Turing test for computer vision systems." PNAS, 2015.
  7. Abhishek Das et al. "Visual dialog." CVPR, 2017.
  8. Anna Rohrbach et al. "Generating Descriptions with Grounded and Co-Referenced People." CVPR, 2017.
  9. Charles Beattie et al. Deepmind lab. arXiv, 2016.
  10. Haonan Yu et al. "Guided Feature Transformation (GFT): A Neural Language Grounding Module for Embodied Agents." arXiv, 2018.
  11. Openai universe., 2016.
  12. Alison Gopnik et al. "Semantic and cognitive development in 15- to 21-month-old children." Journal of Child Language, 1984.
  13. Abhishek Das et al. "Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning." ICCV, 2017.
  14. Karl Moritz Hermann et al. "Grounded Language Learning in a Simulated 3D World." arXiv, 2017.
  15. Alexander G. Huth et al. "Natural speech reveals the semantic maps that tile human cerebral cortex." Nature, 2016.
  16. Alexander G. Huth, et al. "Decoding the semantic content of natural movies from human brain activity." Frontiers in systems neuroscience, 2016.
  17. Piotr Mirowski et al. "Learning to Navigate in Cities Without a Map." arXiv, 2018.
  18. Stevan Harnad. "The symbol grounding problem." CNLS, 1989.
  19. E Kolve, R Mottaghi, D Gordon, Y Zhu, A Gupta, A Farhadi. "AI2-THOR: An Interactive 3D Environment for Visual AI." arXiv, 2017.
  20. Yi Wu et al. "House3D: A Rich and Realistic 3D Environment." arXiv, 2017.
  21. Angel Chang et al. "Matterport3D: Learning from RGB-D Data in Indoor Environments." arXiv, 2017.
  22. Abhishek Das et al. "Embodied Question Answering." CVPR, 2018.
  23. Peter Anderson et al. "Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments." CVPR, 2018.
  24. Fei Xia et al. "Gibson Env: Real-World Perception for Embodied Agents." CVPR, 2018.
  25. Manolis Savva et al. "MINOS: Multimodal indoor simulator for navigation in complex environments." arXiv, 2017.
  26. Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, Ali Farhadi. "IQA: Visual Question Answering in Interactive Environments." CVPR, 2018.