AI Talk: Visual common sense

July 15th, 2020 / By V. “Juggy” Jagannathan, PhD

This week’s AI Talk…

We have all heard the phrase: “A picture is worth a thousand words.” I saw that literally come to life last week! The occasion was the 58th annual meeting of the Association of Computational Linguistics, held virtually this year. Professor Yejin Choi from the University of Washington Allen Institute for Artificial Intelligence provided a keynote address with grand in its title: “The second grand challenge and workshop on Human Multimodal Language.”

So, what was the talk about? This example gives a pretty accurate picture: “Given an image of a man struggling to stay afloat in water, we can reason that the man fell into the water sometime in the past; the intent of that man at the moment is to stay alive, and he will need help in the near future or else he will get washed away.” The researchers from the Allen Institute have built a system called VisualComet to come up with inferences from images, as suggested by the above example.

Making computers display common sense has been the holy grail for AI scientists for a very long time. In 1984, Doug Lenat, a researcher at MCC, created CYC, one of the first explicitly curated knowledge bases that encodes common sense. It uses a logic-based representation of commonsense facts and concepts. This was long before machine learning was on the scene. CYC and its successor, OpenCyc, now have over 2 million facts related to over 200,000 concepts.

Fast forward to a few years ago: The same team from the Allen Institute mentioned above came up with a system called COMeT. Check out this excellent article in Quanta Magazine about the work done by Professor Choi and her team on this front. The main insight here is that one should be able to leverage the advances in deep learning, particularly the spectacular success achieved by a new generation of “Transformers” that crunch away on millions of text documents, to create a model that can be used to generate text. How can one get these models to support the generation of commonsense inference? That’s the breakthrough! They created a new representation scheme called ATOMIC, which relies on natural language to encode everyday inferential knowledge about events. A crowdsourcing strategy was deployed to collect 300,000 events that were then linked in 877,000 inferential relations. The linkages were depicted in a graph, then they used this dataset to train a model.

How does the model perform? You can go to this website and check for yourself. I typed in “I went to a golf course,” and it came up with a range of inferences related to what could have happened before this event, possible attributes of the person involved in the event and what could possibly happen afterwards. Most were reasonable like: “Person x wants to have fun,” while some were a bit bizarre like “Then golf ball hits a home run!” Clearly they have some ways to go here, but this is definitely a more promising direction and, in a few years, has achieved a degree of sophistication that the OpenCyc never achieved after three decades of trying.

The most recent work by this group, the VisualComet, takes the COMet effort to pictures. How can one infer commonsense notions from pictures? They took a similar tack and created a training corpus using crowdsourcing. Each volunteer who signed up to annotate pictures was paid $15/hour, which turned out to be $4 per image. 60,000 images were annotated this way, each with a descriptive summary. The task of the annotators is to divine what could have happened before the event depicted in the image, inferences on what is likely happening during the event and postulate what happens afterwards. As in the case of COMet, these textual descriptions were captured as graphs. The images were mostly drawn from movies and to help weed out silly inferences such as “before, Person1 needed to be born,” they showed a few frames from the movie before the event and a few afterwards. You can see the progress they have made with this effort here.

We still have a long way to go before computers display true common sense—but the progress made by this group appears to suggest that fresh approaches utilizing neural computing can go a long way toward making this dream a reality sooner rather than later.

I am always looking for feedback and if you would like me to cover a story, please let me know. “See something, say something!” Leave me a comment below or ask a question on my blogger profile page.

V. “Juggy” Jagannathan, PhD, is Director of Research for 3M M*Modal and is an AI Evangelist with four decades of experience in AI and Computer Science research.