[YesBut Benchmark] Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

Zhe Hu*¹, Tuo Liang*², Jing Li¹, Yiren Lu², Yunlai Zhou², Yiran Qiao², Jing Ma², Yu Yin²

¹The Hong Kong Polytechnic University ²Case Western Reserve University

NeurIPS 2024 (Oral)

^*Indicates Equal Contribution

Paper arXiv Github

Dataset

We introduce the YesBut Benchmark to examine VLMs' capability in understanding humor, with a specific emphasis on humor derived from contrasting narratives (juxtaposition). (Comic by Anton Gudim)

Abstract

Recent advancements in large multimodal language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut Benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large (vision) language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even state-of-the-art models still lag behind human performance on this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.

YesBut Benchmark Overview

Our benchmark consists of YesBut comics featuring contradictory narratives. Specifically, each sample includes:
(1) a two-panel comic that forms a narrative with inherent contradictions;
(2) a literal description of the comic narratives;
(3) an explanation that illustrates the contradiction within the narrative;
(4) the deep philosophy or underlying message the comic aims to convey;
(5) a title of the comic.
Based on these components, we construct various tasks for comic understanding.

Data Construction Pipeline

For each comic, we annotate the corresponding literal description, contradiction explanation, underlying philosophy and comic title. The annotation process consists of two key stages: a human-AI collaborative annotation stage (steps 1 & 2) followed by a quality check and cross-verification stage (step 3). Gold-standard annotations are primarily obtained through human annotators. ('Pos' and 'Neg' in figure represent the positive and negative options, respectively.)

Evaluating Large Models' Understanding of Humor in Juxtaposition: Task Designs from Our Paper

We aim to evaluate the capabilities of recent large (visual) language models in understanding humor through contradictions. This is challenging because it requires both social reasoning about human events and nonlinear logical reasoning about the narratives, going beyond the literal understanding of the comic. We design a series of tasks that require different levels of narrative understanding and reasoning abilities to evaluate the models’ performance in reading comics.

Navigating VLM Failures: Lessons and Future Pathways

Visual Misinterpretation: The model incorrectly interprets the image contents.

=> This highlights the need for future research to improve models’ visual interpretation capabilities.
In-depth Reasoning of the Relationship: Models also struggle to conduct in-depth reasoning of the relationship between two panels by recognizing their differences and similarities.

=> Future work might incorporate recent advanced reasoning approaches (e.g., multi-agent debate [68], test-time compute scaling [69]) to further improve model performance.
Hallucination and Incorrect Association

=> This suggests the need for improving world knowledge and social understanding abilities to enhance model performance on this task.

Potential Applications

VLM / LLM Evaluation
As a benchmark, this dataset can be used to evaluate the reasoning ability, comic understanding and humor understanding ability of a Vision Language Model. The following result is the how we evaluate the humor understanding ability of VLMs in our paper.
Generative task
In the future, we intend to explore more deeply how AI can not only interpret but also creatively engage with content. This includes generating pivotal turning points from one perspective and creating counterpoints to given scenarios, like generating a "YES" image’s counterpart. The following is a simple example of it.
VLM image understanding
We will explore in more depth how VLM understands these images and how to improve VLM’s ability to understand these humorous images. We can address the hallucinations in the samples by improving the model’s reasoning ability and improve VLM’s understanding of the deep semantics of the images.

Ethics Statement

Copyright and License
All data samples collected are sourced from publicly available content on social media platforms. We ensure compliance with copyright by utilizing original links to comics without infringement. In addition, we obtained permission from the author artist (e.g., {Anton Gudim, Liz Climo}) to conduct our benchmark using these public images. Additionally, we commit to open-sourcing our annotated benchmark, providing corresponding links to each comic image. We diligently review samples, filtering out potentially offensive or harmful content.
The Large Vision Language Models
The VLMs utilized in our experiments are pretrained using diverse web corpora, which may introduce biases in their outputs. We advise users to conscientiously evaluate the ethical implications of generated outputs when employing them in future research endeavors.
Data Annotation
Eight human judges are engaged in our annotation process. We compensate these judges with an average hourly wage of $11, ensuring fair remuneration for their contributions.

Citation

If you find our work helpful, please consider cite us:

@article{hu2024cracking,
          title={Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions},
          author={Hu, Zhe and Liang, Tuo and Li, Jing and Lu, Yiren and Zhou, Yunlai and Qiao, Yiran and Ma, Jing and Yin, Yu},
          journal={Advances in Neural Information Processing Systems},
          volume={37},
          pages={47166--47188},
          year={2024}
          }
}