[YesBut-v2] When ‘YES’ Meets ‘BUT’: Can AI Comprehend Contradictory Humor Through Comparative Reasoning?

Tuo Liang*¹,
Zhe Hu*², Jing Li², Hao Zhang¹, Yiren Lu¹, Yunlai Zhou¹, Yiran Qiao¹, Disheng Liu¹, Jeirui Peng¹, Jing Ma¹, Yu Yin^{1 ✉}

¹Case Western Reserve University
²The Hong Kong Polytechnic University

^*Indicates Equal Contribution

YESBUT_V1 arXiv Github

YESBUT_v2 Dataset

In previous work, we introduce the YesBut-v1 to examine VLMs' capability in understanding humor, with a specific emphasis on humor derived from contrasting narratives. However, some limitations remain in previous work. In YesBut-v2, we expand the original YESBUT dataset from 349 to 1,262 images to enhance its diversity and robustness. Furthermore, we conduct more comprehensive and fine-grained analyses to better understand model performance. Finally, we propose a simple yet effective pipeline to improve VLMs’ ability to comprehend humor in juxtaposed comics.

Abstract

Understanding humor, especially when it involves complex and contradictory narratives, remains a significant challenge for large vision-language models (VLMs). This limitation hinders AI’s ability to engage in human-like reasoning and cultural expression. In this paper, we investigate this challenge through an in-depth analysis of comics that juxtapose panels to create humor through contradictions. We introduce the YESBUT, a novel benchmark with 1,262 comic images from diverse multilingual and multicultural contexts, featuring comprehensive annotations that capture various aspects of narrative understanding. Using this benchmark, we systematically evaluate a wide range of VLMs through four complementary tasks spanning from surface content comprehension to deep narrative reasoning. Our extensive experiments reveal that even the most advanced models significantly underperform compared to humans, with common failures in visual perception, key element identification, and hallucinations. We further investigate text-based training strategies and social knowledge augmentation methods to enhance model performance. Our findings not only highlight critical weaknesses in VLMs’ understanding of cultural and creative expressions but also provide pathways toward developing context-aware models capable of deeper narrative reasoning.

YesBut Dataset Overview

Our benchmark consists of YesBut comics featuring contradictory narratives. Specifically, each sample includes:
(1) a two-panel comic that contains a contradictory narrative;
(2) a literal description of the comic narratives;
(3) an explanation that illustrates the contradiction within the narrative;
(4) the underlying symbolism or message conveyed by the comic;
(5) a title of the comic;
(6) additional features, including social knowledge and linguistic context necessary for interpreting the comic.
Based on these components, we construct various tasks and analysis for comic understanding.

Dataset Statistics and Attribute Distribution

Our dataset consists of 1,262 comics, each accompanied by high-quality annotations. A statistical breakdown of annotated components, including their quantity and length, is presented in the right table.

Component		#Num	Avg. Len.
Image		1,262	-
Literal Description		1,262	134
Explicit Contradiction		1,262	33
Underlying Symbolism		5,048	26
Title		5,048	6
Additional Features	Social Knowledge	3,407	97
Additional Features	Linguistic Context	1,262	1

In addition to the basic dataset text length statistics, we conducted a more comprehensive statistics on the content of 1262 images, including Linguistic context (left), Social Knowledge (middle) and Humor Categories (right).

Data Construction Pipeline

For each comic, we annotate the corresponding literal description, contradiction explanation, underlying philosophy and comic title. The annotation process consists of two key stages: a human-AI collaborative annotation stage followed by a quality check and cross-verification stage. Gold-standard annotations are primarily obtained through human annotators. ('Pos' and 'Neg' in figure represent the positive and negative options, respectively.)

Evaluating Large Models' Understanding of Humor in Juxtaposition: Task Designs from Our Paper

We aim to evaluate the capabilities of recent large (visual) language models in understanding humor through contradictions. This is challenging because it requires both social reasoning about human events and comparative reasoning about the narratives, going beyond the literal understanding of the comic. We design a series of tasks that require different levels of narrative understanding and reasoning abilities to evaluate the models’ performance in reading comics.

Knowledge Augmentation Analysis

← Here is an example of a comic that requires social knowledge to be fully understood. Comprehending the comic not only demands the model's reasoning ability, but also a comprehensive understanding of social events and human behavioral norms. We conduct experiments by enriching the model’s input prompts with an notated social knowledge tailored to each comic’s specific context. As shown in the figure below, incorporating this annotated social knowledge leads to significantly better performance compared to using only the image as input.

↓

Model Finetuning for Deep Reasoning Tasks

Data Generation and Finetuning:

i.e.,

Navigating VLM Failures: Lessons and Future Pathways

Visual Perception Error: The model incorrectly identify the image elements.
=> These perceptual errors cascade into subsequent reasoning processes, establishing flawed premises that undermine higher-level understanding
Key Element Omission: Models fail to recognize or acknowledge significant visual elements present in the comic.

=> Such omissions eliminate essential information required for understanding the comic’s humor.
Incorrect Association: Models make up non-existent information or hallucinations for the visual content.

=> These hallucinated associations impose incorrect interpretive frameworks that fundamentally alter the comic’s intended meaning.

Potential Applications

VLM / MLLM / LLM Evaluation
As a benchmark, this dataset can be used to evaluate the reasoning ability, comic understanding and humor understanding ability of a Vision Language Model. The following result is the how we evaluate the humor understanding ability of VLMs in our paper.
Generative task
In the future, we intend to explore more deeply how AI can not only interpret but also creatively engage with content. This includes generating pivotal turning points from one perspective and creating counterpoints to given scenarios, like generating a "YES" image’s counterpart. The following is a simple example of it.
VLM image understanding
We will explore in more depth how VLM understands these images and how to improve VLM’s ability to understand these humorous images. We can address the hallucinations in the samples by improving the model’s reasoning ability and improve VLM’s understanding of the deep semantics of the images.

Ethics Statement

Copyright and License
All data samples collected are sourced from publicly available content on social media platforms. We ensure compliance with copyright by utilizing original links to comics without infringement. In addition, we obtained permission from the author artist (e.g., {Anton Gudim, Liz Climo}) to conduct our benchmark using these public images. Additionally, we commit to open-sourcing our annotated benchmark, providing corresponding links to each comic image. We diligently review samples, filtering out potentially offensive or harmful content.
The Large Vision Language Models
The VLMs utilized in our experiments are pretrained using diverse web corpora, which may introduce biases in their outputs. We advise users to conscientiously evaluate the ethical implications of generated outputs when employing them in future research endeavors.
Data Annotation
Eight human judges are engaged in our annotation process. We compensate these judges with an average hourly wage of $11, ensuring fair remuneration for their contributions.

Citation

If you find our work helpful, please consider cite us:

@article{liang2025yesbut,
  title={When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?},
  author={Tuo Liang and Zhe Hu and Hao Zhang and Yiren Lu and Yunlai Zhou and Yiran Qiao and Disheng Liu and Jeirui Peng and Jing Ma and Yu Yin},
  journal={arXiv preprint arXiv:2503.23137},
  year={2025},
  url={https://arxiv.org/abs/2503.23137}
}