[YesBut-v2] When ‘YES’ Meets ‘BUT’: Can AI Comprehend Contradictory Humor Through Comparative Reasoning?

1Case Western Reserve University
2The Hong Kong Polytechnic University
*Indicates Equal Contribution
MY ALT TEXT

In previous work, we introduce the YesBut-v1 to examine VLMs' capability in understanding humor, with a specific emphasis on humor derived from contrasting narratives. However, some limitations remain in previous work. In YesBut-v2, we expand the original YESBUT dataset from 349 to 1,262 images to enhance its diversity and robustness. Furthermore, we conduct more comprehensive and fine-grained analyses to better understand model performance. Finally, we propose a simple yet effective pipeline to improve VLMs’ ability to comprehend humor in juxtaposed comics.

Abstract

Understanding humor, especially when it involves complex and contradictory narratives, remains a significant challenge for large vision-language models (VLMs). This limitation hinders AI’s ability to engage in human-like reasoning and cultural expression. In this paper, we investigate this challenge through an in-depth analysis of comics that juxtapose panels to create humor through contradictions. We introduce the YESBUT, a novel benchmark with 1,262 comic images from diverse multilingual and multicultural contexts, featuring comprehensive annotations that capture various aspects of narrative understanding. Using this benchmark, we systematically evaluate a wide range of VLMs through four complementary tasks spanning from surface content comprehension to deep narrative reasoning. Our extensive experiments reveal that even the most advanced models significantly underperform compared to humans, with common failures in visual perception, key element identification, and hallucinations. We further investigate text-based training strategies and social knowledge augmentation methods to enhance model performance. Our findings not only highlight critical weaknesses in VLMs’ understanding of cultural and creative expressions but also provide pathways toward developing context-aware models capable of deeper narrative reasoning.

YesBut Dataset Overview

Our benchmark consists of YesBut comics featuring contradictory narratives. Specifically, each sample includes:
(1) a two-panel comic that contains a contradictory narrative;
(2) a literal description of the comic narratives;
(3) an explanation that illustrates the contradiction within the narrative;
(4) the underlying symbolism or message conveyed by the comic;
(5) a title of the comic;
(6) additional features, including social knowledge and linguistic context necessary for interpreting the comic.
Based on these components, we construct various tasks and analysis for comic understanding.

Dataset Statistics and Attribute Distribution

Our dataset consists of 1,262 comics, each accompanied by high-quality annotations. A statistical breakdown of annotated components, including their quantity and length, is presented in the right table.

Component #Num Avg. Len.
Image 1,262 -
Literal Description 1,262 134
Explicit Contradiction 1,262 33
Underlying Symbolism 5,048 26
Title 5,048 6
Additional Features Social Knowledge 3,407 97
Linguistic Context 1,262 1

In addition to the basic dataset text length statistics, we conducted a more comprehensive statistics on the content of 1262 images, including Linguistic context (left), Social Knowledge (middle) and Humor Categories (right).

MY ALT TEXT

Data Construction Pipeline

MY ALT TEXT


For each comic, we annotate the corresponding literal description, contradiction explanation, underlying philosophy and comic title. The annotation process consists of two key stages: a human-AI collaborative annotation stage followed by a quality check and cross-verification stage. Gold-standard annotations are primarily obtained through human annotators. ('Pos' and 'Neg' in figure represent the positive and negative options, respectively.)

Evaluating Large Models' Understanding of Humor in Juxtaposition: Task Designs from Our Paper

We aim to evaluate the capabilities of recent large (visual) language models in understanding humor through contradictions. This is challenging because it requires both social reasoning about human events and comparative reasoning about the narratives, going beyond the literal understanding of the comic. We design a series of tasks that require different levels of narrative understanding and reasoning abilities to evaluate the models’ performance in reading comics.

MY ALT TEXT

Knowledge Augmentation Analysis

MY ALT TEXT

Here is an example of a comic that requires social knowledge to be fully understood. Comprehending the comic not only demands the model's reasoning ability, but also a comprehensive understanding of social events and human behavioral norms. We conduct experiments by enriching the model’s input prompts with an notated social knowledge tailored to each comic’s specific context. As shown in the figure below, incorporating this annotated social knowledge leads to significantly better performance compared to using only the image as input.

MY ALT TEXT

Model Finetuning for Deep Reasoning Tasks

    Data Generation and Finetuning: To overcome the scarcity of large-scale comic datasets, we propose a text-only training approach that leverages LLMs for synthetic data generation. Using GPT-4o, we create 20,000 narrative descriptions with corresponding reasoning questions, based on few-shot prompting from a small labeled set. The resulting dataset is used to finetune only the language components of VLMs, enhancing deep reasoning without modifying visual modules. By leveraging these generated data, we finetune models with LoRA method. The figure below shows the performance comparison with (w/) and without (w/o) finetuning (FT) on deep reasoning tasks (i.e., symbolism selection and title matching tasks).
MY ALT TEXT

Navigating VLM Failures: Lessons and Future Pathways

MY ALT TEXT
  • Visual Perception Error: The model incorrectly identify the image elements.

    => These perceptual errors cascade into subsequent reasoning processes, establishing flawed premises that undermine higher-level understanding

  • Key Element Omission: Models fail to recognize or acknowledge significant visual elements present in the comic.

    => Such omissions eliminate essential information required for understanding the comic’s humor.

  • Incorrect Association: Models make up non-existent information or hallucinations for the visual content.

    => These hallucinated associations impose incorrect interpretive frameworks that fundamentally alter the comic’s intended meaning.

Potential Applications

  • VLM / MLLM / LLM Evaluation
    As a benchmark, this dataset can be used to evaluate the reasoning ability, comic understanding and humor understanding ability of a Vision Language Model. The following result is the how we evaluate the humor understanding ability of VLMs in our paper.

    FAIL TO LOAD

  • Generative task
    In the future, we intend to explore more deeply how AI can not only interpret but also creatively engage with content. This includes generating pivotal turning points from one perspective and creating counterpoints to given scenarios, like generating a "YES" image’s counterpart. The following is a simple example of it.

    FAIL TO LOAD

  • VLM image understanding
    We will explore in more depth how VLM understands these images and how to improve VLM’s ability to understand these humorous images. We can address the hallucinations in the samples by improving the model’s reasoning ability and improve VLM’s understanding of the deep semantics of the images.

    FAIL TO LOAD

Ethics Statement

  • Copyright and License
    All data samples collected are sourced from publicly available content on social media platforms. We ensure compliance with copyright by utilizing original links to comics without infringement. In addition, we obtained permission from the author artist (e.g., {Anton Gudim, Liz Climo}) to conduct our benchmark using these public images. Additionally, we commit to open-sourcing our annotated benchmark, providing corresponding links to each comic image. We diligently review samples, filtering out potentially offensive or harmful content.

  • The Large Vision Language Models
    The VLMs utilized in our experiments are pretrained using diverse web corpora, which may introduce biases in their outputs. We advise users to conscientiously evaluate the ethical implications of generated outputs when employing them in future research endeavors.

  • Data Annotation
    Eight human judges are engaged in our annotation process. We compensate these judges with an average hourly wage of $11, ensuring fair remuneration for their contributions.

Citation

If you find our work helpful, please consider cite us:

@article{Tuo,
          title={When ‘YES’ Meets ‘BUT’: Can AI Comprehend Contradictory Humor in Comics?},
          author={Tuo Liang, Zhe Hu, Jing Li, Hao Zhang, Yiren Lu, Yunlai Zhou, Yiran Qiao, Disheng Liu, Jierui Peng, Jing Ma, Yu Yin},
          journal={arXiv preprint arXiv:},
          year={2025}
          }