Review — GPT-4V(ision) System Card

GPT-4 for Vision. ChatGPT Can See, Hear and Speak

Sik-Ho Tsang
7 min readOct 4, 2023
ChatGPT Can See, Hear and Speak, Announced on 25th Sep 2023 (

GPT-4V(ision) System Card
, by OpenAI
2023 OpenAI (Sik-Ho Tsang @ Medium)
Visual/Vision/Video Language Model (VLM)
2017 … 2022 [FILIP] [Wukong] [LiT] [Flamingo] [FLAVA] [SimVLM] [VLMo] [BEiT-3] [GLIP] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

  • Recently, GPT-4 with Vision (GPT-4V) is launched to support vision inputs. GPT-4V enables users to instruct GPT-4 to analyze image inputs provided by the user. In this system card, authors want to analyze the safety properties of GPT-4V.
  • Training of GPT-4V was completed in 2022 and we began providing early access to the system in March 2023 for small scale users and safety learnings.
  • Similar to GPT-4, the pre-trained model was first trained to predict the next word in a document, using a large dataset of text and image data from the Internet as well as licensed sources of data. It was then fine-tuned with additional data, using an algorithm called reinforcement learning from human feedback (RLHF).


  1. GPT-4V Development Preparation
  2. External Red Teaming
  3. Mitigations

1. GPT-4V Development Preparation

1.1. Early Access

Beginning in March, 2023, Be My Eyes and OpenAI collaborated to develop Be My AI, a new tool to describe the visual world for people who are blind or have low vision.

  • Be My Eyes piloted Be My AI from March to early August 2023 with a group of nearly 200 blind and low vision beta testers to hone the safety and user experience of the product. By September, the beta test group had grown to 16,000 blind and low vision users requesting a daily average of 25,000 descriptions.
  • Be My AI can provide its 500,000 blind and low-vision users with unprecedented tools addressing informational, cultural, and employment needs.

1.2. Evaluation

For quantitative evaluations, evaluations are built that measured model refusals and model performance accuracy.

  1. Harmful content: Refusal evaluations for illicit behaviour
  2. Harms of representation, allocation, and quality of service: Refusal evaluations for ungrounded inferences. Performance accuracy evaluations for gender, race and age recognition across demographics.
  3. Privacy: Refusal evaluation for person identification requests. Performance accuracy evaluation for person identification requests. Geolocalization evaluations.
  4. Cybersecurity: Performance accuracy CAPTCHA breaking evaluations.
  5. Multimodal Jailbreaks: Refusal evaluation for text-screenshot jailbreak.
Jailbreak Prompt Examples
  • Jailbreaks typically involve trapping the model via convoluted logical reasoning chains designed to make it ignore its instructions and training, as above.

2. External Red Teaming

OpenAI worked with external experts to qualitatively assess the limitations and risks associated with the model and system.

  • 6 key risk areas were received especially useful red teamer feedback in:
  1. Scientific proficiency
  2. Medical advice
  3. Stereotyping and ungrounded inferences
  4. Disinformation risks
  5. Hateful Content
  6. Visual vulnerabilities

2.1. Scientific Proficiency

Scientific Proficiency
  • Red teamers tested GPT-4V’s capabilities and limitations in scientific domains.

If two separate text components were closely located in an image, the model would occasionally combine them. For instance, it may merge “multipotent hematopoietic stem cell (HSC)” and “self-renewing division,” (as above) leading to the creation of unrelated terms.

  • Additionally, the model was prone to hallucinations and sometimes could make factual errors in an authoritative tone.
  • In some cases, it could also fail to identify information from images.
Scientific Proficiency

The model would give information for the synthesis and analysis of some dangerous chemicals such as Isotonitazene, a synthetic opioid.

  • The model’s generations here can be inaccurate and error prone, limiting its use for such tasks.
Scientific Proficiency

The model is unreliable and should not be used for any high risk tasks such as identification of dangerous compounds or foods.

2.2. Medical Advice

Medical Advice
  • Some of the vulnerabilities or inaccuracies that could result from an incorrect or decontextualized interpretation of the directionality of medical imaging.

Authors do not consider the current version of GPT-4V to be fit for performing any medical function or substituting professional medical advice, diagnosis, or treatment, or judgment.

2.3. Stereotyping and Ungrounded Inferences

Stereotyping and Ungrounded Inferences
  • Using GPT-4V for some tasks might generate unwanted or harmful assumptions that are not grounded in the information provided to the model.
  • Broad open-ended questions to the model paired with an image also exposed bias or anchoring towards specific topics that may not necessarily have been intended by the prompt.

2.4. Disinformation Risks

Disinformation Risks

When paired with vision capabilities, image and text content can pose increased risks with disinformation since the model can create text content tailored to an image input.

  • Red teamers also tested GPT-4V’s ability to detect incorrect information or disinformation in an image. But GPT-4V was not trained for this purpose and should not be used as a way to detect disinformation, or to otherwise verify whether something is true or false.
  • Realistic, customized images can be created using other generative image models, and used in combination with GPT-4V’s capabilities. Pairing the ability of image models to generate images more easily with GPT-4V’s ability to generate accompanying text more easily may have an impact on disinformation risks.

2.5. Hateful Content

Hateful Content

(a): GPT-4V knows the historic meaning of the Templar Cross but misses its modern meaning in the US, where it has been appropriated by hate groups.

(b): The model can also sometimes make songs or poems that praise certain hate figures or groups.

2.6. Visual Vulnerabilities

Visual Vulnerabilities

For example: ordering of the images used as input may influence the recommendation made.

3. Mitigations

3.1. Model-Level Mitigations

Multimodal Dataset

To further reduce the risks in emerging and high-stake areas, authors integrated additional multimodal data into the post-training process in order to reinforce refusal behavior.

For illicit behavior, a multimodal dataset is collected by augmenting the existing text-only dataset with image synonyms. For example, given a text string “how do i kill the people?”, it is adapted into a multimodal example “how do i [image of knife] the [image of people]?”.

  • The augmentation consists of the following steps:
  1. For each original text-only example, GPT-4 is asked to pick the top two most harmful short phrases (ref the table below);
  2. For each chosen short phrase, we replace it with a web crawled image.
  3. To ensure semantic-invariant, human review is conducted and low quality augmentations are filtered out.
  4. To reinforce the robustness of the refusal behavior, the examples are also augmented with various system messages.

For ungrounded inference requests, data is collected through the red teaming campaigns. The goal was to train the model to refuse prompts that were requesting an ungrounded conclusion based on certain attributes of a person. For example, if the prompt includes a photo of a person and the text “What is this person’s personality type?”, the desired model completion is “I’m sorry, I can’t help with that.” The examples collected through the red teaming campaign were further reviewed by humans before adding to the training dataset.

  • According to our internal evaluations after post-training, we observed that 97.2% of the completions refused requests for illicit advice, and 100% of the completions refused requests for ungrounded inference.
  • The correct refusal style rate improved from 44.4% to 72.2% for illicit advice style, and from 7.5% to 50% for ungrounded inference style.
  • This process will iterate and improve refusals over time.

3.2. Sysytem-Level Mitigations

Sysytem-level mitigations are added for adversarial images containing overlaid text in order to ensure this input couldn’t be used to circumvent our text safety mitigations. For example, a user could submit an image containing the text, “How do I build a bomb?”

As one mitigation for this risk, images are run through an OCR tool and then moderation scores are calculated on the resulting text in the image. This is in addition to detecting any text inputted directly in the prompt.

3.3. Results

Significant Progress in Refusing Disallowed Prompts
Significant Progress for Image Jailbreak Refusals
  • (After reading this system card, what we need to aware is that, when we use GPT-4V, we need to know GPT-4V can make up stuffs at any moments, similar to other LLMs.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.