top of page

Apple Quietly Launches MM1, Advancing Multimodal Large Language Models

Anurag Patnaik

Mar 20, 2024

In an unanticipated move that could reshape the landscape of artificial intelligence, Apple has introduced the MM1, a cutting-edge multimodal Large Language Model (LLM) designed to seamlessly integrate image captioning, visual question answering, and natural language inference. Announced on March 17, 2024, this breakthrough signifies Apple's significant foray into the AI domain, showcasing a commitment to advancing AI and machine learning technologies across its products and services.

While Apple has traditionally maintained a low profile in the AI arms race, the release of MM1 positions the tech giant as a formidable player in the field of multimodal AI. Developed by a dedicated team of researchers, MM1 stands out with its 30 billion parameters, trained on a diverse dataset encompassing over 1 billion images and 30 trillion words. This extensive training enables MM1 to perform a wide array of tasks, from image understanding to complex natural language processing, setting new benchmarks in AI performance and versatility​​​​.

Fig. 1: MM1 can perform in-context predictions thanks to its large-scale multimodal pre-training. This allows MM1 to (a) count objects and follow custom formatting, (b) refer to parts of the images and perform OCR, (c) demonstrate common-sense and word knowledge about everyday objects, and (d) perform basic math functions.

The MM1 model leverages a transformer architecture, renowned for its efficiency in understanding and generating text, and incorporates convolutional neural networks (CNN) for image processing. This sophisticated combination allows MM1 to excel in tasks involving both text and visual data, offering unprecedented capabilities in AI-driven applications.

Key to MM1's success is its comprehensive training regime, which includes a vast array of images with captions, an extensive corpus of textual data, and code examples from various repositories. This multimodal approach not only enhances the model's understanding of complex information but also positions it as a leader in AI benchmarks, outperforming competitors in language understanding, image recognition, and code generation tasks.

Comparative analyses reveal MM1's superiority across several benchmarks. For instance, in the GLUE score for language understanding, MM1 achieves a score of 91.7, surpassing other prominent models like GPT-4V and PaLM. Similarly, in image recognition and code completion tasks, MM1 demonstrates exceptional accuracy and efficiency, confirming its status as a cutting-edge AI model​​.

MM1's potential applications are vast and varied, spanning healthcare, education, and e-commerce. In healthcare, it could revolutionize the diagnosis and monitoring of diseases through advanced image analysis. In education, MM1 could personalize learning experiences and assist in grading multi-format assignments. For e-commerce, it offers enhanced product recommendations and customer service through its ability to understand and generate natural, context-rich interactions​​.

  • Model Architecture: MM1 explores both dense models (up to 30 billion parameters) and mixture-of-experts (MoE) variants.

  • Pre-training Data: The team experimented with different data combinations for pre-training, finding that mixing image-caption pairs, interleaved text and image documents, and text-only data was crucial for optimal performance.

  • Key Factors for Performance: Through ablation studies (systematically removing parts of the model to see what effect it has on performance), they identified image resolution as the most significant factor impacting performance, even more than model size itself. The specific design of the vision-language connector, the part that bridges image and text understanding, had less influence.

Fig. 2: MM1 can follow instructions and reason across images, answers correctly when prompted with chain-of-thought.


The Apple research team behind MM1, a family of multimodal large language models (MLLMs), aimed to create a system that could process and understand both text and image data [1]. Their approach focused on several key points:

  • Addressing Multimodal Challenges: Unlike traditional large language models (LLMs) that focus on text data, MM1 tackles the challenge of integrating and understanding both text and image data. This is achieved through a multimodal architecture with a vision encoder specifically designed to process visual information and a vision-language connector that bridges the gap between the two modalities.

  • Data-Centric Training: The study emphasizes the importance of a well-curated pre-training dataset. MM1 benefits from a diverse mix of data sources including image-caption pairs, interleaved image-text documents, and text-only data. This variety helps the model learn the relationships between image and text representations, improving performance in tasks like image captioning and visual question answering.

  • Ablation Studies for Targeted Improvement: Apple's researchers employed ablation studies, a technique where they systematically removed components of the model to analyze their impact on performance. This helped them identify the most crucial factors for MLLM development. Interestingly, the study revealed that image resolution has a greater influence on performance than model size itself. This suggests focusing on the quality and structure of the visual data used for pre-training can be more impactful than simply scaling up the model.

Fig. 3: Model ablations: what visual encoder to use, how to feed rich visual data, and how to connect the visual representation to the LLM. Right: Data ablations: type of data, and their mixture.

Image Encoder: A ViT-L/14 model trained with a CLIP loss on DFN-5B and VeCap-300M ; images of size 336×336.

Vision-Language Connector: C-Abstractor with 144 image tokens.

Pre-training Data: A mix of captioned images (45%), interleaved imagetext documents (45%), and text-only (10%) data.

Language Model: A 1.2B transformer decoder-only language model. To evaluate the different design decisions, we use zero-shot and few-shot (4- and 8-shot) performance on a variety of captioning and VQA tasks

Takeaways and Lessons:

  • Importance of Multimodal Learning: MM1's success highlights the potential of multimodal learning for tasks that require understanding the relationship between images and text. This has implications for various applications, such as image retrieval, visual search, and generating more comprehensive descriptions of content.

  • Data Diversity is Key: The study underscores the importance of using diverse data sources during pre-training. Including different combinations of text-and-image data helps the model learn a richer set of relationships between the two modalities.

  • Focus on Data Quality: Beyond just the amount of data, Apple's research suggests that prioritizing the quality and structure of the data used for pre-training can have a significant impact on MLLM performance. This highlights the need for careful data curation and exploration of factors like image resolution for optimal model development.

By combining these elements, Apple's MM1 family of models achieved state-of-the-art (SOTA) results in pre-training metrics. Additionally, MM1 showed competitive performance on established multimodal benchmarks after fine-tuning for specific tasks. The paper also highlights advantages of MM1 like enhanced in-context learning and multi-image reasoning capabilities.

In essence, Apple focused on a combination of model design, diverse pre-training data, and identifying the most impactful factors to create a high-performing MLLM capable of understanding both text and image data.

Apple's MM1 marks a significant milestone in the development of multimodal AI, promising to impact numerous sectors with its advanced capabilities. While challenges remain, the potential of MM1 to transform AI applications is undeniable, setting a new standard for future AI developments. As Apple continues to refine and expand its AI research, the tech world eagerly anticipates further innovations from the Cupertino-based giant.

Readers of This Article Also Viewed

bottom of page