KOSMOS -2: Grounding Multimodal Large Language Models to the World

We introduce KOSMOS -2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions and grounding text to the visual world. KOSMOS-2 integrates the grounding capability into downstream applications and has been evaluated on tasks such as multimodal grounding, referring expression comprehension, perception-language tasks, and language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Data, demo, and pretrained models are available at https://aka.ms/kosmos-2.

KOSMOS -2: Grounding Multimodal Large Language Models to the World

Previoujs Article

AI Assistant Comes to Rider! First Rider build to include features leveraging generative AI and large language models

Next Article

A Contrarian View of Software Architecture - Jeremy Miller - NDC Oslo 2023

Tags