Blog Logo

KOSMOS -2: Grounding Multimodal Large Language Models to the World

We introduce KOSMOS -2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions and grounding text to the visual world. KOSMOS-2 integrates the grounding capability into downstream applications and has been evaluated on tasks such as multimodal grounding, referring expression comprehension, perception-language tasks, and language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Data, demo, and pretrained models are available at