A multimodal model capable of seeing your images and interpreting them. Ideal for visual recognition, OCR, object detection, etc.