Ferret enables referring and grounding capabilities for multimodal large language model (LLM). In terms of referring, a user can refer to a region or an object in point, box, or any free-form shape. The regionN in the input will be replaced by the proposed hybrid representation before being fed into the LLM. In terms of grounding, Ferret is able to accurately ground any open-vocabulary descriptions. The boxN in the output denotes the predicted bounding box coordinates.