识别图片中对象,图片文字描述 https://github.dev/xinyu1205/recognize-anything
天问 f6a8efcdb5 Update 'README.md' | 1 year ago | |
---|---|---|
README.md | 1 year ago |
Official PyTorch Implementation of Recognize Anything: A Strong Image Tagging Model and Tag2Text: Guiding Vision-Language Model via Image Tagging.
Both Tag2Text and RAM exihibit strong recognition ability. We have combined Tag2Text and RAM with localization models (Grounding-DINO and SAM) and developed a strong visual semantic analysis pipeline in the Grounded-SAM project.
RAM is a strong image tagging model, which can recognize any common category with high accuracy.
RAM significantly improves the tagging ability based on the Tag2text framework.
Tag2Text is an efficient and controllable vision-language model with tagging guidance.
Name | Backbone | Data | Illustration | Checkpoint | |
---|---|---|---|---|---|
1 | RAM-14M | Swin-Large | COCO, VG, SBU, CC-3M, CC-12M | Provide strong image tagging ability. | Download link |
2 | Tag2Text-14M | Swin-Base | COCO, VG, SBU, CC-3M, CC-12M | Support comprehensive captioning and tagging. | Download link |
Download RAM pretrained checkpoints.
(Optional) To use RAM and Tag2Text in other projects, better to install recognize-anything as a package:
pip install -e .
Then the RAM and Tag2Text model can be imported in other projects:
from ram.models import ram, tag2text
Get the English and Chinese outputs of the images:
python inference_ram.py --image images/demo/demo1.jpg \ --pretrained pretrained/ram_swin_large_14m.pthFirstly, custom recognition categories in build_openset_label_embedding, then get the tags of the images:
python inference_ram_openset.py --image images/openset_example.jpg \ --pretrained pretrained/ram_swin_large_14m.pthGet the tagging and captioning results:
python inference_tag2text.py --image images/demo/demo1.jpg \ --pretrained pretrained/tag2text_swin_14m.pth Or get the tagging and sepcifed captioning results (optional): python inference_tag2text.py --image images/demo/demo1.jpg \ --pretrained pretrained/tag2text_swin_14m.pth \ --specified-tags "cloud,sky"We release two datasets OpenImages-common
(214 seen classes) and OpenImages-rare
(200 unseen classes). Copy or sym-link test images of OpenImages v6 to datasets/openimages_common_214/imgs/
and datasets/openimages_rare_200/imgs
.
To evaluate RAM on OpenImages-common
:
python batch_inference.py \
--model-type ram \
--checkpoint pretrained/ram_swin_large_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/ram
To evaluate RAM open-set capability on OpenImages-rare
:
python batch_inference.py \
--model-type ram \
--checkpoint pretrained/ram_swin_large_14m.pth \
--open-set \
--dataset openimages_rare_200 \
--output-dir outputs/ram_openset
To evaluate Tag2Text on OpenImages-common
:
python batch_inference.py \
--model-type tag2text \
--checkpoint pretrained/tag2text_swin_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/tag2text
Please refer to batch_inference.py
for more options. To get P/R in table 3 of our paper, pass --threshold=0.86
for RAM and --threshold=0.68
for Tag2Text.
To batch inference custom images, you can set up you own datasets following the given two datasets.
At present, we can only open source the forward function of Tag2Text as much as possible. To train/finetune Tag2Text on a custom dataset, you can refer to the complete training codebase of BLIP and make the following modifications:
The training code of RAM cannot be open-sourced temporarily as it is in the company's process.