Browse Source

Update 'README.md'

天问 1 year ago
parent
commit
f6a8efcdb5
1 changed files with 238 additions and 0 deletions
  1. 238 0
      README.md

+ 238 - 0
README.md

@@ -1,2 +1,240 @@
 # recognize-anything
 
+[![Web Demo](https://img.shields.io/badge/🤗-HuggingFace%20Space-cyan.svg)](https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text)
+[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mhd-medfa/recognize-anything/blob/main/recognize_anything_demo.ipynb)
+
+Official PyTorch Implementation of <a href="https://recognize-anything.github.io/">Recognize Anything: A Strong Image Tagging Model </a> and <a href="https://tag2text.github.io/">Tag2Text: Guiding Vision-Language Model via Image Tagging</a>.
+
+- **Recognize Anything Model(RAM)** is an image tagging model, which can recognize any common category with high accuracy.
+- **Tag2Text** is a vision-language model guided by tagging, which can support caption, retrieval and tagging.
+
+<!-- Welcome to try our [RAM & Tag2Text web Demo! 🤗](https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text) -->
+
+Both Tag2Text and RAM exihibit strong recognition ability. 
+We have combined Tag2Text and RAM with localization models (Grounding-DINO and SAM) and developed a strong visual semantic analysis pipeline in the [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) project.
+
+![](./images/ram_grounded_sam.jpg)
+
+
+
+## :bulb: Highlight of RAM
+RAM is a strong image tagging model, which can recognize any common category with high accuracy.
+- **Strong and general.** RAM exhibits exceptional image tagging capabilities with powerful zero-shot generalization;
+    - RAM showcases impressive zero-shot performance, significantly outperforming CLIP and BLIP.
+    - RAM even surpasses the fully supervised manners (ML-Decoder).
+    - RAM exhibits competitive performance with the Google tagging API.
+- **Reproducible and affordable.** RAM requires Low reproduction cost with open-source and annotation-free dataset;
+- **Flexible and versatile.** RAM offers remarkable flexibility, catering to various application scenarios.
+
+
+<p align="center">
+ <table class="tg">
+  <tr>
+    <td class="tg-c3ow"><img src="images/experiment_comparison.png" align="center" width="800" ></td>
+  </tr>
+  <p align="center">(Green color means fully supervised learning and Blue color means zero-shot performance.)</p>
+</table>
+</p>
+
+<p align="center">
+ <table class="tg">
+  <tr>
+    <td class="tg-c3ow"><img src="images/tagging_results.jpg" align="center" width="800" ></td>
+  </tr>
+</table>
+</p>
+
+RAM significantly improves the tagging ability based on the Tag2text framework.
+- **Accuracy.** RAM utilizes a **data engine** to **generate** additional annotations and **clean** incorrect ones,  **higher accuracy** compared to Tag2Text.
+- **Scope.** RAM upgrades the number of fixed tags from  3,400+ to **[6,400+](./ram/data/ram_tag_list.txt)** (synonymous reduction to 4,500+ different semantic tags), covering **more valuable categories**.
+  Moreover, RAM is equipped with **open-set capability**, feasible to recognize tags not seen during training
+
+## :sunrise: Highlight of Tag2text
+Tag2Text is an efficient and controllable vision-language model with tagging guidance.
+- **Tagging.** Tag2Text recognizes **[3,400+](./ram/data/tag_list.txt)** commonly human-used categories without manual annotations.
+- **Captioning.** Tag2Text integrates **tags information** into text generation as the **guiding elements**, resulting in **more controllable and comprehensive descriptions**. 
+- **Retrieval.** Tag2Text provides **tags** as **additional visible alignment indicators** for image-text retrieval. 
+
+<p align="center">
+ <table class="tg">
+  <tr>
+    <td class="tg-c3ow"><img src="images/tag2text_framework.png" align="center" width="800" ></td>
+  </tr>
+</table>
+</p>
+</details>
+
+
+<!-- ## :sparkles: Highlight Projects with other Models
+- [Tag2Text/RAM with Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) is trong and general pipeline for visual semantic analysis, which can automatically **recognize**, detect, and segment for an image!
+- [Ask-Anything](https://github.com/OpenGVLab/Ask-Anything) is a multifunctional video question answering tool. Tag2Text provides powerful tagging and captioning capabilities as a fundamental component.
+- [Prompt-can-anything](https://github.com/positive666/Prompt-Can-Anything) is a gradio web library that integrates SOTA multimodal large models, including Tag2text as the core model for graphic understanding -->
+
+
+<!-- 
+## :fire: News
+
+- **`2023/06/08`**: We release the [Recognize Anything Model (RAM) Tag2Text web demo 🤗](https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text), checkpoints and inference code!
+- **`2023/06/07`**: We release the [Recognize Anything Model (RAM)](https://recognize-anything.github.io/), a strong image tagging model!
+- **`2023/06/05`**: Tag2Text is combined with [Prompt-can-anything](https://github.com/OpenGVLab/Ask-Anything).
+- **`2023/05/20`**: Tag2Text is combined with [VideoChat](https://github.com/OpenGVLab/Ask-Anything).
+- **`2023/04/20`**: We marry Tag2Text with with [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything).
+- **`2023/04/10`**: Code and checkpoint is available Now!
+- **`2023/03/14`**: [Tag2Text web demo 🤗](https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text) is available on Hugging Face Space!  -->
+
+
+
+
+
+## :writing_hand: TODO 
+
+- [x] Release Tag2Text demo.
+- [x] Release checkpoints.
+- [x] Release inference code.
+- [x] Release RAM demo and checkpoints.
+- [x] Release training codes.
+- [ ] Release training datasets.
+
+
+
+## :toolbox: Checkpoints
+
+<!-- insert a table -->
+<table>
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>Name</th>
+      <th>Backbone</th>
+      <th>Data</th>
+      <th>Illustration</th>
+      <th>Checkpoint</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>1</th>
+      <td>RAM-14M</td>
+      <td>Swin-Large</td>
+      <td>COCO, VG, SBU, CC-3M, CC-12M</td>
+      <td>Provide strong image tagging ability.</td>
+      <td><a href="https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text/blob/main/ram_swin_large_14m.pth">Download  link</a></td>
+    </tr>
+    <tr>
+      <th>2</th>
+      <td>Tag2Text-14M</td>
+      <td>Swin-Base</td>
+      <td>COCO, VG, SBU, CC-3M, CC-12M</td>
+      <td>Support comprehensive captioning and tagging.</td>
+      <td><a href="https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text/blob/main/tag2text_swin_14m.pth">Download  link</a></td>
+    </tr>
+  </tbody>
+</table>
+
+
+## :running: Model Inference
+
+### **Setting Up** ###
+
+1. Install the dependencies::
+
+<pre/>pip install -r requirements.txt</pre>
+
+2. Download RAM pretrained checkpoints.
+
+3. (Optional) To use RAM and Tag2Text in other projects, better to install recognize-anything as a package:
+
+```bash
+pip install -e .
+```
+
+Then the RAM and Tag2Text model can be imported in other projects:
+
+```python
+from ram.models import ram, tag2text
+```
+
+### **RAM Inference** ##
+
+Get the English and Chinese outputs of the images:
+<pre/>
+python inference_ram.py  --image images/demo/demo1.jpg \
+--pretrained pretrained/ram_swin_large_14m.pth
+</pre>
+
+
+### **RAM Inference on Unseen Categories (Open-Set)** ##
+
+Firstly, custom recognition categories in [build_openset_label_embedding](./ram/utils/openset_utils.py), then get the tags of the images:
+<pre/>
+python inference_ram_openset.py  --image images/openset_example.jpg \
+--pretrained pretrained/ram_swin_large_14m.pth
+</pre>
+
+
+### **Tag2Text Inference** ##
+
+Get the tagging and captioning results:
+<pre/>
+python inference_tag2text.py  --image images/demo/demo1.jpg \
+--pretrained pretrained/tag2text_swin_14m.pth
+</pre>
+Or get the tagging and sepcifed captioning results (optional):
+<pre/>python inference_tag2text.py  --image images/demo/demo1.jpg \
+--pretrained pretrained/tag2text_swin_14m.pth \
+--specified-tags "cloud,sky"</pre>
+
+
+### **Batch Inference and Evaluation** ##
+We release two datasets `OpenImages-common` (214 seen classes) and `OpenImages-rare` (200 unseen classes). Copy or sym-link test images of [OpenImages v6](https://storage.googleapis.com/openimages/web/download_v6.html) to `datasets/openimages_common_214/imgs/` and `datasets/openimages_rare_200/imgs`.
+
+To evaluate RAM on `OpenImages-common`:
+
+```bash
+python batch_inference.py \
+  --model-type ram \
+  --checkpoint pretrained/ram_swin_large_14m.pth \
+  --dataset openimages_common_214 \
+  --output-dir outputs/ram
+```
+
+To evaluate RAM open-set capability on `OpenImages-rare`:
+
+```bash
+python batch_inference.py \
+  --model-type ram \
+  --checkpoint pretrained/ram_swin_large_14m.pth \
+  --open-set \
+  --dataset openimages_rare_200 \
+  --output-dir outputs/ram_openset
+```
+
+To evaluate Tag2Text on `OpenImages-common`:
+
+```bash
+python batch_inference.py \
+  --model-type tag2text \
+  --checkpoint pretrained/tag2text_swin_14m.pth \
+  --dataset openimages_common_214 \
+  --output-dir outputs/tag2text
+```
+
+Please refer to `batch_inference.py` for more options. To get P/R in table 3 of our paper, pass `--threshold=0.86` for RAM and `--threshold=0.68` for Tag2Text.
+
+To batch inference custom images, you can set up you own datasets following the given two datasets.
+
+
+## :golfing: Model Training/Finetuning
+
+
+### **Tag2Text** ##
+At present, we can only open source [the forward function of Tag2Text](./ram/models/tag2text.py#L141) as much as possible.
+To train/finetune Tag2Text on a custom dataset, you can refer to the complete training codebase of [BLIP](https://github.com/salesforce/BLIP/tree/main) and make the following modifications:
+1. Replace the "models/blip.py" file with the current "[tag2text.py](./ram/models/tag2text.py)" model file;
+2. Load additional tags based on the original dataloader.
+
+### **RAM** ##
+
+The training code of RAM cannot be open-sourced temporarily as it is in the company's process.
+
+