om-ai-lab / rs5m Goto Github PK

View Code? Open in Web Editor NEW

193.0 11.0 8.0 45.79 MB

RS5M: a large-scale vision language dataset for remote sensing

License: MIT License

Python 91.29% HTML 8.71%

foundation-models remote-sensing vision-and-language

rs5m's Introduction

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Vision-Language Foundation Model for Remote Sensing

Zilun Zhang, Tiancheng Zhao, Yulong Guo, Jianwei Yin
[2024.07.24] Accepted By TGRS 2024
[2023.12.03] Preprint: https://arxiv.org/abs/2306.11300
RS5M Data: https://huggingface.co/datasets/Zilun/RS5M/
CLIP-like Model for Remote Sensing: https://huggingface.co/Zilun/GeoRSCLIP
Stable Diffusion Model for Remote Sensing: https://huggingface.co/Zilun/GeoRSSD
Post: Link

RS5M Dataset

Pre-trained Vision-Language Models (VLMs) utilizing extensive image-text paired data have demonstrated unprecedented image-text association capabilities, achieving remarkable results across various downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. In this paper, we propose a new framework that includes the Domain pre-trained Vision-Language Model (DVLM), bridging the gap between the General Vision-Language Model (GVLM) and domain-specific downstream tasks. Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset. Additionally, we fine-tuned the CLIP model and tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the DVLM. Experimental results show that our proposed dataset is highly effective for various tasks, and our model GeoRSCLIP improves upon the baseline or previous state-of-the-art model by 3% ～ 20% in Zero-shot Classification (ZSC) tasks, 3% ～ 6% in Remote Sensing Cross-Modal Text–Image Retrieval (RSCTIR) and 4% ～ 5% in Semantic Localization (SeLo) tasks.

GeoRSCLIP Model

Installation

Install Pytorch following instructions from the official website (We tested in torch 2.0.1 with CUDA 11.8 and 2.1.0 with CUDA 12.1)

  pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

Install other dependencies

  pip install pillow pandas scikit-learn ftfy tqdm matplotlib transformers adapter-transformers open_clip_torch pycocotools timm clip-benchmark torch-rs

Usage

Clone the repo from: https://huggingface.co/Zilun/GeoRSCLIP

git clone https://huggingface.co/Zilun/GeoRSCLIP
cd GeoRSCLIP

Unzip the test data

unzip data/rs5m_test_data.zip

Run the inference script:

  python codebase/inference.py --ckpt-path /your/local/path/to/RS5M_ViT-B-32.pt --test-dataset-dir /your/local/path/to/rs5m_test_data

(Optional) If you just want to load the GeoRSCLIP model:

  import open_clip
  import torch
  from inference_tool import get_preprocess


  ckpt_path = "/your/local/path/to/RS5M_ViT-B-32.pt"
  model, _, _ = open_clip.create_model_and_transforms("ViT-B/32", pretrained="openai")
  checkpoint = torch.load(ckpt_path, map_location="cpu")
  msg = model.load_state_dict(checkpoint, strict=False)
  model = model.to("cuda")
  img_preprocess = get_preprocess(
        image_resolution=224,
  )

  import open_clip
  import torch
  from inference_tool import get_preprocess

  ckpt_path = "/your/local/path/to/RS5M_ViT-H-14.pt"
  model, _, _ = open_clip.create_model_and_transforms("ViT-H/14", pretrained="laion2b_s32b_b79k")
  checkpoint = torch.load(ckpt_path, map_location="cpu")
  msg = model.load_state_dict(checkpoint, strict=False)
  model = model.to("cuda")
  img_preprocess = get_preprocess(
        image_resolution=224,
  )

Experiment Result

All tasks

	EuroSAT_acc	RESISC45_acc	AID_acc	retrieval-image2text-R@1-rsitmd	retrieval-image2text-R@5-rsitmd	retrieval-image2text-R@10-rsitmd	retrieval-text2image-R@1-rsitmd	retrieval-text2image-R@5-rsitmd	retrieval-text2image-R@10-rsitmd	retrieval-mean-recall-rsitmd	retrieval-image2text-R@1-rsicd	retrieval-image2text-R@5-rsicd	retrieval-image2text-R@10-rsicd	retrieval-text2image-R@1-rsicd	retrieval-text2image-R@5-rsicd	retrieval-text2image-R@10-rsicd	retrieval-mean-recall-rsicd	Selo_Rsu	Selo_Rda	Selo_Ras	Selo_Rmi
GeoRSCLIP-ViTB32	61.40	72.74	74.42	17.92	34.96	46.02	14.12	41.46	57.52	35.33	12.17	28.45	38.61	9.31	26.51	41.28	26.06	0.755636	0.730925	0.258044	0.744670
GeoRSCLIP-ViTH14	67.47	73.83	76.33	23.45	42.92	53.32	18.01	44.60	59.96	40.38	14.27	29.55	40.44	11.38	30.80	44.41	28.48	0.759515	0.741806	0.256649	0.749430

RSCTIR Task

RSICD Test set

Method	Paradigm	Tuned on	R@1 (I2T)	R@5 (I2T)	R@10 (I2T)	R@1 (T2I)	R@5 (T2I)	R@10 (T2I)	mR
LW-MCR	Supervised	RSICD	3.29%	12.52%	19.93%	4.66%	17.51%	30.02%	14.66%
VSE++	Supervised	RSICD	3.38%	9.51%	17.46%	2.82%	11.32%	18.10%	10.43%
AFMFN	Supervised	RSICD	5.39%	15.08%	23.40%	4.90%	18.28%	31.44%	16.42%
KCR	Supervised	RSICD	5.84%	22.31%	36.12%	4.76%	18.59%	27.20%	19.14%
GaLR	Supervised	RSICD	6.59%	19.85%	31.04%	4.69%	19.48%	32.13%	18.96%
SWAN	Supervised	RSICD	7.41%	20.13%	30.86%	5.56%	22.26%	37.41%	20.61%
HVSA	Supervised	RSICD	7.47%	20.62%	32.11%	5.51%	21.13%	34.13%	20.16%
PIR	Supervised	RSICD	9.88%	27.26%	39.16%	6.97%	24.56%	38.92%	24.46%
FAAMI	Supervised	RSICD	10.44%	22.66%	30.89%	8.11%	25.59%	41.37%	23.18%
Multilanguage	Supervised	RSICD	10.70%	29.64%	41.53%	9.14%	28.96%	44.59%	27.42%
PE-RSITR	GVLM + FT	RSICD	14.13%	31.51%	44.78%	11.63%	33.92%	50.73%	31.12%
MTGFE	Supervised	RSICD	15.28%	37.05%	51.60%	8.67%	27.56%	43.92%	30.68%
RemoteCLIP	GVLM + FT	RET-3 + DET-10 + SEG-4	17.02%	37.97%	51.51%	13.71%	37.11%	54.25%	35.26%
CLIP-Baseline	GVLM	-	5.31%	14.18%	23.70%	5.78%	17.73%	27.76%	15.74%
GeoRSCLIP-FT	GVLM + FT	RS5M + RSICD	22.14%	40.53%	51.78%	15.26%	40.46%	57.79%	38.00%
GeoRSCLIP-FT	GVLM + FT	RS5M + RET-2	21.13%	41.72%	55.63%	15.59%	41.19%	57.99%	38.87%

RSITMD test set

Method	Paradigm	Tuned on	R@1 (I2T)	R@5 (I2T)	R@10 (I2T)	R@1 (T2I)	R@5 (T2I)	R@10 (T2I)	mR
LW-MCR	Supervised	RSITMD	10.18%	28.98%	39.82%	7.79%	30.18%	49.78%	27.79%
VSE++	Supervised	RSITMD	10.38%	27.65%	39.60%	7.79%	24.87%	38.67%	24.83%
AFMFN	Supervised	RSITMD	11.06%	29.20%	38.72%	9.96%	34.03%	52.96%	29.32%
HVSA	Supervised	RSITMD	13.20%	32.08%	45.58%	11.43%	39.20%	57.45%	33.15%
SWAN	Supervised	RSITMD	13.35%	32.15%	46.90%	11.24%	40.40%	60.60%	34.11%
GaLR	Supervised	RSITMD	14.82%	31.64%	42.48%	11.15%	36.68%	51.68%	31.41%
FAAMI	Supervised	RSITMD	16.15%	35.62%	48.89%	12.96%	42.39%	59.96%	35.99%
MTGFE	Supervised	RSITMD	17.92%	40.93%	53.32%	16.59%	48.50%	67.43%	40.78%
PIR	Supervised	RSITMD	18.14%	41.15%	52.88%	12.17%	41.68%	63.41%	38.24%
Multilanguage	Supervised	RSITMD	19.69%	40.26%	54.42%	17.61%	49.73%	66.59%	41.38%
PE-RSITR	GVLM + FT	RSITMD	23.67%	44.07%	60.36%	20.10%	50.63%	67.97%	44.47%
RemoteCLIP	GVLM + FT	RET-3 + DET-10 + SEG-4	27.88%	50.66%	65.71%	22.17%	56.46%	73.41%	49.38%
CLIP-Baseline	GVLM	-	9.51%	23.01%	32.74%	8.81%	27.88%	43.19%	24.19%
GeoRSCLIP-FT	GVLM + FT	RS5M + RSITMD	30.09%	51.55%	63.27%	23.54%	57.52%	74.60%	50.10%
GeoRSCLIP-FT	GVLM + FT	RS5M + RET-2	32.30%	53.32%	67.92%	25.04%	57.88%	74.38%	51.81%

GeoRSSD

The GeoRSSD models that were tuned with 1% and 20% data of RS5M has been released:
- https://huggingface.co/Zilun/GeoRSSD

RS5M Dataset Download (About 500GB, 128 webdataset tars)

RS5M

Dropbox:
- https://www.dropbox.com/scl/fo/kfv40wil27cadhtr1y23r/h?rlkey=t9pexlj0sklgochev2sf70w4s&dl=0
Baidu Disk
- https://pan.baidu.com/s/1AcZcoY5VwdhZOhF_o8o0Fg?pwd=41y2
- Password: 41y2
The BigEarthNet with RGB channels only (with corresponding filenames in our csv files)
- https://pan.baidu.com/s/1aCqRmnCeow18ry__R_oZow?pwd=6ya9
- Password: 6ya9

Geometa

Dropbox:
- https://www.dropbox.com/scl/fo/psbr0670835y3jaorp967/h?rlkey=tu6m20g2tcwkz9gy9px7432b4&dl=0
Baidu Disk
- https://pan.baidu.com/s/1NT8qxJJhWjxSlrXq5UqVPg?pwd=mcqc
- Password: mcqc

MetaFile

The metafile and other useful files of RS5M can be found here: https://huggingface.co/datasets/Zilun/RS5M/
See README.md in huggingface for a breakdown explanation of each file.

How to use this dataset

Option 1 (Recommended)

We create the webdataset format files containing paired image and text for sequential data io. Do NOT untar the files.

Download the webdataset files from the link provided above. The dataset directory should look like this:

    /nas/zilun/RS5M_v5/webdataset                                                       
    ├── train                        
        ├── pub11-train-0000.tar                                                         
        ├── pub11-train-0001.tar
        ├── ......
        ├── pub11-train-0030.tar                                         
        ├── pub11-train-0031.tar
        ├── rs3-train-0000.tar                                              
        ├── rs3-train-0001.tar
        ├── ......
        ├── rs3-train-0030.tar                                              
        ├── rs3-train-0031.tar
    ├── val                        
        ├── pub11-val-0000.tar                                                         
        ├── pub11-val-0001.tar
        ├── ......
        ├── pub11-val-0030.tar                                         
        ├── pub11-val-0031.tar
        ├── rs3-val-0000.tar                                              
        ├── rs3-val-0001.tar
        ├── ......
        ├── rs3-val-0030.tar                                              
        ├── rs3-val-0031.tar

An example of data IO pipeline using webdataset files is provided in "dataloader.py". The throughput (images per second) is ~1800 images per second. (With Ryzen 3950x CPU and dual-channel 3200MHZ DDR4 RAM)

Run the following to have a taste:

python dataloader.py --train_dir /media/zilun/mx500/RS5M/data/train --val_dir /media/zilun/mx500/RS5M/data/val --num_worker 16 --batch_size 400 --num_shuffle 10000

Option 2

We also provide the pure image files, which could be used with the metafiles from huggingface. Due to the huge amount of the image data, an SSD drive is recommended.

Download the files from the Dropbox link or Baidu disk link provided. The dataset directory should look like this:

    /nas/zilun/RS5M_v5/img_only                                                      
    ├── pub11                        
        ├── pub11.tar.gz_aa                                                       
        ├── pub11.tar.gz_ab
        ├── ......
        ├── pub11.tar.gz_ba                                              
        ├── pub11.tar.gz_bc
    ├── rs3                        
        ├── ben
            ├── ben.tar.gz_aa                                       
        ├── fmow
            ├── fmow.tar.gz_aa
            ├── fmow.tar.gz_ab
            ├── ......
            ├── fmow.tar.gz_ap
            ├── fmow.tar.gz_aq
        ├── millionaid
            ├── millionaid.tar.gz_aa
            ├── millionaid.tar.gz_ab
            ├── ......
            ├── millionaid.tar.gz_ap
            ├── millionaid.tar.gz_aq

Combine and untar the files. You will have the images files now.

 # optional, for split and zip the dataset
 tar -I pigz -cvf - pub11 | split --bytes=500MB - pub11.tar.gz_

 # combine different parts into one
 cat pub11.tar.gz_* > pub11.tar

 # extract
 pigz -dc pub11.tar | tar -xvf - -C /data/zilun/RS5M_v5/img_only/

Statistics

PUB11 Subset

Name	Amount	After Keyword Filtering	Download Image	Invalid Image (Removed)	Duplicate Image (Removed)	Outlier images (Removed by VLM and RS Detector)	Remain
LAION2B	2.3B	1,980,978	1,737,584	102	343,017	333,686	1,060,779
COYO700M	746M	680,089	566,076	28	245,650	94,329	226,069
LAIONCOCO	662M	3,014,283	2,549,738	80	417,689	527,941	1,604,028
LAION400M	413M	286,102	241,324	25	141,658	23,860	75,781
WIT	37 M	98,540	93,754	0	74,081	9,299	10,374
YFCC15M	15M	27,166	25,020	0	265	15,126	9,629
CC12M	12M	18,892	16,234	0	1,870	4,330	10,034
Redcaps	12M	2,842	2,686	0	228	972	1,486
CC3M	3.3M	12,563	11,718	1	328	1,817	9,572
SBU	1M	102	91	0	4	36	51
VG	0.1M	26	26	0	0	20	6
Total	4.2B	6,121,583	5,244,251	236	1,224,790	1,011,416	3,007,809

RS3 Subset

Name	Amount	Original Split	Has Class label
FMoW	727,144	Train	Yes
BigEarthNet	344,385	Train	Yes
MillionAID	990,848	Test	No
Total	2,062,377	-	-

Geo-Statistics

Statistics of geometa for images contain the UTM zone, latitude, and longitude information.
- YFCC14M: 7841
- FMoW: 727,144
- BigEarthNet: 344,385
Extract entity with "GPE" label using NER from NLTK
- Applied to captions in PUB11 subset
- Extraction Result
- 880,354 image-text pairs contain "GPE", and most of them are city/country names.

BLIP2 fine-tuned with RSITMD dataset

Tuned with LoRA
Checkpoint and inference code can be found through this link

Image-Text Pair Rating Tool

Link

Awesome Remote Sensing Vision-Language Models & Papers

https://github.com/om-ai-lab/awesome-RSVLM

Contact

Email: [email protected]

WeChat: zilun960822

Slack Group: https://join.slack.com/t/visionlanguag-fks1990/shared_invite/zt-290vxhx5y-SUkCzf2aH3G9eu3lye2YvQ

Acknowledgement

We thank Delong Chen and his ITRA framework for helping us fine-tune the CLIP-like models. https://itra.readthedocs.io/en/latest/Contents/introduction/overview.html

BibTeX Citation

If you use RS5M or GeoRSCLIP in a research paper, we would appreciate using the following citations:

@misc{zhang2023rs5m,
      title={RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model}, 
      author={Zilun Zhang and Tiancheng Zhao and Yulong Guo and Jianwei Yin},
      year={2023},
      eprint={2306.11300},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Some other citations:

@article{Long2021DiRS,
  title={On Creating Benchmark Dataset for Aerial Image Interpretation: Reviews, Guidances and Million-AID},
  author={Yang Long and Gui-Song Xia and Shengyang Li and Wen Yang and Michael Ying Yang and Xiao Xiang Zhu and Liangpei Zhang and Deren Li},
  journal={IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing},
  year={2021},
  volume={14},
  pages={4205-4230}
}

@inproceedings{Sumbul_2019,
  title={Bigearthnet: A Large-Scale Benchmark Archive for Remote Sensing Image Understanding},
  url={http://dx.doi.org/10.1109/IGARSS.2019.8900532},
  DOI={10.1109/igarss.2019.8900532},
  booktitle={IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium},
  publisher={IEEE},
  author={Sumbul, Gencer and Charfuelan, Marcela and Demir, Begum and Markl, Volker},
  year={2019},
  month=jul
}

@inproceedings{fmow2018,
  title={Functional Map of the World},
  author={Christie, Gordon and Fendley, Neil and Wilson, James and Mukherjee, Ryan},
  booktitle={CVPR},
  year={2018}
}

rs5m's People

Contributors

Stargazers

Watchers

Forkers

tangzwei daifeng2016 lsxxy bloodsuckerking1 rongtongxueya lemyx peanut2code whuhxb

rs5m's Issues

数据集形式

尊敬的作者您好：
请问数据集的形式是怎样的呢？关于标签信息是不是只有图像标题（文字描述），是否还有类别、目标框或分割mask等标注信息？
感谢！！

Links for download and CLIP model weights

Dear authors,
Thank you for your work! When will the dataset be available for download? Additionally, when will the weights of the CLIP model fine-tuned on the RS5M dataset will be released?
Thanks
Xavi

训练数据问题

您好，请问多个子集的影像尺寸不同，比如有的为500×500，有的是120×120，这块在训练前是怎么处理的呢？会统一训练影像的尺寸吗？

【Image Size and GSD】

Hello, I would like to inquire about the range of image dimensions and Ground Sampling Distance (GSD) in the RS5M dataset. It seems that this information is not specified in the paper. Thank you.

RS-SD(Stable diffusion model tuned with 1% RS5M)

Your work is fantastics!
May the RS-SD model, fine-tuned with 1% of RS5M data, be released soon? What kind of fine-tuned method is used to created this model?

msrgb and rgb

fmow/train/airport/airport_0/airport_0_0_msrgb.jpg,fmow,fmow_airport_0_0_msrgb.jpg,"above the Winter landscapes of Etimesgut, Turkey, the view is transfixed by airport at its center and center-left blocks. with sharpness ensured by a ground sample distance of 2.45 meters, it's a moment frozen from the utm zone 36S, timestamped 8 o'clock, February 7, 2002. this satellite image shows an airport close to a city"

fmow/train/airport/airport_0/airport_0_0_rgb.jpg,fmow,fmow_airport_0_0_rgb.jpg,"sailing over Etimesgut, Turkey in Winter, the lens emphasizes airport nestled at the center and center-left blocks. the details, highlighted by a ground sample distance of 2.45 meters, associate it with the utm zone 36S, time-stamped 8 o'clock, February 7, 2002. a satellite image of a large airport"

Hello,

I have a few questions regarding image datasets:

1.Are these two images the same?
2. Is "msrgb" an abbreviation for "multispectral RGB", referring to the selection of three bands (RGB channels) from a multispectral image, specifically from the fMoW-full dataset?
3. Does the "RGB" come from the fMoW-rgb dataset?
4. What is the difference between "msrgb" and "RGB"?
Thank you.

关于训练的一些问题

感觉作者大大的工作，我在复现GeoRSCLIP遇到了些问题，我 Full-fine-tuning open_clip(RS5M) + rsicd training set 结果在34附近，与论文结果38差很多，佬能提供一些实现细节嘛，期待你的回复。

Clip weights

Dear author, I would like to inquire about the release date for CLIP weights fine-tuned on the RS5M dataset. At present, it appears that only the BLIP-2 inference code is available.

clip finetune模型推理报错

用lora的权重推理出现了如下报错，请问是什么原因，感谢！
2023-10-18 10:04:14.469988: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Seed set to 2023
---------load dataset---------
Resolving data files: 100% 454/454 [00:00<00:00, 94581.73it/s]
---------load model------
Loading checkpoint shards: 100% 4/4 [03:02<00:00, 45.72s/it]
use lora weights for blip2
---------start test--------------
<PIL.TiffImagePlugin.TiffImageFile image mode=RGB size=256x256 at 0x78FCDA7F6920>
Traceback (most recent call last):
File "/content/gdrive/MyDrive/blip/blip2_peft_inference.py", line 106, in
main()
File "/content/gdrive/MyDrive/blip/blip2_peft_inference.py", line 71, in main
generated_output = model.generate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/blip_2/modeling_blip_2.py", line 1880, in generate
outputs = self.language_model.generate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1644, in generate
input_ids, model_kwargs = self._expand_inputs_for_generation(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 742, in _expand_inputs_for_generation
input_ids = input_ids.repeat_interleave(expand_size, dim=0)
File "/usr/local/lib/python3.10/dist-packages/torch/_meta_registrations.py", line 963, in meta_repeat_interleave_Tensor
raise RuntimeError("cannot repeat_interleave a meta tensor without output_size")
RuntimeError: cannot repeat_interleave a meta tensor without output_size

权重

您好，能否提供一个GeoRSSD的百度链接，在huggingface下载老是失败。

A question about finetuning on RSICD

Thank you very much for your outstanding work！
I have a question that I haven't quite understood. When fine-tuning your RS5M model on RSICD or RSITMD using the methods outlined in the paper (infoNCE, lr=1e-6), I did not achieve the expected performance. Taking RSICD as an example, the paper and the weights you provided for RS5M RET-2 version result in an accuracy around 38, but when I fine-tuned using my own RS5M VitB32 version, the result was around 34. Could you provide more details on fine-tuning RET-2 or RSICD so that I can better replicate the process? Thank you very much.

how to use the dataset for segmentation?

请问RS-SD和SD的FID分数是如何计算的？

感谢作者杰出的工作，构建了如此庞大的遥感数据集！
但关于RS-SD和SD的FID分数我有一点小小的疑问，希望能解答一下！
如下图所示，您使用了RS-SD和SD分别计算了FID分数：

但我们知道，FID是需要样本的真实分布作为对照的：

因为不像图生图一样，有原图可以直接与生成的图像做对比，论文中的图24和25，均是文生图的示例，我的疑问是您的真值样本是如何构建的？即上面公式中的X是从哪里获得的？
期待您的回复，谢谢~~

数据加载问题

问题已解决，谢谢！

The method for generating captions

Hello:
Thank you for sharing such great dataset and model. But I have a question, I noticed in your paper the method for generating captioins: captions can be employed by the tuned BLIP2 model (tuning details can be found in Appendix B10) [16] with the OPT 6.7B checkpoint in half-precision from Huggingface for caption generation.
Could you please send me your program for generating captions?
Thanks.

A question about rs3 subset

Thank you for such a great work, RS5M.
I am following your work, and now I have encountered some problems. Now I would like to ask the following questions:

There are two caption tags on RS3, top - cap and ssl-cap. I don't know the meaning and difference between the two, especially ssl-cap. I also want to know which training model you finally used.
Where did you download the bigearthNet data set of RS? I saw that the officially downloaded file name is not the sequence ID, but the data sequence ID in your paper is 45786. I want to use this data now. What should I do? Where to download and how to process this data.
Very much looking forward to your reply！

Code for Fine-tuning GeoRSCLIP

Great paper and work!

I have not found the fine-tuning or training code in the repository. Could you release it?

Thanks!

GeoRSCLIP ResNet-50 checkpoint

Hi, could you share the pretrained weights for GeoRSCLIP in Resnet-50. It's not available in huggingface.

about RS-SD

作者您好！在论文中看到您使用RS5M对stable diffusion进行了微调，我想咨询一下您具体是如何微调sd的？是否将包括vae在内的全部参数都打开微调了呢？如果方便的话您是否可以开源RS-SD的权重呢？十分感谢！

加载预训练权重时报错，维度不统一

代码片段
ckpt_path = "E:\Modelpth\RS5M_ViT-B-32.pt"
model, _, _ = open_clip.create_model_and_transforms("ViT-B/32", pretrained="openai")
checkpoint = torch.load(ckpt_path, map_location="cpu")
msg = model.load_state_dict(checkpoint, strict=False)
model = model.to("cuda")
img_preprocess = get_preprocess(
image_resolution=224,
)
报错
RuntimeError: Error(s) in loading state_dict for CLIP:
size mismatch for visual.positional_embedding: copying a param with shape torch.Size([50, 768]) from checkpoint, the shape in current model is torch.Size([65, 768]).

fine-tuned checkpoint for RSICD and RSITMD

Hello，could you please tell me where I can download the checkpoint fine-tuned on RSITMD and RSICD?Thank you very much

关于RS-SD推理时，1p和20p模型的代码不一致问题

您好，请问我在使用RS-SD推理时，按您在hugging face中的提示，使用了Dreambooth中的代码进行了推理：

from diffusers import StableDiffusionPipeline
import torch

model_id = r"F:\huggingface\Zilun\GeoRSSD\checkpoint\20p\22000"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")

prompt = ["there is an aerial view of some buildings", "an aerial view of the campus and surrounding fields"]
images = pipe(prompt, num_inference_steps=50, guidance_scale=7.5, height=512, width=512).images

for index, image in enumerate(images):
    image.save(prompt[index] + ".png")

使用20p模型的时候正常工作了，但是为什么使用1p模型时报错：

OSError: Error no file named scheduler_config.json found in directory F:\huggingface\Zilun\GeoRSSD\checkpoint\1p\18000.

具体报错如下：

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm 2023.3\plugins\python\helpers\pydev\pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File "C:\Program Files\JetBrains\PyCharm 2023.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2023.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "D:\Code\Python\Git\kohya_ss\geo_rs_sd.py", line 12, in <module>
    pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
  File "D:\Code\Python\Git\kohya_ss\venv\lib\site-packages\diffusers\pipelines\pipeline_utils.py", line 1105, in from_pretrained
    loaded_sub_model = load_sub_model(
  File "D:\Code\Python\Git\kohya_ss\venv\lib\site-packages\diffusers\pipelines\pipeline_utils.py", line 475, in load_sub_model
    loaded_sub_model = load_method(cached_folder, **loading_kwargs)
  File "D:\Code\Python\Git\kohya_ss\venv\lib\site-packages\diffusers\schedulers\scheduling_utils.py", line 140, in from_pretrained
    config, kwargs, commit_hash = cls.load_config(
  File "D:\Code\Python\Git\kohya_ss\venv\lib\site-packages\diffusers\configuration_utils.py", line 364, in load_config
    raise EnvironmentError(
OSError: Error no file named scheduler_config.json found in directory F:\huggingface\Zilun\GeoRSSD\checkpoint\1p\18000.

我查看了1p的模型文件夹，下方确实没有scheduler_config.json这个文件，而20p的文件夹下方是有这个文件的，请问该如何解决这个问题？您是否能给出您的推理代码？

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.