Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
Meng Chu¹, Zhedong Zheng²*, Wei Ji¹, Tingyu Wang³, Tat-Seng Chua¹
¹ School of Computing, National University of Singapore, Singapore
² FST and ICI, University of Macau, China
³ School of Communication Engineering, Hangzhou Dianzi University, China
We have prepared 24G Test for CUDA OUT OF MEMORY users. You could find it in : https://huggingface.co/datasets/truemanv5666/GeoText1652_Dataset and just use test_24G_version.json
GeoText-1652 is a groundbreaking benchmark dataset for ECCV 2024, focusing on natural language-guided drone navigation with spatial relation matching. This dataset bridges the gap between natural language processing, computer vision, and robotics, paving the way for more intuitive and flexible drone control systems.
- Multi-platform imagery: drone, satellite, and ground cameras
- Covers multiple universities with no overlap between train and test sets
- Rich annotations including global descriptions, bounding boxes, and spatial relations
Training and test sets all include the image, global description, bbox-text pair and building numbers. We note that there is no overlap between the 33 universities of the training set and the 39 universities of the test sets. Three platforms are considered, i.e., drone, satellite, and ground cameras.
| Split | #Imgs | #Global Descriptions | #Bbox-Texts | #Classes | #Universities | 
|---|---|---|---|---|---|
| Training (Drone) | 37,854 | 113,562 | 113,367 | 701 | 33 | 
| Training (Satellite) | 701 | 2,103 | 1,709 | 701 | 33 | 
| Training (Ground) | 11,663 | 34,989 | 14,761 | 701 | 33 | 
| Test (Drone) | 51,355 | 154,065 | 140,179 | 951 | 39 | 
| Test (Satellite) | 951 | 2,853 | 2,006 | 951 | 39 | 
| Test (Ground) | 2,921 | 8,763 | 4,023 | 793 | 39 | 
- Google Drive: GeoText-1652 Dataset
- HuggingFace Hub:
- Dataset: truemanv5666/GeoText1652_Dataset
- Model: truemanv5666/GeoText1652_model
 
This dataset is designed to support the development and testing of models in geographical location recognition, providing images from multiple views at numerous unique locations.
GeoText_Dataset_Official/
├── test/
│ ├── gallery_no_train(250)/
│ │ ├── 0001/
│ │ │ ├── drone_view.jpg
│ │ │ ├── street_view.jpg
│ │ │ ├── satellite_view.jpg
│ │ ├── 0002/
│ │ ├── ... // More locations
│ │ ├── 0250/
│ ├── query(701)/
│ │ ├── 0001/
│ │ │ ├── drone_view.jpg
│ │ │ ├── street_view.jpg
│ │ │ ├── satellite_view.jpg
│ │ ├── 0002/
│ │ ├── ... // More locations
│ │ ├── 0701/
├── train/
│ ├── 0001/
│ │ ├── drone_view.jpg
│ │ ├── street_view.jpg
│ │ ├── satellite_view.jpg
│ ├── 0002/
│ ├── ... // More locations
│ ├── 0701/
├── test_951_version.json
├── train.json
Example entry in train.json:
{
  "image_id": "0839/image-43.jpeg",
  "image": "train/0839/image-43.jpeg",
  "caption": "In the center of the image is a large, modern office building...",
  "sentences": [
    "The object in the center of the image is a large office building with several floors and a white facade",
    "On the upper middle side of the building, there is a street with cars driving on it",
    "On the middle right side of the building, there is a small parking lot with several cars parked in it"
  ],
  "bboxes": [
    [0.408688485622406, 0.6883664131164551, 0.38859522342681885, 0.6234817504882812],
    [0.2420489490032196, 0.3855597972869873, 0.30488067865371704, 0.2891976535320282],
    [0.7388443350791931, 0.8320053219795227, 0.5213109254837036, 0.33447015285491943]
  ]
}- Caption: Provides a global description for the entire image.
- Sentences: Each sentence is aligned with a specific part of the image, related to the bounding boxes.
- Bounding Boxes: Specified as arrays of coordinates in the format [cx, cy, w, h].
- Git
- Git Large File Storage (LFS)
- Conda
- 
Clone the repository: git clone https://github.com/MultimodalGeo/GeoText-1652.git 
- 
Install Miniconda: wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh sh Miniconda3-latest-Linux-x86_64.sh 
- 
Create and activate conda environment: conda create -n gt python=3.8 conda activate gt 
- 
Install requirements: cd GeoText-1652 pip install -r requirements.txt
- 
Install and configure Git LFS: apt install git-lfs git lfs install 
- 
Download dataset and model: git clone https://huggingface.co/datasets/truemanv5666/GeoText1652_Dataset git clone https://huggingface.co/truemanv5666/GeoText1652_model 
- 
Extract dataset images: cd GeoText1652_Dataset/images find . -type f -name "*.tar.gz" -print0 | xargs -0 -I {} bash -c 'tar -xzf "{}" -C "$(dirname "{}")" && rm "{}"' 
- 
Update configuration files: - Update re_bbox.yamlwith correct paths
- Update method/configs/config_swinB_384.jsonwith correctckptpath
 
- Update 
From the Method directory:
cd Methodpython3 run.py --task "re_bbox" --dist "l4" --evaluate --output_dir "output/eva" --checkpoint "/root/GeoText-1652/GeoText1652_model/geotext_official_checkpoint.pth"Evaluation paths:
- Full test (951 cases): GeoText1652_Dataset/test_951_version.json
- 24GB GPU version (~190 cases): GeoText1652_Dataset/test_24G_version.json
24GB Version Results on Two 3090Ti:
| Text Query | Image Query |
|R@1  R@5  R@10|R@1  R@5  R@10|
|----|----|----|----|----|----| 
|29.9|46.3|54.1|50.1|81.2|90.3|
Full evaluation results are in the paper.
nohup python3 run.py --task "re_bbox" --dist "l4" --output_dir "output/train" --checkpoint "/root/GeoText-1652/GeoText1652_model/geotext_official_checkpoint.pth" &If you find GeoText-1652 useful for your work, please cite:
@inproceedings{chu2024towards, 
  title={Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching}, 
  author={Chu, Meng and Zheng, Zhedong and Ji, Wei and Wang, Tingyu and Chua, Tat-Seng}, 
  booktitle={ECCV}, 
  year={2024} 
}We would like to express our gratitude to the creators of X-VLM for their excellent work, which has significantly contributed to this project.