A more powerful version than Virtual Scientists, which supports a million-agent-level scientific collaboration simulation. Our scientific collaboration includes six sections: (1) collaborator selection; (2) topic selection; (3) idea generation; (4) novelty check; (5) abstract generation; (6) review generation.
[2025-05]
- The related paper is available on Arxiv.
[2025-05]
- We release a simple reinforcement learning (RL)-based algorithm for collaborator selection in the
RL-Basedbranch.
[2025-04]
- We release the code and data of VirSci-v2, which is a powerful platform for scientfic collaboration simulation.
git clone https://github.com/RenqiChen/Virtual-Scientists-v2
conda create --name virsci python=3.11
conda activate virsci
Install dependencies of the basic multi-agent framework CAMEL.
cd camel-master
pip install --upgrade pip setuptools
pip install -e . # This will install dependencies in pyproject.toml and install camel in editable mode
Then, install the following necessary packages.
pip install ollama
pip install faiss-gpu
Some other dependencies can be installed as needed.
In our experiments, we use ollama to deploy the llama3.1-8b and llama3.1-70b language models and mxbai-embed-large embedding model. The details of deployment could refer to URL. Here we show some key steps:
- Ollama should be installed. The linux version:
curl -fsSL https://ollama.com/install.sh | sh
- Run ollama in the path where ollama is installed:
./ollama serve
- Pull a model to use with the library:
./ollama pull llama3.1
./ollama pull llama3.1:70b
./ollama pull mxbai-embed-large
- Install the ollama python library in your environment:
pip install ollama
- Complete the installation and close the terminal.
The raw data is based on the AMiner Computer Science Dataset and Open Academic Graph.
After preprocessing, the used data is publicly available at Google Drive.
The files in Google Drive is related to def _init_ of class Platform in sci_platform/sci_platform_fast.py.
- Computer Science Dataset
- Past paper database is put in the
Papers/papers.tar.gz, which is used inpaper_folder_path. The corresponding embedding database is put in theEmbeddings/faiss_index.index, which is used inpaper_index_path. - Contemporary paper database is put in the
Papers/papers_future.tar.gz, which is used infuture_paper_folder_path. The corresponding embedding database is put in theEmbeddings/faiss_index_future.index, which is used inpaper_future_index_path. - Author knowledge bank is put in the
Authors/books.tar, which is used in ininput_dirinsci_platform/configs/knowledge_config.jsonandauthor_folder_path. - Adjacency matrix is put in the
adjacency.txt, which is used inadjacency_matrix_dir.
- Open Academic Graph Dataset
- Past paper database is put in the
Papers/papers_OAG.zip, which is used inpaper_folder_path. The corresponding embedding database is put in theEmbeddings/faiss_index_OAG.index, which is used inpaper_index_path. - Contemporary paper database is put in the
Papers/papers_future_OAG.tar.gz, which is used infuture_paper_folder_path. The corresponding embedding database is put in theEmbeddings/faiss_index_OAG_future.index, which is used inpaper_future_index_path. - Author knowledge bank is put in the
Authors/books_OAG.zip, which is used in ininput_dirinsci_platform/configs/knowledge_config.jsonandauthor_folder_path. - Adjacency matrix is put in the
weight_matrix.txt, which is used inadjacency_matrix_dir.
Note
Please replace all paths in sci_platform/sci_platform_fast.py with your own settings after download the data.
Here we explain the roles of several critial files.
sci_platform/configs/deploy_config.pydefines all hyper-parameter settings.sci_platform/social_agent/sci_agent.pydefines the customized scientist agent in this project.sci_platform/social_agent/channel.pydefines the message sending and receiving, which is the lowest-level module.sci_platform/inferencecontrols the messages sent to or received from the channel, which corresponds to different threads.sci_platform/run_fast.pyis the main execution file.sci_platform/sci_platform_fast.pydefines the platform for the initialization of our multi-agent system.sci_platform/utils/prompt.pycontains all the prompts used.sci_platform/utils/scientist_utils.pycontains all the common functions used.sci_platform/sci_team/SciTeam.pydefines the execution mechanism of each scientist team.
Our code support different environment settings. The commonly used arguments in deploy_config.py:
- Deploy Setup
ips: the ips for the LLM model deployment
port: the ports of the ip for the LLM model deployment
- Experiment Setup
agent_num: how many independent scientists are included in the simulation
runs: how many times does the program run
team_limit: the max number of teams for a scientist
max_discuss_iteration: the max discussion iterations for a team in a step
max_team_member: the max team member of a team (including the team leader)
epochs: the allowed time steps for one program run (the publish of a complete paper usually needs 5 epochs)
model_name: the LLM base model for simulation (e.g., llama3.1)
leader_mode: who is the leader (e.g., normal or random)
- Checkpoint Setup
checkpoint: use the checkpoint or create a new program
test_time: the name of the test as a checkpoint
load_time: the name of the loaded checkpoint
In deploy_config.py, set the ips=['127.0.0.1']. In port2.sh, CUDA_VISIBLE_DEVICES means how many GPUs are used and OLLAMA_HOST=0.0.0.0:XXXXX means the port of one GPU is deployed with a LLM model.
cd sci_platform
bash port2.sh
In deploy_config.py, set the ips includes the ips of all machines. port1.sh is to deploy LLM models on these distributed machines.
cd sci_platform
bash port1.sh
bash port2.sh
All information during simulation will be saved at sci_platform/{self.checkpoint_path}/{self.test_time} folder. In this folder, the structure of data follows:
βββ citation
βββ faiss
βββ paper
βββ team
βββ database_large.db
βββ weight_matrix.txt
citation folder saves the citations of paper which are cited in the nearest epoch.
faiss folder saves the embeddings of all paper, including the original paper from public database and newly published paper from the simulation.
paper folder saves the text of all paper in the form of .txt, including the original paper from public database and newly published paper from the simulation.
team folder saves the team data of each scientist.
database_large.db saves the data of all paper, including the original paper from public database and newly published paper from the simulation. We use sqlite3 Python Library to manage the simulation database, where the form of data follows:
cursor.execute('''
CREATE TABLE IF NOT EXISTS papers (
id INTEGER PRIMARY KEY,
title TEXT,
authors TEXT,
cite_papers TEXT,
abstract TEXT,
year INTEGER,
citation INTEGER,
reviews TEXT,
discipline TEXT,
keyword TEXT
)
''')
While title, authors, abstract, keyword are esay to understand, several other items require explaination:
idis the ID number of the paper.cite_papersis the reference of the paper, composed of numbers and combined with comma.yearis the publication time of the paper. If the paper is extracted from the original public database, the value will be set as-1. If the paper is generated from the simulation, the value will set asepoch(e.g., 1).citationis the citation count of the paper.reviewsis a list of review scores, which stores the results of review at different epochs.disciplinedenotes the discipline of the paper.
weight_matrix.txt saves the collaboration times between scientist after weighted sum.
Please feel free to evaluate these results for specific Science of Science analysis.
This project is supported by Shanghai Artificial Intelligence Laboratory.
The multi-agent framework in this work is based on the CAMEL.
The concurrent distributed system in this work is based on the OASIS.
The raw data is based on the AMiner Computer Science Dataset and the Open Academic Graph.
This repository is licensed under the Apache-2.0 License.
If you find this repository useful, please consider citing our work:
@article{chen2025ai,
title={AI-Driven Automation Can Become the Foundation of Next-Era Science of Science Research},
author={Chen, Renqi and Su, Haoyang and Tang, Shixiang and Yin, Zhenfei and Wu, Qi and Li, Hui and Sun, Ye and Dong, Nanqing and Ouyang, Wanli and Torr, Philip},
journal={arXiv preprint arXiv:2505.12039},
year={2025}
}
@software{Chen_VirSciv2_2025,
author = {Renqi Chen and Haoyang Su},
month = apr,
title = {{VirSci-v2}},
url = {https://github.com/RenqiChen/Virtual-Scientists-v2},
year = {2025}
}