Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions datasets/arxiv-2023/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# ARXIV-2023

## Dataset Description

arxiv-2023 is collected to be compared with ogbn-arxiv. Both datasets represent directed citation networks where each node corresponds to a paper published on arXiv and each edge indicates one paper citing another.

Statistics:
- Nodes: 33868
- Edges: 305672
- Number of Classes: 40

#### Citation

- Original Source
+ [Website](https://github.com/TRAIS-Lab/LLM-Structured-Data)
+ LICENSE: [<license type>](<URL to license>)



```
@misc{huang2023llms,
title={Can LLMs Effectively Leverage Graph Structural Information: When and Why},
author={Jin Huang and Xingjian Zhang and Qiaozhu Mei and Jiaqi Ma},
year={2023},
eprint={2309.16595},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```

- Current Version
+ [Website](https://github.com/TRAIS-Lab/LLM-Structured-Data)
+ LICENSE: [<license type>](<URL to license>)



```
@misc{huang2023llms,
title={Can LLMs Effectively Leverage Graph Structural Information: When and Why},
author={Jin Huang and Xingjian Zhang and Qiaozhu Mei and Jiaqi Ma},
year={2023},
eprint={2309.16595},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```

- Previous Version
+ [Website](<URL to website>)
+ LICENSE: [<license type>](<URL to license>)


## Available Tasks

### <Task Name>



- Task type: `NodeClassification`


#### Citation

```
@misc{huang2023llms,
title={Can LLMs Effectively Leverage Graph Structural Information: When and Why},
author={Jin Huang and Xingjian Zhang and Qiaozhu Mei and Jiaqi Ma},
year={2023},
eprint={2309.16595},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```

<!-- Insert the BibTeX citation into the above code block. -->

## Preprocessing

The data files and task config file in GLI format are transformed in arxiv-2023.ipynb file. Raw data aquried in TRAIS-Lab/LLM-Structured-Data folder.

### Requirements

```
openai
pytorch
PyG
ogb
```


53 changes: 37 additions & 16 deletions datasets/arxiv-2023/arxiv-2023.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,15 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import torch\n",
"import pandas as pd\n",
"import numpy"
"import numpy\n",
"import json"
]
},
{
Expand Down Expand Up @@ -110,10 +111,23 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": []
"outputs": [
{
"data": {
"text/plain": [
"33868"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num_nodes"
]
},
{
"cell_type": "markdown",
Expand All @@ -122,7 +136,7 @@
},
{
"cell_type": "code",
"execution_count": 46,
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -134,25 +148,32 @@
" feature=[\"Node/NodeFeature\"],\n",
" target=\"Node/NodeLabel\",\n",
" num_classes=40,\n",
" train_set=train_mask,\n",
" val_set=\n"
" train_set=numpy.array(train_mask),\n",
" val_set=numpy.array(val_mask),\n",
" test_set=numpy.array(test_mask),\n",
" task_id=\"1\"\n",
")\n"
]
},
{
"cell_type": "code",
"execution_count": 53,
"execution_count": 62,
"metadata": {},
"outputs": [
{
"ename": "SyntaxError",
"evalue": "invalid syntax (3587969684.py, line 1)",
"output_type": "error",
"traceback": [
"\u001b[0;36m Cell \u001b[0;32mIn[53], line 1\u001b[0;36m\u001b[0m\n\u001b[0;31m hi=[y[i] for i <30000]\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
]
"data": {
"text/plain": [
"torch.Size([2, 305672])"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
"source": [
"edge_index.shape"
]
}
],
"metadata": {
Expand Down
6 changes: 3 additions & 3 deletions datasets/arxiv-2023/task_node_classification_1.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@
"target": "Node/NodeLabel",
"num_classes": 40,
"train_set": {
"file": "arxiv-2023__task_node_classification_1__96c4df545b7523dab7ea703813687c06.npz",
"file": "arxiv-2023__task_node_classification_1__707e9444940a9744a72ae8a990fe9136.npz",
"key": "train_set"
},
"val_set": {
"file": "arxiv-2023__task_node_classification_1__96c4df545b7523dab7ea703813687c06.npz",
"file": "arxiv-2023__task_node_classification_1__707e9444940a9744a72ae8a990fe9136.npz",
"key": "val_set"
},
"test_set": {
"file": "arxiv-2023__task_node_classification_1__96c4df545b7523dab7ea703813687c06.npz",
"file": "arxiv-2023__task_node_classification_1__707e9444940a9744a72ae8a990fe9136.npz",
"key": "test_set"
}
}
Loading