You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- By executing the `arxivcollectory.py` script from the command line.
25
30
26
31
### Step 1: obtain an arXiv search results URL
27
32
To obtain an arXiv search results URL for your search query, go to [https://arxiv.org/](https://arxiv.org/) or to the [advanced search page](https://arxiv.org/search/advanced) and construct your search query. Press the big blue button that says "Search", wait until you arrive on the page that displays the search results. Now copy the entire URL as is, and you're done ✅.
28
33
29
-
### Step 2: use arXivCollector in one of two ways
34
+
### Step 2: use ArXivCollector in one of two ways
30
35
#### In Python
31
36
Run the following Python code (e.g., in a script or from a Jupyter notebook).
32
37
33
38
```python
34
-
fromarxivimportarXivCollector
39
+
fromarxivcollectorimportArXivCollector
35
40
36
41
# Initiate a new instance of the arXivCollector class
After running this with your own search URL and title, a new file should appear in the parent directory of arXivCollector.
45
51
46
52
#### From the commandline
47
-
The first argument after `arxivcollectory.py` is the search URL, the second argument is your title.
53
+
The first argument after `arxivcollectory.py` is the search URL, the second argument is your title, and the third argument is the type of the output file (csv or bibtex).
# Inspired by: Fatima, R., Yasin, A., Liu, L., Wang, J., & Afzal, W. (2023). Retrieving arXiv, SocArXiv, and SSRN metadata for initial review screening. Information and Software Technology, 161, 107251. https://doi.org/10.1016/j.infsof.2023.107251
2
2
3
-
importhttpx
4
-
frombs4importBeautifulSoup
5
-
frombibtexparser.bwriterimportBibTexWriter
6
-
frombibtexparser.bibdatabaseimportBibDatabase
7
-
importpandasaspd
3
+
importargparse
8
4
importdatetime
9
-
importurllib.parse
5
+
importlogging
10
6
importsys
11
-
importargparse
12
-
importlogging
7
+
importurllib.parse
8
+
9
+
importhttpx
10
+
importpandasaspd
11
+
frombibtexparser.bibdatabaseimportBibDatabase
12
+
frombibtexparser.bwriterimportBibTexWriter
13
+
frombs4importBeautifulSoup
13
14
14
15
MAX_RETRIES=3
15
16
17
+
16
18
classArXivCollector():
17
-
def__init__(self,
19
+
def__init__(self,
18
20
user_agent="Mozilla/5.0 (X11; Linux x86_64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
0 commit comments