Skip to content

Commit 45d22b9

Browse files
authored
Merge pull request #1 from anamatoso/main
Minor Corrections and Some Improvements
2 parents 81ef257 + ab12fab commit 45d22b9

File tree

6 files changed

+161
-58
lines changed

6 files changed

+161
-58
lines changed
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
2+
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
3+
4+
name: Python package
5+
6+
on:
7+
push:
8+
branches: [ "main" ]
9+
pull_request:
10+
branches: [ "main" ]
11+
12+
jobs:
13+
build:
14+
15+
runs-on: ubuntu-latest
16+
strategy:
17+
fail-fast: false
18+
matrix:
19+
python-version: ["3.8", "3.9", "3.10", "3.11"]
20+
21+
steps:
22+
- uses: actions/checkout@v4
23+
- name: Set up Python ${{ matrix.python-version }}
24+
uses: actions/setup-python@v3
25+
with:
26+
python-version: ${{ matrix.python-version }}
27+
- name: Install dependencies
28+
run: |
29+
python -m pip install --upgrade pip setuptools wheel
30+
python -m pip install flake8 pytest
31+
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
32+
- name: Lint with flake8
33+
run: |
34+
# stop the build if there are Python syntax errors or undefined names
35+
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
36+
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
37+
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
38+
- name: Test with pytest
39+
run: |
40+
PYTHONPATH=. pytest
41+

README.md

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
[![Python 3.8](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org/downloads/release/python-360/)
2+
[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-360/)
3+
[![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-360/)
4+
[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-360/)
5+
16
arXivCollector
27
======
38

@@ -20,34 +25,35 @@ Getting started
2025
------
2126

2227
**arXivCollector** can be used in two ways:
23-
- By importing the `arXivCollector()` class;
28+
- By importing the `ArXivCollector()` class;
2429
- By executing the `arxivcollectory.py` script from the command line.
2530

2631
### Step 1: obtain an arXiv search results URL
2732
To obtain an arXiv search results URL for your search query, go to [https://arxiv.org/](https://arxiv.org/) or to the [advanced search page](https://arxiv.org/search/advanced) and construct your search query. Press the big blue button that says "Search", wait until you arrive on the page that displays the search results. Now copy the entire URL as is, and you're done ✅.
2833

29-
### Step 2: use arXivCollector in one of two ways
34+
### Step 2: use ArXivCollector in one of two ways
3035
#### In Python
3136
Run the following Python code (e.g., in a script or from a Jupyter notebook).
3237

3338
```python
34-
from arxiv import arXivCollector
39+
from arxivcollector import ArXivCollector
3540

3641
# Initiate a new instance of the arXivCollector class
37-
collector = arXivCollector()
38-
# Set the title of the exported file (optional)
42+
collector = ArXivCollector()
43+
# Set the title and type of the exported file
3944
collector.set_title("Parrots")
45+
collector.set_mode("csv")
4046
# Pass the search URL to the run method
4147
collector.run('https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=stochastic+parrot&terms-0-field=title&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first')
4248
```
4349

4450
After running this with your own search URL and title, a new file should appear in the parent directory of arXivCollector.
4551

4652
#### From the commandline
47-
The first argument after `arxivcollectory.py` is the search URL, the second argument is your title.
53+
The first argument after `arxivcollectory.py` is the search URL, the second argument is your title, and the third argument is the type of the output file (csv or bibtex).
4854

4955
```bash
50-
python arxivcollector.py "https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=stochastic+parrot&terms-0-field=title&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first" "Parrots"
56+
python arxivcollector.py "https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=stochastic+parrot&terms-0-field=title&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first" "output" "csv"
5157
```
5258

5359
Special thanks
@@ -62,7 +68,7 @@ Fatima, R., Yasin, A., Liu, L., Wang, J., & Afzal, W. (2023). Retrieving arXiv,
6268
API
6369
------
6470

65-
### Class: arXivCollector
71+
### Class: ArXivCollector
6672

6773
This class is used to collect metadata from the arXiv website and save it in either BibTeX or CSV format.
6874

@@ -83,14 +89,19 @@ Initializes an instance of the ArXiv class.
8389

8490
Sets the title of the output file.
8591

92+
#### `set_mode(self, mode: str)`
93+
94+
Sets the type of the output file.
95+
8696
##### Parameters:
8797

8898
- `title` (str): The title to set.
99+
- `mode` (str): The type of file to set.
89100

90101
#### `run(self, url)`
91102

92103
Starts the collection process for the specified URL.
93104

94105
##### Parameters:
95106

96-
- `url` (str): The URL to start the collection process for.
107+
- `url` (str): The URL to start the collection process for.

arxivcollector.py

Lines changed: 64 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,49 @@
11
# Inspired by: Fatima, R., Yasin, A., Liu, L., Wang, J., & Afzal, W. (2023). Retrieving arXiv, SocArXiv, and SSRN metadata for initial review screening. Information and Software Technology, 161, 107251. https://doi.org/10.1016/j.infsof.2023.107251
22

3-
import httpx
4-
from bs4 import BeautifulSoup
5-
from bibtexparser.bwriter import BibTexWriter
6-
from bibtexparser.bibdatabase import BibDatabase
7-
import pandas as pd
3+
import argparse
84
import datetime
9-
import urllib.parse
5+
import logging
106
import sys
11-
import argparse
12-
import logging
7+
import urllib.parse
8+
9+
import httpx
10+
import pandas as pd
11+
from bibtexparser.bibdatabase import BibDatabase
12+
from bibtexparser.bwriter import BibTexWriter
13+
from bs4 import BeautifulSoup
1314

1415
MAX_RETRIES = 3
1516

17+
1618
class ArXivCollector():
17-
def __init__(self,
19+
def __init__(self,
1820
user_agent="Mozilla/5.0 (X11; Linux x86_64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
1921
num_abstracts=50,
2022
arxiv_doi_prefix="https://doi.org/10.48550",
2123
default_item_type="ARTICLE",
22-
verbose=False,
24+
verbose=False,
2325
mode="bibtex") -> None:
2426
self.user_agent = user_agent
2527
self.num_abstracts = num_abstracts
2628
self.arxiv_doi_prefix = arxiv_doi_prefix
2729
self.default_item_type = default_item_type
2830
self.verbose = verbose
29-
self.client = httpx.Client(headers={"User-Agent": self.user_agent,})
31+
self.client = httpx.Client(headers={"User-Agent": self.user_agent})
3032
self.title = datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
3133
self.mode = mode
3234

3335
logging.basicConfig(level=logging.INFO,
34-
force = True, handlers=[logging.StreamHandler(sys.stdout)])
36+
force=True, handlers=[logging.StreamHandler(sys.stdout)])
3537

3638
# Error handling for the mode parameter
3739
if self.mode not in ["bibtex", "csv"]:
3840
raise ValueError("The mode parameter must be either 'bibtex' or 'csv'.")
3941

4042
def set_title(self, title: str):
41-
self.title = f"{self.title}_{title}"
43+
self.title = f"{title}"
44+
45+
def set_mode(self, mode: str):
46+
self.mode = f"{mode}"
4247

4348
def send_request(self, url, method="GET"):
4449
for attempt in range(MAX_RETRIES):
@@ -52,11 +57,11 @@ def send_request(self, url, method="GET"):
5257
else:
5358
logging.error(f"Failed to send request after {MAX_RETRIES} attempts.")
5459
return None
55-
56-
def extract_text(self,soup:BeautifulSoup,selector):
60+
61+
def extract_text(self, soup: BeautifulSoup, selector):
5762
try:
5863
text = soup.select_one(selector).getText(strip=True)
59-
except AttributeError as err:
64+
except AttributeError:
6065
text = None
6166
return text
6267

@@ -65,32 +70,37 @@ def find_data(self, soup: BeautifulSoup, keyword) -> str:
6570
for p in soup.select('p'):
6671
if p.getText(strip=True).startswith(keyword):
6772
temp = p.getText(strip=True).split(';')
68-
sub = temp[0].strip().removeprefix('Submitted')
69-
ann = temp[-1].strip().removeprefix('originally announced')
73+
sub = temp[0].strip()[len('Submitted'):]
74+
ann = temp[-1].strip()[len('originally announced'):]
7075
# Convert sub to a datetime object
7176
sub = datetime.datetime.strptime(sub, "%d %B, %Y")
7277
break
7378
return sub, ann
74-
75-
def parse_html(self,response:httpx.Response):
76-
soup = BeautifulSoup(response.content,'html.parser')
79+
80+
def parse_html(self, response: httpx.Response):
81+
soup = BeautifulSoup(response.content, 'html.parser')
7782

7883
lis = soup.select('li.arxiv-result')
79-
if len(lis) == 0: return []
80-
for i,li in enumerate(lis,start=1):
81-
title =self.extract_text(li,'p.title')
84+
if len(lis) == 0:
85+
return []
86+
for i, li in enumerate(lis, start=1):
87+
title = self.extract_text(li, 'p.title')
8288
if self.verbose:
83-
print(i,title)
84-
89+
print(i, title)
90+
8591
temp_authors = li.select('p.authors>a')
8692
authors = ' AND '.join([', '.join(j.getText(strip=True).split()[::-1]) for j in temp_authors])
8793

88-
Abstract = self.extract_text(li,'span.abstract-full').removesuffix('△ Less')
94+
abstract_text = self.extract_text(li, 'span.abstract-full')
95+
if abstract_text:
96+
Abstract = abstract_text[:-len('△ Less')]
97+
else:
98+
Abstract = ''
8999

90-
extracted_text = self.extract_text(li,'p.comments > span:nth-of-type(2)')
100+
extracted_text = self.extract_text(li, 'p.comments > span:nth-of-type(2)')
91101
note = extracted_text if extracted_text else ""
92102

93-
sub,ann = self.find_data(li,'Submitted')
103+
sub, ann = self.find_data(li, 'Submitted')
94104

95105
# Construct ID from first author's last name and year of submission
96106
id = authors.split(',')[0] + str(sub.year)
@@ -100,18 +110,18 @@ def parse_html(self,response:httpx.Response):
100110
pdf = li.select_one('p.list-title > span > a[href*="pdf"]')['href']
101111
except TypeError:
102112
pdf = ""
103-
113+
104114
month_abbr = ["", "jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]
105115

106-
yield { # BibTeX-friendly format
107-
"title":title,
108-
"author":authors,
109-
"abstract":Abstract,
110-
"note":note,
111-
"year":str(sub.year),
116+
yield { # BibTeX-friendly format
117+
"title": title,
118+
"author": authors,
119+
"abstract": Abstract,
120+
"note": note,
121+
"year": str(sub.year),
112122
"month": month_abbr[sub.month],
113-
"doi": f"{self.arxiv_doi_prefix}/arXiv.{link.split('/')[-1]}", # Construct the DOI from the arXiv ID
114-
"howpublished" : fr"\url{{{pdf}}}",
123+
"doi": f"{self.arxiv_doi_prefix}/arXiv.{link.split('/')[-1]}", # Construct the DOI from the arXiv ID
124+
"howpublished": fr"\url{{{pdf}}}",
115125
"ENTRYTYPE": self.default_item_type,
116126
"ID": id
117127
}
@@ -123,10 +133,10 @@ def run(self, url):
123133
# Parse the URL and its parameters
124134
parsed_url = urllib.parse.urlparse(url)
125135
params = urllib.parse.parse_qs(parsed_url.query)
126-
136+
127137
# Update the 'start' parameter
128-
params['start'] = [page*self.num_abstracts]
129-
138+
params['start'] = [page * self.num_abstracts]
139+
130140
# Construct the new URL
131141
new_query = urllib.parse.urlencode(params, doseq=True)
132142
if 'advanced' not in params:
@@ -136,35 +146,40 @@ def run(self, url):
136146
results = list(self.parse_html(res))
137147
self.mainLIST.extend(results)
138148
logging.info(f"Scraped abstracts {page*self.num_abstracts} - {len(self.mainLIST)}")
139-
149+
140150
if self.mode == 'bibtex':
141151
# Create a BibDatabase
142152
db = BibDatabase()
143153
db.entries = self.mainLIST
144-
154+
145155
# Write the BibDatabase to a BibTeX file
146156
writer = BibTexWriter()
147-
with open(f'Bibliography_ArXiv_{self.title}.bib', 'w') as bibfile:
157+
with open(f'{self.title}.bib', 'w') as bibfile:
148158
bibfile.write(writer.write(db))
149159
elif self.mode == 'csv':
150160
# Convert the list of dictionaries to a DataFrame
151161
df = pd.DataFrame(self.mainLIST)
152-
162+
153163
# Write the DataFrame to a CSV file
154-
df.to_csv(f'Bibliography_ArXiv_{self.title}.csv', index=False)
164+
df.to_csv(f'{self.title}.csv', index=False)
155165

156166
page += 1
157-
if len(results) < self.num_abstracts: break
167+
if len(results) < self.num_abstracts:
168+
break
169+
158170

159171
def main():
160172
parser = argparse.ArgumentParser(description='Retrieve arXiv metadata.')
161173
parser.add_argument('url', help='The URL to scrape.')
162174
parser.add_argument('title', help='The title for the output file.')
175+
parser.add_argument('mode', help='The file type of the output file.')
163176
args = parser.parse_args()
164177

165-
arxiv = ArXiv()
178+
arxiv = ArXivCollector()
166179
arxiv.set_title(args.title)
180+
arxiv.set_mode(args.mode)
167181
arxiv.run(args.url)
168182

183+
169184
if __name__ == '__main__':
170-
main()
185+
main()

example.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
from arxivcollector import ArXivCollector
2+
3+
4+
def create(url, title="output", mode='bibtex'):
5+
# Initiate a new instance of the arXivCollector class
6+
collector = ArXivCollector()
7+
# Set the title of the exported file (optional)
8+
collector.set_title(title)
9+
collector.set_mode(mode)
10+
# Pass the search URL to the run method
11+
collector.run(url)
12+
13+
14+
url = 'https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=stochastic+parrot&terms-0-field=title&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first'
15+
16+
create(url, "output", "csv")

requirements.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
bibtexparser==1.4.1
2+
bs4==0.0.2
3+
httpx==0.27.0
4+
numpy==1.24.4
5+
pandas==2.0.3

tests/test_generate_csv.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
from pathlib import Path
2+
3+
from example import create
4+
5+
6+
def test_create_csv():
7+
url = 'https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=stochastic+parrot&terms-0-field=title&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first'
8+
create(url, "output", "csv")
9+
create(url, "output", "bibtex")
10+
11+
# Check if file exists
12+
filenamecsv = "output.csv"
13+
filenamebib = "output.bib"
14+
assert Path(filenamecsv).exists(), "csv file was not created."
15+
assert Path(filenamebib).exists(), "bibtex file was not created."

0 commit comments

Comments
 (0)