Merge pull request #1 from anamatoso/main

koenraijer · web-flow · commit 45d22b966d4d · 2025-08-13T13:55:14.000+02:00
Minor Corrections and Some Improvements
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -0,0 +1,41 @@
+# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
+
+name: Python package
+
+on:
+  push:
+    branches: [ "main" ]
+  pull_request:
+    branches: [ "main" ]
+  
+jobs:
+  build:
+
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.8", "3.9", "3.10", "3.11"]
+
+    steps:
+    - uses: actions/checkout@v4
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v3
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip setuptools wheel
+        python -m pip install flake8 pytest
+        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+    - name: Lint with flake8
+      run: |
+        # stop the build if there are Python syntax errors or undefined names
+        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
+        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
+        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
+    - name: Test with pytest
+      run: |
+        PYTHONPATH=. pytest
+
diff --git a/README.md b/README.md
@@ -1,3 +1,8 @@
+[![Python 3.8](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org/downloads/release/python-360/)
+[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-360/)
+[![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-360/)
+[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-360/)
+
 arXivCollector
 ======
 
@@ -20,34 +25,35 @@ Getting started
 ------
 
 **arXivCollector** can be used in two ways:
-- By importing the `arXivCollector()` class; 
+- By importing the `ArXivCollector()` class; 
 - By executing the `arxivcollectory.py` script from the command line. 
 
 ### Step 1: obtain an arXiv search results URL 
 To obtain an arXiv search results URL for your search query, go to [https://arxiv.org/](https://arxiv.org/) or to the [advanced search page](https://arxiv.org/search/advanced) and construct your search query. Press the big blue button that says "Search", wait until you arrive on the page that displays the search results. Now copy the entire URL as is, and you're done ✅. 
 
-### Step 2: use arXivCollector in one of two ways
+### Step 2: use ArXivCollector in one of two ways
 #### In Python
 Run the following Python code (e.g., in a script or from a Jupyter notebook). 
 
 ```python
-from arxiv import arXivCollector
+from arxivcollector import ArXivCollector
 
 # Initiate a new instance of the arXivCollector class
-collector = arXivCollector()
-# Set the title of the exported file (optional)
+collector = ArXivCollector()
+# Set the title and type of the exported file
 collector.set_title("Parrots")
+collector.set_mode("csv")
 # Pass the search URL to the run method
 collector.run('https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=stochastic+parrot&terms-0-field=title&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first')
 ```
 
 After running this with your own search URL and title, a new file should appear in the parent directory of arXivCollector. 
 
 #### From the commandline
-The first argument after `arxivcollectory.py` is the search URL, the second argument is your title. 
+The first argument after `arxivcollectory.py` is the search URL, the second argument is your title, and the third argument is the type of the output file (csv or bibtex). 
 
 ```bash
-python arxivcollector.py "https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=stochastic+parrot&terms-0-field=title&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first" "Parrots"
+python arxivcollector.py "https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=stochastic+parrot&terms-0-field=title&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first" "output" "csv"
 ```
 
 Special thanks
@@ -62,7 +68,7 @@ Fatima, R., Yasin, A., Liu, L., Wang, J., & Afzal, W. (2023). Retrieving arXiv,
 API
 ------
 
-### Class: arXivCollector
+### Class: ArXivCollector
 
 This class is used to collect metadata from the arXiv website and save it in either BibTeX or CSV format.
 
@@ -83,14 +89,19 @@ Initializes an instance of the ArXiv class.
 
 Sets the title of the output file.
 
+#### `set_mode(self, mode: str)`
+
+Sets the type of the output file.
+
 ##### Parameters:
 
 - `title` (str): The title to set.
+- `mode` (str): The type of file to set.
 
 #### `run(self, url)`
 
 Starts the collection process for the specified URL.
 
 ##### Parameters:
 
-- `url` (str): The URL to start the collection process for.
+- `url` (str): The URL to start the collection process for.
diff --git a/arxivcollector.py b/arxivcollector.py
@@ -1,44 +1,49 @@
 # Inspired by: Fatima, R., Yasin, A., Liu, L., Wang, J., & Afzal, W. (2023). Retrieving arXiv, SocArXiv, and SSRN metadata for initial review screening. Information and Software Technology, 161, 107251. https://doi.org/10.1016/j.infsof.2023.107251
 
-import httpx
-from bs4 import BeautifulSoup
-from bibtexparser.bwriter import BibTexWriter
-from bibtexparser.bibdatabase import BibDatabase
-import pandas as pd
+import argparse
 import datetime
-import urllib.parse
+import logging
 import sys
-import argparse
-import logging 
+import urllib.parse
+
+import httpx
+import pandas as pd
+from bibtexparser.bibdatabase import BibDatabase
+from bibtexparser.bwriter import BibTexWriter
+from bs4 import BeautifulSoup
 
 MAX_RETRIES = 3
 
+
 class ArXivCollector():
-    def __init__(self, 
+    def __init__(self,
                  user_agent="Mozilla/5.0 (X11; Linux x86_64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
                  num_abstracts=50,
                  arxiv_doi_prefix="https://doi.org/10.48550",
                  default_item_type="ARTICLE",
-                 verbose=False, 
+                 verbose=False,
                  mode="bibtex") -> None:
         self.user_agent = user_agent
         self.num_abstracts = num_abstracts
         self.arxiv_doi_prefix = arxiv_doi_prefix
         self.default_item_type = default_item_type
         self.verbose = verbose
-        self.client = httpx.Client(headers={"User-Agent": self.user_agent,})
+        self.client = httpx.Client(headers={"User-Agent": self.user_agent})
         self.title = datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
         self.mode = mode
 
         logging.basicConfig(level=logging.INFO,
-                            force = True, handlers=[logging.StreamHandler(sys.stdout)])
+                            force=True, handlers=[logging.StreamHandler(sys.stdout)])
 
         # Error handling for the mode parameter
         if self.mode not in ["bibtex", "csv"]:
             raise ValueError("The mode parameter must be either 'bibtex' or 'csv'.")
 
     def set_title(self, title: str):
-        self.title = f"{self.title}_{title}"
+        self.title = f"{title}"
+
+    def set_mode(self, mode: str):
+        self.mode = f"{mode}"
 
     def send_request(self, url, method="GET"):
         for attempt in range(MAX_RETRIES):
@@ -52,11 +57,11 @@ def send_request(self, url, method="GET"):
                 else:
                     logging.error(f"Failed to send request after {MAX_RETRIES} attempts.")
         return None
-    
-    def extract_text(self,soup:BeautifulSoup,selector):
+
+    def extract_text(self, soup: BeautifulSoup, selector):
         try:
             text = soup.select_one(selector).getText(strip=True)
-        except AttributeError as err:
+        except AttributeError:
             text = None
         return text
 
@@ -65,32 +70,37 @@ def find_data(self, soup: BeautifulSoup, keyword) -> str:
         for p in soup.select('p'):
             if p.getText(strip=True).startswith(keyword):
                 temp = p.getText(strip=True).split(';')
-                sub = temp[0].strip().removeprefix('Submitted')
-                ann = temp[-1].strip().removeprefix('originally announced')
+                sub = temp[0].strip()[len('Submitted'):]
+                ann = temp[-1].strip()[len('originally announced'):]
                 # Convert sub to a datetime object
                 sub = datetime.datetime.strptime(sub, "%d %B, %Y")
                 break
         return sub, ann
-    
-    def parse_html(self,response:httpx.Response):
-        soup = BeautifulSoup(response.content,'html.parser')
+
+    def parse_html(self, response: httpx.Response):
+        soup = BeautifulSoup(response.content, 'html.parser')
 
         lis = soup.select('li.arxiv-result')
-        if len(lis) == 0: return []
-        for i,li in enumerate(lis,start=1):
-            title =self.extract_text(li,'p.title')
+        if len(lis) == 0:
+            return []
+        for i, li in enumerate(lis, start=1):
+            title = self.extract_text(li, 'p.title')
             if self.verbose:
-                print(i,title)
-            
+                print(i, title)
+
             temp_authors = li.select('p.authors>a')
             authors = ' AND '.join([', '.join(j.getText(strip=True).split()[::-1]) for j in temp_authors])
 
-            Abstract = self.extract_text(li,'span.abstract-full').removesuffix('△ Less')
+            abstract_text = self.extract_text(li, 'span.abstract-full')
+            if abstract_text:
+                Abstract = abstract_text[:-len('△ Less')]
+            else:
+                Abstract = ''
 
-            extracted_text = self.extract_text(li,'p.comments > span:nth-of-type(2)')
+            extracted_text = self.extract_text(li, 'p.comments > span:nth-of-type(2)')
             note = extracted_text if extracted_text else ""
 
-            sub,ann = self.find_data(li,'Submitted')
+            sub, ann = self.find_data(li, 'Submitted')
 
             # Construct ID from first author's last name and year of submission
             id = authors.split(',')[0] + str(sub.year)
@@ -100,18 +110,18 @@ def parse_html(self,response:httpx.Response):
                 pdf = li.select_one('p.list-title > span > a[href*="pdf"]')['href']
             except TypeError:
                 pdf = ""
-            
+
             month_abbr = ["", "jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]
 
-            yield { # BibTeX-friendly format
-                "title":title,
-                "author":authors,
-                "abstract":Abstract,
-                "note":note,
-                "year":str(sub.year),
+            yield {  # BibTeX-friendly format
+                "title": title,
+                "author": authors,
+                "abstract": Abstract,
+                "note": note,
+                "year": str(sub.year),
                 "month": month_abbr[sub.month],
-                "doi": f"{self.arxiv_doi_prefix}/arXiv.{link.split('/')[-1]}", # Construct the DOI from the arXiv ID
-                "howpublished" : fr"\url{{{pdf}}}",
+                "doi": f"{self.arxiv_doi_prefix}/arXiv.{link.split('/')[-1]}",  # Construct the DOI from the arXiv ID
+                "howpublished": fr"\url{{{pdf}}}",
                 "ENTRYTYPE": self.default_item_type,
                 "ID": id
             }
@@ -123,10 +133,10 @@ def run(self, url):
             # Parse the URL and its parameters
             parsed_url = urllib.parse.urlparse(url)
             params = urllib.parse.parse_qs(parsed_url.query)
-            
+
             # Update the 'start' parameter
-            params['start'] = [page*self.num_abstracts]
-            
+            params['start'] = [page * self.num_abstracts]
+
             # Construct the new URL
             new_query = urllib.parse.urlencode(params, doseq=True)
             if 'advanced' not in params:
@@ -136,35 +146,40 @@ def run(self, url):
             results = list(self.parse_html(res))
             self.mainLIST.extend(results)
             logging.info(f"Scraped abstracts {page*self.num_abstracts} - {len(self.mainLIST)}")
-            
+
             if self.mode == 'bibtex':
                 # Create a BibDatabase
                 db = BibDatabase()
                 db.entries = self.mainLIST
-                
+
                 # Write the BibDatabase to a BibTeX file
                 writer = BibTexWriter()
-                with open(f'Bibliography_ArXiv_{self.title}.bib', 'w') as bibfile:
+                with open(f'{self.title}.bib', 'w') as bibfile:
                     bibfile.write(writer.write(db))
             elif self.mode == 'csv':
                 # Convert the list of dictionaries to a DataFrame
                 df = pd.DataFrame(self.mainLIST)
-                
+
                 # Write the DataFrame to a CSV file
-                df.to_csv(f'Bibliography_ArXiv_{self.title}.csv', index=False)
+                df.to_csv(f'{self.title}.csv', index=False)
 
             page += 1
-            if len(results) < self.num_abstracts: break
+            if len(results) < self.num_abstracts:
+                break
+
 
 def main():
     parser = argparse.ArgumentParser(description='Retrieve arXiv metadata.')
     parser.add_argument('url', help='The URL to scrape.')
     parser.add_argument('title', help='The title for the output file.')
+    parser.add_argument('mode', help='The file type of the output file.')
     args = parser.parse_args()
 
-    arxiv = ArXiv()
+    arxiv = ArXivCollector()
     arxiv.set_title(args.title)
+    arxiv.set_mode(args.mode)
     arxiv.run(args.url)
 
+
 if __name__ == '__main__':
-    main()
+    main()
diff --git a/example.py b/example.py
@@ -0,0 +1,16 @@
+from arxivcollector import ArXivCollector
+
+
+def create(url, title="output", mode='bibtex'):
+    # Initiate a new instance of the arXivCollector class
+    collector = ArXivCollector()
+    # Set the title of the exported file (optional)
+    collector.set_title(title)
+    collector.set_mode(mode)
+    # Pass the search URL to the run method
+    collector.run(url)
+
+
+url = 'https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=stochastic+parrot&terms-0-field=title&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first'
+
+create(url, "output", "csv")
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,5 @@
+bibtexparser==1.4.1
+bs4==0.0.2
+httpx==0.27.0
+numpy==1.24.4
+pandas==2.0.3
diff --git a/tests/test_generate_csv.py b/tests/test_generate_csv.py
@@ -0,0 +1,15 @@
+from pathlib import Path
+
+from example import create
+
+
+def test_create_csv():
+    url = 'https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=stochastic+parrot&terms-0-field=title&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first'
+    create(url, "output", "csv")
+    create(url, "output", "bibtex")
+
+    # Check if file exists
+    filenamecsv = "output.csv"
+    filenamebib = "output.bib"
+    assert Path(filenamecsv).exists(), "csv file was not created."
+    assert Path(filenamebib).exists(), "bibtex file was not created."