A powerful and intelligent plagiarism detection system designed to uphold academic integrity and ensure the authenticity of written works. Named after Heimdallr, the all-seeing Norse god, this system possesses unparalleled vigilance and a watchful eye over the realm of academic content.
More information Heimdallr.
- Heimdallr
- Comprehensive Plagiarism Detection: Heimdallr scans through a vast array of sources, including web content, books, papers, photographs, and previous student works, leaving no room for undetected plagiarism.
- Precise Location Identification: It doesn't just flag potential instances of plagiarism; Heimdallr provides pinpoint accuracy by highlighting the exact lines or words in question, enabling users to swiftly address the issue.
- Source Attribution: The system not only detects plagiarism but also identifies the source that is being plagiarized, whether it's a website, a specific publication, or the work of another student.
Heimdallr stands as a guardian of academic integrity, inspired by the mythological figure known for his unwavering vigilance. With its advanced detection capabilities and user-friendly interface, Heimdallr is the ultimate solution for educational institutions, researchers, and students to maintain the highest standards of academic honesty and originality.
- Variables prefixed with
FASTAPI_are used to configure the API UI.
| Name | Description | Default Value |
|---|---|---|
| FASTAPI_DEBUG | Debug Mode | False |
| FASTAPI_PROJECT_NAME | Swagger Title | Heimdallr |
| FASTAPI_PROJECT_DESCRIPTION | Swagger Description | ... |
| FASTAPI_PROJECT_LICENSE | License info | ... |
| FASTAPI_PROJECT_CONTACT | Contact details | ... |
| FASTAPI_VERSION | Application Version | template.version |
| FASTAPI_DOCS_URL | Swagger Endpoint | /docs |
| FASTAPI_MODEL_PATH | Trained model path | /app/models/topic_predictor.joblib |
| FASTAPI_DETECT_PLAGIARISM | Whether to detect plagiarism or not | True |
| FASTAPI_SIMILARITY_THRESHOLD | Minimum similarity percentage | 0.95 |
- Variables prefixed with
UVICORN_are used to configure the server.
| Name | Description | Default Value |
|---|---|---|
| UVICORN_HOST | Server Host | '127.0.0.1' |
| UVICORN_PORT | Server Port | 8000 |
| UVICORN_LOG_LEVEL | Log Level | 'info' |
| UVICORN_RELOAD | Enable/Disable Reload | False |
- Variables prefixed with
MONGO_are used for MongoDB connection.
| Name | Description | Default Value |
|---|---|---|
| MONGO_CLIENT | Server Host | 'mongodb://localhost:27017' |
| MONGO_DATABASE | Server Database | 'heimdallr-local' |
| MONGO_USER | User LogIn credentials | None |
| MONGO_PASSWORD | User credentials | None |
This project uses make as an adaptation layer.
Run make help to see all available commands.
HIGHLY RECOMMENDED: Use docker-compose to run the application locally.
-
Run:
docker-compose up
This will:
- Build the application image
- Start a
MongoDBserver - Start the application
- Train a model for topic prediction
- Run DB migrations
-
Go to http://localhost:8000/docs to see the API documentation.
-
Use the
Verify AssignmentPOSTmethod to verify a document. You can use the rifkin_test document as an example. -
See the logs for the
heimdallrservice to see the results. e.g:INFO: Started server process [1] INFO: Waiting for application startup. INFO: Loading model from /app/models/topic_predictor_dev.joblib INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO: 172.20.0.1:52912 - "POST /api/v1/assignments HTTP/1.1" 202 Accepted INFO: Read Assignment(id=32750af8-bdfe-4508-a37a-5aa9718a1188, author=Franco Zanette) INFO: Similar(0.992015) to Assignment(id=891bdd47-5ec0-40cf-ad42-9e7c99a78648, author=David Choren) INFO: Similar(0.994410) to Assignment(id=4bdd4797-1617-45be-8b64-bd36b97908ff, author=Levy Nazareno Isaac) INFO: Similar(0.991231) to Assignment(id=35cc1575-ea46-416c-95cd-260cb7efda75, author=Leon Peralta)NOTE: The
POSTmethod returns a202 Acceptedresponse. This means that the document is being verified in the background -
When verification is complete you should a log similar to:
INFO: Finished comparison with Assignment(id=4bdd4797-1617-45be-8b64-bd36b97908ff, author=Levy Nazareno Isaac) in 117.274768 seconds. INFO: Finished comparison with Assignment(id=35cc1575-ea46-416c-95cd-260cb7efda75, author=Leon Peralta) in 107.669727 seconds. INFO: Finished comparison with Assignment(id=891bdd47-5ec0-40cf-ad42-9e7c99a78648, author=David Choren) in 147.236305 seconds. INFO: Assignment e68d59ff-d725-4dfc-940a-b5ab449cf98c verified.NOTE: Verification time depends on the size of the document and the number of documents in the database. It may mate take a while.
-
Retrieve the assignment with the
GETmethod, with itsUUIDas parameter. e.g:{ "data": { "id": "32750af8-bdfe-4508-a37a-5aa9718a1188", "title": "TP5-Franco Zanette.docx.docx", "topic": "Emerging Systems", "author": "Franco Zanette", "similarities": [ { "id": "4bdd4797-1617-45be-8b64-bd36b97908ff", "author": "Levy Nazareno Isaac", "plagiarism": 0.19135095761849497, "similarities": [ { "present": "Puede describir el vínculo entre las leyes de la termodinámica de Newton y la “factura entrópica”.", "compared": "Puede describir el vínculo entre las leyes de la termodinámica de Newton y la “factura entrópica”.", "plagiarism": 1 }, { "present": "Qué dice Rifkin que la “internet de las cosas IOT” le aportará a la 3ra revolución industrial?", "compared": "Qué dice Rifkin que la “internet de las cosas IOT” le aportará a la 3ra revolución industrial?", "plagiarism": 1 }, { "present": "NOTA: las respuestas no deberán superar en su conjunto a 2 páginas del mismo formato que esta guía.", "compared": "NOTA: las respuestas no deberán superar en su conjunto a 2 páginas del mismo formato que esta guía.", "plagiarism": 1 }, { "present": "Podría caracterizar la Primera y Segunda revolución industrial al decir de Rifkin?", "compared": "Podría caracterizar la Primera y Segunda revolución industrial al decir de Rifkin?", "plagiarism": 1 }, { "present": "Qué inventos son las metáforas de cada infraestructura en cada una de esas etapas.", "compared": "Qué inventos son las metáforas de cada infraestructura en cada una de esas etapas.", "plagiarism": 1 }, { "present": "Qué ejemplos actuales de “procomunes” se le ocurren?", "compared": "Qué ejemplos actuales de “procomunes” se le ocurren?", "plagiarism": 1 }, { "present": "Qué límites le ve Ud. a los procomunes como forma de producción?", "compared": "Qué límites le ve Ud. a los procomunes como forma de producción?", "plagiarism": 1 }, { "present": "qué estaría faltando?", "compared": "qué estaría faltando?", "plagiarism": 1 }, { "present": "La principal limitación de los procomunes como forma de producción la incapacidad de la sociedad de proteger los recursos procomunes de la sobreexplotación por parte un individuo", "compared": "El internet de las cosas permitirá unificar la comunicación, la energía y la logística posibilitando la optimización de los procesos .", "plagiarism": 0.9502497961617389 }, { "present": "E. intangible o “sin peso”", "compared": "E. intangible o “sin peso”", "plagiarism": 1 } ] }, { "id": "35cc1575-ea46-416c-95cd-260cb7efda75", "author": "Leon Peralta", "plagiarism": 0.11538461538461539, "similarities": [ { "present": "Qué dice Rifkin que la “internet de las cosas IOT” le aportará a la 3ra revolución industrial?", "compared": "Qué dice Rifkin que la “internet de las cosas IOT” le aportará a la 3ra revolución industrial?", "plagiarism": 1 }, { "present": "Podría caracterizar la Primera y Segunda revolución industrial al decir de Rifkin?", "compared": "Podría caracterizar la Primera y Segunda revolución industrial al decir de Rifkin?", "plagiarism": 1 }, { "present": "Qué inventos son las metáforas de cada infraestructura en cada una de esas etapas.", "compared": "Qué inventos son las metáforas de cada infraestructura en cada una de esas etapas.", "plagiarism": 1 }, { "present": "Qué límites le ve Ud. a los procomunes como forma de producción?", "compared": "Qué límites le ve Ud. a los procomunes como forma de producción?", "plagiarism": 1 }, { "present": "Qué ejemplos actuales de “procomunes” se le ocurren?", "compared": "Qué ejemplos actuales de “procomunes” se le ocurren?", "plagiarism": 1 }, { "present": "qué estaría faltando?", "compared": "qué estaría faltando?", "plagiarism": 1 } ] }, { "id": "891bdd47-5ec0-40cf-ad42-9e7c99a78648", "author": "David Choren", "plagiarism": 0.1519146892150015, "similarities": [ { "present": "Las plataformas tecnológicas de la primera y segunda revolución industrial estaban centralizadas y sometidas a un control jerarquizado y su explotación estaba basada en la idea de que los recursos de la Tierra están para el servicio de la personas y el lucro..", "compared": "Las plataformas tecnológicas de la primera y la segunda revoluciones industriales estaban centralizadas y sometidas a un control jerarquizado.", "plagiarism": 0.9535765262616537 }, { "present": "3.¿Qué dice Rifkin que la “internet de las cosas IOT” le aportará a la 3ra revolución industrial?", "compared": "Qué dice Rifkin que la “internet de las cosas IOT” le aportará a la 3ra revolución industrial?", "plagiarism": 0.992672017325718 }, { "present": "Qué límites le ve Ud. a los procomunes como forma de producción?", "compared": "Qué límites le ve Ud. a los procomunes como forma de producción?", "plagiarism": 1 }, { "present": "NOTA: las respuestas no deberán superar en su conjunto a 2 páginas del mismo formato que esta guía.", "compared": "NOTA: las respuestas no deberán superar en su conjunto a 2 páginas del mismo formato que esta guía.", "plagiarism": 0.9813714487124504 }, { "present": "qué estaría faltando?", "compared": "qué estaría faltando?", "plagiarism": 1 }, { "present": "2.¿Podría caracterizar la Primera y Segunda revolución industrial al decir de Rifkin?", "compared": "Podría caracterizar la Primera y Segunda revolución industrial al decir de Rifkin?", "plagiarism": 0.989975333730042 }, { "present": "E. intangible o “sin peso”", "compared": "E. intangible o “sin peso”", "plagiarism": 1 }, { "present": "5.¿Qué ejemplos actuales de “procomunes” se le ocurren?", "compared": "Qué ejemplos actuales de “procomunes” se le ocurren?", "plagiarism": 0.9819685131502129 } ] } ] } } -
To stop the application run:
docker-compose down
This package uses poetry for dependency management and Python 3.10 as interpreter.
Install poetry in the system site_packages. DO NOT INSTALL IT in a virtual environment itself.
To install poetry, run:
pip install poetry-
Clone the repository
git clone "git@github.com:tomasanchez/heimdallr.git" -
Install dependencies
cd cosmic-fastapi && poetry install
Note that poetry doesn't activate the virtual environment for you. You have to do it manually. Or prefix subsequent the commands with
poetry run
You can view the environment that poetry uses with
poetry env info
To activate run:
poetry shell
-
Download
spaCytrained Spanish Modelpoetry run python -m spacy download es_core_news_lg
-
Train model for topic prediction (Optional)
- Run script
poetry run python -m heimdallr.train
- Update
FASTAPI_MODEL_PATHenvironment variable with the new model pathexport FASTAPI_MODEL_PATH=/path/to/new/model
-
Activate pre-commit hooks (Optional)
Using pre-commit to run some checks before committing is highly recommended.
To activate the pre-commit hooks run:
pre-commit install
To run the checks manually run:
poetry run pre-commit run --all-files
The following checks are run:
black,flake8,isort,mypy,pylint.
-
Either:
poetry run python -m heimdallr.main
NOTE: A
MongoDBserver running to verify documents.docker-compose.ymlcontains a service configured for that. -
Go to http://localhost:8000/docs to see the API documentation.
For supporting .doc files you must have antiword installed. Otherwise, the API will not verify .doc files, but
It will still work for .pdf and
If using Ubuntu, run:
sudo apt-get update -y && sudo apt-get install -y antiwordIf using Mac OSX, run:
brew install antiwordRun the application in docker. Requires no further configuration.
- Run:
docker-compose up
- Go to http://localhost:8000/docs to see the API documentation.
You can run the tests with:
poetry run pytestor with the make command:
make testTo generate a coverage report add --cov src.
poetry run pytest --cov srcOr with the make command:
make coverTo update the dependencies run:
poetry update- Performance improvements:
- Maybe filter documents by topic. Right now, all documents are being compared.
- There is a lot to improve about processing speed. A possible solution is to use a
Celerytask queue to process documents in the background, as usingspaCyis CPU intensive. - Better response management. Right now, if a document is being processed, the API will return a
202 Acceptedresponse, but it is impossible to know any errors that may occur during processing. It's only possible to know when the process is finished by checking the logs. Or if by retrieving its ID it doesn't return a404 Not Found.
- Interesting Topics: their definition was more related to what were the assignment requirements.
- Internet scrapping: It was not implemented due to time constraints. It would be a great addition to the system.
Also, it would relative easy to implement: an
Adaptercapable of scrapping the web and returning anAssignmentwill do the job. - Batch Processing
- Title Identification: Only file names are used to identify documents. More NLP domain knowledge is needed to identify titles.
- Better Author recognition: Right now is using the first identified name. Some documents don't include authors
name, others aren't their name first mentioned. Sometimes another token is wrongly recognized as a
PERtoken. - Requirements Identification: Requirements are being considered as plagiarism. A better approach would be to identify them and exclude them from the comparison like for each topic have a list of requirements and exclude them.
- FastAPI official Documentation
- Pydantic official Documentation
- Cosmic Python
- Cosmic FastAPI template
- spaCy Documentation
This project is licensed under the terms of the MIT license unless otherwise specified. See LICENSE for
more details or visit https://mit-license.org/.
This project was designed and developed by Tomás Sánchez <info@tomsanchez.com.ar>.
If you find this project useful, please consider supporting its development by sponsoring it.