A Question Answering (QA) system designed specifically for the Malaysian language, allowing users to ask natural language questions and receive accurate answers based on provided context. It supports context input via file upload or raw text and includes a user-friendly Gradio interface. The system also utilizes Retrieval-Augmented Generation (RAG) to handle multiple documents inputs for better information retrieval.
- Contextual QA using RAG with support for
.txt
,.pdf
,.docx
files. - Gradio UI: Intuitive chatbot interface for users to chat with the QA system.
- Source: Web-scraped from The Star (Tech articles) using newspaper3k and Selenium.
- Preprocessing: Question generation using valhalla/t5-base-qg-hl.
- Translation: Original English articles text are translated to Malay using malaya's translation model.
- Languages: English (original) and Malay (translated).
- Formats:
.csv
,.jsonl
,.parquet
. - Available on HuggingFace Dataset.
- Training: Fine-tuned version from timpal0l/mdeberta-v3-base-squad2.
- Dataset used: The model is trained on both English and Malay language QA dataset.
- Architecture: Transformer-based (deberta).
- Available on HuggingFace Model.
-
Clone the repository.
-
Create a virtual environment.
-
Install CUDA or check your installed CUDA version using this command (cmd):
nvcc --version
-
Then, install PyTorch based on your installed CUDA version.
-
Install the required dependencies:
pip install -r requirements.txt
- Start the gradio application:
python app.py
Wait until it shows this output:
Device set to use cuda:0
* Running on local URL: http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.
-
Then, navigate to
http://localhost:7860/
and start using the application. -
Done!
- Provide Context:
- Upload
.txt
,.pdf
,.docx
files or - Paste raw text into the "Extracted Context" field.
- Ask Questions:
- Enter your question in the input box.
- The model will return an answer based on the uploaded/pasted context.
- RAG Pipeline:
When multiple documents are provided, the system uses RAG to extract relevant content before answering.
Huge thanks to:
- HuggingFace for model and dataset hosting.
- Gradio for the interactive UI.
- The Star for the original tech articles.
- malaya for Malay translation model.