Question Answering System in Malaysian Language

A Question Answering (QA) system designed specifically for the Malaysian language, allowing users to ask natural language questions and receive accurate answers based on provided context. It supports context input via file upload or raw text and includes a user-friendly Gradio interface. The system also utilizes Retrieval-Augmented Generation (RAG) to handle multiple documents inputs for better information retrieval.

Features

Contextual QA using RAG with support for .txt, .pdf, .docx files.
Gradio UI: Intuitive chatbot interface for users to chat with the QA system.

Dataset

Source: Web-scraped from The Star (Tech articles) using newspaper3k and Selenium.
Preprocessing: Question generation using valhalla/t5-base-qg-hl.
Translation: Original English articles text are translated to Malay using malaya's translation model.
Languages: English (original) and Malay (translated).
Formats: .csv, .jsonl, .parquet.
Available on HuggingFace Dataset.

Model

Training: Fine-tuned version from timpal0l/mdeberta-v3-base-squad2.
Dataset used: The model is trained on both English and Malay language QA dataset.
Architecture: Transformer-based (deberta).
Available on HuggingFace Model.

Installation

Clone the repository.
Create a virtual environment.
Install CUDA or check your installed CUDA version using this command (cmd):

nvcc --version

Then, install PyTorch based on your installed CUDA version.
Install the required dependencies:

pip install -r requirements.txt

Start the gradio application:

python app.py

Wait until it shows this output:

Device set to use cuda:0
* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.

Then, navigate to http://localhost:7860/ and start using the application.
Done!

Getting started

Provide Context:

Upload .txt, .pdf, .docx files or
Paste raw text into the "Extracted Context" field.

Ask Questions:

Enter your question in the input box.
The model will return an answer based on the uploaded/pasted context.

RAG Pipeline:

When multiple documents are provided, the system uses RAG to extract relevant content before answering.

Screenshots

Acknowledgement

Huge thanks to:

HuggingFace for model and dataset hosting.
Gradio for the interactive UI.
The Star for the original tech articles.
malaya for Malay translation model.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
prepare.ipynb		prepare.ipynb
prepare_url.ipynb		prepare_url.ipynb
rag.py		rag.py
requirements.txt		requirements.txt
train.ipynb		train.ipynb
valhalla.py		valhalla.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Question Answering System in Malaysian Language

Features

Dataset

Model

Installation

Getting started

Screenshots

Acknowledgement

About

Uh oh!

Languages

License

yumiian/tnl

Folders and files

Latest commit

History

Repository files navigation

Question Answering System in Malaysian Language

Features

Dataset

Model

Installation

Getting started

Screenshots

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages