Arabic Financial Instruction-Tuning Dataset & Models
SAHM is the first large-scale Arabic financial NLP benchmark covering both modern financial analysis and Islamic/Shari’ah-compliant reasoning, introduced in our paper SAHM: Arabic Financial Instruction-Tuning Dataset And Models.
It includes 14,000+ high-quality Arabic samples across eight tasks, derived from:
- AAOIFI Shari’ah standards
- Official fatwa archives
- Corporate earnings reports
- Market news
- Business and accounting exams
- Islamic finance regulatory material
SAHM also introduces:
🟦 SAHM-7B-Instruct, an Arabic financial instruction-tuned model
🟦 A unified evaluation framework
🟦 First-of-its-kind datasets for Islamic finance + Arabic corporate analysis
Our benchmark includes eight diverse tasks:
- Islamic Finance Shari’ah Standards QA
- Islamic Financial Fatwa QA
- Islamic Financial Fatwa MCQ
- Business MCQ
- Accounting MCQ
- Financial Report Sentiment Analysis
- Report Extractive Summarization
- Event–Cause Reasoning QA
These tasks reflect real Arabic financial workflows, combining modern finance with Islamic jurisprudence (fiqh al-muʿāmalāt).
Each dataset adheres to a unified JSON schema and standardized evaluation protocol.
| Task | #Train | #Eval | Format | Capability |
|---|---|---|---|---|
| Shari’ah Standards QA | 1621 | 406 | QA | Islamic finance legal reasoning |
| Islamic Fatwa QA | 11,703 | 250 | QA | Faith-based financial rulings |
| Event–Cause Reasoning | 160 | 40 | QA | Financial causal inference |
| Extractive Summarization | 160 | 40 | Summary | Financial disclosure extraction |
| Fatwa MCQ | – | 250 | MCQ | Recognition-style reasoning |
| Business MCQ | 381 | 76 | MCQ | Business fundamentals |
| Accounting MCQ | 95 | 24 | MCQ | Numerical & IFRS reasoning |
| Sentiment Analysis | 160 | 40 | MCQ | Financial polarity detection |
Full details appear in Table 1 of the paper.
A 7B Arabic-centric model instruction-tuned on all SAHM datasets.
Built on top of ALLAM-7B, it achieves:
- Best MCQ performance among Arabic/open models
- +37.5 improvement in Accounting MCQ
- Strong business & sentiment accuracy
- Competent but still developing open-ended reasoning
See Table 2 in the paper for full comparison.
| Model | Mean Accuracy (%) |
|---|---|
| GPT-5 | 73.9 |
| GPT-4o | 67.0 |
| Qwen2.5-72B | 60.4 |
| Fanar-1-9B | 53.9 |
| ALLAM-7B | 56.1 |
| SAHM-7B-Instruct (ours) | 71.7 |
Average Judge Score (0–10):
| Model | Score |
|---|---|
| GPT-5 | 8.98 |
| Claude 4 Sonnet | 7.77 |
| GPT-4o | 7.08 |
| Gemini 2.5 Pro | 5.73 |
| ALLAM-7B | 5.05 |
| Fanar-1-9B | 4.82 |
| SAHM-7B-Instruct | 5.07 |
(Open-ended tasks remain significantly harder for current Arabic models.)
git clone https://github.com/mbzuai-nlp/SAHM
cd SAHM
pip install -r requirements.txtfrom sahm import load_dataset
ds = load_dataset("sahm", "fatwa_qa")
print(ds["train"][0])from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("mbzuai-nlp/SAHM-7B-Instruct")
model = AutoModelForCausalLM.from_pretrained("mbzuai-nlp/SAHM-7B-Instruct")
prompt = "اشرح حكم بيع المرابحة في حالة عدم تملك السلعة."
out = model.generate(**tok(prompt, return_tensors="pt"))
print(tok.decode(out[0], skip_special_tokens=True))SAHM/
│
├── data/
│ ├── shariah_standards/
│ ├── fatwa_qa/
│ ├── mcq/
│ ├── sentiment/
│ ├── summarization/
│ └── event_cause/
│
├── models/
│ └── SAHM-7B-Instruct/
│
├── docs/
│ └── assets/logo.png
│
├── evaluation/
└── README.mdIf you use SAHM, please cite:
@article{sahm2025,
title={SAHM: Arabic Financial Instruction-Tuning Dataset And Models},
author={Elbadry, Rania and Ahmad, Sarfraz and Bouch, Dani and Ahsan, Momina and Peng, Xueqing and Huang, Jimin and AlMahri, Muhra and Khalil, Marwa Elsaid and Wang, Yuxia and Lahlou, Salem and Stoyanov, Veselin and Ananiadou, Sophia and Nakov, Preslav and Xie, Zhuohan},
year={2025},
institution={MBZUAI}
}The dataset and code are released under MIT License
