This project demonstrates the application of Non-Negative Matrix Factorization (NMF) for topic modeling on a dataset of abstracts. Below, you will find a detailed explanation of the dataset, target variable, mathematical background of NMF, and evaluation methodology.
"""
## 📂 Project Structure
├── data/
│ └── NLP_Topic_modeling_Data.csv # abstracts with 31 discipline labels
├── NMF_TOPIC_MODELING.ipynb # Full analysis pipeline
└── README.md # Documentation
"""The dataset consists of research abstracts across various scientific disciplines. Each entry contains:
- Abstract: The main text data used for topic modeling.
- Fields of Study: Binary columns indicating the fields of research associated with each abstract (e.g., Physics, Mathematics, Computer Science, etc.).
id: Unique identifier for each abstract.ABSTRACT: The text of the research abstract.Physics,Mathematics,Statistics, etc.: Binary indicators of relevance to specific fields.
The target variable for this project is the ABSTRACT column, which contains the data used for topic extraction and modeling.
We utilized Non-negative Matrix Factorization (NMF) for topic modeling. NMF is a dimensionality reduction technique that factorizes a non-negative matrix V into two non-negative matrices W and H, such that:
V: Document-term matrix.W: Document-topic matrix.H: Topic-term matrix.
The optimization problem solved by NMF is:
Where F^2 represents the Frobenius norm.
The coherence score measures the semantic similarity between words in a topic. A higher coherence score indicates more interpretable topics. Here’s how it is calculated:
-
Preprocessing:
- The text data is cleaned by removing stop words, punctuation, and irrelevant tokens.
- The text is tokenized into individual words (or tokens).
- Words are lemmatized to their root forms.
-
Topic Extraction:
- After applying NMF, each topic is represented as a ranked list of words. These are the most significant words for each topic, determined by their weights in the topic-term matrix
H.
- After applying NMF, each topic is represented as a ranked list of words. These are the most significant words for each topic, determined by their weights in the topic-term matrix
-
Pairwise Word Similarity:
- For each topic, pairs of the top
Nwords are created. - A similarity measure, such as Pointwise Mutual Information (PMI), is calculated for each pair based on their co-occurrence in the original dataset.
- For each topic, pairs of the top
-
Average Coherence:
- The coherence score for a topic is the average of the pairwise similarities of its words.
- The overall coherence score
Cacross all topics is:
Where
Nis the number of topics, andCoherence(Topic(i))is the coherence score of theitopic
To identify the optimal number of topics k, multiple values were tested. The best k was chosen based on:
- Maximizing the coherence score. The CoherenceModel from gensim.models evaluates the quality of a topic model by measuring how coherent or semantically meaningful the topics are. It works by comparing the words within each topic and checking their co-occurrence patterns or similarity. It can use different coherence measures, such as C_v, C_p, U_mass, and NPMI, which vary in how they calculate word relationships (e.g., cosine similarity between word vectors or frequency of co-occurrence). The model takes the topic model, the corpus, and the dictionary as inputs and returns a coherence score, where higher values indicate better topic coherence.
- Minimizing the reconstruction error.
The evaluation was conducted using:
- Topic Coherence Score: This ensures that the extracted topics are interpretable and meaningful.
- Reconstruction Error: This measures how well the factorized matrices
WandHapproximate the original matrixV. A lower reconstruction error indicates a better approximation.
- Clone this repository:
git clone https://github.com/Topic-Modeling-with-NMFt cd Topic-Modeling-with-NMF - Installing libraries:
pip install pandas numpy nltk scikit-learn gensim seaborn matplotlib wordcloud