K-Means Clustering Customer Segmentation is a user-friendly, interactive web application built with Python, scikit-learn, and Streamlit. It enables businesses and data enthusiasts to segment customers based on annual income and spending score using the K-Means clustering algorithm. The app provides real-time predictions, clear visualizations, and supports easy retraining with new data or features. Ideal for marketing, retail, banking, and more, it helps identify high-value or at-risk customer groups, personalize offers, and optimize business strategies. The modular codebase and comprehensive documentation make it easy to customize, extend, and deploy in various environments.
- Overview
 - Business Use Cases
 - Quick Start
 - Features
 - Project Structure
 - Technical Architecture
 - Dataset
 - K-Means Clustering
 - How It Works: Step-by-Step
 - Behind the Scenes: Code Structure
 - Customization & Extensibility
 - Sample Input/Output
 - Installation & Requirements
 - Usage
 - Advanced Usage
 - Troubleshooting
 - Best Practices
 - Security & Privacy
 - Documentation
 - Contributing
 - FAQ
 - Support
 - Community & Social
 - Changelog
 - Roadmap
 - Glossary
 - References & Acknowledgements
 - License
 - Citation
 
K-Means Clustering Customer Segmentation is an end-to-end, interactive web application for segmenting customers using unsupervised machine learning. Built with Python, scikit-learn, and Streamlit, this project enables businesses and data enthusiasts to:
- Identify distinct customer groups based on spending patterns and income
 - Visualize clusters for actionable business insights
 - Experiment with new data and retrain models easily
 
Business Value:
- Target marketing campaigns to specific customer segments
 - Personalize offers and improve customer retention
 - Discover high-value or at-risk customer groups
 
- Retail: Segment shoppers to tailor promotions and loyalty programs.
 - Banking: Identify high-value clients for premium services.
 - E-commerce: Personalize recommendations and offers.
 - Hospitality: Group guests for targeted experiences.
 - Telecom: Detect churn-prone customers and upsell opportunities.
 - Education: Cluster students for personalized learning paths.
 
- Clone the repository:
git clone <repo-url> cd KMeans-Clustering-Customer-Segmentation
 - Install dependencies:
pip install -r requirements.txt
 - Launch the app:
streamlit run app/main.py
 - Open your browser: Visit http://localhost:8501
 
| Feature | Description | 
|---|---|
| Interactive Web UI | User-friendly Streamlit interface for input and results | 
| Real-time Prediction | Instantly predicts customer segment from input values | 
| Visualizations | Cluster plots, Elbow method, and more (add your screenshots!) | 
| Easy Retraining | Jupyter notebook for model retraining with new data/features | 
| Modular Codebase | Clean separation of UI, model, and logic for easy customization | 
| Deployment Ready | Simple to deploy on Streamlit Cloud, Heroku, or Docker | 
| Documentation | Extensive docs for dataset, clustering, and deployment | 
KMeans-Clustering-Customer-Segmentation/
  βββ app/
  β   βββ main.py         # Streamlit app entry point
  β   βββ model.py        # Model loading and prediction logic
  β   βββ ui.py           # Streamlit UI components
  βββ dataset/
  β   βββ mall_customers.csv  # Customer data
  βββ model/
  β   βββ model_training.ipynb # Jupyter notebook for training
  β   βββ model.pkl            # Trained KMeans model
  βββ docs/
  β   βββ dataset.md
  β   βββ kmeans-clustering.md
  β   βββ streamlit.md
  βββ requirements.txt
  βββ README.md
flowchart TD
    A[User Input (Streamlit UI)] --> B[Model Loader (app/model.py)]
    B --> C[Trained KMeans Model (model/model.pkl)]
    A --> D[UI Logic (app/ui.py)]
    B --> E[Prediction Output]
    D --> E
    E --> F[Visualization (matplotlib/seaborn)]
    F --> G[Display Results in Streamlit]
    subgraph Data Science
        C
        F
    end
    - File: 
dataset/mall_customers.csv - Source: Kaggle Mall Customers Dataset
 - Columns:
CustomerID: Unique identifierGender: Male/FemaleAge: Customer ageAnnual Income (k$): Annual income in thousands of dollarsSpending Score (1-100): Score assigned by the mall based on customer behavior and spending
 
Note: The default model uses only Annual Income (k$) and Spending Score (1-100) for clustering.
K-Means is an unsupervised algorithm that partitions data into k clusters, grouping similar data points together. It is widely used for customer segmentation due to its simplicity and effectiveness.
- 
How it works:
- Choose 
kcluster centers (centroids) - Assign each data point to the nearest centroid
 - Update centroids as the mean of assigned points
 - Repeat until assignments stabilize
 
 - Choose 
 - 
Why K-Means?
- Fast and scalable
 - Intuitive results for business users
 - Well-suited for numerical features
 
 
For more, see docs/kmeans-clustering.md.
- Data Preparation:
- Load and explore the dataset
 - Select relevant features (default: income & spending score)
 
 - Model Training:
- Use the Elbow Method to find optimal 
k - Train KMeans on selected features
 - Save the trained model as 
model/model.pkl 
 - Use the Elbow Method to find optimal 
 - Web Application:
- User enters income and spending score
 - App loads the trained model and predicts the segment
 - Results and (optionally) cluster visualizations are displayed
 
 
app/main.py: Streamlit entry point; initializes app, loads model, and handles routingapp/model.py: Handles model loading and prediction logicapp/ui.py: Contains Streamlit UI components for input and outputmodel/model_training.ipynb: Jupyter notebook for data exploration, training, and saving the model
- Add More Features:
- Edit 
model/model_training.ipynbto include more columns (e.g., Age, Gender) - Update the app UI in 
app/ui.pyto accept new inputs 
 - Edit 
 - Use Your Own Data:
- Replace 
dataset/mall_customers.csvwith your dataset (same or similar format) - Retrain the model using the notebook
 
 - Replace 
 - Change Number of Clusters:
- Adjust 
kin the notebook and retrain 
 - Adjust 
 - Deploy Anywhere:
- See 
docs/streamlit.mdfor deployment guides (Streamlit Cloud, Docker, etc.) 
 - See 
 
Sample Input:
- Annual Income (k$): 
60 - Spending Score (1-100): 
42 
Sample Output:
Predicted Segment: 3
This customer belongs to the "Average Income, Average Spending" group.
- Python: 3.7 or higher
 - Install all dependencies:
pip install -r requirements.txt
 - requirements.txt includes:
- pandas
 - numpy
 - scikit-learn
 - matplotlib
 - seaborn
 - streamlit
 - jupyter, ipykernel (optional, for notebook)
 
 
- Run the Streamlit app:
streamlit run app/main.py
 - Open your browser: Go to http://localhost:8501
 - Interact:
- Enter "Annual Income (k$)" and "Spending Score (1-100)"
 - Click "Predict" to see the customer segment
 
 
- Retrain the Model:
- Open 
model/model_training.ipynbin Jupyter - Modify code or data as needed
 - Run all cells to retrain and save a new model
 - Restart the app to use the updated model
 
 - Open 
 - Deploy Online:
- See 
docs/streamlit.mdfor deployment instructions 
 - See 
 
| Problem | Solution | 
|---|---|
ModuleNotFoundError | 
Run pip install -r requirements.txt | 
| Streamlit not launching | Check Python version and Streamlit installation | 
| Model file not found | Retrain model using the notebook | 
| Port 8501 already in use | Use streamlit run app/main.py --server.port <other_port> | 
| UI not updating after retrain | Restart Streamlit app | 
- Always explore your data before training
 - Use the Elbow Method to select the best 
k - Document any changes to the dataset or features
 - Test the app after retraining the model
 - Use virtual environments for dependency management
 - Add screenshots to the README for better engagement
 
- No personal data is stored by the app; all predictions are in-memory
 - If using real customer data, ensure compliance with GDPR or local privacy laws
 - Do not upload sensitive data to public repositories
 
docs/dataset.md: Dataset details and schemadocs/kmeans-clustering.md: K-Means theory and implementationdocs/streamlit.md: Streamlit and deployment guides
Contributions are welcome! To contribute:
- Fork the repository
 - Create a new branch (
git checkout -b feature/your-feature) - Commit your changes (
git commit -am 'Add new feature') - Push to your branch (
git push origin feature/your-feature) - Open a Pull Request
 
Best Practices:
- Write clear, concise commit messages
 - Add docstrings and comments
 - Test your code before submitting
 
Q: Can I use a different dataset?
A: Yes! Replace dataset/mall_customers.csv and retrain the model.
Q: How do I add more features?
A: Update feature selection in the notebook and app UI.
Q: The app doesn't start or throws an error. What should I do?
A: Ensure all dependencies are installed and Python version is compatible. Check error messages for details.
Q: How do I deploy this app online?
A: See docs/streamlit.md for deployment instructions.
- Open an issue for bugs or feature requests
 - Email: ptnhanit230104@gmail.com
 
- Discussions (ask questions, share ideas)
 - Contributors
 - Suggest a Slack/Discord channel for real-time help!
 
- v1.0: Initial release with Streamlit app, model training notebook, and documentation
 - v1.1: Improved modularity, added advanced usage and deployment docs
 - v1.2: Enhanced README, added FAQ and troubleshooting
 
- Add more clustering algorithms (DBSCAN, Hierarchical)
 - Add user authentication for private deployments
 - Enable export of cluster assignments
 - Add more visualizations (3D plots, interactive charts)
 - Docker Compose for multi-service deployment
 - Add REST API for programmatic access
 - Internationalization (i18n) support
 
- K-Means: Unsupervised clustering algorithm
 - Cluster: Group of similar data points
 - Centroid: Center of a cluster
 - Elbow Method: Technique to find optimal number of clusters
 - Streamlit: Python library for building web apps
 - scikit-learn: Python ML library
 
This project is licensed under the MIT License. See the LICENSE file for details.
If you use this project in your research, please cite as:
@misc{KMeansClusteringCustomerSegmentation,
  author = {Nhan Pham Thanh},
  title = {K-Means Clustering Customer Segmentation},
  year = {2024},
  howpublished = {\url{https://github.com/NhanPhamThanh-IT/KMeans-Clustering-Customer-Segmentation}}
}
For more information, see the documentation in the docs/ folder.
Add your screenshots to the docs/ folder and reference them above for a more visual README!