Welcome to the Data Science Library Hub, a curated collection of the most pivotal and innovative tools in the Pyhton Data Science ecosystem. My aim is to serve as a comprehensive resource for data scientists, analysts, and enthusiasts.
This repository is a roadmap to the vast landscape of Python libraries that drive analysis, insights, and machine learning. From data manipulation with Pandas and NumPy to creating sophisticated models with Scikit-learn, and visualizations with MatplotLib and Seaborn - these tools form the core of day-to-day Data Science work.
Here, you will also discover libraries that are pushing the boundaries, whether through elegant solutions for complex problems or by introducing new paradigms altogether. Libraries like Altair for declarative visualizations, PyCaret for automating machine learning workflows like MLflow), and Pingouin for advanced statistical analysis show the exciting direction of our field.
Each library is listed with its description, GitHub link, and tags for easy navigation and reference.
In the realm of Data Science libraries, each tool often possesses a unique set of capabilities, potentially spanning multiple functional areas. To simplify categorization, we have grouped these libraries into five distinct tags, acknowledging that some libraries could fit into more than one category.
-
Data Manipulation: This category encompasses libraries that are integral to processing and transforming data. They are the backbone of data analysis, offering functionalities for sorting, filtering, and summarizing data.
-
Data Visualization: Libraries under this tag focus on the graphical representation of data. They enable users to create a wide array of charts, graphs, and other visual tools to make data more understandable and engaging.
-
Machine Learning: This category includes libraries specifically designed for developing machine learning models. They provide tools for training, testing, and deploying algorithms that can learn from and make predictions on data.
-
Advanced Tools: Here, you will find libraries that offer specialized functionalities, often for specific, more complex tasks in data science, such as high-performance computing, advanced statistical models, or large-scale data processing.
-
Development and Debugging: Libraries in this group are geared towards facilitating the development process itself, including code writing, testing, and debugging. They enhance the efficiency and quality of the development workflow.
In the realm of Data Science, certain tools form the backbone of data analysis and modeling. This section, Key Data Science Tools, focuses on the essential libraries that are foundational for any Data Science practitioner. Ranging from data manipulation to basic visualization and statistical analysis, these tools are the building blocks for developing robust Data Science solutions. They include well-known libraries like Pandas for data processing, Scikit-learn for machine learning, and MatplotLib for plotting and visualization, among others. Whether you are just starting out or are a seasoned data scientist, these are the tools you will turn to time and again.
| Library Name | Description | GitHub Link | Tags |
|---|---|---|---|
| matplotlib | 2D Plotting library for Python | MatplotLib | Data Visualization |
| seaborn | Statistical data visualization | Seaborn | Data Visualization |
| altair | Declarative statistical visualization library for Python | Altair | Data Visualization |
| pandas | Data manipulation and analysis | Pandas | Data Manipulation |
| numpy | Numerical computing with Python | NumPy | Data Manipulation |
| scipy | Scientific computing and technical computing | SciPy | Data Manipulation |
| math | Mathematical functions defined by the C standard | - | Data Manipulation |
| itertools | Functions creating iterators for efficient looping | - | Data Manipulation |
| Scikit-learn | Machine learning in Python | Scikit-learn | Machine Learning |
| yaml | YAML parser and emitter for Python | YAML | Advanced Tools |
| joblib | Lightweight pipelining: using Python functions as pipeline jobs | joblib | Advanced Tools |
| pickle | Serialize Python object structures | - | Advanced Tools |
| bs4 (BeautifulSoup) | Pulling data out of HTML and XML files | BS4 | Advanced Tools |
| requests | HTTP library for Python | request | Advanced Tools |
| urllib | URL handling modules for Python | - | Advanced Tools |
| xlsxwriter | Python module for creating Excel XLSX files | xlsxwriter | Advanced Tools |
| time | Time access and conversions | - | Development and Debugging |
| sqlite3 | Database engine included with Python | - | Development and Debugging |
| IPython | Powerful interactive shell for Python | IPython | Development and Debugging |
| datetime | Basic date and time types | - | Development and Debugging |
| warnings | Non-fatal alerts used to issue cautionary advice | - | Development and Debugging |
Beyond the basics, the Data Science landscape is enriched by a variety of innovative tools that push the boundaries of data analysis and model development. In Innovative Data Science Tools, we explore libraries that bring unique functionalities, offer enhanced performance, or simplify complex processes in groundbreaking ways. These tools might not be as widely known as the foundational ones, but they are invaluable for specialized tasks, advanced analysis, and for keeping your methodologies at the cutting edge. From libraries that handle large datasets with ease, to those that provide novel approaches to machine learning and data visualization, this section is dedicated to the tools that are reshaping the future of Data Science.
| Library Name | Description | GitHub Link | Tags |
|---|---|---|---|
| YellowBrick | A suite of visualization and diagnostic tools for faster model selection. | YellowBrick | Data Visualization |
| Missingno | Visualize missing values in your dataset with ease. | Missingno | Data Visualization |
| stickyland | Break the linear presentation of Jupyter Notebooks with sticky cells! | stickyland | Data Visualization |
| PyGWalker | PyGWalker: Turn your pandas dataframe into an interactive UI for visual analysis | PyGWalker | Data Visualization |
| lux | Automatically visualize your pandas dataframe via a single print! | lux | Data Visualization |
| SweetViz | In-depth EDA report in two lines of code. | SweetViz | Data Visualization |
| Pivot Table JS | Drag-n-drop tools to group, pivot, plot dataframe. | Pivot Table JS | Data Visualization |
| DABEST | Data Analysis using Bootstrapped ESTimation | DABEST | Data Visualization |
| tableone | Create "Table 1" summaries for research papers | TableOne | Data Visualization |
| statannot | Add statistical annotations on an existing boxplot/barplot | StatAnnot | Data Visualization |
| imbalanced-learn | A variety of methods to handle class imbalance. | Imbalanced Learn | Data Manipulation |
| Modin | Boost Pandas' performance up to 70x by modifying the import. | Modin | Data Manipulation |
| Parallel-Pandas | Parallelize Pandas across all CPU cores for faster computation. | Parallel Pandas | Data Manipulation |
| Vaex | High performance package for lazy Out-of-Core DataFrames. | Vaex | Data Manipulation |
| statsmodels | Statistical testing and data exploration at fingertips. | StatsModels | Data Manipulation |
| Pandas-Profiling | Generate a high-level EDA report of your data in no time. | Pandas Profiling | Data Manipulation |
| Category-encoders | Over 15 categorical data encoders. | Category Encoders | Data Manipulation |
| DuckDB | Run SQL queries on DataFrame. | DuckDB | Data Manipulation |
| Numexpr | Parallelize NumPy to all CPU cores for 20x speedup. | NumExpr | Data Manipulation |
| CSV-Kit | Explore, query, and describe CSV files from the terminal. | CSV-Kit | Data Manipulation |
| pingouin | Statistics in Python | Pingouin | Data Manipulation |
| Sidetable | Supercharge Pandas' value_counts() method. | Side Table | Data Manipulation |
| PyCaret | Automate ML workflows with this low-code library. | PyCaret | Machine Learning |
| mlflow | Open source platform for the machine learning lifecycle | MLflow | Machine Learning |
| SHAP | Explain the output of any ML model in a few lines of code. | SHAP | Machine Learning |
| Featuretools | Automated feature engineering for ML models. | Feature Tools | Machine Learning |
| Lazy Predict | Lazy Predict help build a lot of basic models without much code and helps understand which models works better without any parameter tuning. | Lazy Predict | Machine Learning |
| openai | OpenAI API client library | OpenAI | Machine Learning |
| Skorch | Leverage the power of PyTorch with the elegance of sklearn. | Skorch | Machine Learning |
| mlxtend | A collection of utility functions for processing evaluating visualizing models. | lmxtend | Machine Learning |
| Pandas ML | Pandas data wrangling + Sklearn algorithms + Matplotlib visualization. | PandasML | Machine Learning |
| Faiss | Efficient algorithms for similarity search and clustering dense vectors. | Faiss | Machine Learning |
| Pytest | An elegant testing framework to test your code. | PyTest | Development and Debugging |
| Streamlit | Create and host data-based Python web apps in few lines of code. | Streamlit | Development and Debugging |
| Faker | Generate fake yet meaningful data in seconds. | Faker | Development and Debugging |
| Icecream | Don't debug with print(). Use icecream instead. | IceCream | Development and Debugging |
| Pyforest | Automatic package import for commonly used Python libraries. | PyForest | Development and Debugging |
| PySnooper | Profile your code to track new variables and their updates. | PySnooper | Development and Debugging |
| watermark | IPython magic to display timestamps, version numbers, and hardware information | WaterMark | Development and Debugging |
| ipywidgets | Interactive HTML widgets for Jupyter notebooks | iPyWidgets | Development and Debugging |
| myst_nb | Jupyter Notebooks in Sphinx Documentation | MyST-NB | Development and Debugging |
This table summarizes each library with its description, GitHub link, and relevant tags, providing a clear and concise overview of these Innovative Data Science Tools.
Your contributions are what make the Data Science Library Hub a dynamic and community-driven resource. If you have suggestions for adding new libraries or improving the existing content, your insights are incredibly valuable to me and the broader data science community.
-
🆕 Suggesting New Libraries: If you know of a library that you believe should be included in this collection, please let me know! I am always on the lookout for tools that can benefit data scientists, whether they're well-established or emerging gems. When suggesting a new library, it would be helpful if you could provide a brief description, its primary functionalities, and why you think it's a valuable addition to this list.
-
📝 Improving Existing Content: If you have additional information, updates, or corrections for any of the libraries listed, feel free to share your knowledge.
-
📚 Sharing Examples and Tutorials: Practical examples and tutorials are always beneficial. If you have used any of these libraries in your projects and want to share your experience or code snippets, I would be delighted to include them.
-
📫 Open an Issue: Start by opening an issue in this GitHub repository. Describe the contribution you want to make, whether it's adding a new library, improving an existing one, or providing additional resources.
-
🍴 Fork and Edit: Fork this repository, make your changes, and then submit a pull request with your contributions.
-
🔍 Review: I will review your submission, and if everything is in order, your contributions will be merged into the project.
-
🏆 Credit: All contributors will be duly credited for their work. We believe in recognizing the efforts of the community members.
We welcome contributions from everyone, regardless of your level of experience. Every bit of information helps, and collective knowledge makes this resource better for everyone.
For more information, suggestions, or questions, you can contact me via GitHub, X or LinkedIn.