Skip to content

This repository is a curated list of essential Data Science libraries, showcasing the core tools like Pandas and NumPy for data manipulation, as well as innovative libraries such as Seaborn and PyCaret for advanced data visualization and automated machine learning workflows.

License

Notifications You must be signed in to change notification settings

imarranz/data-science-library-hub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 

Repository files navigation

Data Science Library Hub

Data Science Library Hub

Pull Requests MIT License Stars

Welcome to the Data Science Library Hub, a curated collection of the most pivotal and innovative tools in the Pyhton Data Science ecosystem. My aim is to serve as a comprehensive resource for data scientists, analysts, and enthusiasts.

This repository is a roadmap to the vast landscape of Python libraries that drive analysis, insights, and machine learning. From data manipulation with Pandas and NumPy to creating sophisticated models with Scikit-learn, and visualizations with MatplotLib and Seaborn - these tools form the core of day-to-day Data Science work.

Here, you will also discover libraries that are pushing the boundaries, whether through elegant solutions for complex problems or by introducing new paradigms altogether. Libraries like Altair for declarative visualizations, PyCaret for automating machine learning workflows like MLflow), and Pingouin for advanced statistical analysis show the exciting direction of our field.

Each library is listed with its description, GitHub link, and tags for easy navigation and reference.

List of Libraries

In the realm of Data Science libraries, each tool often possesses a unique set of capabilities, potentially spanning multiple functional areas. To simplify categorization, we have grouped these libraries into five distinct tags, acknowledging that some libraries could fit into more than one category.

  • Data Manipulation: This category encompasses libraries that are integral to processing and transforming data. They are the backbone of data analysis, offering functionalities for sorting, filtering, and summarizing data.

  • Data Visualization: Libraries under this tag focus on the graphical representation of data. They enable users to create a wide array of charts, graphs, and other visual tools to make data more understandable and engaging.

  • Machine Learning: This category includes libraries specifically designed for developing machine learning models. They provide tools for training, testing, and deploying algorithms that can learn from and make predictions on data.

  • Advanced Tools: Here, you will find libraries that offer specialized functionalities, often for specific, more complex tasks in data science, such as high-performance computing, advanced statistical models, or large-scale data processing.

  • Development and Debugging: Libraries in this group are geared towards facilitating the development process itself, including code writing, testing, and debugging. They enhance the efficiency and quality of the development workflow.

Key Data Science Tools

In the realm of Data Science, certain tools form the backbone of data analysis and modeling. This section, Key Data Science Tools, focuses on the essential libraries that are foundational for any Data Science practitioner. Ranging from data manipulation to basic visualization and statistical analysis, these tools are the building blocks for developing robust Data Science solutions. They include well-known libraries like Pandas for data processing, Scikit-learn for machine learning, and MatplotLib for plotting and visualization, among others. Whether you are just starting out or are a seasoned data scientist, these are the tools you will turn to time and again.

Library Name Description GitHub Link Tags
matplotlib 2D Plotting library for Python MatplotLib Data Visualization
seaborn Statistical data visualization Seaborn Data Visualization
altair Declarative statistical visualization library for Python Altair Data Visualization
pandas Data manipulation and analysis Pandas Data Manipulation
numpy Numerical computing with Python NumPy Data Manipulation
scipy Scientific computing and technical computing SciPy Data Manipulation
math Mathematical functions defined by the C standard - Data Manipulation
itertools Functions creating iterators for efficient looping - Data Manipulation
Scikit-learn Machine learning in Python Scikit-learn Machine Learning
yaml YAML parser and emitter for Python YAML Advanced Tools
joblib Lightweight pipelining: using Python functions as pipeline jobs joblib Advanced Tools
pickle Serialize Python object structures - Advanced Tools
bs4 (BeautifulSoup) Pulling data out of HTML and XML files BS4 Advanced Tools
requests HTTP library for Python request Advanced Tools
urllib URL handling modules for Python - Advanced Tools
xlsxwriter Python module for creating Excel XLSX files xlsxwriter Advanced Tools
time Time access and conversions - Development and Debugging
sqlite3 Database engine included with Python - Development and Debugging
IPython Powerful interactive shell for Python IPython Development and Debugging
datetime Basic date and time types - Development and Debugging
warnings Non-fatal alerts used to issue cautionary advice - Development and Debugging

Innovative Data Science Tools

Beyond the basics, the Data Science landscape is enriched by a variety of innovative tools that push the boundaries of data analysis and model development. In Innovative Data Science Tools, we explore libraries that bring unique functionalities, offer enhanced performance, or simplify complex processes in groundbreaking ways. These tools might not be as widely known as the foundational ones, but they are invaluable for specialized tasks, advanced analysis, and for keeping your methodologies at the cutting edge. From libraries that handle large datasets with ease, to those that provide novel approaches to machine learning and data visualization, this section is dedicated to the tools that are reshaping the future of Data Science.

Library Name Description GitHub Link Tags
YellowBrick A suite of visualization and diagnostic tools for faster model selection. YellowBrick Data Visualization
Missingno Visualize missing values in your dataset with ease. Missingno Data Visualization
stickyland Break the linear presentation of Jupyter Notebooks with sticky cells! stickyland Data Visualization
PyGWalker PyGWalker: Turn your pandas dataframe into an interactive UI for visual analysis PyGWalker Data Visualization
lux Automatically visualize your pandas dataframe via a single print! lux Data Visualization
SweetViz In-depth EDA report in two lines of code. SweetViz Data Visualization
Pivot Table JS Drag-n-drop tools to group, pivot, plot dataframe. Pivot Table JS Data Visualization
DABEST Data Analysis using Bootstrapped ESTimation DABEST Data Visualization
tableone Create "Table 1" summaries for research papers TableOne Data Visualization
statannot Add statistical annotations on an existing boxplot/barplot StatAnnot Data Visualization
imbalanced-learn A variety of methods to handle class imbalance. Imbalanced Learn Data Manipulation
Modin Boost Pandas' performance up to 70x by modifying the import. Modin Data Manipulation
Parallel-Pandas Parallelize Pandas across all CPU cores for faster computation. Parallel Pandas Data Manipulation
Vaex High performance package for lazy Out-of-Core DataFrames. Vaex Data Manipulation
statsmodels Statistical testing and data exploration at fingertips. StatsModels Data Manipulation
Pandas-Profiling Generate a high-level EDA report of your data in no time. Pandas Profiling Data Manipulation
Category-encoders Over 15 categorical data encoders. Category Encoders Data Manipulation
DuckDB Run SQL queries on DataFrame. DuckDB Data Manipulation
Numexpr Parallelize NumPy to all CPU cores for 20x speedup. NumExpr Data Manipulation
CSV-Kit Explore, query, and describe CSV files from the terminal. CSV-Kit Data Manipulation
pingouin Statistics in Python Pingouin Data Manipulation
Sidetable Supercharge Pandas' value_counts() method. Side Table Data Manipulation
PyCaret Automate ML workflows with this low-code library. PyCaret Machine Learning
mlflow Open source platform for the machine learning lifecycle MLflow Machine Learning
SHAP Explain the output of any ML model in a few lines of code. SHAP Machine Learning
Featuretools Automated feature engineering for ML models. Feature Tools Machine Learning
Lazy Predict Lazy Predict help build a lot of basic models without much code and helps understand which models works better without any parameter tuning. Lazy Predict Machine Learning
openai OpenAI API client library OpenAI Machine Learning
Skorch Leverage the power of PyTorch with the elegance of sklearn. Skorch Machine Learning
mlxtend A collection of utility functions for processing evaluating visualizing models. lmxtend Machine Learning
Pandas ML Pandas data wrangling + Sklearn algorithms + Matplotlib visualization. PandasML Machine Learning
Faiss Efficient algorithms for similarity search and clustering dense vectors. Faiss Machine Learning
Pytest An elegant testing framework to test your code. PyTest Development and Debugging
Streamlit Create and host data-based Python web apps in few lines of code. Streamlit Development and Debugging
Faker Generate fake yet meaningful data in seconds. Faker Development and Debugging
Icecream Don't debug with print(). Use icecream instead. IceCream Development and Debugging
Pyforest Automatic package import for commonly used Python libraries. PyForest Development and Debugging
PySnooper Profile your code to track new variables and their updates. PySnooper Development and Debugging
watermark IPython magic to display timestamps, version numbers, and hardware information WaterMark Development and Debugging
ipywidgets Interactive HTML widgets for Jupyter notebooks iPyWidgets Development and Debugging
myst_nb Jupyter Notebooks in Sphinx Documentation MyST-NB Development and Debugging

This table summarizes each library with its description, GitHub link, and relevant tags, providing a clear and concise overview of these Innovative Data Science Tools.

Contributing

Your contributions are what make the Data Science Library Hub a dynamic and community-driven resource. If you have suggestions for adding new libraries or improving the existing content, your insights are incredibly valuable to me and the broader data science community.

How to Contribute

  • 🆕 Suggesting New Libraries: If you know of a library that you believe should be included in this collection, please let me know! I am always on the lookout for tools that can benefit data scientists, whether they're well-established or emerging gems. When suggesting a new library, it would be helpful if you could provide a brief description, its primary functionalities, and why you think it's a valuable addition to this list.

  • 📝 Improving Existing Content: If you have additional information, updates, or corrections for any of the libraries listed, feel free to share your knowledge.

  • 📚 Sharing Examples and Tutorials: Practical examples and tutorials are always beneficial. If you have used any of these libraries in your projects and want to share your experience or code snippets, I would be delighted to include them.

Contribution Process

  1. 📫 Open an Issue: Start by opening an issue in this GitHub repository. Describe the contribution you want to make, whether it's adding a new library, improving an existing one, or providing additional resources.

  2. 🍴 Fork and Edit: Fork this repository, make your changes, and then submit a pull request with your contributions. Pull Requests

  3. 🔍 Review: I will review your submission, and if everything is in order, your contributions will be merged into the project.

  4. 🏆 Credit: All contributors will be duly credited for their work. We believe in recognizing the efforts of the community members.

We welcome contributions from everyone, regardless of your level of experience. Every bit of information helps, and collective knowledge makes this resource better for everyone.

Contact

For more information, suggestions, or questions, you can contact me via GitHub, X or LinkedIn.

About

This repository is a curated list of essential Data Science libraries, showcasing the core tools like Pandas and NumPy for data manipulation, as well as innovative libraries such as Seaborn and PyCaret for advanced data visualization and automated machine learning workflows.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published