Skip to content

Commit 6e00bbb

Browse files
authored
Merge pull request #353 from VinciGit00/pre/beta
Pre/Beta update
2 parents 00a392b + dd2b3a8 commit 6e00bbb

File tree

249 files changed

+12193
-2115
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

249 files changed

+12193
-2115
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,15 @@ docs/source/_templates/
2121
docs/source/_static/
2222
.env
2323
venv/
24+
.venv/
2425
.vscode/
2526

2627
# exclude pdf, mp3
2728
*.pdf
2829
*.mp3
2930
*.sqlite
3031
*.google-cookie
32+
*.python-version
3133
examples/graph_examples/ScrapeGraphAI_generated_graph
3234
examples/**/result.csv
3335
examples/**/result.json

CHANGELOG.md

Lines changed: 240 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11

22
# 🕷️ ScrapeGraphAI: You Only Scrape Once
3+
[English](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/README.md) | [中文](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/chinese.md)
4+
35
[![Downloads](https://static.pepy.tech/badge/scrapegraphai)](https://pepy.tech/project/scrapegraphai)
46
[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen)](https://github.com/pylint-dev/pylint)
57
[![Pylint](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml/badge.svg)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml)
@@ -17,7 +19,7 @@ Just say which information you want to extract and the library will do it for yo
1719

1820
## 🚀 Quick install
1921

20-
The reference page for Scrapegraph-ai is available on the official page of pypy: [pypi](https://pypi.org/project/scrapegraphai/).
22+
The reference page for Scrapegraph-ai is available on the official page of PyPI: [pypi](https://pypi.org/project/scrapegraphai/).
2123

2224
```bash
2325
pip install scrapegraphai
@@ -28,7 +30,7 @@ pip install scrapegraphai
2830
## 🔍 Demo
2931
Official streamlit demo:
3032

31-
[![My Skills](https://skillicons.dev/icons?i=react)](https://scrapegraph-ai-demo.streamlit.app/)
33+
[![My Skills](https://skillicons.dev/icons?i=react)](https://scrapegraph-ai-web-dashboard.streamlit.app)
3234

3335
Try it directly on the web using Google Colab:
3436

@@ -162,13 +164,23 @@ print(result)
162164

163165
The output will be an audio file with the summary of the projects on the page.
164166

167+
## Sponsors
168+
<div style="text-align: center;">
169+
<a href="https://serpapi.com?utm_source=scrapegraphai">
170+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
171+
</a>
172+
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
173+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
174+
</a>
175+
</div>
176+
165177
## 🤝 Contributing
166178

167179
Feel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!
168180

169181
Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md).
170182

171-
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/gkxQDAjfeX)
183+
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/uJN7TYcpNa)
172184
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
173185
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)
174186

@@ -179,15 +191,6 @@ Wanna visualize the roadmap in a more interactive way? Check out the [markmap](h
179191

180192
## ❤️ Contributors
181193
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
182-
## Sponsors
183-
<div style="text-align: center;">
184-
<a href="https://serpapi.com?utm_source=scrapegraphai">
185-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
186-
</a>
187-
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
188-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 10%;">
189-
</a>
190-
</div>
191194

192195
## 🎓 Citations
193196
If you have used our library for research purposes please quote us with the following reference:

docs/chinese.md

Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
# 🕷️ ScrapeGraphAI: 只需抓取一次
2+
[![下载量](https://static.pepy.tech/badge/scrapegraphai)](https://pepy.tech/project/scrapegraphai)
3+
[![代码检查: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen)](https://github.com/pylint-dev/pylint)
4+
[![Pylint](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml/badge.svg)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml)
5+
[![CodeQL](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml/badge.svg)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml)
6+
[![许可证: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
7+
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)
8+
9+
ScrapeGraphAI 是一个*网络爬虫* Python 库,使用大型语言模型和直接图逻辑为网站和本地文档(XML,HTML,JSON 等)创建爬取管道。
10+
11+
只需告诉库您想提取哪些信息,它将为您完成!
12+
13+
<p align="center">
14+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
15+
</p>
16+
17+
## 🚀 快速安装
18+
19+
Scrapegraph-ai 的参考页面可以在 PyPI 的官方网站上找到: [pypi](https://pypi.org/project/scrapegraphai/)
20+
21+
```bash
22+
pip install scrapegraphai
23+
```
24+
**注意**: 建议在虚拟环境中安装该库,以避免与其他库发生冲突 🐱
25+
26+
## 🔍 演示
27+
28+
官方 Streamlit 演示:
29+
30+
[![My Skills](https://skillicons.dev/icons?i=react)](https://scrapegraph-ai-web-dashboard.streamlit.app)
31+
32+
在 Google Colab 上直接尝试:
33+
34+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)
35+
36+
## 📖 文档
37+
38+
ScrapeGraphAI 的文档可以在[这里](https://scrapegraph-ai.readthedocs.io/en/latest/)找到。
39+
40+
还可以查看 Docusaurus 的[版本](https://scrapegraph-doc.onrender.com/)
41+
42+
## 💻 用法
43+
44+
有三种主要的爬取管道可用于从网站(或本地文件)提取信息:
45+
46+
- `SmartScraperGraph`: 单页爬虫,只需用户提示和输入源;
47+
- `SearchGraph`: 多页爬虫,从搜索引擎的前 n 个搜索结果中提取信息;
48+
- `SpeechGraph`: 单页爬虫,从网站提取信息并生成音频文件。
49+
- `SmartScraperMultiGraph`: 多页爬虫,给定一个提示
50+
可以通过 API 使用不同的 LLM,如 **OpenAI****Groq****Azure****Gemini**,或者使用 **Ollama** 的本地模型。
51+
52+
### 案例 1: 使用本地模型的 SmartScraper
53+
请确保已安装 [Ollama](https://ollama.com/) 并使用 `ollama pull` 命令下载模型。
54+
55+
``` python
56+
from scrapegraphai.graphs import SmartScraperGraph
57+
58+
graph_config = {
59+
"llm": {
60+
"model": "ollama/mistral",
61+
"temperature": 0,
62+
"format": "json", # Ollama 需要显式指定格式
63+
"base_url": "http://localhost:11434", # 设置 Ollama URL
64+
},
65+
"embeddings": {
66+
"model": "ollama/nomic-embed-text",
67+
"base_url": "http://localhost:11434", # 设置 Ollama URL
68+
},
69+
"verbose": True,
70+
}
71+
72+
smart_scraper_graph = SmartScraperGraph(
73+
prompt="List me all the projects with their descriptions",
74+
# 也接受已下载的 HTML 代码的字符串
75+
source="https://perinim.github.io/projects",
76+
config=graph_config
77+
)
78+
79+
result = smart_scraper_graph.run()
80+
print(result)
81+
```
82+
83+
输出将是一个包含项目及其描述的列表,如下所示:
84+
85+
```python
86+
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
87+
```
88+
89+
### 案例 2: 使用混合模型的 SearchGraph
90+
我们使用 **Groq** 作为 LLM,使用 **Ollama** 作为嵌入模型。
91+
92+
```python
93+
from scrapegraphai.graphs import SearchGraph
94+
95+
# 定义图的配置
96+
graph_config = {
97+
"llm": {
98+
"model": "groq/gemma-7b-it",
99+
"api_key": "GROQ_API_KEY",
100+
"temperature": 0
101+
},
102+
"embeddings": {
103+
"model": "ollama/nomic-embed-text",
104+
"base_url": "http://localhost:11434", # 任意设置 Ollama URL
105+
},
106+
"max_results": 5,
107+
}
108+
109+
# 创建 SearchGraph 实例
110+
search_graph = SearchGraph(
111+
prompt="List me all the traditional recipes from Chioggia",
112+
config=graph_config
113+
)
114+
115+
# 运行图
116+
result = search_graph.run()
117+
print(result)
118+
```
119+
120+
输出将是一个食谱列表,如下所示:
121+
122+
```python
123+
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
124+
```
125+
126+
### 案例 3: 使用 OpenAI 的 SpeechGraph
127+
128+
您只需传递 OpenAI API 密钥和模型名称。
129+
130+
```python
131+
from scrapegraphai.graphs import SpeechGraph
132+
133+
graph_config = {
134+
"llm": {
135+
"api_key": "OPENAI_API_KEY",
136+
"model": "gpt-3.5-turbo",
137+
},
138+
"tts_model": {
139+
"api_key": "OPENAI_API_KEY",
140+
"model": "tts-1",
141+
"voice": "alloy"
142+
},
143+
"output_path": "audio_summary.mp3",
144+
}
145+
146+
# ************************************************
147+
# 创建 SpeechGraph 实例并运行
148+
# ************************************************
149+
150+
speech_graph = SpeechGraph(
151+
prompt="Make a detailed audio summary of the projects.",
152+
source="https://perinim.github.io/projects/",
153+
config=graph_config,
154+
)
155+
156+
result = speech_graph.run()
157+
print(result)
158+
```
159+
输出将是一个包含页面上项目摘要的音频文件。
160+
161+
## 赞助商
162+
163+
<div style="text-align: center;">
164+
<a href="https://serpapi.com?utm_source=scrapegraphai">
165+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
166+
</a>
167+
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
168+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
169+
</a>
170+
</div>
171+
172+
## 🤝 贡献
173+
174+
欢迎贡献并加入我们的 Discord 服务器与我们讨论改进和提出建议!
175+
176+
请参阅[贡献指南](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md)
177+
178+
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/uJN7TYcpNa)
179+
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
180+
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)
181+
182+
183+
## 📈 路线图
184+
185+
[这里](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/README.md)查看项目路线图! 🚀
186+
187+
想要以更互动的方式可视化路线图?请查看 [markmap](https://markmap.js.org/repl) 通过将 markdown 内容复制粘贴到编辑器中进行可视化!
188+
189+
## ❤️ 贡献者
190+
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
191+
192+
193+
## 🎓 引用
194+
195+
如果您将我们的库用于研究目的,请引用以下参考文献:
196+
```text
197+
@misc{scrapegraph-ai,
198+
author = {Marco Perini, Lorenzo Padoan, Marco Vinciguerra},
199+
title = {Scrapegraph-ai},
200+
year = {2024},
201+
url = {https://github.com/VinciGit00/Scrapegraph-ai},
202+
note = {一个利用大型语言模型进行爬取的 Python 库}
203+
}
204+
```
205+
## 作者
206+
207+
<p align="center">
208+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/logo_authors.png" alt="Authors_logos">
209+
</p>
210+
211+
## 联系方式
212+
| | Contact Info |
213+
|--------------------|----------------------|
214+
| Marco Vinciguerra | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/marco-vinciguerra-7ba365242/) |
215+
| Marco Perini | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/perinim/) |
216+
| Lorenzo Padoan | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/lorenzo-padoan-4521a2154/) |
217+
218+
## 📜 许可证
219+
220+
ScrapeGraphAI 采用 MIT 许可证。更多信息请查看 [LICENSE](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/LICENSE) 文件。
221+
222+
## 鸣谢
223+
224+
- 我们要感谢所有项目贡献者和开源社区的支持。
225+
- ScrapeGraphAI 仅用于数据探索和研究目的。我们不对任何滥用该库的行为负责。

docs/source/conf.py

Lines changed: 7 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -23,27 +23,17 @@
2323
# -- General configuration ---------------------------------------------------
2424
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
2525

26-
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon','sphinx_wagtail_theme']
26+
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon']
2727

2828
templates_path = ['_templates']
2929
exclude_patterns = []
3030

3131
# -- Options for HTML output -------------------------------------------------
3232
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
3333

34-
# html_theme = 'sphinx_rtd_theme'
35-
html_theme = 'sphinx_wagtail_theme'
36-
37-
html_theme_options = dict(
38-
project_name = "ScrapeGraphAI",
39-
logo = "scrapegraphai_logo.png",
40-
logo_alt = "ScrapeGraphAI",
41-
logo_height = 59,
42-
logo_url = "https://scrapegraph-ai.readthedocs.io/en/latest/",
43-
logo_width = 45,
44-
github_url = "https://github.com/VinciGit00/Scrapegraph-ai/tree/main/docs/source/",
45-
footer_links = ",".join(
46-
["Landing Page|https://scrapegraphai.com/",
47-
"Docusaurus|https://scrapegraph-doc.onrender.com/docs/intro"]
48-
),
49-
)
34+
html_theme = 'furo'
35+
html_theme_options = {
36+
"source_repository": "https://github.com/VinciGit00/Scrapegraph-ai/",
37+
"source_branch": "main",
38+
"source_directory": "docs/source/",
39+
}

docs/source/getting_started/installation.rst

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,18 @@ The library is available on PyPI, so it can be installed using the following com
2525

2626
It is higly recommended to install the library in a virtual environment (conda, venv, etc.)
2727

28-
If your clone the repository, you can install the library using `poetry <https://python-poetry.org/docs/>`_:
28+
If your clone the repository, it is recommended to use a package manager like `rye <https://rye.astral.sh/>`_.
29+
To install the library using rye, you can run the following command:
2930

3031
.. code-block:: bash
3132
32-
poetry install
33+
rye pin 3.10
34+
rye sync
35+
rye build
36+
37+
.. caution::
38+
39+
**Rye** must be installed first by following the instructions on the `official website <https://rye.astral.sh/>`_.
3340

3441
Additionally on Windows when using WSL
3542
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

docs/source/index.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,15 @@
3232

3333
modules/modules
3434

35+
.. toctree::
36+
:hidden:
37+
:caption: EXTERNAL RESOURCES
38+
39+
GitHub <https://github.com/VinciGit00/Scrapegraph-ai>
40+
Discord <https://discord.gg/uJN7TYcpNa>
41+
Linkedin <https://www.linkedin.com/company/scrapegraphai/>
42+
Twitter <https://twitter.com/scrapegraphai>
43+
3544
Indices and tables
3645
==================
3746

0 commit comments

Comments
 (0)