Skip to content

Commit dad8f86

Browse files
jirispilkaBraniganLeebilgeyucel
authored
Add description of Apify-haystack integration (#248)
* Add description of Apify-haystack integration * Apply suggestions from code review Co-authored-by: BraniganLee <daniel.lee@apify.com> * Update integrations/apify.md Co-authored-by: Bilge Yücel <bilge.yucel@deepset.ai> * Update integrations/apify.md Co-authored-by: Bilge Yücel <bilge.yucel@deepset.ai> * Update integrations/apify.md Co-authored-by: Bilge Yücel <bilge.yucel@deepset.ai> * Update integrations/apify.md Co-authored-by: Bilge Yücel <bilge.yucel@deepset.ai> * Handle review comments * Replace apify logo (with transparent background) --------- Co-authored-by: BraniganLee <daniel.lee@apify.com> Co-authored-by: Bilge Yücel <bilge.yucel@deepset.ai>
1 parent 451f266 commit dad8f86

File tree

2 files changed

+244
-0
lines changed

2 files changed

+244
-0
lines changed

integrations/apify.md

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
---
2+
layout: integration
3+
name: Apify
4+
description: Extract data from the web and automate web tasks using Apify-Haystack integration.
5+
authors:
6+
- name: apify
7+
socials:
8+
github: https://github.com/apify
9+
twitter: https://x.com/apify
10+
linkedin: https://www.linkedin.com/company/apifytech
11+
pypi: https://pypi.org/project/apify-haystack
12+
repo: https://github.com/apify/apify-haystack
13+
type: Data Ingestion
14+
report_issue: https://github.com/apify/apify-haystack/issues
15+
logo: /logos/apify.png
16+
version: Haystack 2.0
17+
toc: true
18+
---
19+
20+
### Table of Contents
21+
22+
- [Overview](#overview)
23+
- [Installation](#installation)
24+
- [Usage](#usage)
25+
- [ApifyDatasetFromActorCall on its own](#apifydatasetfromactorcall-on-its-own)
26+
- [ApifyDatasetFromActorCall in a RAG pipeline](#apifydatasetfromactorcall-in-a-rag-pipeline)
27+
- [License](#license)
28+
29+
## Overview
30+
31+
[Apify](https://apify.com) is a web scraping and data extraction platform.
32+
It helps automate web tasks and extract content from e-commerce websites, social media (Facebook, Instagram, TikTok), search engines, online maps, and more.
33+
Apify provides more than two thousand ready-made cloud solutions called Actors.
34+
35+
## Installation
36+
37+
Install the Apify-haystack integration:
38+
```bash
39+
pip install apify-haystack
40+
```
41+
42+
## Usage
43+
44+
Once installed, you will have access to more than two thousand ready-made apps called Actors at [Apify Store](https://apify.com/store)
45+
46+
- Load a dataset from Apify and convert it to a Haystack Document
47+
- Extract data from Facebook/Instagram and save it in the InMemoryDocumentStore
48+
- Crawl websites, scrape text content, and store it in the InMemoryDocumentStore
49+
- Retrieval-Augmented Generation (RAG): Extracting text from a website & question answering
50+
51+
The integration implements the following components (you can find their usage in these [examples](https://github.com/apify/apify-haystack/tree/main/src/apify_haystack/examples)):
52+
- `ApifyDatasetLoader`: Load a dataset created by an Apify Actor
53+
- `ApifyDatasetFromActorCall`: Call an Apify Actor, load the dataset, and convert it to Haystack Documents
54+
- `ApifyDatasetFromTaskCall`: Call an Apify task, load the dataset, and convert it to Haystack Documents
55+
56+
You need to have an Apify account and an Apify API token to run this example.
57+
You can start with a free account at [Apify](https://apify.com/) and get your [Apify API token](https://docs.apify.com/platform/integrations/api#api-token).
58+
59+
In the examples below, specify `apify_api_token` and run the script.
60+
61+
62+
### ApifyDatasetFromActorCall on its own
63+
64+
65+
Use Apify's [Website Content Crawler](https://apify.com/apify/website-content-crawler) to crawl a website, scrape text content, and convert it to Haystack Documents. You can browse other Actors in [Apify Store](https://apify.com/store)
66+
67+
In the example below, the text content is extracted from https://haystack.deepset.ai/.
68+
You can control the number of crawled pages using `maxCrawlPages` parameter. For a detailed overview of the parameters, please refer to [Website Content Crawler](https://apify.com/apify/website-content-crawler/input-schema).
69+
70+
The script should produce the following output (truncated to a single Document):
71+
```text
72+
Document(id=a617d376*****, content: 'Introduction to Haystack 2.x)
73+
Haystack is an open-source framework fo...', meta: {'url': 'https://docs.haystack.deepset.ai/docs/intro'}
74+
```
75+
76+
```python
77+
from dotenv import load_dotenv
78+
from haystack import Document
79+
80+
from apify_haystack import ApifyDatasetFromActorCall
81+
82+
# Set APIFY-API-TOKEN here or load it from .env file
83+
apify_api_token = "" or load_dotenv()
84+
85+
actor_id = "apify/website-content-crawler"
86+
run_input = {
87+
"maxCrawlPages": 3, # limit the number of pages to crawl
88+
"startUrls": [{"url": "https://haystack.deepset.ai/"}],
89+
}
90+
91+
92+
def dataset_mapping_function(dataset_item: dict) -> Document:
93+
"""Convert an Apify dataset item to a Haystack Document
94+
95+
Website Content Crawler returns a dataset with the following output fields:
96+
{
97+
"url": "https://haystack.deepset.ai",
98+
"text": "Haystack is an open-source framework for building production-ready LLM applications",
99+
}
100+
"""
101+
return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})
102+
103+
104+
actor = ApifyDatasetFromActorCall(
105+
actor_id=actor_id,
106+
run_input=run_input,
107+
dataset_mapping_function=dataset_mapping_function,
108+
apify_api_token=apify_api_token,
109+
)
110+
print(f"Calling the Apify Actor {actor_id} ... crawling will take some time ...")
111+
print("You can monitor the progress at: https://console.apify.com/actors/runs")
112+
113+
dataset = actor.run().get("documents")
114+
115+
print(f"Loaded {len(dataset)} documents from the Apify Actor {actor_id}:")
116+
for d in dataset:
117+
print(d)
118+
```
119+
120+
### ApifyDatasetFromActorCall in a [RAG pipeline](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline)
121+
122+
*Retrieval-Augmented Generation (RAG):* Extracting text content from a website and using it for question answering.
123+
Answer questions about the https://haystack.deepset.ai website using the extracted text content.
124+
125+
Expected output:
126+
```text
127+
question: "What is haystack?"
128+
answer: Haystack is an open-source framework for building production-ready LLM applications
129+
``````
130+
131+
In addition to the `Apify API token`, you also need to specify `OpenAI API token` to run this example.
132+
133+
```python
134+
135+
import os
136+
137+
from dotenv import load_dotenv
138+
from haystack import Document, Pipeline
139+
from haystack.components.builders import PromptBuilder
140+
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
141+
from haystack.components.generators import OpenAIGenerator
142+
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
143+
from haystack.document_stores.in_memory import InMemoryDocumentStore
144+
from haystack.utils.auth import Secret
145+
146+
from apify_haystack import ApifyDatasetFromActorCall
147+
148+
# Set APIFY-API-TOKEN here or use it from .env file
149+
load_dotenv()
150+
apify_api_token = "" or os.getenv("APIFY_API_TOKEN")
151+
openai_api_key = "" or os.getenv("OPENAI_API_KEY")
152+
153+
actor_id = "apify/website-content-crawler"
154+
run_input = {
155+
"maxCrawlPages": 1, # limit the number of pages to crawl
156+
"startUrls": [{"url": "https://haystack.deepset.ai/"}],
157+
}
158+
159+
160+
def dataset_mapping_function(dataset_item: dict) -> Document:
161+
"""Convert an Apify dataset item to a Haystack Document
162+
163+
Website Content Crawler returns a dataset with the following output fields:
164+
{
165+
"url": "https://haystack.deepset.ai",
166+
"text": "Haystack is an open-source framework for building production-ready LLM applications",
167+
}
168+
"""
169+
return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})
170+
171+
172+
apify_dataset_loader = ApifyDatasetFromActorCall(
173+
actor_id=actor_id,
174+
run_input=run_input,
175+
dataset_mapping_function=dataset_mapping_function,
176+
apify_api_token=apify_api_token,
177+
)
178+
179+
# Components
180+
print("Initializing components...")
181+
document_store = InMemoryDocumentStore()
182+
183+
docs_embedder = OpenAIDocumentEmbedder(api_key=Secret.from_token(openai_api_key))
184+
text_embedder = OpenAITextEmbedder(api_key=Secret.from_token(openai_api_key))
185+
retriever = InMemoryEmbeddingRetriever(document_store)
186+
generator = OpenAIGenerator(model="gpt-3.5-turbo")
187+
188+
# Load documents from Apify
189+
print("Crawling and indexing documents...")
190+
print("You can visit https://console.apify.com/actors/runs to monitor the progress")
191+
docs = apify_dataset_loader.run()
192+
embeddings = docs_embedder.run(docs.get("documents"))
193+
document_store.write_documents(embeddings["documents"])
194+
195+
template = """
196+
Given the following information, answer the question.
197+
198+
Context:
199+
{% for document in documents %}
200+
{{ document.content }}
201+
{% endfor %}
202+
203+
Question: {{question}}
204+
Answer:
205+
"""
206+
207+
prompt_builder = PromptBuilder(template=template)
208+
209+
# Add components to your pipeline
210+
print("Initializing pipeline...")
211+
pipe = Pipeline()
212+
pipe.add_component("embedder", text_embedder)
213+
pipe.add_component("retriever", retriever)
214+
pipe.add_component("prompt_builder", prompt_builder)
215+
pipe.add_component("llm", generator)
216+
217+
# Now, connect the components to each other
218+
pipe.connect("embedder.embedding", "retriever.query_embedding")
219+
pipe.connect("retriever", "prompt_builder.documents")
220+
pipe.connect("prompt_builder", "llm")
221+
222+
question = "What is haystack?"
223+
224+
print("Running pipeline ... ")
225+
response = pipe.run({"embedder": {"text": question}, "prompt_builder": {"question": question}})
226+
227+
print(f"question: {question}")
228+
print(f"answer: {response['llm']['replies'][0]}")
229+
230+
# Other questions
231+
examples = [
232+
"Who created Haystack?",
233+
"Are there any upcoming events or community talks?",
234+
]
235+
236+
for example in examples:
237+
response = pipe.run({"embedder": {"text": example}, "prompt_builder": {"question": example}})
238+
print(f"question: {question}")
239+
print(f"answer: {response['llm']['replies'][0]}")
240+
```
241+
242+
243+
### License
244+
`apify-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.

logos/apify.png

70.7 KB
Loading

0 commit comments

Comments
 (0)