|
| 1 | +--- |
| 2 | +layout: integration |
| 3 | +name: Apify |
| 4 | +description: Extract data from the web and automate web tasks using Apify-Haystack integration. |
| 5 | +authors: |
| 6 | + - name: apify |
| 7 | + socials: |
| 8 | + github: https://github.com/apify |
| 9 | + twitter: https://x.com/apify |
| 10 | + linkedin: https://www.linkedin.com/company/apifytech |
| 11 | +pypi: https://pypi.org/project/apify-haystack |
| 12 | +repo: https://github.com/apify/apify-haystack |
| 13 | +type: Data Ingestion |
| 14 | +report_issue: https://github.com/apify/apify-haystack/issues |
| 15 | +logo: /logos/apify.png |
| 16 | +version: Haystack 2.0 |
| 17 | +toc: true |
| 18 | +--- |
| 19 | + |
| 20 | +### Table of Contents |
| 21 | + |
| 22 | +- [Overview](#overview) |
| 23 | +- [Installation](#installation) |
| 24 | +- [Usage](#usage) |
| 25 | + - [ApifyDatasetFromActorCall on its own](#apifydatasetfromactorcall-on-its-own) |
| 26 | + - [ApifyDatasetFromActorCall in a RAG pipeline](#apifydatasetfromactorcall-in-a-rag-pipeline) |
| 27 | +- [License](#license) |
| 28 | + |
| 29 | +## Overview |
| 30 | + |
| 31 | +[Apify](https://apify.com) is a web scraping and data extraction platform. |
| 32 | +It helps automate web tasks and extract content from e-commerce websites, social media (Facebook, Instagram, TikTok), search engines, online maps, and more. |
| 33 | +Apify provides more than two thousand ready-made cloud solutions called Actors. |
| 34 | + |
| 35 | +## Installation |
| 36 | + |
| 37 | +Install the Apify-haystack integration: |
| 38 | +```bash |
| 39 | +pip install apify-haystack |
| 40 | +``` |
| 41 | + |
| 42 | +## Usage |
| 43 | + |
| 44 | +Once installed, you will have access to more than two thousand ready-made apps called Actors at [Apify Store](https://apify.com/store) |
| 45 | + |
| 46 | +- Load a dataset from Apify and convert it to a Haystack Document |
| 47 | +- Extract data from Facebook/Instagram and save it in the InMemoryDocumentStore |
| 48 | +- Crawl websites, scrape text content, and store it in the InMemoryDocumentStore |
| 49 | +- Retrieval-Augmented Generation (RAG): Extracting text from a website & question answering |
| 50 | + |
| 51 | +The integration implements the following components (you can find their usage in these [examples](https://github.com/apify/apify-haystack/tree/main/src/apify_haystack/examples)): |
| 52 | +- `ApifyDatasetLoader`: Load a dataset created by an Apify Actor |
| 53 | +- `ApifyDatasetFromActorCall`: Call an Apify Actor, load the dataset, and convert it to Haystack Documents |
| 54 | +- `ApifyDatasetFromTaskCall`: Call an Apify task, load the dataset, and convert it to Haystack Documents |
| 55 | + |
| 56 | +You need to have an Apify account and an Apify API token to run this example. |
| 57 | +You can start with a free account at [Apify](https://apify.com/) and get your [Apify API token](https://docs.apify.com/platform/integrations/api#api-token). |
| 58 | + |
| 59 | +In the examples below, specify `apify_api_token` and run the script. |
| 60 | + |
| 61 | + |
| 62 | +### ApifyDatasetFromActorCall on its own |
| 63 | + |
| 64 | + |
| 65 | +Use Apify's [Website Content Crawler](https://apify.com/apify/website-content-crawler) to crawl a website, scrape text content, and convert it to Haystack Documents. You can browse other Actors in [Apify Store](https://apify.com/store) |
| 66 | + |
| 67 | +In the example below, the text content is extracted from https://haystack.deepset.ai/. |
| 68 | +You can control the number of crawled pages using `maxCrawlPages` parameter. For a detailed overview of the parameters, please refer to [Website Content Crawler](https://apify.com/apify/website-content-crawler/input-schema). |
| 69 | + |
| 70 | +The script should produce the following output (truncated to a single Document): |
| 71 | +```text |
| 72 | +Document(id=a617d376*****, content: 'Introduction to Haystack 2.x) |
| 73 | +Haystack is an open-source framework fo...', meta: {'url': 'https://docs.haystack.deepset.ai/docs/intro'} |
| 74 | +``` |
| 75 | + |
| 76 | +```python |
| 77 | +from dotenv import load_dotenv |
| 78 | +from haystack import Document |
| 79 | + |
| 80 | +from apify_haystack import ApifyDatasetFromActorCall |
| 81 | + |
| 82 | +# Set APIFY-API-TOKEN here or load it from .env file |
| 83 | +apify_api_token = "" or load_dotenv() |
| 84 | + |
| 85 | +actor_id = "apify/website-content-crawler" |
| 86 | +run_input = { |
| 87 | + "maxCrawlPages": 3, # limit the number of pages to crawl |
| 88 | + "startUrls": [{"url": "https://haystack.deepset.ai/"}], |
| 89 | +} |
| 90 | + |
| 91 | + |
| 92 | +def dataset_mapping_function(dataset_item: dict) -> Document: |
| 93 | + """Convert an Apify dataset item to a Haystack Document |
| 94 | + |
| 95 | + Website Content Crawler returns a dataset with the following output fields: |
| 96 | + { |
| 97 | + "url": "https://haystack.deepset.ai", |
| 98 | + "text": "Haystack is an open-source framework for building production-ready LLM applications", |
| 99 | + } |
| 100 | + """ |
| 101 | + return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")}) |
| 102 | + |
| 103 | + |
| 104 | +actor = ApifyDatasetFromActorCall( |
| 105 | + actor_id=actor_id, |
| 106 | + run_input=run_input, |
| 107 | + dataset_mapping_function=dataset_mapping_function, |
| 108 | + apify_api_token=apify_api_token, |
| 109 | +) |
| 110 | +print(f"Calling the Apify Actor {actor_id} ... crawling will take some time ...") |
| 111 | +print("You can monitor the progress at: https://console.apify.com/actors/runs") |
| 112 | + |
| 113 | +dataset = actor.run().get("documents") |
| 114 | + |
| 115 | +print(f"Loaded {len(dataset)} documents from the Apify Actor {actor_id}:") |
| 116 | +for d in dataset: |
| 117 | + print(d) |
| 118 | +``` |
| 119 | + |
| 120 | +### ApifyDatasetFromActorCall in a [RAG pipeline](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline) |
| 121 | + |
| 122 | +*Retrieval-Augmented Generation (RAG):* Extracting text content from a website and using it for question answering. |
| 123 | +Answer questions about the https://haystack.deepset.ai website using the extracted text content. |
| 124 | + |
| 125 | +Expected output: |
| 126 | +```text |
| 127 | +question: "What is haystack?" |
| 128 | +answer: Haystack is an open-source framework for building production-ready LLM applications |
| 129 | +`````` |
| 130 | +
|
| 131 | +In addition to the `Apify API token`, you also need to specify `OpenAI API token` to run this example. |
| 132 | +
|
| 133 | +```python |
| 134 | +
|
| 135 | +import os |
| 136 | +
|
| 137 | +from dotenv import load_dotenv |
| 138 | +from haystack import Document, Pipeline |
| 139 | +from haystack.components.builders import PromptBuilder |
| 140 | +from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder |
| 141 | +from haystack.components.generators import OpenAIGenerator |
| 142 | +from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever |
| 143 | +from haystack.document_stores.in_memory import InMemoryDocumentStore |
| 144 | +from haystack.utils.auth import Secret |
| 145 | +
|
| 146 | +from apify_haystack import ApifyDatasetFromActorCall |
| 147 | +
|
| 148 | +# Set APIFY-API-TOKEN here or use it from .env file |
| 149 | +load_dotenv() |
| 150 | +apify_api_token = "" or os.getenv("APIFY_API_TOKEN") |
| 151 | +openai_api_key = "" or os.getenv("OPENAI_API_KEY") |
| 152 | +
|
| 153 | +actor_id = "apify/website-content-crawler" |
| 154 | +run_input = { |
| 155 | + "maxCrawlPages": 1, # limit the number of pages to crawl |
| 156 | + "startUrls": [{"url": "https://haystack.deepset.ai/"}], |
| 157 | +} |
| 158 | +
|
| 159 | +
|
| 160 | +def dataset_mapping_function(dataset_item: dict) -> Document: |
| 161 | + """Convert an Apify dataset item to a Haystack Document |
| 162 | + |
| 163 | + Website Content Crawler returns a dataset with the following output fields: |
| 164 | + { |
| 165 | + "url": "https://haystack.deepset.ai", |
| 166 | + "text": "Haystack is an open-source framework for building production-ready LLM applications", |
| 167 | + } |
| 168 | + """ |
| 169 | + return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")}) |
| 170 | +
|
| 171 | +
|
| 172 | +apify_dataset_loader = ApifyDatasetFromActorCall( |
| 173 | + actor_id=actor_id, |
| 174 | + run_input=run_input, |
| 175 | + dataset_mapping_function=dataset_mapping_function, |
| 176 | + apify_api_token=apify_api_token, |
| 177 | +) |
| 178 | +
|
| 179 | +# Components |
| 180 | +print("Initializing components...") |
| 181 | +document_store = InMemoryDocumentStore() |
| 182 | +
|
| 183 | +docs_embedder = OpenAIDocumentEmbedder(api_key=Secret.from_token(openai_api_key)) |
| 184 | +text_embedder = OpenAITextEmbedder(api_key=Secret.from_token(openai_api_key)) |
| 185 | +retriever = InMemoryEmbeddingRetriever(document_store) |
| 186 | +generator = OpenAIGenerator(model="gpt-3.5-turbo") |
| 187 | +
|
| 188 | +# Load documents from Apify |
| 189 | +print("Crawling and indexing documents...") |
| 190 | +print("You can visit https://console.apify.com/actors/runs to monitor the progress") |
| 191 | +docs = apify_dataset_loader.run() |
| 192 | +embeddings = docs_embedder.run(docs.get("documents")) |
| 193 | +document_store.write_documents(embeddings["documents"]) |
| 194 | +
|
| 195 | +template = """ |
| 196 | +Given the following information, answer the question. |
| 197 | +
|
| 198 | +Context: |
| 199 | +{% for document in documents %} |
| 200 | + {{ document.content }} |
| 201 | +{% endfor %} |
| 202 | +
|
| 203 | +Question: {{question}} |
| 204 | +Answer: |
| 205 | +""" |
| 206 | +
|
| 207 | +prompt_builder = PromptBuilder(template=template) |
| 208 | +
|
| 209 | +# Add components to your pipeline |
| 210 | +print("Initializing pipeline...") |
| 211 | +pipe = Pipeline() |
| 212 | +pipe.add_component("embedder", text_embedder) |
| 213 | +pipe.add_component("retriever", retriever) |
| 214 | +pipe.add_component("prompt_builder", prompt_builder) |
| 215 | +pipe.add_component("llm", generator) |
| 216 | +
|
| 217 | +# Now, connect the components to each other |
| 218 | +pipe.connect("embedder.embedding", "retriever.query_embedding") |
| 219 | +pipe.connect("retriever", "prompt_builder.documents") |
| 220 | +pipe.connect("prompt_builder", "llm") |
| 221 | +
|
| 222 | +question = "What is haystack?" |
| 223 | +
|
| 224 | +print("Running pipeline ... ") |
| 225 | +response = pipe.run({"embedder": {"text": question}, "prompt_builder": {"question": question}}) |
| 226 | +
|
| 227 | +print(f"question: {question}") |
| 228 | +print(f"answer: {response['llm']['replies'][0]}") |
| 229 | +
|
| 230 | +# Other questions |
| 231 | +examples = [ |
| 232 | + "Who created Haystack?", |
| 233 | + "Are there any upcoming events or community talks?", |
| 234 | +] |
| 235 | +
|
| 236 | +for example in examples: |
| 237 | + response = pipe.run({"embedder": {"text": example}, "prompt_builder": {"question": example}}) |
| 238 | + print(f"question: {question}") |
| 239 | + print(f"answer: {response['llm']['replies'][0]}") |
| 240 | +``` |
| 241 | + |
| 242 | + |
| 243 | +### License |
| 244 | +`apify-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license. |
0 commit comments