Add Elasticsearch #215

evanvolgas · 2025-04-19T23:52:37Z

I'm still working on the LTR, SemanticSearch, etc classes, but I've got the basics of an Elasticsearch engine up and running and have tested chapter 4 (further tests are coming).

I'm still working on this, and will comment below when I believe it is ready for testing / merge,

I'd like a quick set of eyeballs if possible to make sure I'm not wasting my time with this. I know the LTR and semantic search classes aren't there yet, and I'm still working on them. The Collection and Engine should both be in good shape, I believe

Tested from the notebook server using existing Jupyter notebooks and the following:

# Import necessary libraries
import aips
import os
from aips.data_loaders.products import load_dataframe

# Set environment variables for Elasticsearch
os.environ["AIPS_ES_HOST"] = "aips-elasticsearch"
os.environ["AIPS_ES_PORT"] = "9201"

# Set Elasticsearch as the active engine
aips.set_engine("elasticsearch")

# Check if Elasticsearch is running
aips.healthcheck()

# Initialize the engine and create a collection
engine = aips.get_engine()
collection = engine.create_collection("products")

# Check if the data file exists
!ls -la /tmp/notebooks/data/retrotech/products.csv

# If the file doesn't exist, download the data
!mkdir -p /tmp/notebooks/data/retrotech
![ ! -f /tmp/notebooks/data/retrotech/products.csv ] && \
    ([ ! -d /tmp/notebooks/retrotech ] && git clone --depth=1 https://github.com/ai-powered-search/retrotech.git /tmp/notebooks/retrotech || true) && \
    cd /tmp/notebooks/retrotech && \
    tar -xvf products.tgz -C '/tmp/notebooks/data/retrotech/'

# Load the product data
products_dataframe = load_dataframe("/tmp/notebooks/data/retrotech/products.csv")

# Write the data to Elasticsearch
collection.write(products_dataframe)

# Verify indexing by performing a simple search
response = collection.search(
    query="ipad",
    query_fields=["name", "manufacturer", "long_description"],
    limit=5
)

# Print the search results
print(f"Found {len(response['docs'])} documents:")
for doc in response["docs"]:
    print(f"- {doc.get('name', 'Unknown')} (Score: {doc.get('score', 0)})")
"""
Found 5 documents:
- iPad&#xAE; - Refurbished Digital A/V Adapter (Score: 7.6593866)
- iPad&#xAE; - Refurbished USB Power Adapter (Score: 7.6593866)
- iPad&#xAE; - Refurbished Dock Connector-to-VGA Adapter (Score: 7.6593866)
- iPad&#xAE; - Refurbished Keyboard Dock (Score: 7.6593866)
- iPad&#xAE; - Refurbished Digital Camera Connection Kit (Score: 7.6593866)
"""

Also, would you be open to PR to format everything with Black?

kosmikdc · 2025-04-24T14:18:29Z

Hey Evan, this is looking pretty good. A few comments:

It seems you've got some duplication in the docker files and docker config .yml for elasticsearch
As far as schema management, Elasticsearch is much more like Opensearch than Solr, so I'd model the schema creation/management after what the Opensearch code is doing, perhaps even using the opensearch schema config

evanvolgas · 2025-04-24T22:33:56Z

Thanks for the feedback! I'll need the weekend but I'll make the changes and some more progress soon.

evanvolgas · 2025-04-26T15:24:10Z

@kosmikdc I think this is what you're after?

I've added a test script (I'll move it to a test folder or something before this is ready for merge) that was helpful for me. I'd like to review the dense vectors and script score documentation a bit and make sure I didn't screw that up, and obviously I still need to test and implement some things (LTR, semantic, etc)

From the perspective of the schema management, though, is this better?

I'm committed to finishing this PR and making the book accessible to ES users as well (I simply cannot convince my org to switch products, but I'd love to teach members of my team how to use these concepts and tools). It'll be a weekend thing, but I can spend enough weekends on this to get ES support for the book. Please keep the feedback coming :)

kosmikdc · 2025-04-28T12:11:26Z

Mornin' @evanvolgas, love your enthusiasm and contributions here. The test script is an excellent idea serving many purposes in the integration of ES :)

The schema management is better in the sense that it will render the correct format, but perhaps the opensearch's config should be generalized (a simple rename and directory move) and then used by ElasticsearchEngine (As opposed to duplicated), but it seems functional at the time at least.

Another mention is that the AIPS system is set up to utilize Spark for data processing which simplifies and standardizes many data operations, mainly batch the batch reading and writing (aips.spark.create_view_from_collection() and Collection.write()). To use spark for a given search engine, the spark connector just has to be installed on the Jupyter image similar to the other spark installs. Once the Elasticsearch spark connector is hooked in, the batch data functionalities would greatly simplify in 'ElasticsearchCollection.write()' and in the multimethod aips.spark.create_view_from_collection()

I'll be contributing how I can in the coming weeks

Evan Volgas added 5 commits April 19, 2025 16:49

Add Elasticsearch

36806a5

LTR and SparseSemanticSearch

7682f59

LTR and SparseSemanticSearch

eca5f9f

Chapter 4 tested

7985c97

Chapter 4 tested

1f9c9f9

Evan Volgas added 2 commits April 26, 2025 08:02

AI Powered Search: Elasticsearch

d8dc904

AI Powered Search: ES 8 vector search

c746a76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Elasticsearch #215

Add Elasticsearch #215

Uh oh!

evanvolgas commented Apr 19, 2025 •

edited

Loading

Uh oh!

kosmikdc commented Apr 24, 2025

Uh oh!

evanvolgas commented Apr 24, 2025

Uh oh!

evanvolgas commented Apr 26, 2025

Uh oh!

kosmikdc commented Apr 28, 2025

Uh oh!

Uh oh!

Add Elasticsearch #215

Are you sure you want to change the base?

Add Elasticsearch #215

Uh oh!

Conversation

evanvolgas commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kosmikdc commented Apr 24, 2025

Uh oh!

evanvolgas commented Apr 24, 2025

Uh oh!

evanvolgas commented Apr 26, 2025

Uh oh!

kosmikdc commented Apr 28, 2025

Uh oh!

Uh oh!

evanvolgas commented Apr 19, 2025 •

edited

Loading