Skip to content

Add Elasticsearch #215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

evanvolgas
Copy link

@evanvolgas evanvolgas commented Apr 19, 2025

I'm still working on the LTR, SemanticSearch, etc classes, but I've got the basics of an Elasticsearch engine up and running and have tested chapter 4 (further tests are coming).

I'm still working on this, and will comment below when I believe it is ready for testing / merge,

I'd like a quick set of eyeballs if possible to make sure I'm not wasting my time with this. I know the LTR and semantic search classes aren't there yet, and I'm still working on them. The Collection and Engine should both be in good shape, I believe

Tested from the notebook server using existing Jupyter notebooks and the following:

# Import necessary libraries
import aips
import os
from aips.data_loaders.products import load_dataframe

# Set environment variables for Elasticsearch
os.environ["AIPS_ES_HOST"] = "aips-elasticsearch"
os.environ["AIPS_ES_PORT"] = "9201"

# Set Elasticsearch as the active engine
aips.set_engine("elasticsearch")

# Check if Elasticsearch is running
aips.healthcheck()

# Initialize the engine and create a collection
engine = aips.get_engine()
collection = engine.create_collection("products")

# Check if the data file exists
!ls -la /tmp/notebooks/data/retrotech/products.csv

# If the file doesn't exist, download the data
!mkdir -p /tmp/notebooks/data/retrotech
![ ! -f /tmp/notebooks/data/retrotech/products.csv ] && \
    ([ ! -d /tmp/notebooks/retrotech ] && git clone --depth=1 https://github.com/ai-powered-search/retrotech.git /tmp/notebooks/retrotech || true) && \
    cd /tmp/notebooks/retrotech && \
    tar -xvf products.tgz -C '/tmp/notebooks/data/retrotech/'

# Load the product data
products_dataframe = load_dataframe("/tmp/notebooks/data/retrotech/products.csv")

# Write the data to Elasticsearch
collection.write(products_dataframe)

# Verify indexing by performing a simple search
response = collection.search(
    query="ipad",
    query_fields=["name", "manufacturer", "long_description"],
    limit=5
)

# Print the search results
print(f"Found {len(response['docs'])} documents:")
for doc in response["docs"]:
    print(f"- {doc.get('name', 'Unknown')} (Score: {doc.get('score', 0)})")
"""
Found 5 documents:
- iPad® - Refurbished Digital A/V Adapter (Score: 7.6593866)
- iPad® - Refurbished USB Power Adapter (Score: 7.6593866)
- iPad® - Refurbished Dock Connector-to-VGA Adapter (Score: 7.6593866)
- iPad® - Refurbished Keyboard Dock (Score: 7.6593866)
- iPad® - Refurbished Digital Camera Connection Kit (Score: 7.6593866)
"""

Also, would you be open to PR to format everything with Black?

@kosmikdc
Copy link
Collaborator

Hey Evan, this is looking pretty good. A few comments:

  • It seems you've got some duplication in the docker files and docker config .yml for elasticsearch
  • As far as schema management, Elasticsearch is much more like Opensearch than Solr, so I'd model the schema creation/management after what the Opensearch code is doing, perhaps even using the opensearch schema config

@evanvolgas
Copy link
Author

Thanks for the feedback! I'll need the weekend but I'll make the changes and some more progress soon.

@evanvolgas
Copy link
Author

@kosmikdc I think this is what you're after?

I've added a test script (I'll move it to a test folder or something before this is ready for merge) that was helpful for me. I'd like to review the dense vectors and script score documentation a bit and make sure I didn't screw that up, and obviously I still need to test and implement some things (LTR, semantic, etc)

From the perspective of the schema management, though, is this better?

I'm committed to finishing this PR and making the book accessible to ES users as well (I simply cannot convince my org to switch products, but I'd love to teach members of my team how to use these concepts and tools). It'll be a weekend thing, but I can spend enough weekends on this to get ES support for the book. Please keep the feedback coming :)

@kosmikdc
Copy link
Collaborator

Mornin' @evanvolgas, love your enthusiasm and contributions here. The test script is an excellent idea serving many purposes in the integration of ES :)

The schema management is better in the sense that it will render the correct format, but perhaps the opensearch's config should be generalized (a simple rename and directory move) and then used by ElasticsearchEngine (As opposed to duplicated), but it seems functional at the time at least.

Another mention is that the AIPS system is set up to utilize Spark for data processing which simplifies and standardizes many data operations, mainly batch the batch reading and writing (aips.spark.create_view_from_collection() and Collection.write()). To use spark for a given search engine, the spark connector just has to be installed on the Jupyter image similar to the other spark installs. Once the Elasticsearch spark connector is hooked in, the batch data functionalities would greatly simplify in 'ElasticsearchCollection.write()' and in the multimethod aips.spark.create_view_from_collection()

I'll be contributing how I can in the coming weeks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants