Create a server client object in Outlines #1541
Replies: 3 comments
-
To clarify, this is running a server that Outlines handles, or do we assume the user has an external vLLM server they're managing? General idea notesIf we assume the user has their own server, I think this is a great idea. I've found best practices to be separating inference from logic, such that it's easier to just swap your inference backend from a dev server with a tiny model to a production inference server by just changing the server URL and model. This also would allow us to replicate the actual value of Instructor, which is serving a consistent API across model providers. I think we would want to target vLLM first. As you mentioned, vLLM uses the exact same tools the OpenAI , though vLLM supports additional features like I would delegate most or all of the API to the In principle we could just have a class like Async vs. syncI would provide a sync client first and then an async variant separately, though I feel like the async variant is a lower priority, mostly because sync is easier to work with. That said, I'm not a big async user, so opinions welcome there. @RobinPicard, it seems like you have a preference for async-first, is that right? |
Beta Was this translation helpful? Give feedback.
-
It also occurs to me that we could maybe create a separate package to fake OpenAI compatibility, similar to our vllm wrapper but across all model types. We'd basically have a FastAPI wrapper that managed a few common arguments, but primarily import outlines
import outlines_server
import transformers
MODEL_NAME = "TheBloke/Mistral-7B-OpenOrca-AWQ"
model = outlines.from_transformers(
transformers.AutoModelForCausalLM.from_pretrained(MODEL_NAME),
transformers.AutoTokenizer.from_pretrained(MODEL_NAME)
)
outlines_server.serve(model) which would handle incoming schemas/grammars/etc. This would let us provide a sufficient OpenAI wrapper around any of our backends. |
Beta Was this translation helpful? Give feedback.
-
We want this to be as simple as possible. In the spirit of our other model integrations the user would be in charge of passing the client (OpenAI client for vLLM for instance) and we would only do the output type translation. I think that should make the integrations very low maintenance. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
LLM users often like to use a langage model that runs in a server and to which they then make requests (they use vLLM for instance). An issue is that making requests to a server is quite inconvenient as you need to handle requests etc.
@rlouf suggested we create an object in Outlines that would be used to call an llm running on a server. This object would be similar to a
Model
as it would also be called with a prompt and an output type and would return the completion. On top of providing convenience to users, this feature would help strengthen Outlines' position as the best interface to use LLMs.A few questions on this topic:
Beta Was this translation helpful? Give feedback.
All reactions