Ollama brings the power of Large Language Models (LLMs) directly to your local machine. It removes the complexity of cloud-based solutions by offering a user-friendly framework for running these powerful models.
Ollama is a robust platform designed to simplify the process of running machine learning models locally. It offers an intuitive interface that allows users to efficiently manage and deploy models without the need for extensive technical knowledge. By streamlining the setup and execution processes, Ollama makes it accessible for developers to harness the power of advanced models directly on their local machines, promoting ease of use and faster iterations in development cycles.
However, Ollama does come with a notable limitation when it comes to containerized deployments. To download and manage models, Ollama must be actively running and serving before the models can be accessed. This requirement complicates the deployment process within containers, as it necessitates additional steps to ensure the service is up and operational before any model interactions can occur. Consequently, this adds complexity to Continuous Integration (CI) and Continuous Deployment (CD) pipelines, potentially hindering seamless automation and scaling efforts.
On Ollama’s docker hub it has clear instructions over how to run Ollama requiring 2 steps. In the 1st step you need to have Ollama running before you can download the model to have it ready for prompting.
docker run -d — gpus=all -v ollama:/root/.ollama -p 11434:11434 — name ollama ollama/ollama
docker exec -it ollama ollama run llama3
On their Discord there is a help query about how to do this in one shot with a solution which is good but not something I would put in production to lack of orchestration and supervision of processes. Its on github as autollama and I recommend to check it out to learn some new tricks.
This is where I leveraged my past experience of using s6-overlay to setup serve
and pull
in a single container with serve as a longrun
and pull as a oneshot
dependent on serve to be up and running.
The directory structure for it as below
It runs flawlessly with pull
running well supervised and orchestrated for it to complete and even when the download gets hammered due to internet speeds it keeps the process going without a glitch.
Currently there is a known issue in s6-overlay for service wait time which initially caused the oneshot
to timeout. Had to S6_CMD_WAIT_FOR_SERVICES_MAXTIME=0 to disable it for the model download to not fail.
It is alive, at this point I was just super happy how smoothly it came up
On following run pull
only gets the diff if any without the need to download the whole model again.
And Ollama has an api that you can prompt and its a charm to play around with.
With serve and pull in a single container to be served along your application it simplifies not only your deployments but also your CI to test it without overly complicating things by hacking scripts.
I have put the repo on github as ollama-s6 for anyone looking to productionize their ollama deplyoments.
Originally published at https://dev.to on May 29, 2024.