Building your own self-hosted AI Assistant on Civo with Llama 4
Learn how to build and self-host your own AI assistant using Llama 4 on Civo, and understand how to take full ownership of your AI future.
Written by
Software Engineer @ GoCardless
Written by
Software Engineer @ GoCardless
Relying on expensive third-party AI APIs is no longer your only option. With the release of Llama 4, developers now have access to cutting-edge open-source models that can run entirely on their own infrastructure with no rate limits, no hidden costs, and complete control over data privacy.
Llama 4 introduces powerful multimodal capabilities and an unprecedented 10 million token context window, making it a serious contender against proprietary models like GPT-4o. When combined with Civo, deploying and scaling your own AI assistant becomes not just possible, but surprisingly simple.
In this tutorial, you’ll learn how to build and self-host your own AI assistant using Llama 4 on Civo. Whether you’re looking to explore advanced AI workloads or bring a real assistant into production, this hands-on walkthrough will show you how to take full ownership of your AI future.
Prerequisites
To follow this tutorial, you will need the following:
- Familiarity with Kubernetes and Docker basics.
- A Civo account to create and manage your Kubernetes cluster.
- kubectl installed and configured to interact with the cluster.
- Docker installed locally to containerize the application.
What’s new with Llama 4?

Meta’s latest release, Llama 4, marks an exciting leap forward in open-source AI. With native multimodal capabilities, Llama 4 can seamlessly understand both text and images, opening the door to richer, more versatile applications. It also introduces an industry-leading 10 million token context window, making it ideal for projects that demand deep memory and personalization.
Across benchmarks, the Llama 4 Scout and Maverick models deliver impressive results, outperforming previous Llama generations and even rivaling top models like Gemini and GPT-4o in reasoning, coding, and multilingual tasks, all while offering outstanding cost efficiency.
Llama 4 vs. Llama 3: A comparison
While Llama 3 laid the foundation for cutting-edge open-source AI, Llama 4 brings several significant improvements that set it apart. These enhancements make Llama 4 a stronger contender for a wider range of use cases, especially when dealing with more complex tasks. Here’s a comparison of the key differences:
In summary, Llama 4 represents a major leap forward from Llama 3, especially in its ability to handle multimodal data, manage long context windows, and perform complex reasoning tasks. These enhancements make it an even more attractive option for developers looking to build advanced, real-world AI applications with greater flexibility and depth.
Building your own self-hosted AI assistant
Step 1: Set up Civo Kubernetes cluster
To create a Kubernetes cluster, follow the detailed instructions provided in the Civo Kubernetes documentation.
Step 2: Building the Llama 4 API server

Source: Image by author
Now that your Kubernetes cluster is ready, let's move on to building the core of your AI assistant, the API server that will interact with the Llama 4 model.
We’ll use FastAPI, a modern, high-performance Python framework, to quickly build and expose the API. FastAPI makes it easy to define API routes, handle request validation, and automatically generate interactive documentation for your endpoints.
1. Create app.py
Start by creating a new file called app.py. This file will serve two main purposes:
- Load the Llama 4 model using Hugging Face’s Transformers library.
- Expose a simple API endpoint where users can send prompts and receive model-generated responses.
Here’s an example app.py- import the necessary libraries
from fastapi import FastAPI, HTTPExceptionfrom pydantic import BaseModelfrom transformers import pipelineimport uvicorn
Initialize FastAPI app:
app = FastAPI()
Load the Llama 4 model using the Hugging Face Transformers pipeline:
pipe = pipeline("image-text-to-text",model="meta-llama/Llama-4-Scout-17B-16E-Instruct")
Define request structure:
class PromptRequest(BaseModel):role: strcontent: str
Define a POST endpoint to accept prompts:
@app.post("/generate")async def generate_text(request: PromptRequest):try:messages = [{"role": request.role, "content": request.content}]output = pipe(messages)return {"response": output}except Exception as e:raise HTTPException(status_code=500, detail=str(e))if __name__ == "__main__":uvicorn.run(app, host="0.0.0.0", port=8000)
What this app does:
- When a user sends a
POSTrequest to/generatewith a prompt like"Explain Kubernetes", the model will complete the prompt and send back the generated text. - If anything goes wrong (like model errors), it will return a clean error response.
2. Create requirements.txt
You also need to create a requirements.txt file to list all the Python libraries your app will need when running inside the container.
Here’s what you'll include:
fastapi: To build and expose the web API.uvicorn: To serve your FastAPI app.transformers: To load and run the LLaMA 4 model.torch: Required backend for running the model computations.
Later, when you build your Docker image, these dependencies will be automatically installed inside the container.
With your API server and dependencies now ready, you're all set to move on to containerizing the application and preparing it for deployment on your Civo Kubernetes cluster.
Step 3: Dockerize the application
Now that your API server is ready, it’s time to package it into a Docker container. This will make it portable, easy to deploy, and ready for Kubernetes.
1. Create a Dockerfile
Inside your project directory (where app.py and requirements.txt are located), create a file called Dockerfile. This file will tell Docker how to build your container image.
Here’s a simple Dockerfile for your project:
# Use an official lightweight Python imageFROM python:3.10-slim# Set the working directoryWORKDIR /app# Copy project files into the containerCOPY requirements.txt .COPY app.py .# Install dependenciesRUN pip install --no-cache-dir -r requirements.txt# Expose port 8000 for the FastAPI appEXPOSE 8000# Command to run the appCMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
Quick breakdown of the Dockerfile:
- It uses a minimal Python image to keep the container lightweight.
- It copies your app files inside the container.
- It installs all dependencies listed in
requirements.txt. - It exposes port
8000, which matches the port your FastAPI app listens on. - It sets the container to run your app automatically when started.
2. Build the Docker Image
Once your Dockerfile is ready, you can build your container image by running the following command inside your project directory:
docker build -t your-dockerhub-username/llama4-api
Replace your-dockerhub-username with your actual Docker Hub username.
3. Push the image to Docker Hub
After building the image, push it to a container registry like Docker Hub so that Kubernetes can pull it later:
First, log in to Docker Hub:
docker login
Then push your image:
docker push your-dockerhub-username/llama4-api
With these steps, your application is now containerized and ready to be deployed to your Kubernetes cluster.
Step 4: Deploy Llama 4 on Civo Kubernetes
Now that your app is containerized and pushed to Docker Hub, it’s time to deploy it on your Kubernetes cluster.
You’ll write two YAML files:
- One to deploy your app
- One to expose it so you can access it from outside the cluster
1. Create deployment.yaml
First, create a file named deployment.yaml. This file defines the Kubernetes Deployment that will run your container on the cluster.
Here’s a basic example:
apiVersion: apps/v1kind: Deploymentmetadata:name: llama4-apispec:replicas: 1selector:matchLabels:app: llama4-apitemplate:metadata:labels:app: llama4-apispec:containers:- name: llama4-apiimage: your-dockerhub-username/llama4-apiports:- containerPort: 8000
Quick breakdown:
- This tells Kubernetes to run one replica of your container.
- It uses the Docker image you pushed earlier.
- It exposes port
8000inside the container, matching your FastAPI app.
2. Create service.yaml
Now create a second file called service.yaml. This file will expose your app so you can access it from outside the cluster.
Here’s an example:
apiVersion: v1kind: Servicemetadata:name: llama4-api-servicespec:selector:app: llama4-apitype: LoadBalancerports:- protocol: TCPport: 80targetPort: 8000
Quick breakdown:
- This tells Kubernetes to create a LoadBalancer service (great for public access).
- Port
80will forward to your app's internal port8000. - The service automatically finds the pods labeled
app: llama4-api.
Note: If you prefer to keep it private and access it through kubectl port-forward, you can change type: LoadBalancer to ClusterIP.
3. Apply the YAMLs
Now that you’ve created the YAML files, apply them to your Kubernetes cluster using the following commands:
kubectl apply -f deployment.yamlkubectl apply -f service.yaml
This will deploy your app and expose it to the outside world.
You can check if everything is running with:
kubectl get podskubectl get services
Once your deployment and service are up, your self-hosted Llama 4 assistant will be live on your Kubernetes cluster, ready to accept prompts.
Step 5: Automating unit test generation with Llama 4
Now that your Llama 4 API server is deployed, it’s time to put it to work on a practical, high-impact use case: automating unit test generation.
Writing unit tests is one of the most repetitive and time-consuming tasks in software development. While essential for maintaining code quality, it often ends up rushed or overlooked under tight deadlines. In fact, studies show that developers spend up to 30% of their coding time on code maintenance tasks, including writing and maintaining tests. Automating this process helps reduce repetitive work, improve test coverage, and free up valuable time for more critical development tasks.
By using Llama 4’s structured reasoning capabilities, you can automate this tedious process, allowing developers to focus on more complex and impactful tasks. This isn’t just a productivity boost; it’s a practical way to improve code reliability and maintain higher test coverage with less manual effort.
In this step, you’ll learn how to turn your self-hosted AI assistant into a powerful code quality tool that generates clean, well-structured unit tests on demand.
1. Access the API
You can expose the API using one of two methods:
Port-forwarding (if you used ClusterIP service):
kubectl port-forward svc/llama4-service 8000:8000
LoadBalancer (if your service is of type LoadBalancer) - simply grab the external IP address from:
kubectl get svc
Once ready, your API will be accessible at http://localhost:8000 or via the LoadBalancer’s IP.
2. Send a prompt request
Using Postman or curl, send a POST request to the /generate endpoint.
Example prompt:
POST http://localhost:8000/generateContent-Type: application/json{"role": "user","content": "Write Python unit tests for a function that connects to a database, retrieves user data by ID, and handles connection errors."}
If everything is configured correctly, Llama 4 will respond with a generated Python unit test based on the prompt.
3. Verify the response
You should receive a structured output similar to this:
import unittestfrom unittest.mock import patch, MagicMockfrom my_module import get_user_by_id # Assuming your function is named get_user_by_idclass TestGetUserById(unittest.TestCase):@patch('my_module.database_connection')def test_get_user_success(self, mock_db_conn):# Mock the database connection and cursormock_cursor = MagicMock()mock_cursor.fetchone.return_value = {'id': 1, 'name': 'John Doe'}mock_db_conn.return_value.cursor.return_value = mock_cursoruser = get_user_by_id(1)self.assertEqual(user['id'], 1)self.assertEqual(user['name'], 'John Doe')mock_cursor.execute.assert_called_once_with("SELECT * FROM users WHERE id = %s", (1,))@patch('my_module.database_connection')def test_connection_error(self, mock_db_conn):# Simulate a connection errormock_db_conn.side_effect = Exception("Connection failed")with self.assertRaises(Exception) as context:get_user_by_id(1)self.assertIn("Connection failed", str(context.exception))if __name__ == '__main__':unittest.main()
This confirms that your self-hosted AI assistant can understand code snippets and generate useful unit tests automatically.
Feel free to experiment by providing different code examples (e.g., a sorting function, a database connection function) and see how Llama 4 adapts its test generation. This highlights its real-world value for developers aiming to automate tedious coding tasks.
Benefits of hosting Llama 4 on Civo
By deploying your Llama 4 assistant on Civo’s Kubernetes infrastructure, you’re tapping into a platform that’s built to make running AI workloads easier, faster, and more affordable. Here’s why Civo is a great fit:
Key takeaways
In this tutorial, you built your own self-hosted AI assistant by deploying Llama 4 on Civo Kubernetes. You learned how to containerize the application, expose it securely, and interact with the model through a simple API, all without relying on closed services.
Using Civo’s fast, scalable, and developer-friendly infrastructure made the deployment process straightforward and efficient. With flexible cluster setups, quick provisioning, and predictable costs, Civo offers an ideal environment for running open-source AI workloads.
Now that you have the foundation in place, we encourage you to experiment further, whether it's generating more complex applications, building intelligent agents, or extending your assistant to handle new types of tasks. The possibilities are wide open.

Software Engineer @ GoCardless
Mostafa Ibrahim is a software engineer and technical writer specializing in developer-focused content for SaaS and AI platforms. He currently works as a Software Engineer at GoCardless, contributing to production systems and scalable payment infrastructure.
Alongside his engineering work, Mostafa has written more than 200 technical articles reaching over 500,000 readers. His content covers topics including Kubernetes deployments, AI infrastructure, authentication systems, and retrieval-augmented generation (RAG) architectures.
Share this article