Large Language Models (LLMs) have changed how we interact with information, offering impressive capabilities to synthesize and generate text from vast amounts of training data. But if you've worked with LLMs, you've also likely encountered their limitations when dealing with specific contexts or proprietary information. While these models excel at general knowledge, they fall short when you need them to reason about your company's internal documentation, specialized datasets, or information that emerged after their training cutoff. Vector databases help solve this problem by converting domain-specific information into context for large language models.

Qdrant is an open-source vector database designed specifically for high-dimensional similarity searches. It excels at quickly finding the most relevant information from your custom dataset, which can then be fed to an LLM as context. This tutorial will walk you through setting up Qdrant on Kubernetes.

Vector Databases & RAG

Before a vector database can convert a custom dataset, say your internal documentation for example, into context for a large language model, it first needs to transform that text into vector embeddings. Vector embeddings are numerical representations that capture the semantic meaning of your content in a high-dimensional space.

When we create embeddings, we're essentially translating words, sentences, and documents into points in a mathematical space where similar concepts are positioned near each other. This translation is performed by embedding models (like OpenAI's text-embedding-ada-002 or open-source alternatives like BERT or Sentence Transformers) that have been trained to understand language.

How Vector Databases Work

Once your data is transformed into vectors, it needs to be stored efficiently for retrieval. This is where Qdrant and other vector databases come in. They use specialized indexing structures like HNSW (Hierarchical Navigable Small World) graphs that allow for:

  1. Fast approximate nearest neighbor search: Finding the most similar vectors without exhaustively comparing against every vector in the database.
  2. Scalability to millions or billions of vectors: Maintaining performance even as your dataset grows.
  3. Filtering based on metadata: Combining semantic similarity with traditional filtering ("find similar products, but only those in stock and under $50").

How Vector Databases Enable Retrieval-Augmented Generation (RAG)

RAG has emerged as one of the most practical applications of vector databases in the LLM ecosystem. The process works like this:

  1. Indexing phase: Your documents are chunked into manageable pieces, transformed into vector embeddings, and stored in a vector database like Qdrant, along with the original text and metadata.
  2. Retrieval phase: When a user asks a question, their query is converted to a vector using the same embedding model, and Qdrant finds the most semantically relevant chunks from your knowledge base.
  3. Generation phase: These retrieved chunks are sent to an LLM (like GPT-4 or Claude) along with the original question as context, allowing the model to generate an answer based on your specific information.

Prerequisites

This tutorial assumes some familiarity with Kubernetes. Additionally, you will need the following installed on your machine:

Deploying Qdrant on Kubernetes

Aside: GPU Support

Qdrant performs well on CPU-based instances for many workloads, and while we won't be using its GPU support in this demonstration. It also offers GPU acceleration for significantly improved vector search performance at scale. This becomes especially important when working with large collections containing millions of vectors or when handling high query throughput.

Civo's predictable GPU pricing model is particularly well-suited for both development and production Qdrant workloads, offering consistent costs without the surprise billing spikes that can occur on other cloud platforms.

For detailed instructions on configuring Qdrant with GPU support, refer to the official Qdrant GPU documentation.

Creating a Kubernetes Cluster (Optional)

If you already have a Kubernetes cluster up and running, feel free to skip this step. The important part here is to ensure you are running Kubernetes v1.24+ (as you need grpc probe).

You can verify your kubernetes version using:

kubectl version

Output is similar to:

Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.0

To create a cluster using the Civo CLI, run the following command:

civo k3s create --create-firewall --nodes 1 -m --save --switch --wait qdrant

This would launch a one-node Kubernetes cluster in your Civo account, the -m flag would merge the kubeconfig for the new cluster with your existing kubeconfig, --switch points your kube-context to the newly created cluster.

Output is similar to:

The cluster Qdrant (dca42473-f079-44a2-8328-5fae315c005b) has been created in 2 min 43 sec

Access your cluster with:

kubectl get node

Note: For production deployments, your node size requirements will vary significantly based on several factors: the dimensionality of your vectors, size of your payload data, and total storage needs. Higher vector dimensions and larger payloads require more memory and compute resources.

Install Qdrant

With a cluster created, the next step is to add the Qdrant Helm repository:

helm repo add qdrant https://qdrant.github.io/qdrant-helm

Update your repository cache so it pulls the latest version of the Qdrant Helm chart:

helm repo update

Output is similar to:

Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "qdrant" chart repository
...Successfully got an update from the "chaos-mesh" chart repository
...Successfully got an update from the "metrics-server" chart repository
...Successfully got an update from the "runwhen-contrib" chart repository
...Successfully got an update from the "portainer" chart repository
...Successfully got an update from the "litmuschaos" chart repository
...Successfully got an update from the "jaegertracing" chart repository
...Successfully got an update from the "jetstack" chart repository

Finally, install the Helm release with the following command:

helm upgrade -i qdrant qdrant/qdrant  --namespace vector --create-namespace

Output is similar to:

Release "qdrant" does not exist. Installing it now.
NAME: qdrant
LAST DEPLOYED: Sun Mar  2 11:59:55 2025
NAMESPACE: vector
STATUS: deployed
REVISION: 1
NOTES:
Qdrant v1.13.4 has been deployed successfully.

The full Qdrant documentation is available at https://qdrant.tech/documentation/.

The command above will also create a namespace called vector and deploy Qdrant within it.

To verify your deployment is running correctly, run the following command:

kubectl get all -n vector

Your output should be similar to:

Installing Qdrant

Interacting with Qdrant

Now that you have Qdrant running on your Kubernetes cluster, let's set up a local development environment to interact with it. We'll use Python and the Qdrant client SDK to create a collection, upload some vectors, and perform a simple similarity search.

Setting up a local Python environment

First, let's create a Python virtual environment to keep our dependencies isolated…

Create a new virtual environment:

python3 -m venv env

Activate the virtual environment:


# On Linux/macOS
source env/bin/activate
# On Windows
# env\Scripts\activate

Exposing Qdrant

To interact with our Kubernetes-hosted Qdrant instance, we need to set up port forwarding. To do this, run the following commands:

# Get the name of the Qdrant pod POD_NAME=$(kubectl get pods --namespace vector -l "app.kubernetes.io/name=qdrant,app.kubernetes.io/instance=qdrant" -o jsonpath="{.items[0].metadata.name}")

Port forward the pod:

kubectl --namespace vector port-forward $POD_NAME 6333:6333

Installing the Qdrant client

In a new terminal, with your virtual environment activated, install the Qdrant Python client:

pip install qdrant-client

Output is similar to:

Collecting qdrant-client
Downloading qdrant_client-1.13.2-py3-none-any.whl.metadata (10 kB)
Collecting grpcio>=1.41.0 (from qdrant-client)
 Downloading grpcio-1.70.0-cp313-cp313-macosx_10_14_universal2.whl.metadata (3.9 kB)ta (8.5 kB)
Collecting pydantic>=1.10.8 (from qdrant-client)

Creating a simple Qdrant example

Now, let's create a Python script to interact with our Qdrant instance. Create a new file called qdrant_example.py with the following content:

from qdrant_client import QdrantClient
from qdrant_client.http import models
import numpy as np

# Connect to Qdrant
client = QdrantClient(host="localhost", port=6333)

# Create a collection to store our vectors
collection_name = "demo_collection"

# First, check if the collection already exists and recreate it if it does
collections = client.get_collections().collections
collection_names = [collection.name for collection in collections]
if collection_name in collection_names:
    client.delete_collection(collection_name)

# Create a new collection with 3-dimensional vectors
client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=3,  # Vector size
        distance=models.Distance.COSINE
    )
)

# create some sample vectors to upload
# These could represent embeddings of: [fast food], [italian food], [dessert]
vectors = [
    np.array([0.9, 0.1, 0.1]),  # Fast food
    np.array([0.1, 0.9, 0.1]),  # Italian food
    np.array([0.1, 0.1, 0.9]),  # Dessert
]

# Add metadata to each vector to help us understand the results
payloads = [
    {"category": "fast food", "name": "burger"},
    {"category": "italian", "name": "pasta"},
    {"category": "dessert", "name": "ice cream"},
]

# Upload vectors with their associated metadata
client.upsert(
    collection_name=collection_name,
    points=models.Batch(
        ids=[1, 2, 3],
        vectors=vectors,
        payloads=payloads
    )
)

print("Collection created and vectors uploaded!")

# Now, let's search for something similar to fast food
search_vector = np.array([0.85, 0.15, 0.05])  # Similar to fast food

# Perform the search
search_results = client.search(
    collection_name=collection_name,
    query_vector=search_vector,
    limit=3
)

# Display the search results
print("\nSearch results (sorted by similarity):")
for result in search_results:
    print(f"ID: {result.id}, Score: {result.score:.4f}, Category: {result.payload['category']}, Name: {result.payload['name']}")

In this example, we:

  • Created a new collection for storing vectors using client.create_collection() with a specified vector size and distance metric.
  • Uploaded three 3-dimensional vectors representing different food categories with the client.upsert() method, which allows us to add both vectors and their associated metadata in a single operation.
  • Performed a similarity search using client.search() to find vectors similar to our search query representing "fast food".

This is a basic example with toy vectors, but the same principles apply when working with real embeddings from language models, which typically have hundreds or thousands of dimensions.

Execute the code by running:

python qdrant.py 

Your output should be similar to:

Creating a simple Qdrant example

Debugging

In case you run into any errors while running this example code, here are some things you can try.

  • Unable to connect to Qdrant: If you run into a connection error while running this code, double-check that your port forward is active while you run the code.
  • Module not found: If you run into an error that says module not found, be sure to check if you have the Qdrant SDK installed via pip and you have your virtual environment activated.

Conclusion

Vector databases play a key role in powering many of the AI applications we know and love today. In this post, we walked through deploying Qdrant—a powerful open-source vector database—on Kubernetes. We covered everything from setting up a cluster and installing Qdrant via Helm to creating collections, uploading vectors, and performing similarity searches. If you're looking to explore LLMs or RAG further, here are a couple of ideas: