21 AI concepts every beginner should know before their first interview
Written by
Technical Writer at Civo
Written by
Technical Writer at Civo
If you’re prepping for your first AI or MLOps interview, the hardest part usually isn’t always the hands-on element.
For me, it’s the vocabulary. Interviewers sometimes lob single-word concepts at you (“what’s quantization?”) and watch how far you can carry the thread. The questions sound clear-cut, but each one is really a doorway into a bigger topic, and the interviewer is judging how cleanly you walk through it.
In this blog, we will look at 21 concepts that recur at the beginner level. For each one, we’ll give you the question, a line on what the interviewer is really testing, and a starting point you can use to go read up on it.
This list is not meant to be exhaustive, but a good base for you to branch out and do further research - and hopefully discover some new concepts.
If you want to see these concepts in action on real infrastructure, check out Civo's machine learning tutorial library.
1. What is an LLM?
Typically, a warm-up question.
A generic answer would be something like: “They are applications like ChatGPT.”
A stronger answer covers three layers: LLM stands for large language model, a deep neural network (specifically a transformer) trained on vast amounts of text to predict the next token in a sequence.
ChatGPT is one product wrapped around such a model; GPT is the model family, and the LLM is the underlying neural network. Being able to separate model, family, and product shows you can think about AI as a system, not just an app.
2. What is GPT?
This is testing whether you know the dominant LLM architecture family by name. GPT (Generative Pre-trained Transformer) is OpenAI’s flagship LLM line, but the term has become the generic shorthand for the family of decoder-only transformer LLMs.
Mentioning that Claude, Gemini, and Llama are siblings (also decoder-only transformers) signals you know GPT is one branch of a wider family, not a synonym for “AI.”
3. Why is it called a “Generative Pre-trained Transformer”?
Typically, a follow-up to the previous question. Each word in the name is a real concept.
Generative means it produces new content (text) rather than classifying existing content. Pre-trained means it was trained on a huge general dataset first, before being fine-tuned for specific behavior.
Transformer is the underlying neural network architecture, introduced in the 2017 “Attention Is All You Need” paper, built around the attention mechanism. During an interview, you will need to be able to define each word independently.
4. What are model weights?
This tests whether you understand what a "model" is, concretely — a giant collection of numbers.
Model weights are the learned parameters of the neural network: the actual values that get tuned during training. A modern LLM has billions of them. They live on disk, get loaded into GPU memory at inference time, and they are the model. Everything else is just runtime plumbing.
5. What is a token?
Tokens are the unit of work for an LLM, and most pricing, latency, and context-window decisions live at the token level. A token is a small chunk of text produced by a tokenizer. LLMs read tokens, predict tokens, and bill per token. If you can't reason about tokens, you can't reason about cost or limits.
6. What is attention in an LLM?
Typically, the interviewer is testing whether you understand the core mechanism that made modern LLMs possible.
Attention is how the model decides, for each token it's generating, which of the previous tokens are most relevant to look at. It's what lets a transformer "remember" context across long inputs. You don't need to derive the math, but you should be able to say it's a weighted lookup over previous tokens.
7. What is inference?
This tests whether you can distinguish between using a model and training one.
Inference is the act of running a trained model to produce an output. When you ask ChatGPT a question, or when an AI Overview summarizes a search result at the top of the page, that's inference. Every API call to an LLM, every autocomplete, every classifier in production, all inference.
For a deep dive into running inference at scale, see Civo's guide to deploying LLMs on Kubernetes.
8. What’s the difference between training and inference?
Training is the expensive, slow, GPU-heavy, mostly one-time process of teaching the model. Inference is the cheaper-per-call, latency-sensitive, always-on process of using it.
9. What’s the difference between deductive and inductive inference?
Deductive inference goes from general rules to specific conclusions (e.g., if all humans are mortal and Socrates is human, then Socrates is mortal). Inductive inference goes the other way, from specific observations to general patterns (every swan I've seen is white, so swans are probably white). LLMs lean heavily on inductive reasoning.
10. What is an inference engine?
An inference engine is the runtime that actually executes a model's forward pass efficiently. Examples: vLLM, TGI, llama.cpp, TensorRT-LLM, ONNX Runtime. They handle batching, KV-cache management, quantization, GPU memory, and streaming. Knowing one by name signals you've moved past notebooks.
Civo's LLM boilerplate tutorial walks through deploying Llama on a GPU Kubernetes cluster using Ollama as the inference server. This is a great way to see an inference engine in a real deployment: Kubernetes meets Llama 3.2: How to deploy AI models on GPU clusters.
11. What is KV-cache?
A natural follow-up if you mention inference engines.
When an LLM generates text, it produces tokens one at a time. For each new token, the attention mechanism needs to look at every previous token in the sequence. Without a cache, the model would redo that work from scratch on every step.
The KV-cache stores the key and value tensors computed for each previous token, so the model only needs to do new work for the current one. It is the single biggest reason generating long outputs is feasible. The trade-off is GPU memory. The cache grows with context length, and on long-context requests it is usually the first thing to exhaust your GPU memory. Inference engines like vLLM exist largely to manage this cache cleverly (PagedAttention is the well-known example).
12. What is quantization?
Quantization compresses a model by storing its weights in lower-precision number formats (FP16 to INT8 to INT4, and so on). The model gets smaller and faster to serve, at a small cost to quality. It is the single biggest lever for running large models on cheap hardware. Mentioning specific formats like GGUF, AWQ, and GPTQ earns extra credit.
Civo's DeepSeek deployment guide covers running quantized models on GPU Kubernetes clusters in practice: How to deploy DeepSeek-R1 on Civo GPUs.
13. What is managed inference, and why would you use it?
Managed inference services (Anthropic API, OpenAI API, Bedrock, Together, Fireworks, and so on) run models for you behind a REST API. You trade a per-token fee for not having to provision GPUs, manage drivers, or babysit an inference engine.
Civo offers GPU-accelerated managed Kubernetes as an alternative, useful when you want more control over your inference stack without the overhead of a hyperscaler. Check out more about Civo AI here.
14. What is a context window?
The context window is the maximum number of tokens an LLM can attend to in a single request, prompt, and response combined. Modern frontier models reach hundreds of thousands, even millions, of tokens. If you have ever had to chunk a long document so it would fit in an LLM call, you have bumped into the context window.
15. What is an embedding?
An embedding is a vector (a list of numbers) that represents the meaning of a piece of text in a high-dimensional space. Texts that mean similar things end up near each other in that space. Embeddings power semantic search, deduplication, clustering, and the retrieval step of RAG.
16. What is RAG (Retrieval Augmented Generation)?
The most common production pattern for using LLMs on your own data, and a near-guaranteed question.
RAG works in two steps. First, retrieve relevant chunks from a vector database using embeddings. Second, stuff those chunks into the LLM's prompt as context, then ask the LLM the user's actual question. RAG grounds the model in data it was not trained on, without you having to retrain anything.
Civo has hands-on tutorials for building a RAG system on Kubernetes, including deploying a vector database (Qdrant) and wiring it to an LLM:
- Running Qdrant on Kubernetes using Civo
- Building a RAG system with Gemini for financial forecasting on Civo Kubernetes
17. What is fine-tuning?
Often discussed alongside RAG, the interviewer wants to see that you know when to reach for each.
Fine-tuning is taking a pre-trained model and continuing its training on your own data so the weights themselves shift toward your domain. It is better than RAG for teaching style or behavior, worse for teaching specific facts that change often. The two approaches are frequently combined in production. Modern fine-tuning often uses LoRA, a technique that only trains a small adapter on top of the base model instead of touching every weight.
18. What does the "temperature" parameter do?
Temperature controls the randomness of the sampling step. At temperature 0, the model always picks the single most likely next token (deterministic, dry, often repetitive). Higher temperatures spread the probability over more candidates (more creative, more risk of going off the rails). It is the single most common knob developers tune.
19. What is hallucination, and why does it happen?
A hallucination is when an LLM produces output that is fluent and plausible but factually wrong. It happens because the model is fundamentally a next-token predictor, not a fact lookup. When it does not "know" something, it interpolates from patterns. The mitigation playbook, RAG, citations, tool use, lower temperature, evaluation harnesses, are what most production teams are actually building.
20. What is function calling (or tool use)?
Function calling lets the LLM, instead of just generating text, emit a structured request to call a function or API you have defined (search the web, query a database, send an email, and so on). Your code runs the function, returns the result, and the LLM uses that result to keep going. It is the bridge between a language model and real action.
21. What's the difference between an LLM and an AI agent?
An LLM by itself is stateless: text in, text out, one-shot. An agent is a system built around an LLM that loops, holds memory, uses tools (function calling), and pursues goals over multiple steps. Claude Code, Cursor, and most of the "AI assistants" you will meet in 2026 are agents, not bare LLMs.
AI in our cloud or yours
Civo AI is a full-stack AI platform that lets you build, train, deploy, and scale AI workloads using the latest NVIDIA GPUs, in our cloud or your own. It includes GPU Compute, GPU Kubernetes, relaxAI (our privacy-first AI assistant), and private cloud options, all with transparent pricing and no hyperscaler headaches.
Summary
Artificial intelligence is a growing field. Many of the concepts explained in this blog apply to many of the current trends in AI today and can serve as a good base for learning or discovering new topics, even if you're not preparing for an interview.
Ready to put these concepts into practice? Explore Civo's AI infrastructure and tutorial library to start building.

Technical Writer at Civo
Jubril Oyetunji is a DevOps engineer and technical writer with a strong focus on cloud-native technologies and open-source tools. His work centers on creating practical tutorials that help developers better understand platforms such as Kubernetes, NGINX, Rust, and Go.
As a contract technical writer, Jubril authored an extensive library of technical guides covering cloud-native infrastructure and modern development workflows. Many of his tutorials achieved strong search rankings, helping developers around the world learn and adopt emerging technologies.
Share this article