Nothing's in my cart
7-minute read
Today's large language models are getting smarter. They can chat, write articles, and even generate images. But you might have noticed something odd: their "intelligence" sometimes resembles a student cramming for exams—reading the textbook without truly understanding it. Their answers sound confident but aren't always correct, and they struggle with memory for example they might forget what they just said or the context of the conversation in the same session.
These quirks are tied to how large language models "learn" knowledge. Do they memorize formulas, or do they truly internalize concepts? Microsoft's research team is working on a new technology called KBLaM (Knowledge Base-Augmented Language Model), aiming to shift AI from rote memorization to genuine understanding, addressing current limitations in AI technology.
In recent years, the most common and practical approach to addressing language models' knowledge gaps has been using RAG, or Retrieval-Augmented Generation.
The idea behind RAG is quite pragmatic: a model's training data can't cover all knowledge, especially rapidly changing information, internal company documents, or user notes that aren't part of the training corpus. So, how do we fill these gaps? The answer is - looking up information.
RAG equips AI with a mini-Google. When you input a question, it first searches a pre-established knowledge base (like document systems, PDFs, web pages) for relevant passages, then feeds this data along with your question into the language model, which "reads the data before answering."
Hold on, let me dig through and find what you said yesterday. (Source: Pixar)
This setup's advantage is that it doesn't require retraining the model, allows real-time data updates, and can connect to private databases, offering flexibility for users and businesses. That's why most current LLM applications, including internal company knowledge Q&A, customer service bots, and API-based assistants, use RAG or similar retrieval-augmented mechanisms.
Examples include well-known GPTs and Google's NotebookLM, which let users upload notes and documents for AI to answer questions and summarize, citing sources. Additionally, products like Microsoft Copilot, Notion AI, Perplexity, Humata, and ChatPDF use RAG to integrate document retrieval and answer generation, making them practical and scalable AI tools.
However, RAG isn't without drawbacks. When you input too much data, the model's processing speed slows down, mainly due to the Transformer architecture used in language models. Transformers use a "self-attention mechanism," meaning each word (token) must compare with all others to establish semantic relationships. While this helps the model understand context, it also means that as input length increases, the processing load grows exponentially. For example, with 1,000 tokens, the model handles nearly a million interactions, straining memory and computation speed.
Moreover, language models don't truly "understand" the data; they predict based on language patterns. This means they might find the most likely response from word combinations rather than reasoning from knowledge or logic. When data is vague or retrieval is imprecise, hallucinations are more likely to occur.
We often hope AI can be more than a personal assistant; some even expect it to be a companion or friend, remembering your preferences and past instructions. However, this memory is still limited and not always stable. You might notice that when conversations get lengthy, AI "forgets" previous details, needing reminders to "recall" them. This is because current language models' memory relies on experimental modules, lacking true lasting stability.
Even humans forget things all the time. But we probably won't forget that "dinner is eaten at night."
To address this, large language models like ChatGPT and Gemini are gradually developing "memory functions" that can remember your name, occupation, and preferences for tone or format. Users can also view, disable, or delete these memories at any time.
These features aim to create a continuous and personalized "impression" of you, allowing each conversation to better meet your needs and context—a form of "remembering."
However, remembering isn't the same as understanding. Even if AI can remember your preferences and identity, it often just "notes" or "references" the information you provide without truly grasping its meaning or logical structure. This is the challenge KBLaM addresses, aiming to help AI genuinely "absorb" the knowledge you provide and make logical, evidence-based reasoning and applications in its responses.
KBLaM (Knowledge Base-Augmented Language Model) is a technology that "builds knowledge" into language models. Its brilliance lies in integrating knowledge into the model's architecture without altering the original language model or retraining it, allowing AI to naturally reference it in responses.
The key difference from traditional RAG is that RAG retrieves data, while KBLaM understands it. RAG pulls relevant content from a knowledge base and inserts it into the model's context; the model then generates responses based on language patterns without truly "absorbing" the knowledge. KBLaM, however, transforms knowledge into vector forms that the model can read, using special attention designs to allow the model to genuinely reference this data during computation, making it an internal part.
So, how does it achieve this? Let's break down its operation:
The first step in KBLaM is converting human-readable knowledge (like text descriptions) into structured, machine-processable forms. Specifically, it encodes knowledge in triple form, such as "the sun is a star" or "water is a liquid," into key-value vector pairs.
These vectors aren't just data compression; rather than having AI "refer to a passage," it's more like "evoking specific concepts." This is the main difference from RAG: RAG provides raw paragraphs or sentences, and the model can only infer meaning from textual clues. KBLaM offers semantic structure entry points, allowing the model to perform more accurate semantic matching and logical applications with lower computational burden, making it closer to "understanding."
This special attention design addresses the energy-intensive problem of traditional Transformers. RAG's "self-attention mechanism" combines user questions with retrieved passages from the knowledge base, requiring the model to perform comprehensive attention calculations on the entire content. This means not only do questions and data cross-reference, but data also cross-references each other.
This happens because language models don't distinguish between questions and data, treating all inputs equally. This not only significantly increases computational load but can also cause unrelated knowledge passages to interfere with each other, reducing response accuracy.
KBLaM's "Rectangular Attention" solves this issue. In KBLaM, knowledge is no longer a large block of text for the model to "read all at once" but is organized into a format the model can quickly query. When you input a question, the model only references the corresponding knowledge, and these knowledge pieces don't interfere with each other.
In other words, it's a "one-way reference" design, reducing attention computation from quadratic to linear. Even with tens of thousands of knowledge entries, the model can maintain efficient processing speed, achieving truly scalable knowledge application capabilities.
Additionally, KBLaM is a "plug-and-play" knowledge expansion module that doesn't require modifying model weights or retraining, allowing it to work with existing models. It supports various open-source models like LLaMA 3 and Phi-3 and may integrate into more language model platforms in the future.
In summary, KBLaM's design allows language models to handle large-scale knowledge bases without slowing down computation speed, avoiding the efficiency bottleneck of traditional self-attention mechanisms where each word must cross-reference others. It also no longer just pastes knowledge into context but encodes it into vector structures the model can understand, allowing AI to use this data as if it were "built-in logic" when responding.
More importantly, KBLaM increases response transparency and credibility because the model can more clearly indicate which content comes from knowledge sources and choose to decline to answer when data is lacking rather than guessing. It makes AI more like a knowledgeable worker who has read and understood the material, rather than a student who memorizes without comprehension.
In AI's development journey, memory and understanding are not opposites but complementary. A truly mature AI should have the ability to "remember your needs" and "understand knowledge," knowing both you and how the world works. When AI can participate in thinking and decision-making, knowing what it "knows," we might finally see that smart and reliable AI partner become a reality, not just a figment of our imagination.