Tech

Amazon proposes new AI standard to measure RAG


amazon-aws-rag-benchmark-crop-for-twitter-new

Outline of Amazon’s recommended benchmarking process for deploying RAG for synthetic AI.

Amazon AWS

This year was supposed to be the year artificial intelligence reproduction (GenAI) takes off in the enterprise, according to many observers. One of the ways this can happen is through generation of retrieval enhancement (RAG), a method in which a large AI language model is connected to a database containing domain-specific content such as company files.

However, RAG is an emerging technology with many drawbacks.

Also: Make Way for RAG: How Gen AI’s Balance of Power is Changing

For that reason, researchers at Amazon’s AWS propose in a new paper that a set of benchmarks be established to specifically test how well RAG can answer domain-specific content questions.

“Our method is an automated, cost-effective, interpretable, and robust strategy for selecting optimal components for RAG systems,” lead author Gauthier Guinet and his team write in the paper “Automatic Evaluation of Retrieval-Enhanced Language Models with Task-Specific Test Post Generation.” posted on the arXiv preprint server.

The paper is being presented at 41st International Conference on Machine Learningan AI conference taking place from July 21 to 27 in Vienna.

The fundamental problem, Guinet and team explain, is that while there are many benchmarks for comparing the capabilities of different large language models (LLMs) across a variety of tasks, in the RAG field in particular, there is no “standard” approach for measuring “comprehensive task-specific evaluations” of many important qualities, including “truthfulness” and “practicality.”

The authors believe that their automated approach creates a certain uniformity: “By automatically generating multiple-choice tests that match the corpus associated with each task, our approach enables standardized, scalable, and interpretable scoring of different RAG systems.”

To accomplish that task, the authors create question-answer pairs by drawing on material from four areas: AWS troubleshooting documentation on the topic of DevOps; article summaries of scientific articles from the preprint server arXiv; question on StackExchange; and filings from the U.S. Securities and Exchange Commission, the main regulator of publicly listed companies.

Also: Connecting generalized AI to medical data has improved its usefulness for doctors

They then gave LLMs multiple-choice tests to assess how close each LLM was to the correct answer. They required two open-source LLM streams to take these tests — Mistralfrom the French company of the same name and Meta Properties’s llamas.

They tested the models in three scenarios. The first was a “closed-book” scenario, in which the LLM had no access to the RAG data and had to rely on its pre-trained neural “parameters”—or “weights”—to come up with an answer. The second scenario was a so-called “Oracle” form of RAG, in which the LLM was given access to the exact document used to generate a question, the ground truth, as it were.

The third form is “classical retrieval”, in which the model must search the entire data set to find the context of the question using many different algorithms. Several common RAG formulations are used, including one introduced in 2019 by scholars at Tel-Aviv University and the Allen Institute for Artificial Intelligence, MultiQA; and an older but A very popular approach to information retrieval is called BM25.

Also: Microsoft Azure gets ‘Model as a Service’, advanced RAG services for AI built for the enterprise

Then they ran tests and examined the results, complex enough to fill out tons of charts and tables about the relative strengths and weaknesses of the LLM and various RAG approaches. The authors even did a meta-analysis of their exam questions — to evaluate their usefulness — based on “well-known knowledge in the field of education.”Bloom’s Taxonomy.”

More important than the data points from the tests are the general findings that are likely to hold true for RAG — regardless of the implementation details.

A common finding is that better RAG algorithms can improve LLM more than making LLM larger.

“Choosing the right retrieval method can often yield performance improvements that are superior to simply choosing larger LLMs,” they write.

That’s important given concerns about GenAI’s growing resource intensity. If you can do more with less resources then that’s a valuable avenue to explore. It also suggests that the conventional wisdom about AI at the moment, that scaling is always best, is not entirely true when it comes to solving specific problems.

Also: Artificial intelligence is a new attack method that endangers businesses, according to CrowdStrike’s CTO

Equally important, the authors found that if the RAG algorithm does not work correctly, it can degrade the performance of LLM compared to the plain vanilla, closed-book version without RAG.

“Poorly aligned retrieval tool components can lead to worse accuracy than no retrieval at all,” Guinet and team explain.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *