.avif)
In the rapidly evolving world of AI, the Thai Retrieval Augmented Generation (TRAG) Benchmark stands out as a pioneering tool designed to evaluate the performance of large language models (LLMs) in understanding and generating human-like responses in the Thai language. This benchmark is crucial for advancing AI technology tailored to Thai, ensuring that models can handle the unique linguistic and contextual challenges presented by the language.
The primary objective of the TRAG Benchmark is to provide a standardized framework for assessing the capabilities of LLMs in processing and generating Thai language content. By focusing on enterprise Q&A data from various industries, the benchmark aims to simulate real-world scenarios where accurate and contextually appropriate responses are essential. This helps identify the strengths and weaknesses of different models, guiding further improvements and innovations in Thai language AI.
The TRAG Benchmark employs a rigorous testing methodology using enterprise Q&A data synthesized from diverse industries such as airlines, automotive, banking, healthcare, and more. The test data comprises two main components:
To generate answers, these two data points are combined with standard prompts. Due to variations in how different families of models handle prompts, two standard prompts are used for each model evaluation, and the score set with the higher sum is taken as the official score of that respective model. The generated answers are then graded based on predefined criteria.
The scoring criteria of the TRAG Benchmark are designed to ensure a comprehensive evaluation of the models. The criteria include:
This metric measures the percentage of questions that a model can answer correctly. It reflects the model's ability to understand the question and retrieve the correct information from the provided context. The accuracy of answerable questions is processed by a separate set of impartial LLMs with a different predefined prompt to judge the answer as either right or wrong.
This metric assesses the model's ability to correctly respond with "Sorry, I don't know" to questions that intentionally do not have the right answer in the accompanying context. This is crucial for evaluating the model's capability to recognize when it lacks sufficient information to provide a valid response. Similar to the accuracy of answerable questions, the accuracy of unanswerable questions is also processed by a separate set of impartial LLMs with a different predefined prompt.
Inspired by the paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," this criterion uses GPT-4 to act as an impartial judge. The LLM Judge evaluates the responses based on the following sub-criteria:
The LLM Judge then provides an arbitrary rating based on these criteria, which is used as the final score for the model. This comprehensive evaluation ensures that the models are not only accurate but also contextually and linguistically appropriate.
Another dimension measured is the total response time, which is the time the model takes to complete the response (not just the time-to-first-token). This metric is crucial for evaluating the efficiency and practicality of the model in real-world applications.
.avif)
Q: How to care for diabetic patients in dental treatment?
Document Context: <Guidelines for handling people with diabetes in a dental clinic>
A: For caring for diabetic patients in dental treatment, here are some precautions and guidelines:
I hope this information is helpful for the care of diabetic patients in dental treatment. If you have any additional questions, feel free to ask.
(wrong info from what’s given in context document):
Q: How will passengers be compensated if the flight is delayed?
Document Context: <Information about delay compensation>
A: According to the available information, if a flight is delayed, the airline will compensate passengers. However, it depends on the length and cause of the delay. Generally, if the delay is less than 2 hours, there will be no compensation. If the delay is between 2 to 4 hours, passengers might receive water. For delays over 4 hours, the airline might provide meal vouchers worth up to 100 baht. However, it is recommended to contact the airline directly for more details, as policies may vary.
Passengers should contact the airline staff immediately upon learning about the delay and keep all receipts, such as for food or accommodation, if they had to pay upfront. Additionally, each airline may have different compensation policies, so it is best to check the specific conditions of the airline you are flying with.
This criterion highlights the model's ability to recognize when it lacks sufficient information to provide a valid response. Models that perform well in this area can effectively avoid hallucinations and provide appropriate responses such as "Sorry, I don't know." This is particularly important for maintaining the reliability and trustworthiness of the model.
.avif)
Q: Does this car model have ABS brakes?
Document Context: <Not provided>
A: I apologize, but the available information does not include details about the ABS brakes for this car model. Therefore, I cannot confirm whether this car model has ABS brakes. If you need specific information about the safety features of this car model, I recommend asking the dealer or checking the official website of the car manufacturer directly. If you have any other questions about cars, I am happy to help with the information available.
Q: Which hospitals can health insurance be claimed at?
Document Context: <Not provided>
A: Thank you for your question. I can provide a list of private hospitals designated by the company in Bangkok and its vicinity as follows: <List of hospitals>
The LLM Judge provides a nuanced evaluation of the model's responses, considering correctness, helpfulness, translation quality, and adherence to specific constraints. High scores in this area indicate that the model not only provides accurate information but also communicates it effectively and appropriately in Thai.
.avif)
Q: What documents are needed to authorize someone to close a current account?
A: You can authorize someone else to close a current account. The documents required are:
Preparing the power of attorney document for account closure:
In cases of authorizing someone to close an account, the amount received from the account closure will be issued as a bank check (Cashier's Cheque) payable in the name of the account holder and stamped 'A/C Payee Only' or transferred to an account in the same name as the account holder. No cash will be paid to the agent.
Q: What are the different models of the Apple Watch?
A: Sorry, which specific model are you interested in? Currently, I have information about Apple Watch Ultra 2, Apple Watch Series 9, and Apple Watch SE2.
Efficient models that can generate responses quickly without compromising accuracy are highly valued. This metric ensures that the models are practical for real-world applications where timely responses are crucial.
.avif)
One of the significant challenges in the TRAG Benchmark is the variation in prompt structures and best practices across different LLM models. It is difficult to tailor the benchmark to suit all models with their optimal prompt versions without being biased toward overly tuning a prompt for one model over another.
For example, if we add the instruction "try to provide a helpful response," some models, such as the Claude 3 family models, see a massive drop in the accuracy of unanswerable questions due to an increased hallucination rate. Further investigation reveals that the model is following the instructions to be helpful and therefore tries not to provide a true negative answer.
Consult with our experts at Amity Solutions for additional information here
เยี่ยมชมเว็บไซต์ที่เพิ่ม
ประสิทธิภาพสำหรับประเทศไทย