We are thrilled to introduce the Thai Retrieval Augmented Generation (TRAG) Benchmark, a groundbreaking evaluation platform designed to assess the performance of large language models (LLMs) in understanding and generating high-quality responses in the Thai language. This benchmark represents a significant step forward in the field of AI, providing a robust framework for evaluating LLMs across various dimensions and test case categories.
The TRAG Benchmark has been meticulously developed to evaluate LLMs on their ability to comprehend document context and generate accurate, contextually appropriate answers in Thai. The benchmark comprises 56 test cases, categorized into 8 distinct areas, and features 7 unique scenarios. Each test case includes a user query and its corresponding document context, ensuring a comprehensive assessment of the model's capabilities.
The test cases are distributed across the following categories:
Airline: Policies and procedures related to seat reservations, ticket pricing, and promotion deadlines. Sample questions include:
Automotive: Knowledge about vehicle accessories, load capacity, and current promotions. Sample questions include:
Bank: Information on required documents for account closure, authorization procedures, and opening new accounts. Sample questions include:
CRM: Membership details, including sign-up processes, point checks, and reward calculations. Sample questions include:
Health Care: Medical knowledge such as dental care and hypertension medications. Sample questions include:
Human Resources: Employee healthcare benefits, insurance coverage, and reimbursement policies. Sample questions include:
IT Gadget: User questions about mobile phones and smartphone comparisons. Sample questions include:
Tech Support: LAN configuration, password retrieval, and WiFi router setup. Sample questions include:
1. Types of Questions: Single-turn Q&A and follow-up questions.
2. Language of Document Context: English, Thai, and intentionally empty contexts.
3. Availability of Information: Scenarios where information is available or missing, requiring the model to align its response with the prompt.
The TRAG Benchmark assesses LLM performance based on:
The benchmark employs a two-step scoring process:
1. LLM Overall Grading: An advanced LLM, GPT-4-2024-05-13, evaluates the quality of responses based on the criteria mentioned above.
2. LLM Answerable Classification: GPT-4-0613 classifies responses to determine if the model used only the provided document to answer.
We invite researchers and developers to utilize the TRAG Benchmark to evaluate and compare the performance of their models. By participating, you contribute to the advancement of Thai language AI technology, helping to refine and innovate AI solutions.
Explore the forefront of Thai language AI with the TRAG Benchmark and join us in driving the future of AI technology.
For more information and to get started with the TRAG Benchmark, visit our website.