New COT Evaluation

August 27, 2024

Optimized query with qa_cot_few_shot working good, and better than original staging and direct answer ones. However it’s hallucinating. Why’s that, how to prevent them.

To prevent hallucination from LLM, there are few ways to do it:

Prompt Optimization, concise prompt with clear instruction on generation rules like Apple does
Post-process, filter out hallucinated answers by checking the answer with a pre-trained model like RAG or BART(This is costly with low priority)
Use a more powerful model like Llama 3.1 or GPT-4o, which is more accurate and less likely to hallucinate
Reducing input context length, therefore hallucination(Lost in Instructions).
- Extra step after search and before QA: Reranking and filtering retrievals
- Chunking the documents

Evaluation with positive and negative datasets. Each with 20 test questions are good enough. Use powerful model to generate ground truth to differentiate between False Positive and True Negative generations, and generate the TP and TN dataset. TP(20)/FP(5)/TN(5)/FN(5). The reason of checking False negative is to see if the model is too conservative and missing out some good answers.

Used GPT-4o evaluating around 4 metrics in 2 areas.

General RAG
- Answer Relevancy
- Answer Faithfulness
- Toxicity and Bias （Red teaming）
- Hallucination
Business Goal for Task(Summarization)
- Answer Structure: Instructional. Step-by-step and code snippet is a plus. Markdown style
- Answer quality: 现在的RAG回答千篇一律，宽泛且没有特点。也许应该利用LLM的内在知识，结合RAG生成更有特点的回答。比如当常识(Factual)性的回答冲突时，优先使用RAG的答案。如何脱离回答的范式，分为区域通用回答和RAG补全。宽泛的回答可以通过补全提问，缺乏深度需要引入专家系统，

Goal of knowledge_qa_agent is to achieve 100% correct on CL queries.

TODO:

Sort out the hallucinated answers.
understanding why hallucination