Source transparency in LLM information retrieval systems
Published on under the AI category.Toggle Memex mode
While designing my LLM-powered chatbot that is designed to answer questions with reference to a limited subset of my writing, I have been thinking about source attribution. The intent is to help people better evaluate the veracity, balance, and context associated with an answer returned by a model. Hallucination and it’s implications are at the forefront of my mind. I want to do what I can to ensure people can easily fact check the outputs of an LLM retrieval system.
In the prompt I send to the OpenAI GPT 3.5 API — the model used by the chatbot — I include a list of statements that are deemed most similar, semantically, to the user’s query.
This similarity is determined by embeddings, which encode language semantics numerically; the closer two embeddings are, the more similar the statements should be. Thus, a statement such as “i like Taylor swift’s music” would likely be deemed a similar source to pair with the question “what music do you like”.
Some sources are embedded at the article level, whereas others are embedded at the paragraph level. I am unsure which is more effective. My system embeds sources at the article level if important context surrounding each piece of information is provided only in one place (usually the top of the page). This is the case with my wiki pages, for example.
Whereas my system embeds my blog posts at the paragraph level because there is generally a lot of context and information in each paragraph. I do this because I can only fit so much information into a prompt before running into limits in GPT 3.5. If I embed at the paragraph level, I can provide more relevant information in the source sent to a prompt. I don’t need to send a whole article to the model, which may not fit in the case of longer writings.
The prompt also includes direction to refer only to the sources in the prompt and, where possible, reference each source using the available metadata — the content URL, title, and date published. The bot sometimes hallucinates, but I have observed a strong success rate with sources being used as expected.
I presently display the titles and URLs of the sources mentioned in the prompt sent to GPT 3.5 on the UI that accompanies the application. Next, I want to display the exact text from the sources so that it is easier to inspect the statements that have been provided to a model.
The users for this feature are both me, who is designing this information retrial system, as well as anyone else who makes a query. I want to make it as easy as possible for people to evaluate a statement made by the model.
Indeed, as someone building using existing LLMs, I must strive to do my best to mitigate the risks associated with hallucination and ensure the veracity of statements made.
In the ideal world, LLMs would not hallucinate. This feels like a largely academic and engineering problem to solve in the models themselves. For now, however, I can try my best to make source information visible and provide more tools for both myself and others to use when evaluating the outputs from my LLM information retrieval system.
Importantly, each response generated by my bot is given a permalink. By displaying source information on the share pages associated with each question asked by users, anyone — not solely the person who posited the question originally — benefit from the same information about the sources used by an LLM to answer a question.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.