Observations designing information retrieval systems built on generative AI
Published on under the AI category.
I have been experimenting with OpenAI's GPT 3.5 Turbo API for the last week or so. The result of my experimentation is James Bot, an AI bot that is built to answer questions where the answers can be provided with reference to my blog, wiki, and GitHub READMEs. My goal with this project was to explore building an information retrieval system where a user can ask a natural language query and receive a cohesive answer with reference to relevant source materials to substantiate claims.
Indeed, one of my chief reservations of the use of generative AI in text has been asserting the validity of a statement. Setting expectations by saying that a text was partially or fully generated by AI was not acceptable. Thus, from the beginning of playing around with the GPT 3.5 API, I knew that a lot of my work would be oriented around referencing. I wanted the bot to link, inline, to related blog posts. I wanted to be able to reason why the bot may have returned a particular result.
Using the information gathered through testing -- both by myself and with others asking questions -- I refined my information architecture, changed the prompts sent to accompany queries, and made other changes to help improve the quality of answers. The architecture of how this bot is built is outside of the scope of this post. Rather, I would like to focus on some interesting results both I and the people who tested the bot have observed. I list the common themes in bullet point form below.
- The Bot doesn't have a good sense of time, despite the System prompt (the master instruction for GPT, per se) explicitly containing the date on which the prompt was sent. When asked to do mathematics with dates, the bot failed to do so.
- The Bot would sometimes provide the wrong answer to a question, while citing a source. That answer was sometimes decisive. In one example, the bot answered "yes" to a question for which I knew both: (i) the answer was no and; (ii) I had never addressed the query in the source material. The issue -- pending a fix -- was that the vector store returned many nearest neighbours that said "Yes", but were not answers to or pertaining to my question. I suspect the similarity was low and the vector store was returning concise answers because there was no information to address the users' claim.
- The Bot, despite instruction in both the System prompt and the prompt prepended to a user's query, would sometimes cite sources from outside my website (in one example, American Express was cited as a source). On investigation, a hyperlink to the American Express article referenced was in the sources. The hyperlink was useful, but the source in the vector database contained no further context. The answer was of high quality, but the response wasn't related to anything I had written.
- The Bot would sometimes divulge, verbatim, the sources in its prompt. The approach taken was prompt engineering, whereby I (and others) asked questions to try and get the bot to share part of its prompt. An effective prompt involved my asking for sources so I could fact-check them.
- Answers from my source material were often out of date. Some of my blog posts go back more than two years, with no further thoughts expressed about the subject matter. Thus, the Bot had no better source to cite than old posts in some cases. This prompted me to include the dates on which content was published in the list of sources (where dates were available), and ask the Bot to note when sources were old. I also want the Bot to, in all results, explicitly state the publication date of a piece. This means that if the Bot isn't clear that the information may be out of date, a human is given context in the answer about the pubilcation date.
I have spent hours testing the Bot. The insights above are only few of the many lessons I have learned while designing this system. Issues pertaining to prompt injection will likely be less of a factor with new versions of GPT, but for now it is a concern. I am designing within the technologies to which I have access today, rather than deferring to some future date for changes to the underlying model.
Notably, the issue about providing the wrong answer while citing a source that didn't explicitly address my query was an information architecture issue. A lot of time is going to be spent -- and has already been spent -- building systems that make it easier to build on top of language models, track and evaluate results from the model, "chain" prompts together to perform multi-step evaluations that require different sources or models, and more. LangChain is one such example of prompt chaining. GuardRails lets you build schemas for working with language models.
I am still in the early stages of exploring generative text technologies. I have significant reservations about generative text (a subject, perhaps, for another post), up to and including the risk of hallucination, the prospect of citing sources that don't fully substantiate a claim
Tagged in ai.
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at firstname.lastname@example.org.