This two minute video from my How AI Ate My Website talk, highlights the importance of cleaning up the source materials used for conversational interfaces. It illustrates the issues PDF documents can have on large-language model generated answers and how to address them.
PDFs are special in another way, as in painfully special. Let's look at what happened to our answers when we added 370 plus PDFs to our embedding index. On the left is an answer to the question, what is design? Pretty good response and sourced from a bunch of web pages.
When PDFs got added to the index, the response to this question changed a lot and not in a way that I liked. But more importantly, only one PDF was cited as a source instead of multiple web pages.
So what happened?
What happened is a great demonstration of the importance of the document processing, aka cleanup step, I emphasized before. This ugly spreadsheet shows the ugly truth of PDFs. They have a ton of layout markup to achieve their good looks.
But when breaking them down, you can easily end up with a bunch of bad content chunks like the ones here. After scoring all our content embeddings, we were able to get rid of a bunch that were effectively junk and clogging up our answers.
Removing those now gives a much better balance of PDFs, videos, podcasts, and web pages, all of which gets cited in the answer to what is design. More importantly, however, the answer itself actually got better.