Video: Structuring Website Content with AI

by December 4, 2023

To create useful conversational interfaces for specific sets of content like this Website, we can use a variety of AI models to add structure to videos, audio files, and text. In this 2.5 minute video from my How AI Ate My Website talk, I discuss how and also illustrate if you can model a behavior, you can probably train a machine to do it at scale.

Transcript

There's more document types than just web pages. Videos, podcasts, PDFs, images, and more. So let's look at some of these object types and see how we can break them down using AI models in a way that can then be reassembled into the Q&A interface we just saw.

For each video file, we first need to turn the audio into written text. For that, we use a speech-to-text AI model. Next, we need to break that transcript down into speakers. For that, we use a diarization model. Finally, a large language model allows us to make a summary, extract keyword topics, and generate a list of questions each video can answer.

We also explored models for identifying objects and faces, but don't use them here. But we did put together a custom model for one thing, keyframe selection. There's also a processing step that I'll get to in a bit, but first let's look at this keyframe selection use case.

We needed to pick out good thumbnails for each video to put into the user interface. Rather than manually viewing each video and selecting a specific keyframe for the thumbnail, we grabbed a bunch automatically, then quickly trained a model by providing examples of good results. Show the speaker, eyes open, no stupid grin.

In this case, you can see it nailed the which Paris girl are you backdrop, but left a little dumb grin, so not perfect. But this is a quick example of how you can really think about having AI models do a lot of things for you.

If you can model the behavior, you can probably train a machine to do it at scale. In this case, we took an existing model and just fine-tuned it with a smaller number of examples to create a useful thumbnail picker.

In addition to video files, we also have a lot of audio, podcasts, interviews, and so on. Lots of similar AI tasks to video files. But here I wanna discuss the processing step on the right.

There's a lot of cleanup work that goes into making sure our AI generated content is reliable enough to be used in citations and key parts of the product experience. We make sure proper nouns align, aka Luke is Luke. We attach metadata that we have about the files, date, type, location, and break it all down into meaningful chunks that can be then used to assemble our responses.