I thought I would share these because they were very well written and would be useful for those trying to understand why large documents can’t be used directly in gtp3, 4 as well as vector databases, and data scraping large documents.
Here is an excerpt:
The problem is that these models have limited context. You can only fit in a few thousands words into them at a time. If you need to squeeze in more, tough luck! You’ve either got to fine-tune the model by training it on the data (which you can’t do with ChatGPT, only the older GPT-3 model) or, what is usually the better option, you need to extract only the relevant text for your specific prompt.
To give an example: If you want to have an answer to the question “What’s the best way to grill a steak?” you can’t just feed ChatGPT an entire cookbook all at once. These models can only understand 3000-6000 words at once, so you’ve got to be clever about which content you feed it.
Of course, a naive approach would be to just run a regex for “steak” and include those pages, but you’ll miss out on all the pages that talk about “beef”, “red meat”, “burgers”, and many other similarly related concepts. It’s easy as a human reading through the index to relate these items, but it’s very difficult to write a program to exhaustively map every related concept (without a significant time investment).
And that’s where embeddings come in. They’re a clever way to slice up the cookbook into smaller, more manageable chunks that can be fed into the limited context of a language model like ChatGPT.
There are also some other good links in the articles.