Any idea how to break large documents (10K+ words) into chunks for Open AI ‘s 3.5 turbo model?

so I’m working on this project where I have to give the ability for the user to upload text docs of 10K words plus and receive a response around 10K words plus This means a user copy/pastes or uploads a document to ask the AI questions about the information provided. The problem is the model only allows 4096 tokens hence causing a prompt limit issue. I have been attempting to chunk the data and have GPT 3.5T provide a single output to no avail.

Any chance someone can provide me with some guidance on how to properly execute chunking this data in Bubble?

Hello @doublethink158

It is a bit elaborate to explain but let me provide some broad strokes.

One of the approaches to tackle this is to use what is referred to as retrieval augmentation.

  • The idea is to partition the data for efficient querying and handling of large datasets, enabling the LLM to access relevant, bite-sized pieces of information as needed.
  • Then you turn them into robot-numbers using an LLM vector-embedding endpoint, store them in an indexed way into a super-fast robot-numbers (vectors) database.
  • You can then send natural language queries to it using the same LLM’s embeddings endpoint that turns that NL query into a robot-numbers query.
  • Last step is to send the top robot-numbers results to an LLM generative endpoint that can interpret them and return to you as a NL completion.

The data part of the above approach is about being able to handle data in its different forms (like PDF files as an example) for which there are Python or JS tools to do this. Tokenization count of the data should be used instead of number of characters for the chunking part in order to optimize for meaning and to make things work when dealing with token limits like 4096 as mentioned above.

Tools to do this:

  • Chunking: There are two no-code tools that interact with the nowadays hottest AI framework to manage this and other aspects of the process called Langchain. These tools need to be forked from Github repositories and hosted either locally or on a hosting service. One is called Langflow and the other one is called Flowise. Both are very easy to use once you have them installed.
  • OpenAI embeddings and generative endpoints (for chat or just straight forward completions)
  • Pinecone vector database

This is a great video the should walk you through most of the above:

Hope the above makes sense :sweat_smile:

PD: A special mention to @eli that came up, during a chat about this topic, with the best way in my opinion to refer to a vector … a “robot number” :smiley:

3 Likes

Thanks, I will definitely take a look at it.

1 Like

This topic was automatically closed after 70 days. New replies are no longer allowed.