Any idea how to break large pdfs into chunks for Open AI 's davinci model?

so I’m working on this project where I have to train chatgpt on pdf data via APIs and get it ready for questioning by the users, This means a user uploads a pdf to ask the AI questions about the pdf. The problem is the model only allows about 4000 tokens hence causing a prompt limit issue, do you have any idea on how I can solve this, for instance breaking the pdf into chunks, etc…

Hmm :thinking: Good question.

I’m assuming that you have already figured out a way to extract the text from the PDF (Probably using OCR).

You can take the text that you extract and split it up into parts using regex. Then each chunk of data you could send to chatgpt and ask the same question using a backend workflow that runs asynchronously. Then maybe ask chatgpt which answer is better each time? Either stick with the answer or move on to the next chunk of text until you go through all of the text.

Or… you can have chatgpt summarize the chunks and then ask the question at the end. Might be missing some data that it might need to answer the question at that point though.

Would something like that work? Just brainstorming here. :blush:

like you read my mind!.. exactly what I’ve been thinking!! thanks for your input! that even cements it the more

1 Like

This topic was automatically closed after 70 days. New replies are no longer allowed.