A technique I used was to create a different data type e.g chunk. In a backend workflow, turn your long text into an array of words, count how many words in that array, loop over the long text array to break it up per say 300 words and save it as as a chunk of text, overlap a percentage of the words per chunk, send each chuck to openai embed to get vector and then upsert into pinecone. This is similar to how Langchain handles chunking.
It would be great if you can provide details of how you achieved this. There seem to be various ways to achieve the same sort of thing. How do you find this method? When i tried similar using code from GPT4 inserted into a Bubble plugin it was too slow, taking sometimes timing out before results were returned.
In case anyone missed it, there was a Langchain no-code webinar yesterday with some very useful stuff including integrating Langchain with Bubble. Exactly what we’ve been wanting!
You can watch the webinar back here:
@misbah.sy told me that he is working on a video tutorial for the demos he presented in the webinar.
could you please share your workflow?
I just recreated what @jdeleon1 is describing using a backend recursive workflow. Basically, it’s a loop that at each stage:
- Takes the full text, splits into a list/array of words (each word is it’s own item - you split by " ")
- Selects a start index for the list/array and up until which item in the list/array to create a chunk (using format as text to take the select items in the list/array and combine back into a text string)
- Embeds that chunk to a vector using OpenAI embed
- Saves the returned embedding along with the text chunk to your vector store (in my example i am using Supabase, but should work the same for Pinecone)
- Then updates the starting index for the next chunk and re-calls the same function to start again.
Screenshots:
Backend Workflow Setup
Embedding Step
Upsert to Vector Store
Call Workflow Recursively with Condition
at the start, i am using the 0 as the starting index, 100 for the chunk_size (so 100 words), and 5 for the chunk_overlap (so 5 words).
The benefit of this is it will run asynchronously on the backend once you kick it off with the original text and is saving the the embeddings to your vector store as it goes with each cycle.
It will take a little bit of time for it to cycle through each chunk and get it all updated in the vector store.
Yep thats it, then have another backend workflow for a query that first goes to openai for the embedding and then queries pinecone to return a result.
Thank you! much appreciated!
amazing, would it be possible to use chatgpt 3.5 to incorporate the vectors in the responses?
How could I do this by loading information from a PDF like many applications being created for this function?
thanks
Quick answer is yes. This is the flow:
Once you get the data into the vector store. You query it based on the vector’s of your question, it returns the text chunks that are closes in semantic similarity to your question, and then you insert them as context into your prompt to GPT3.5.
Profile - jeffbuze - Bubble Forum Is it possible to personalize the vectors and search results for individual users, similar to how chat history is customized for each user?
Tursun,
We are using different namespaces in our Pinecone database to isolate queries.
In our case, when we upload a vector and text to pinecone, we set the namespace as the uniqueID of the thing we want to associate it with in Bubble. That way, when we query pinecone, we can limit the search by passing the uniqueID as the namespace to search in.
Further filtering can be achieved by tagging your embeddings with metadata and then using the metadata to filter subsequent queries, but we will haven’t come up with a clean way to do that from a UI perspective while sticking to the chat environment.
Great to see this discussion here. Lots of interesting info.
I was looking at incorporating the python textsplitter function via pipedream to do something similar to this, but like the potential of keeping it in bubble since that is what I’m most comfortable with.
Currently, we’re extracting text from the PDF’s and, in our case, the returned text has a lot of repeating special characters in it. We end up using regex to clean the text before we start chunking. Next we are using regex to split the cleaned text string and create the chunks for embedding. It’s super fast, but I have some concerns since this method does not create the overlaps.
Couple questions. How is the speed of the workflow and WU consumption? How much cleaning are you doing of the original text before kicking off this backend workflow? Is there any priority given to end of paragraph or line breaks vs word breaks for maintaining context?
OH very cool will watch!
Thank you very much Nathan!
Yes, all depends on how you set up the database, similar to the scenario where you have multi-tenancy in normal data within a database. As @nathan.s mentioned, you can use namespace or metadata in Pinecone to restrict the search. Or in Supabase, you can just add a column that has the user id and restrict the search my that.
One note on vectors, they are not a direct translation of the input, they are an approximation of the semantic content of text. So you should always store the text_content either in the vector store or in the Bubble DB (and include the uniqueID in the vector store).
Vectors are most useful in searching for or comparing things for semantic similarity, regardless if they have the same words or not. e.g. “The dog ran across the street.” and “The hound galloped over the road” are almost exactly the same, but only ‘the’ is in both.)
This method is definitely not going to perform as well as the python script in Pipedream, but takes no code and fully in Bubble.
Speed: slow, it goes through each chunk one at at time. Probably over a minute or two to split up any decent doc.
No cleaning of the text before hand. That is one challenge of using Bubble, there isn’t a good way to store data as a variable in data processing (workflow) steps. So in order to do that, you’d need to save the input in the DB first, then do regex processing on it and resave. Then kick off the text splitting process.
For this example, I am just breaking by word, not focusing on line breaks. There might be a method to first split by “/n” line breaks, store them in the DB, then do the split by word to do the chunking.
WU consumption: hmm i think it was about 300-500 WUs for splitting one doc. I just reran one now and will look at WUs after it updates in the log tab later today.
Quick update:
I took a document that was ~7k words (~40k characters) and split it into 70 chunks (~100 words per chunk). The WU cost was 165 (so about 2.3 WU per chunk)
This does not include saving any data to the Bubble DB. I think if you ended up storing the text content in the Bubble DB or other metadata, probably looking at an additional 1 WU per chunk.
finally someone uncover how to use bubble directly with langchain and vector database and the concept of custom chatgpt for custom data , really great video and experience , you must watch it and follow this amazing guy
Sorry if this has been answered.
For example, is it possible to have a DB called “FAQ” and have Langchain read all of it?
Is it possible to create a replacement for “algolia” for what we ultimately want to do?