Langchain with vector database connected to Bubble

jdeleon1 · April 26, 2023, 2:14pm

A technique I used was to create a different data type e.g chunk. In a backend workflow, turn your long text into an array of words, count how many words in that array, loop over the long text array to break it up per say 300 words and save it as as a chunk of text, overlap a percentage of the words per chunk, send each chuck to openai embed to get vector and then upsert into pinecone. This is similar to how Langchain handles chunking.

freelymoving · April 27, 2023, 9:58am

It would be great if you can provide details of how you achieved this. There seem to be various ways to achieve the same sort of thing. How do you find this method? When i tried similar using code from GPT4 inserted into a Bubble plugin it was too slow, taking sometimes timing out before results were returned.

freelymoving · April 27, 2023, 10:01am

In case anyone missed it, there was a Langchain no-code webinar yesterday with some very useful stuff including integrating Langchain with Bubble. Exactly what we’ve been wanting!

You can watch the webinar back here:

@misbah.sy told me that he is working on a video tutorial for the demos he presented in the webinar.

tursun.alkam · April 27, 2023, 12:54pm

could you please share your workflow?

jeffbuze · April 27, 2023, 4:31pm

I just recreated what @jdeleon1 is describing using a backend recursive workflow. Basically, it’s a loop that at each stage:

Takes the full text, splits into a list/array of words (each word is it’s own item - you split by " ")
Selects a start index for the list/array and up until which item in the list/array to create a chunk (using format as text to take the select items in the list/array and combine back into a text string)
Embeds that chunk to a vector using OpenAI embed
Saves the returned embedding along with the text chunk to your vector store (in my example i am using Supabase, but should work the same for Pinecone)
Then updates the starting index for the next chunk and re-calls the same function to start again.

Screenshots:
Backend Workflow Setup

Embedding Step

Upsert to Vector Store

Call Workflow Recursively with Condition

at the start, i am using the 0 as the starting index, 100 for the chunk_size (so 100 words), and 5 for the chunk_overlap (so 5 words).

jeffbuze · April 27, 2023, 4:49pm

The benefit of this is it will run asynchronously on the backend once you kick it off with the original text and is saving the the embeddings to your vector store as it goes with each cycle.

It will take a little bit of time for it to cycle through each chunk and get it all updated in the vector store.

jdeleon1 · April 27, 2023, 5:29pm

Yep thats it, then have another backend workflow for a query that first goes to openai for the embedding and then queries pinecone to return a result.

tursun.alkam · April 27, 2023, 7:45pm

Thank you! much appreciated!

zapclassbr · April 27, 2023, 7:56pm

amazing, would it be possible to use chatgpt 3.5 to incorporate the vectors in the responses?

How could I do this by loading information from a PDF like many applications being created for this function?

thanks

jeffbuze · April 27, 2023, 8:11pm

Quick answer is yes. This is the flow:

jeffbuze:

Basic flow needed:

Upload document.

Parse document into chunks (small enough that you can then send to OpenAI to return word embeddings). Best if each chunk is a connected idea (so the vectors make the most sense)

Send each chunk to OpenAI embedding API and get back the vectors.

Store vectors, original content text, and any needed meta data to a vector database for each chunk.

Create UI in Bubble to get a question.

Send text of question to OpenAI embedding API to get vectors.

Send vectors of question to vector database to search for similar vectors (signifying semantic similarity)

Return document chunks with similarity to the question text.

Send Content Text for matched document chunks along with question text in a prompt to OpenAI completions or chat/completions API (saying something like “answer this questions based on the following context” where you include all of the matched Content Text).

Return answer to user question based on documents uploaded.

Once you get the data into the vector store. You query it based on the vector’s of your question, it returns the text chunks that are closes in semantic similarity to your question, and then you insert them as context into your prompt to GPT3.5.

tursun.alkam · April 28, 2023, 1:01am

Profile - jeffbuze - Bubble Forum Is it possible to personalize the vectors and search results for individual users, similar to how chat history is customized for each user?

nathan.s · April 28, 2023, 1:19am

Tursun,
We are using different namespaces in our Pinecone database to isolate queries.
In our case, when we upload a vector and text to pinecone, we set the namespace as the uniqueID of the thing we want to associate it with in Bubble. That way, when we query pinecone, we can limit the search by passing the uniqueID as the namespace to search in.

Further filtering can be achieved by tagging your embeddings with metadata and then using the metadata to filter subsequent queries, but we will haven’t come up with a clean way to do that from a UI perspective while sticking to the chat environment.

nathan.s · April 28, 2023, 1:45am

Great to see this discussion here. Lots of interesting info.

I was looking at incorporating the python textsplitter function via pipedream to do something similar to this, but like the potential of keeping it in bubble since that is what I’m most comfortable with.

Currently, we’re extracting text from the PDF’s and, in our case, the returned text has a lot of repeating special characters in it. We end up using regex to clean the text before we start chunking. Next we are using regex to split the cleaned text string and create the chunks for embedding. It’s super fast, but I have some concerns since this method does not create the overlaps.

Couple questions. How is the speed of the workflow and WU consumption? How much cleaning are you doing of the original text before kicking off this backend workflow? Is there any priority given to end of paragraph or line breaks vs word breaks for maintaining context?

bkerryk · April 28, 2023, 2:32am

OH very cool will watch!

tursun.alkam · April 28, 2023, 5:55am

Thank you very much Nathan!

jeffbuze · April 28, 2023, 11:49am

Yes, all depends on how you set up the database, similar to the scenario where you have multi-tenancy in normal data within a database. As @nathan.s mentioned, you can use namespace or metadata in Pinecone to restrict the search. Or in Supabase, you can just add a column that has the user id and restrict the search my that.

One note on vectors, they are not a direct translation of the input, they are an approximation of the semantic content of text. So you should always store the text_content either in the vector store or in the Bubble DB (and include the uniqueID in the vector store).

Vectors are most useful in searching for or comparing things for semantic similarity, regardless if they have the same words or not. e.g. “The dog ran across the street.” and “The hound galloped over the road” are almost exactly the same, but only ‘the’ is in both.)

jeffbuze · April 28, 2023, 11:58am

This method is definitely not going to perform as well as the python script in Pipedream, but takes no code and fully in Bubble.

Speed: slow, it goes through each chunk one at at time. Probably over a minute or two to split up any decent doc.

No cleaning of the text before hand. That is one challenge of using Bubble, there isn’t a good way to store data as a variable in data processing (workflow) steps. So in order to do that, you’d need to save the input in the DB first, then do regex processing on it and resave. Then kick off the text splitting process.

For this example, I am just breaking by word, not focusing on line breaks. There might be a method to first split by “/n” line breaks, store them in the DB, then do the split by word to do the chunking.

WU consumption: hmm i think it was about 300-500 WUs for splitting one doc. I just reran one now and will look at WUs after it updates in the log tab later today.

jeffbuze · April 28, 2023, 1:54pm

Quick update:

I took a document that was ~7k words (~40k characters) and split it into 70 chunks (~100 words per chunk). The WU cost was 165 (so about 2.3 WU per chunk)

This does not include saving any data to the Bubble DB. I think if you ended up storing the text content in the Bubble DB or other metadata, probably looking at an additional 1 WU per chunk.

mahmoudammar · April 29, 2023, 10:47am

@MisbahSy

finally someone uncover how to use bubble directly with langchain and vector database and the concept of custom chatgpt for custom data , really great video and experience , you must watch it and follow this amazing guy

soraha · May 4, 2023, 4:49pm

Sorry if this has been answered.

For example, is it possible to have a DB called “FAQ” and have Langchain read all of it?

Is it possible to create a replacement for “algolia” for what we ultimately want to do?

Topic		Replies	Views
Pinecone database and bubble integration APIs	4	2134	April 20, 2023
How to connect langchain and openai in bubble.io Need help	2	156	June 27, 2024
ChatGPT + Document Understanding (LangChain) \| FREE TEMPLATE Templates	0	1016	June 15, 2023
🤖 How to Create LangChain-based Chatbots Plugins	1	852	May 7, 2023
Has anyone worked whit crypto currency in bubble Plugins	2	980	October 10, 2021

Langchain with vector database connected to Bubble

Related topics