Hello folks,
I have another take it or leave it API that you may wish to use. This allows your website to surf the web and use the responses in ChatGPT prompts.
Demo: Try it out here
This API will scrape provided URLs, process that content into embeddings using OpenAI’s embeddings API, and upsert these embeddings to a Pinecone index and namespace of your choice. Then, it will query the Pinecone index using the message you provide and return the relevant matches. The advantage of doing this externally over natively in Bubble is 1. WUs and 2. speed. This API will process most requests between 3-10 seconds, depending on how many URLs provide.
Technically there’s no limit to the number of URLs you can provide - I provide max 5 at a time with no issues. If it can’t scrape a URL it’ll timeout after 10 seconds and only return the URLs that it could extract.
Anyway, I can’t be bothered to write detailed docs so got ChatGPT to do it (I checked over it myself though )
As I said above, the API is take it or leave it. I haven’t had any issues with it. Technically multithreading isn’t recommended but it makes it faster which I care about, and I haven’t had any errors from the API at all. Hope someone finds it useful
There is a suggested Bubble implementation at the bottom of the post.
If you would like me to deploy the API for you, we can discuss a fixed fee for that, and same goes if you’d like me to do the Bubble implementation. DM me or visit Not Quite Unicorns!
Overview
This API is designed to scrape content from given websites, process that content to generate embeddings using OpenAI’s API, and then store these embeddings into Pinecone, a vector database. The API also provides a feature to query Pinecone with a given message to retrieve relevant results.
Dependencies
The API uses:
- Flask: To expose endpoints.
- requests: To make HTTP requests.
- numpy: For mathematical operations.
- BeautifulSoup: For web scraping.
- uuid: To generate unique IDs.
- threading: For multi-threading.
API Endpoint
/browse
Method: POST
Description: This endpoint accepts a list of URLs. For each URL, the content and title are scraped, processed to get embeddings and then upserted to Pinecone in the specified namespace. The endpoint also queries Pinecone with a given message and returns the query results.
Input:
JSON payload with the following fields:
-
urls
: List of URLs to be processed. -
openAIAPIkey
: API key for OpenAI. -
pineconeURL
: URL endpoint for Pinecone. -
pineconeAPIkey
: API key for Pinecone. -
namespace
: Namespace in Pinecone to differentiate between different collections. -
wordLimit
: Word limit for splitting content into chunks. -
uniqueID
: Unique identifier for the current process (helpful for grouping data). -
message
: Message to query Pinecone with. -
topK
: Number of top results to return from Pinecone. -
category
(Optional): If provided, will be used as metadata when storing embeddings in Pinecone and as a filter when querying Pinecone.
Output:
JSON response with the following fields:
-
status
: Status of the processing, typically returns “processing complete”. -
numTokens
: Total number of words from the content of all websites. -
queryResults
: Results from querying Pinecone after upserting the URLs you provided. -
urlData
: List of processed data for each URL including the content, title, URL, and associated vector IDs. These are the data that now lie in your specified Pinecone namespace.
Example Request:
{
"urls": ["<https://example.com>", "<https://example2.com>"],
"openAIAPIkey": "YOUR_OPENAI_API_KEY",
"pineconeURL": "YOUR_PINECONE_URL",
"pineconeAPIkey": "YOUR_PINECONE_API_KEY",
"namespace": "sample-namespace",
"wordLimit": 150,
"uniqueID": "sample-id",
"message": "Looking for technology news",
"topK": 5,
"category": "tech"
}
Example Response:
{
"status": "processing complete",
"numTokens": 1200,
"queryResults": [
{
"metadata": {
"content": "Sample content from example.com",
"memoryID": "sample-id",
"url": "<https://example.com>",
"title": "Sample Title",
"category": "tech"
},
"score": 0.95
},
...
],
"urlData": [
{
"url": "<https://example.com>",
"title": "Sample Title",
"content": "Full content from example.com",
"vector_ids": ["vector-id-1", "vector-id-2"]
},
...
]
}
Deploying the API
- Create a Google Cloud Platform account and make a project. Set up a billing account (you’ll get a few hundred $ to use as a trial)
- Install Google Cloud CLI
-
gcloud init
and configure you’re project - Download the API into a folder
- Navigate to the directory in the terminal
gcloud app deploy
Notes:
- The API uses OpenAI’s API to get embeddings for content chunks.
- The API uses Pinecone’s API to store and query embeddings.
- The API is designed for batch processing of URLs and uses a constant batch size for processing chunks of text.
- The API uses a user-agent string for making requests to avoid being blocked by websites.
- Exception handling is in place to skip problematic websites and continue processing others.
- It will not render JavaScript.
Suggested Bubble Implementation
- When user sends message with web browsing enabled, trigger custom workflow
- Use GPT-3.5 to suggest a search query for the user’s message
- Use a SERP API to search the internet using that search query
- Make
/browse
API request with the topn
URLs in the SERP API’s results :merged with user’s message:extract with Regex:format as text(Content: This text:formatted as JSON-safe Delimiter: ,). The extract with Regex can be a Regex expression to identify URLs in the user’s own message e.g([\\w+]+\\:\\/\\/)?([\\w\\d-]+\\.)*[\\w-]+[\\.\\:]\\w+([\\/\\?\\=\\&\\#\\.]?[\\w-]+)*\\/?
does a pretty good job of extracting most URLs. So, this will send the URLs we found from the SERP API, and any URLs the user provided in their message. - Make
namespace
the Current user’s unique ID. A namespace is like a folder in a Pinecone index, so ensure that when you upsert to Pinecone or query Pinecone, you only do that using namespace = Current user’s unique ID. -
message
is the message used to query the Pinecone index after we upsert it. - After receiving the response, save all of the vectorIDs to your database in some way or another so you can delete them later if needed (I schedule ‘create browsing memory’ on a list (the urlData) which creates a Memory ‘thing’ for each URL that the user can manage.
- Insert the most similar returned matches into the prompt with the content, title, and URL for each one (including title and URL means it can ‘credit’ the source in its response as it knows where the chunk of text has come from)
Wow, you got this far! Thanks for reading the docs. The download link is here.