Willing to Pay: How Do I Attach a Large File to ChatGPT as reference?

Goal

On my platform, users can add past job titles (e.g., “Project Manager”). For filtering and analysis, that’s too broad you could be an IT Project Manager, Construction Project Manager, etc.

In Sweden, we have SSYK (similar to the US SOC – Standard Occupational Classification or the globally used ISCO – International Standard Classification of Occupations ), which categorizes job roles into detailed sections.

“Project Manager” alone is meaningless, but if a user writes “I was a project manager at a construction company,” I want to send that title + context to ChatGPT via API and map it to the correct SSYK code, returning a structured JSON. Example:

{
  "raw_title": "Project Manager",
  "ssyk_hirarki_1": "5",
  "ssyk_hirarki_2": "54",
  "ssyk_hirarki_3": "541",
  "ssyk_hirarki_4": "5419",
  "ssyk_occupation": "Project Manager, Construction"
}

Problem

I know how to make API calls, but I can’t figure out how to attach the SSYK JSON file (≈15,000 lines) so that the model can reference it.

I’ve used the following API call to analyze documents (like invoices), but it seems to only accept .pdf. I tried .json and .txt, but no luck.

{
  "model": "gpt-4o-mini",
  "temperature": 0.7,
  "text": { "format": { "type": "json_object" } },
  "input": [
    {
      "role": "user",
      "content": [
        { "type": "input_file", "file_url": <file_url> },
        { "type": "input_text", "text": <instruction> }
      ]
    }
  ]
}

Question

How do I attach the SSYK JSON to the prompt so that the model can actually read it and look up the right category? Do i upload it beforehand somewhere to openAI?

Is there a reliable way to handle large reference lists (like 15k job codes) without converting everything to PDF?


Payment :money_with_wings:

I’d be happy to pay someone if they can point me in the right direction and help me get this working. I can share screenshots (but not my editor, my business partner isn’t comfortable with that).

Some poeple i look up to that might be able to help out here: @adamhholmes @boston85719 @Jici @georgecollier

You probably shouldn’t, as it’ll be an obscene number of tokens and might not lead to accurate responses.

You’re going to need to turn it into chunks and upsert it to a vector database using something like Pinecone. Each SSYK code is a vector. You will generate an embedding for said vector from its raw text content.

Then, when you want an AI to classify a person’s query/bio/job into a title, you’d query the vector database based on the user’s input. That will return the vectors that are closest semantic match. That might reduce the total job pool to 10-50, which is a much more reasonable number of options to pass the AI so it can select the most appropriate.

This is probably the most elegant/reliable/accurate way to do it.

If you really just want to pass 15,000 lines, it’s static, so just paste it in a prompt and hope for the best.

3 Likes

Hey George :pray: thanks for the quick response! Never played around with vector databases so this will be interesting for sure, do you think the free plan on Pinecone is fine for testing?

Is it extremely complicated to set up?

Or can i do function calls to OpenAI and give it access to the database? Because i have uploaded all the 6000 records to the database

Or can i upload it in the OpenAI playground Vector store and fetch it from there?

The advantage of a vector database is that search is by semantics, not keywords.

So, searching ‘apple’ might return pear, orange, and pineapple, whilst searching ‘mug’ might return ‘glass’, ‘cup’, and ‘bottle’. They mean similar things. You can see how that’s useful for your job categorisation problem.

1 Like

I see, but i cant seem to find any good tutorials on Pinecone? There is not really a “drop your file and its done” solution i guess?

Adding to what @georgecollier said… you can also use Supabase to build your vector database, there are tones of videos on youtube teaching how to do it. The good news is that it’s free and works very very well. I do this for several AI agents I have, and the results are wonderful. I have huge vector databases and have never had any issues with responses.

If you have the opportunity to install n8n on a VPS, that’s even better, as you can define a number of other customizations as well as create your vector database in one click.

2 Likes

Cool thanks for the addon! Did not know Supabase could be used for that, will take a look at that for sure!

I tested now with OpenAI playgrounds Vector Store (Not sure what’s the difference, is this a vector database as well…?) and the price i got for 1 api call was $0.004 which i guess is good? and the results it gave was all correct so far when i was testing.

But i will check Supabase for sure :slight_smile:

You can use Cloudflare Auto RAG. The free version is sufficient, you can upload your JSON and it will vectorize it automagically, then you get an endpoint with an AI layer for the classification https://youtu.be/JUFdbkiDN2U?si=E-cFDsJn3XZpEtRF

3 Likes

Thanks @akamarski for the info! Taking a look at the video now :pray:

Hey @akamarski,

I’ve been testing the Cloudflare AutoRAG feature and it’s a great concept. However, I’ve encountered some unexpected behavior in the Playground that I’d like to get some clarification on.

I’ve noticed two main issues:

  1. Inaccurate Responses: Even when the documents should contain the correct information, the Playground sometimes provides incorrect answers. This suggests that it’s not always retrieving the most relevant data from the documents?
  2. General Knowledge Override: When I ask a question that is completely unrelated to the documents (e.g., “who started world war 1?”), the system first states that it can’t find the information in the documents, but then proceeds to answer the question using its pre-trained knowledge.

This second point is particularly concerning, as it seems the model is prioritizing its internal knowledge over the provided documents, even when the query is relevant to the document content. This could lead to a less reliable user experience, as the model may fail to use the specific information provided.

Is this the intended behavior, or is there a way to configure the system to strictly rely on the documents and refuse to answer questions that are outside of the provided context?

For example “Return the code for project manager” then it returns “Project manager should fall under IT consultant” even tho there are 5 different project manager titles in the documents?

I uploaded them as .json, is that a dumb thing to do? should it always be PDF or txt?

@georgecollier @rpetribu is this something that happends when using Pinecone as well?

My use case is different and I don’t really need concrete responses.

Typically, you should provide the guardrails, specific instructions, output format and other specifics with the prompt itself. Treat the entire prompt as a system prompt in popular models.

2 Likes

Okey thank you, i will test around and hopefully find a solution for this :slight_smile:

From what you’re reporting, I believe your problem is related to two factors:

  1. Your RAG isn’t working properly;

While creating a RAG database isn’t a complex task (there are even tools that do it in one click), you need to keep in mind that there are different types of RAG (Default RAG, Reranked, Hybrid, etc.), and each one works best for different contexts (especially since they can be very complex).

Also, in this context, your RAG database may be poorly configured, with poorly divided chunks and incorrect overlap sizes, which can lead to your agent having difficulty finding the best answer.

  • How are you creating the RAG?
  • Are you using any delimiters to divide the chunks?
  • What is the overlap size of each one?
  1. Your promot isn’t working properly;

Getting the correct answers is linked to both a well-structured knowledge base and a well-written prompt. However, ensuring that your agent operates as planned—avoiding certain topics, prioritizing others, etc.—is solely dependent on your prompt.

Are you building your agent within OpenAI?
Could you share your prompt?

Do you have experience with n8n? Because, honestly, in my opinion, there’s currently no better tool for building AI agents than this.

1 Like

Hey @rpetribu thanks again for commenting and also helping me, you’re teaching me new things for each comment and i really appreciate it!

Right now I’m currently only testing on cloudflares AutoRag by uploading the file .json file that contains around 16 000 lines of code of the SSYK codes and the correct titles like 2520 - Project manager

I assume this is a no that cloudflare is doing all that?

Chunk size: 448 tokens
Chunk overlap: 20%

The plan is to use OpenAI and honstly maybe try to use Pinecone instead?

Promt:
Provide a list of suitable SSYK codes (always 4 digits) and always the correct official name of the code you refer to that fits “Project Manager for a construction company, i was planning the projects and where teams should be”

The structure should be a list of all suggestions, only the SSYK (4 digits), the correct name of the SSYK, and a motivation for why you chose it.

Example:
3312: Bank clerks – This is the most appropriate code based on the description. Bank clerks handle customer transactions, respond to customer inquiries, and provide advice on banking services, which closely matches the work described.

Response:
*0: Operations Managers in Construction, Civil Engineering and Mining – This code is suitable as it involves planning and managing projects, and overseeing teams in the construction industry.

1370: Production Managers in Manufacturing – Although primarily focused on manufacturing, this code also involves planning and managing production processes, which can be applied to construction projects.

0: Sales and Marketing Managers – This code involves strategic planning and team management, which can be relevant to a project manager role in a construction company, particularly in terms of marketing and sales of construction services.

9: Other Civil Engineering Professions – This code encompasses a broad range of engineering professions, including those involved in construction project management, and may be suitable for a project manager with a civil engineering background.


Note: It completely ignored the prompt on some things and still gave strange numbers that have nothing to do with SSYK?

Update: I fixed N8N in Docker and am running it locally for testing. After connecting it to a Supabase Vector database, I’m getting 1000x better results. The mapping is now working correctly every single time! :folded_hands:

Now a question i have is, i have around 10 more files/pdfs that i can upload related to SSYK and different titles etc, but can that make it worse / confuse the model?

1 Like

:raising_hands:

That’s what I’m talking about!

That’s actually how I do it. With n8n, you can even add a free reranking tool to your Supabase RAG retrival, which will make your RAG response infinitely better.

Use the Cohere to make your Reranking. It is free.
Just register and pick your API. You don’t neet to make ANY configuration…

You have two options…

  1. Test it. Upload them and test them with more complex questions, analyzing the responses you receive. You probably won’t have any problems, as the data is very objective. It’s not a FAQ, for example, which is highly subjective in its interpretation.

  2. If you find that the responses have started to get worse, you can create an orchestrator agent and, below that agent, other more specific agents focused on answering questions on specific topics, each with its own knowledge base.

1 Like

Cool i need to check Reranking and Cohere, this is such a new world for me now haha… And once again a huge thank you, @rpetribu, this was really outside my comfort zone and knowledge so getting this to work was really cool and motivating :folded_hands:

1 Like

This topic was automatically closed after 70 days. New replies are no longer allowed.