How do I retrieve .docx and .pdf wordcounts?

Hi all

The application I am building requires producing accurate wordcounts of uploaded word document and pdf files in multiple languages.

Currently I am only aware of one plugin by zerocode which I purchased a while back. Their plugin revealed the following 2 big flaws:

  1. It doesn’t read PDF wordcounts accurately due to extra characters or something being counted in the PDF. I have tested this with the same text in Word showing the correct wordcount of 100 words whereas the PDF copy of the same document would produce a wordcount of 500 words.
  2. Even with uploaded Word documents, it does not show the proper wordcount of languages that use different scripts such as Arabic and Russian. These results are way off (shows 1 word for an Arabic Word document containing 65 words for example).

@ZeroqodeSupport @ZeroqodeTeam @levon have told me here on the forum that they cannot fix these issues in their existing plugin.

Having searched, I found no other plugins that could provide me with what I need. I am not sure if any of the OCR plugins by @redvivi could give me exactly what I need?

If not, can anyone suggest the best way to go about achieving this i.e: an (closest to) accurate wordcount retriever of both .docx and .pdf uploaded files in all or at least most language scripts?

Thanks in advance

1 Like

There are loads of third-party APIs that do this sort of thing and it’s a fairly simple API call to send a document and get back a result.

Have you looked further beyond Bubble plugins to see if there is an API out there that functions as you need it?

Josh @ Support Dept
Helping no-code founders get unstuck fast :rocket:save hours, & ship faster with an expert :man_technologist: on-demand

Hi @josh24 Thanks for the reply

I realize APIs may be the necessary route but its a bit daunting seeing as I have no experience with them. Also my use case is a bit more complicated than just giving word-counts as I need to achieve three things simultaneously:

  1. Upload files (most important are doc(x) and pdf files but also txt to my bubble database so the files themselves van be retrieved by users.
  2. Be able to give the wordcount of said files (no matter the language script (English, Russian, Arabic etc)).
  3. Be able to read/extract the actual text contained in those files for users to work on in my bubble app (in text editors).

If you or anyone else can guide me towards an API that can achieve all these 3 steps, I would immensely appreciate it.

Honestly, I think your only option for something like this is going to be finding one (maybe two) APIs that will cover all of these requirements.

Forget the integration part for a moment - I think you probably just need to hunt around and find a few suitable services for yourself that do all the things you need - you’re probably the best person to do this. Start with Google, or try finding APIs on sites like RapidAPI

Hopefully you’re able to test those APIs with the actual documents on their sites (without integrating anything) testing with files that your users will be uploading and seeing whether the result is what you need.

In my experience parsing PDFs is a bit of a mixed bag depending on what you feed in. If the file is something the user has scanned (so it’s essentially a photo) then the recognition of those sorts of files are pretty bad. Technology exists i.e. Google & Apple do it really well. But seems the services I’ve tested all kinda suck and probably work off super old tech. If it’s a PDF that’s created from text like from a docx file directly, then the results with these sorts of files are usually pretty spot on.

Otherwise, you could look to pay a freelancer to do the research for you & then integrate it?

Josh @ Support Dept
Helping no-code founders get unstuck fast :rocket:save hours, & ship faster with an expert :man_technologist: on-demand

1 Like

Ok thanks @josh24 I will do some research

1 Like

Hey @phrase9, on my end I can propose an accurate multi-language word count based on an input as a text.

The tricky part of your project is to extract text from docx and PDF, that we can’t help you with.

This topic was automatically closed after 70 days. New replies are no longer allowed.