With this plugin, you can automatically extracts text and data, and structure from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.
WARNING: This service provides OCR-specialised operations based on a document as input. If you intend to detect text in an image such as a scene, please refer to the AWS Rekognition - Text Recognition plugin.
AWS Textract service works in synchronous mode for PNG or JPEG files only, e.g. a request is sent and the response comes right away within the same action.
It is possible to use PDF files in asynchronous mode (a request is sent, processed by AWS, and the response is retrieved after the requestor checks the job completion status), requiring you own an AWS S3 storage, use AWS Queue and Notifications to get notified, which adds a layer of complexity.
You can find an example of an asynchronous AWS request for another plugin in this editor. Should you require this implementation, we would be happy to investigate this possibility for you.
@redvivi
I actually need users of my web application to be able to upload a PDF to my system and all the data in that PDF can be saved in the database as attributes of one or more tables.
Is that possible?
Thanks
It is of course possible using both solutions in my previous response:
If you must use Amazon Web Services please let us know, we would be happy to modify our plugin to do so, but keep in mind that AWS has a slightly more complex setup.
Should you want to use Amazon Web Services, please reach to us directly via DM, we would be happy to customise our existing AWS Textract - OCR Text & Data plugin for you.
Just to let you know that we have updated the details in the plugin response and introduced PDF support, along with asynchronous requests, so you can build a comprehensive document structure, as showcased in our demo:
Oh by the way, our demo demonstrates now how to process forms and tables OCR response in Bubble, especially mapping each value with its key, which is notoriously difficult as AWS Textract is quite convoluted.
@redvivi Is there a way to filter/query results using bubble filters on the sync module? We are trying to analyze documents from different countries and need to find specific elements within the documents, and the only “logical” way we can think about this is to filter out the results by comparing them to the expected Entries.
I would suggest to refer to the demo editor, refer to the element named “Example of first form’s value extraction” and the associated filters to match from an existing value.
Should you wish to explore the Textract’s “Query” feature on synchronous operation, it is not supported yet but happy to have a look at your custom use-case.