Talking Avatar with Lip-Sync and Vision Capabilities

So I’ve been working on a project recently, attempting to create an interactive avatar with real-time features such as streaming audio with lip-sync data (phonemes) and support for vision to produce a digital representation of yourself (or some other character). I’ve come up with a demo that manages to do this. Because it also uses AI, it’s also capable of figuring out a specific facial emotion (48 of them in total) where one is chosen for every message you send and the face of the avatar temporarily changes to reflect that emotion whilst it’s speaking still.

This is mainly for use with AI applications where text to speech in involved.

It works relatively well in it’s current state, although there are a few tweaks I shall probably make within the coming weeks.

You can play with the demo here: avatar lip-sync demo


How it works…

It relies on a few things. You need to provide a GLB file (this is the avatar), a default one is included. The main library which runs this requires the GLB file to be based on the ReadyPlayerMe specs, as they contain additional data to help with the movements.

You can use any of these utilities to create a customized avatar.


There is only one text to speech service supported right now, and that’s cartesia because of their ability to produce phonemes (a language that represents units of sound) along with timestamps that refer to the timings associated to these sounds. These are needed to map the correct movement of mouth positions to the spoken text.

If enabling your camera, it will also be able to see you by analysing captured images.

Have a play, it’s not perfect but there’s room for improvement.

Here’s the editor if you’re interested:
paul-testing-1 | Bubble Editor

Paul

Uhm did you create this technology, is it open-source code … or is this an API integration? Can you also give me the pricing page if this is a 3rd party service. Obviously, the technology itself could be better with a better budget but I’d love to keep an eye on this for one application I’m responsible for at the moment.

Also you could probably go and raise a pre-seed round with this demo if this is your code.

Amazing work once again, Paul.
Thanks for implementing this, it’s working perfectly for us.

Keep up the amazing development :flexed_biceps::folded_hands::fire:

1 Like

Hi,

So I didn’t create the tech behind it, but I have customized it to make it work within the plugin it’s currently incorporated with.

There’s no additional costs as it’s open source, and the AI is using Groq + llama model for the demo. The only part which isn’t free, is the text to speech service which is Cartesia. New signups with them give you 20,000 free credits to play around with but they are amazingly good (better than ElevenLabs some may say…) and a lot cheaper.

Hope that helps!
Paul

Cheers Timbo!

1 Like

Well it is open source, but it required a few tweaks to get it working in a Bubble plugin. I’ll have to get my brain into gear and come up with a proper use for it.

1 Like

Okay thanks for the info.

Great work!

1 Like

We now have a service…

It’s still in development, but functionally it’s all working. Over the next week, it should have all the documentation completed and other bits that are currently on my list.

Anyone interested in testing using it, let me know and I’ll grant you free access.