2025年5月2日

Three Ways to Enhance Realtime Vision Chatbot

Author: James Zhang

I've been spending a lot of time interacting with various AI systems lately, and I've been thinking a lot about some potential ways to make real-time AI communication feel less artificial:

One core idea I've been exploring revolves around how we humans process context. When we chat, we're aware of our surroundings – the room we're in, the social situation – but we don't usually comment on it unless it's relevant. It's just background information that shapes how we talk, and that I call the peripheral context. At the same time, we're tracking the actual conversation, responding directly to what's being said, and that I call the immediate context. Current AI seems to lump everything together; any input, visual or auditory, is treated as something needing an explicit reaction. My concept involves an AI assistant that could differentiate, like us. It could have one stream processing the environment, just letting that awareness subtly inform its style and understanding, while another stream focuses solely on the direct conversation. It wouldn't constantly point out the picture on the wall unless asked; it would just know it's there, allowing for a smoother, more intuitive interaction.

Then there's the rhythm of conversation. We don't wait for someone to completely finish their sentence before our brain starts working on a reply. We're processing, predicting, and formulating our response almost instantly, adjusting as the speaker continues. It’s this dynamic back-and-forth that makes dialogue flow. AI today mostly waits patiently for the entire input, processes it, and then delivers a complete response. It feels very stop-start. An approach I envision is for AI to start crafting a response the moment we begin speaking, continually refining it as more words come in. It might even learn when it's appropriate to interject or overlap slightly, just like we do. It sounds technically challenging, requiring a whole new way of building these systems, but the payoff in terms of naturalness could be immense.

And for AI that sees the world, there’s the challenge of visual information overload. Our eyes are incredibly efficient. We don't meticulously re-scan every detail of a static scene; we focus on what's new or different. It’s a built-in filter. Why should a visual AI constantly process unchanging video frames? A smarter approach I thought of would be for the AI to primarily look for changes – has an object moved significantly? Is there new text to read? A lightweight system(Yolo + Mistral Ocr) could run locally, maybe on the device itself, just comparing frames and only sending the truly new information to LLMs for deeper analysis. This feels like mimicking our own selective attention, making the system faster, less demanding, and better able to react quickly when something important does change.

These are not just ideas. I am currently working on some of them.