Request for product: voice-based dev environment
Here's my hacked-together, messy, voice-based dev environment:
Voice-driven loop with screen-shotting so the LLM in the loop can see what's in my terminal and editor. The prompt varies depending on what I'm trying to drive with this loop.
A few tool definitions that give read access to files and URLs.
A tool the LLM can send a block of output to that generates keyboard events, so the LLM can drive any editor/terminal.
A separate process watching a directory and constantly making LLM-driven git commits. (git autosave).
I have some pieces of this running most of the time. But I'm lazy, and doing other stuff, and I also try to use a variety of editors and tools, to see what's good lately. Which ... no stability, so my hacked-together stuff is always broken.
I don't want to replace @Windsurf / @Cursor / Claude code. A seriously good agent and expert-system dev toolkit is a lot of work.
What I want is a conversational voice layer that I can use with any dev environment, in the same way that I can use version control with any dev environment.
I don't have time to focus on this. But I can help and @Pipecat (our open-source project) has all the voice loop orchestration and model/service abstraction pieces. Who wants to build this?
Replies
So, a voice based system that screenshots the IDE periodically - feeds the input into an LLM so it knows what's going on. Sounds cool, but what about high costs of this (assuming directly sending the image to LLM). If using OCR, then it will need to be really fast? How will it deal with, for example, seeing only some of the code in the screen, and needing to scroll for the rest. Just some thoughts lol, sounds fun.
Daily.co
@whisk My really basic version could only see what code and terminal content was on the screen, and it was hit-or-miss whether the right stuff was in the context when needed! It's easy to imagine adding the visual capabilities to a system like Windsurf, Claude Code, or the new OpenAI coding agent, though. Then you'd have an "agent" that has full access to the repo you're working in plus the extra feedback from the screen capture.
Regarding cost: LLM inference keeps dropping so fast that for a lot of things I think it makes sense to build as if costs are basically $0, because they're trending that way. Gemini 2.0 Flash is crazy cheap, now, for example.
Paige from Google made this argument in our voice AI meetup last night: https://www.youtube.com/live/39UsGXgufxA?feature=shared&t=1865
@kwindla'Then you'd have an "agent" that has full access to the repo you're working in plus the extra feedback from the screen capture.' What can the screen capture give that the repo doesn't? Unless it's outside the IDE - I mean, the voice part is okay, just the data retrieval. Feel like there's more ways than that
Daily.co
@whisk I think you often have things you're doing that are hard to completely integrate into the IDE. For example, I usually have multiple terminal windows and web browsers open. Some of the web browsers are showing documentation, for example. I'm thinking of the screen as the "universal API" for getting as much context about the development environment as possible.
@kwindla I see
Product Hunt
Ooh, I like this idea. I've been using Wispr Flow and more recently @Aqua Voice to vibe code via transcription, but for more serious engineering, I feel like the screenshotting + voice loop could be a really powerful multi-app approach to having a coding agent.
Someone should build this and launch on Product Hunt!
Product Hunt
@rajiv_ayyangar I've been using Aqua as well, but for emails. It makes responding about 2x faster. Although I still type at the cafe so I don't sound like a psycho. :P
Daily.co
@rajiv_ayyangar @rrhoover I've become such a believer in voice interfaces that I've started to wonder when we're all just going to talk even at the cafe, because we don't really know how to (or want to) use our devices any other way. Like how when you ride the subway in NYC now there are people watching tiktok and youtube without headphones.
Or, maybe more optimistically, we'll get some kind of subvocalization input device soon.
Product Hunt
@rrhoover @kwindla There are a few companies working on that. Actually, @Wispr Flow is pretty decent at a barely-audible-whisper sound level.
Been thinking about this ever since I sprained my wrist. I tried voice typing code and it was a nightmare. Having an LLM-driven enviroment would’ve saved me.
Daily.co
@miklesh_pal A long time ago I had really bad wrist pain from typing too much. I used Dragon Naturally Speaking, which I think was the best voice input at the time. It sort of, kind of, worked. But was way slower than typing, in terms of my overall productivity. I do think that now a well-designed voice-first programming environment could be faster than typing!
Deepgram
Hey @kwindla! I’m a PM at Deepgram, and my team is building a voice OS for developers. What you’re describing aligns closely with our vision! Would love to chat further and / or keep you updated on our progress :)