Mr. Bones was implemented as a distributed system of four specialized processes running on a Raspberry Pi, communicating through POSIX message queues and pipes to enable real-time conversation and physical animation.
System Architecture
The system is composed of four independent processes that work in concert:
Jaw Controller
This process handles the physical actuation of the animatronic's jaw. It opens a message queue at /jaw-control and listens for a simple binary protocol:
0x01 → Start talking (jaw movement)
0x00 → Stop talking (jaw at rest)
Neck Controller
Responsible for managing the neck motion states, this process listens to a message queue at /neck-control and responds to the following commands:
0x00 → Idle: Look around to random angles with noise every 2-8 seconds
0x01 → Listening: Look straight forward with slight movement every 5 seconds
0x02 → Thinking: Look upward in thought with subtle head movements
Speech Process
This is the central coordination process that uses Vosk for speech-to-text recognition. It manages a state machine with two primary states:
- Wake Word Detection: Listens continuously for "Hey Mr Bones"
- Conversation Mode: Captures user input after wake word detection
When speech is detected, it sends the text to the LLM process through a pipe and coordinates the physical responses by sending appropriate commands to both the jaw and neck controllers. When the LLM responds, it uses Piper for text-to-speech synthesis and plays the audio through the speakers.
LLM Backend
Built using llama.cpp with calme-3.1-llamaloi-3b-Q6_K.gguf as the backing pretrained model, this process listens for conversation input through stdin (redirected from the speech process via pipes) and streams responses token-by-token through stdout. The model was specifically chosen for its balance of performance and resource requirements suitable for the Raspberry Pi platform.
Performance Optimization
To achieve real-time conversation with minimal latency, the system employs several optimization strategies:
- Decoupled speech recognition and LLM generation into separate processes
- Token-by-token streaming from the LLM to enable overlapping speech synthesis
- Lightweight binary protocols for inter-process communication
- Predefined personality constraints to reduce LLM computation complexity
The complete system successfully ran on Halloween, engaging in real-time conversations with my friends and family. It was pretty awesome :)