AI prompts
base on Simple Speech-To-Text on the '10 cents' CH32V003 Microcontroller # Simple Speech-To-Text on the '10 cents' CH32V003 Microcontroller
This program will read audio from an analog microphone
and distinguish between the spoken digits 'zero' to 'nine',
logging the results over a serial link.
I use a MAX4466 electret microphone amplifier board connected to
a CH32V003 development board, along with a WCH-LinkE adaptor for both programming
the chip and acting as a UART-to-USB converter to read the output.
## Compiling
The code is based on the excellent [ch32v003fun](https://github.com/cnlohr/ch32v003fun)
library, and should be compiled in a similar way to the examples there.
(i.e. this directory should be added to the 'examples/' directory
alongside the others)
## Pin Connections
|Link-E Adaptor|CH32V003 Dev Board|
| --- | --- |
| 3V3 | VCC |
| GND | GND |
| SWDIO/TMS | PD1 |
| RX | PD5 |
| --- | --- |
|MAX4466 Microphone|CH32V003 Dev Board|
| --- | --- |
|VCC | VCC |
| GND | GND |
| OUT | PD4 |
![test](ch32v003_with_max4466_mic.jpg)
## Accuracy
I'm getting about 90% accuracy identifying spoken digits with the code
as it stands. As with all R&D, there's always scope for improvement.
## Training / Development
In training/ there is a stripped-down ch32v003 program to just dump
8-bit raw audio samples over the serial link, and a desktop-linux program
to read that data and do training of spoken words, simulating the
integer-pipeline used in the main code for the audio processing.
In misc/ are C programs that generate:
- the FFT 'twiddles' integer sin/cos tables used in the FFT function
(twiddles_RES13.h)
- the FFT-to-Mel matrix, (mel_mx.h)
- the discrete-cosine-transform matrix (dctm_20x8_8bit.h)
- the codebook.h spoken-numbers codebook. (codebook.h)
## Challenges
- Only have 16K storage and 2K RAM available, not much room for lookup tables,
codebooks and matrices for processing. (the binary is currenty about 10K)
- The ADC on the CH32V003 is only 10-bit as opposed to the usual 16-bit
(or even 12-bit) minimum for good quality audio.
This is compensated for somewhat by 8x oversampling/averaging.
I can feed the sampled audio into the 'Whisper' STT application
and get accurate general transcription done, so the quality is certainly
'good enough'.
- No I2S support so using a higher-quality digital microphone is not an option.
- No floating-point support, all code must use integer/fixed-point math
- No hardware-multiply or hardware-divide - so audio processing routines have
to be lean and fast to run within the 48MHz / 100fps timing constraints
## How it works
The code uses traditional (allbeit pared down) MFCC feature extraction,
and simply compares buffered tensors of samples against pre-recorded
spoken digits to get a 'best match'.
A timer is set up to generate an interrupt around 50,000/sec. On receiving
an interrupt, the ADC is read and the next sample convertion started.
8 consecutive samples are averaged to generate a ~6400 samples/sec audio stream.
Every 64 samples (10ms), a 128-wide FFT of a buffer of the last 128 samples is
performed and 20 mel-scale frequency bins are calculated from that. The
mel bin energies are converted to log2-scale.
Finally an 8-bin cepstrum is calculated via a DCT of the 20 log-mel bins.
When the 'energy' of a frame (sum of all mel levels) is above a threshold, it is added to a 'word'
buffer, otherwise a count of 'silence' frames is increased.
When enough 'silence' frames have passed to signify the end of a spoken
sample, its length is warped to exactly 16 frames and compared to a
lookup table of previously stored word samples, and the closest match
is reported.
## How it compares to more common speech-to-text systems
- Usually 16,000 samples/sec audio is used (high fidelity), however to
reduce the audio processing required, only 6400 samples/sec audio is used here.
This is still high enough quality for accurate transcription
(the top speech formants are at around 3200Hz, everything above that is redundant
in terms of recognition).
This makes each 10ms slice of audio exactly 64 samples.
- Heavy pre-emphasis used (1.0 i.e. subtract the previous sample from
the current one. This mostly removes spectral tilt, and completely removes
any DC level in the signal.
- 20ms frames are used, compared to 25ms found elsewhere (Vosk,Whisper). Standard 10ms
step/100fps processing is maintained though. This allows for 128-sample FFT
to be used - FFT alogrithm is optimal at power-of-2 widths, and potentially would
allow for 100% faster FFT by taking advantage of the symmetry of the algo.
The narrow-framesize is compensated for by:
- no windowing used. Usually the signal has a Hann window applied to the signal before FFT, this step is skipped to avoid 'bookending' the narrow signal.
Not windowing raises the noise-floor of the FFT output, but noise-floor removal
of the output mel bins is done to compensate.
- Only 20 mel bins calculated. Usually 40 (Vosk) or even 80 mel-scale bins
(Whisper) are used these days.
- extreme (probably overly so) compression via DCT to 8 cepstrum bins to
minimise the final frame dimensions.
## Applications
Maybe the basis for a low-complexity/low-power/low-cost/always-on wakeword engine?
## Future
New versions of the CH32 line of chips is due soon, improving the ADC to 12-bit and adding hardware-multiply instructions which would speed up the FFT and
matrix-multiply routines in the code significantly, and improve
accuracy in the processing.
## Why?
I dunno, I like a challenge. :grinning: Doing something traditionally associated with
high-end processing on a 10-cent low-end microcontroller is always going to be
a stretch.
", Assign "at most 3 tags" to the expected json: {"id":"10480","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"