Compressed sensing in speech signals

In communications technology, the technique of compressed sensing (CS) may be applied to the processing of speech signals under certain conditions. In particular, CS can be used to reconstruct a sparse vector from a smaller number of measurements, provided the signal can be represented in sparse domain. "Sparse domain" refers to a domain in which only a few measurements have non-zero values.^[1]

Theory

Suppose a signal ${x\in R^{N}}$ can be represented in a domain where only ${\it {M}}$ coefficients out of ${\it {N}}$ (where ${M\ll N}$ ) are non-zero, then the signal is said to be sparse in that domain. This reconstructed sparse vector can be used to construct back the original signal if the sparse domain of signal is known. CS can be applied to speech signal only if sparse domain of speech signal is known.

Consider a speech signal ${x}$ , which can be represented in a domain ${\Psi }$ such that ${x}={\Psi {\boldsymbol {\alpha }}}$ , where speech signal ${x\in R^{\it {N}}}$ , dictionary matrix ${\Psi \in R^{\it {N\times N}}}$ and the sparse coefficient vector ${{\boldsymbol {\alpha }}\in R^{\it {N}}}$ . This speech signal is said to be sparse in domain ${\Psi }$ , if the number of significant (non zero) coefficients in sparse vector ${\boldsymbol {\alpha }}$ is ${\it {K}}$ , where ${\it {K\ll N}}$ .

The observed signal ${x}$ is of dimension ${\it {N\times 1}}$ . To reduce the complexity for solving ${\boldsymbol {\alpha }}$ using CS speech signal is observed using a measurement matrix ${\Phi }$ such that

{y=\Phi x}

(1)

where ${y\in R^{\it {M}}}$ , and measurement matrix ${\Phi \in R^{\it {M\times N}}}$ such that ${\it {M\ll N}}$ .

Sparse decomposition problem for eq. 1 can be solved as standard ${l_{1}}$ minimization^[2] as

{{\boldsymbol {\hat {\mathbf {\boldsymbol {\alpha }} }}}={\mbox{minimize}}\;\Vert \mathbf {\boldsymbol {\alpha }} \Vert _{1}\;\;\;\;{\mbox{s.t.}}\;\;\;\;\mathbf {y} =\mathbf {\Phi x} =\mathbf {\Phi \Psi } \mathbf {\boldsymbol {\alpha }} =\mathbf {A{\boldsymbol {\alpha }}} ,\;{\mbox{where}}\;\;\mathbf {A} =\mathbf {\Phi \Psi } }

(2)

If measurement matrix ${\Phi }$ satisfies the restricted isometric property (RIP) and is incoherent with dictionary matrix ${\Psi }$ .^[3] then the reconstructed signal is much closer to the original speech signal.

Different types of measurement matrices like random matrices can be used for speech signals.^[4]^[5] Estimating the sparsity of a speech signal is a problem since the speech signal varies greatly over time and thus sparsity of speech signal also varies highly over time. If sparsity of speech signal can be calculated over time without much complexity that will be best. If this is not possible then worst-case scenario for sparsity can be considered for a given speech signal.

Sparse vector ( ${\hat {\boldsymbol {\alpha }}}$ ) for a given speech signal is reconstructed from as small as possible a number of measurements ( ${y}$ ) using ${l_{1}}$ minimization.^[2] Then original speech signal is reconstructed form the calculated sparse vector ${\hat {\boldsymbol {\alpha }}}$ using the fixed dictionary matrix as ${\Psi }$ as ${\hat {x}}$ = ${\Psi }$ ${\hat {\boldsymbol {\alpha }}}$ .^[6]

Estimation of both the dictionary matrix and sparse vector from random measurements only has been done iteratively.^[7] The speech signal reconstructed from estimated sparse vector and dictionary matrix is much closer to the original signal. Some more iterative approaches to calculate both dictionary matrix and speech signal from just random measurements of speech signal have been developed.^[8]

Applications

The application of structured sparsity for joint speech localization-separation in reverberant acoustics has been investigated for multiparty speech recognition.^[9] Further applications of the concept of sparsity are yet to be studied in the field of speech processing. The idea behind applying CS to speech signals is to formulate algorithms or methods that use only those random measurements ( ${y})$ ) to carry out various forms of application-based processing such as speaker recognition and speech enhancement.^[10]

References

^ Vidyasagar, M. (2019-12-03). An Introduction to Compressed Sensing. SIAM. ISBN 978-1-61197-612-0.
^ ^a ^b Donoho D. (2006). "Compressed sensing". IEEE Transactions on Information Theory. 52 (4): 1289–1306. CiteSeerX 10.1.1.212.6447. doi:10.1109/TIT.2006.871582. PMID 17969013. S2CID 206737254.
^ Candes E.; Romberg J.; Tao T. (2006). "Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information" (PDF). IEEE Transactions on Information Theory. 52 (2): 489. arXiv:math/0409186. doi:10.1109/TIT.2005.862083. S2CID 7033413.
^ Zhang G.; Jiao S.; Xu X.; Wang L. (2010). "Compressed sensing and reconstruction with bernoulli matrices". The 2010 IEEE International Conference on Information and Automation. pp. 455–460. doi:10.1109/ICINFA.2010.5512379. ISBN 978-1-4244-5701-4. S2CID 15886491.
^ Li K.; Ling C.; Gan L. (2011). "Deterministic compressed-sensing matrices: Where Toeplitz meets Golay". 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3748–3751. doi:10.1109/ICASSP.2011.5947166. ISBN 978-1-4577-0538-0. S2CID 12289159.
^ Christensen M.; Stergaard J.; Jensen S. (2009). "On compressed sensing and its application to speech and audio signals". 2009 Conference Record of the Forty-Third Asilomar Conference on Signals, Systems and Computers. pp. 356–360. doi:10.1109/ACSSC.2009.5469828. ISBN 978-1-4244-5825-7. S2CID 15151303.
^ Raj C. S.; Sreenivas T. V. (2011). "Time-varying signal adaptive transform and IHT recovery of compressive sensed speech". Interspeech 2011. pp. 73–76. doi:10.21437/Interspeech.2011-19. S2CID 35813887.
^ Chetupally S.R.; Sreenivas T.V. (2012). "Joint pitch-analysis formant-synthesis framework for CS recovery of speech". Interspeech: 946–949.
^ Asaei A.; Bourlard H.; Cevher V. (2011). "Model-based Compressive Sensing for Multiparty Distant Speech Recognition". ICASSP: 4600–4603.
^ Abrol Vinayak; Sharma Pulkit (2013). "Speech enhancement using compressed sensing". Interspeech 2013. pp. 3274–3278. doi:10.21437/Interspeech.2013-725.

[1] Vidyasagar, M. (2019-12-03). An Introduction to Compressed Sensing. SIAM. ISBN 978-1-61197-612-0.

[Donoho-2] Donoho D. (2006). "Compressed sensing". IEEE Transactions on Information Theory. 52 (4): 1289–1306. CiteSeerX 10.1.1.212.6447. doi:10.1109/TIT.2006.871582. PMID 17969013. S2CID 206737254.

[3] Candes E.; Romberg J.; Tao T. (2006). "Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information" (PDF). IEEE Transactions on Information Theory. 52 (2): 489. arXiv:math/0409186. doi:10.1109/TIT.2005.862083. S2CID 7033413.

[4] Zhang G.; Jiao S.; Xu X.; Wang L. (2010). "Compressed sensing and reconstruction with bernoulli matrices". The 2010 IEEE International Conference on Information and Automation. pp. 455–460. doi:10.1109/ICINFA.2010.5512379. ISBN 978-1-4244-5701-4. S2CID 15886491.

[5] Li K.; Ling C.; Gan L. (2011). "Deterministic compressed-sensing matrices: Where Toeplitz meets Golay". 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3748–3751. doi:10.1109/ICASSP.2011.5947166. ISBN 978-1-4577-0538-0. S2CID 12289159.

[6] Christensen M.; Stergaard J.; Jensen S. (2009). "On compressed sensing and its application to speech and audio signals". 2009 Conference Record of the Forty-Third Asilomar Conference on Signals, Systems and Computers. pp. 356–360. doi:10.1109/ACSSC.2009.5469828. ISBN 978-1-4244-5825-7. S2CID 15151303.

[7] Raj C. S.; Sreenivas T. V. (2011). "Time-varying signal adaptive transform and IHT recovery of compressive sensed speech". Interspeech 2011. pp. 73–76. doi:10.21437/Interspeech.2011-19. S2CID 35813887.

[8] Chetupally S.R.; Sreenivas T.V. (2012). "Joint pitch-analysis formant-synthesis framework for CS recovery of speech". Interspeech: 946–949.

[9] Asaei A.; Bourlard H.; Cevher V. (2011). "Model-based Compressive Sensing for Multiparty Distant Speech Recognition". ICASSP: 4600–4603.

[10] Abrol Vinayak; Sharma Pulkit (2013). "Speech enhancement using compressed sensing". Interspeech 2013. pp. 3274–3278. doi:10.21437/Interspeech.2013-725.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]