<?xml version="1.0" encoding="utf-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
<journal-title-group><journal-title>Informatica</journal-title></journal-title-group>
<issn pub-type="epub">1822-8844</issn>
<issn pub-type="ppub">0868-4952</issn>
<issn-l>0868-4952</issn-l>
<publisher>
<publisher-name>Vilnius University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">INFOR398</article-id>
<article-id pub-id-type="doi">10.15388/20-INFOR398</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Research Article</subject></subj-group></article-categories>
<title-group>
<article-title>Voice Activation Systems for Embedded Devices: Systematic Literature Review</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Kolesau</surname><given-names>Aliaksei</given-names></name><xref ref-type="aff" rid="j_infor398_aff_001"/><bio>
<p><bold>A. Kolesau</bold> is a PhD student at Department of Information Technologies, Vilnius Gediminas Technical University. His research interests include machine learning and speech recognition.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Šešok</surname><given-names>Dmitrij</given-names></name><email xlink:href="dmitrij.sesok@vgtu.lt">dmitrij.sesok@vgtu.lt</email><xref ref-type="aff" rid="j_infor398_aff_001"/><xref ref-type="corresp" rid="cor1"/><bio>
<p><bold>D. Šešok</bold> is a professor at Department of Information Technologies, Vilnius Gediminas Technical University. His fields of interest are global optimization and machine learning. He has authored or co-authored around 40 papers.</p></bio>
</contrib>
<aff id="j_infor398_aff_001">Department of Information Technologies, <institution>Vilnius Gediminas Technical University</institution>, Saulėtekio al. 11, Vilnius LT-10223, <country>Lithuania</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>*</label>Corresponding author. </corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2020</year></pub-date>
<pub-date pub-type="epub"><day>23</day><month>3</month><year>2020</year></pub-date>
<volume>31</volume><issue>1</issue><fpage>65</fpage><lpage>88</lpage>
<history>
<date date-type="received"><month>1</month><year>2019</year></date>
<date date-type="accepted"><month>11</month><year>2019</year></date>
</history>
<permissions><copyright-statement>© 2020 Vilnius University</copyright-statement><copyright-year>2020</copyright-year><license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/"><license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>A large number of modern mobile devices, embedded devices and smart home devices are equipped with a voice control. Automatic recognition of the entire audio stream, however, is undesirable for the reasons of the resource consumption and privacy. Therefore, most of these devices use a voice activation system, whose task is to find the specified in advance word or phrase in the audio stream (for example, <monospace>Ok, Google</monospace>) and to activate the voice request processing system when it is found. The voice activation system must have the following properties: high accuracy, ability to work entirely on the device (without using remote servers), consumption of a small amount of resources (primarily CPU and RAM), noise resistance and variability of speech, as well as a small delay between the pronunciation of the key phrase and the system activation. This work is a systematic literature review on voice activation systems that satisfy the above properties. We describe the principle of various voice activation systems’ operation, the characteristic representation of sound in such systems, consider in detail the acoustic modelling and, finally, describe the approaches used to assess the models’ quality. In addition, we point to a number of open questions in this problem.</p>
</abstract>
<kwd-group>
<label>Key words</label>
<kwd>voice activation</kwd>
<kwd>keyword spotter</kwd>
<kwd>hidden markov models</kwd>
<kwd>acoustic model</kwd>
<kwd>neural networks</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="j_infor398_s_001">
<label>1</label>
<title>Introduction</title>
<p>The voice activation task has been attracting both research and industry for decades. Since the task of formulating an algorithm to determine whether a code phrase has been uttered in an audio stream is difficult, it is not surprising that heuristic algorithms and machine learning methods have long been used for the voice activation problem.</p>
<p>The history of voice activation models has gone through several important stages in parallel with solving a more general problem of automatic speech recognition. We would like to highlight the following important moments: the beginning of the use of hidden Markov models back in 1989 (Rohlicek <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_054">1989</xref>), the use of neural networks since 1990 (Morgan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_047">1990</xref>, <xref ref-type="bibr" rid="j_infor398_ref_048">1991</xref>; Naylor <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_050">1992</xref>), the use of pattern matching approaches, in particular, dynamic time wrapping (Zeppenfeld and Waibel, <xref ref-type="bibr" rid="j_infor398_ref_085">1992</xref>) optimization of a loss functions specific to a voice activation (as opposed to the common metrics such as accuracy and similar; this enables the system to become more attractive in terms of user experience) (Chang and Lippmann, <xref ref-type="bibr" rid="j_infor398_ref_005">1994</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_067">2010</xref>), attempts to get rid of a garbage model (Junkawitsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_028">1997</xref>), building systems of voice activation for non-English languages such as Chinese (Zheng <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_087">1999</xref>; Hao and Li, <xref ref-type="bibr" rid="j_infor398_ref_018">2002</xref>), Japanese (Ida and Yamasaki, <xref ref-type="bibr" rid="j_infor398_ref_023">1998</xref>), Persian (Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_060">2011</xref>), construction of discriminative systems (Keshet <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_030">2009</xref>; Tabibian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_070">2011</xref>, <xref ref-type="bibr" rid="j_infor398_ref_071">2013</xref>), publications describing voice activation systems in mass products (Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>; Gruenstein <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_016">2017</xref>; Guo <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_017">2018</xref>; Wu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_082">2018</xref>), as well as publishing open datasets to compare different approaches (Warden, <xref ref-type="bibr" rid="j_infor398_ref_077">2018</xref>).</p>
<p>Voice activation systems can be applied in various areas: telephony (Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_061">2013</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_067">2010</xref>), crime analysis (Kavya and Karjigi, <xref ref-type="bibr" rid="j_infor398_ref_029">2014</xref>), the assistance systems in emergency situations (Zhu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_088">2013</xref>), automated management of airports (Tabibian, <xref ref-type="bibr" rid="j_infor398_ref_069">2017</xref>) and, naturally, personal voice assistants, built-in mobile phones and home devices (Gruenstein <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_016">2017</xref>).</p>
<p>The problem of voice activation is closely related to the problems of automatic speech recognition and spoken term detection. In ASR, the task is to find the most likely sequence of words spoken in the audio recording, whereas in voice activation we need to find only a predetermined set of words or to indicate that such a word/words was/were not spoken. Of course, being able to solve the problem of ASR can easily solve the problem of voice activation, but at the moment most of the speech recognition systems consume an unacceptably large amount of resources for voice activation.</p>
<p>Spoken term detection is a search for a given phrase (and this phrase may vary depending on the request) in a static set of audio data. In voice activation, the phrase is fixed, but the audio data is delivered in real time. Therefore, you can use offline methods in spoken term detection, such as bidirectional neural networks or audio pre-indexing.</p>
<p>Despite the differences in these problems, approaches and ideas often overlap. For example, audio data representation, decoding methods or architecture of acoustic models. Additional requirements may apply for voice activation systems. For example, responding only to a keyword that was addressed to the system, but not to the same keyword spoken in the conversation (wake-up-word detection) (Këpuska and Klein, <xref ref-type="bibr" rid="j_infor398_ref_031">2009</xref>; Zhang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_086">2016</xref>); responding only to a keyword spoken by a registered user (Gruenstein <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_016">2017</xref>; Manor and Greenberg, <xref ref-type="bibr" rid="j_infor398_ref_044">2017</xref>; Kurniawati <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_037">2012</xref>).</p>
<p>In this paper, we will focus primarily on voice activation systems that can be used in embedded systems, in particular, mobile phones. Such systems must satisfy the following properties:</p>
<list>
<list-item id="j_infor398_li_001">
<label>•</label>
<p>high recall of finding the keyword (to build a voice interface, you need to be sure that you can start the voice interaction; with a low recall, the user will have to start the interaction in a different way),</p>
</list-item>
<list-item id="j_infor398_li_002">
<label>•</label>
<p>a small number of false positives (since the voice activation system is always on, a large number of false positives is unacceptable: this causes a waste of device resources, distracts the user’s attention and potentially reduces security),</p>
</list-item>
<list-item id="j_infor398_li_003">
<label>•</label>
<p>the ability to work entirely on a limited resource device (firstly, continuous forwarding of audio data to remote servers is impossible due to prohibitively high requirements for resources and communication coverage, and secondly, it is undesirable from the user privacy’s point of view),</p>
</list-item>
<list-item id="j_infor398_li_004">
<label>•</label>
<p>consumption of a small amount of resources (due to the previous property, consuming a large amount of resources will lead to rapid battery depletion and slow operation of other processes),</p>
</list-item>
<list-item id="j_infor398_li_005">
<label>•</label>
<p>noise resistance and variability of speech,</p>
</list-item>
<list-item id="j_infor398_li_006">
<label>•</label>
<p>a small delay between the utterance of the keyword and system activation.</p>
</list-item>
</list>
<p>We will call systems that satisfy these properties <bold>small-footprint keyword activation systems</bold>, similar to Chen <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>). Thus, some papers that suggest the operation of the system in milder conditions (for example, not in real time) were omitted from the study.</p>
<p>Previously, there were reviews of voice activation systems (Bohac, <xref ref-type="bibr" rid="j_infor398_ref_004">2012</xref>; Rohlicek <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_055">1993</xref>; Morgan and Scofield, <xref ref-type="bibr" rid="j_infor398_ref_046">1991</xref>), but there is some outdated information (due to rapid development in the area). Also, as far as we know, our work is the first systematic literature review on the subject.</p>
<p>This work has the following structure. In Section <xref rid="j_infor398_s_002">2</xref>, we describe the structure of a typical voice activation system, and will help to state the research questions which we aim to answer in this work. Next, in Sections <xref rid="j_infor398_s_003">3</xref>, <xref rid="j_infor398_s_004">4</xref>, <xref rid="j_infor398_s_005">5</xref>, <xref rid="j_infor398_s_006">6</xref>, and <xref rid="j_infor398_s_007">7</xref>, we provide the answers to these questions. In Section <xref rid="j_infor398_s_008">8</xref> we describe approaches that are difficult to relate to the typical system described in <xref rid="j_infor398_s_002">2</xref>. Finally, in Section <xref rid="j_infor398_s_009">9</xref>, we summarize the study and describe possible areas for further work.</p>
</sec>
<sec id="j_infor398_s_002">
<label>2</label>
<title>Structure of Voice Activation System</title>
<p>As described in Section <xref rid="j_infor398_s_001">1</xref>, voice activation systems have come a long way. One way to study and compare approaches is to provide the model of a system and to compare the individual components of the model. Most voice activation systems (especially modern ones) consist of the following parts:</p>
<list>
<list-item id="j_infor398_li_007">
<label>•</label>
<p><bold>feature extraction</bold> from audio data (to represent audio data in format acceptable to machine learning models and obtain input data that has enough information to solve the problem),</p>
</list-item>
<list-item id="j_infor398_li_008">
<label>•</label>
<p>application of the <bold>acoustic model</bold> (a system that generally computes the probability of acoustic observations, which often comes down to computing <inline-formula id="j_infor398_ineq_001">
<alternatives><mml:math><mml:mi mathvariant="italic">P</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">u</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi mathvariant="italic">O</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$
P(u|O)$]]></tex-math></alternatives></inline-formula>, where <italic>u</italic> is an acoustic unit and <italic>O</italic> are acoustic observations),</p>
</list-item>
<list-item id="j_infor398_li_009">
<label>•</label>
<p><bold>decoding</bold>: the process of determining the state sequence with the reference to acoustic observation and acoustic model in order to determine whether a keyword has been uttered or not.</p>
</list-item>
</list>
<p>For example, Chen <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>) describe voice activation systems that apply an acoustic model specified by deep neural network to extracted Log Mel-filterbank (feature extraction) and decide whether the keyword was uttered by smoothing deep neural network outputs and comparing them with a threshold (decoding).</p>
<p>Of course, not all the possible voice activation systems are well described by the scheme. For instance, in pattern-matching approaches it is hard to separate acoustic model and feature extraction. Discriminative spotters would be another example. We will discuss these and other systems in more detail in Section <xref rid="j_infor398_s_008">8</xref>. Nevertheless, even in these systems it is always possible to point out the feature representation of the audio or some kind of the acoustic model.</p>
<p>This systematic literature review aims to summarize information available in studies about voice activation systems for embedded devices by answering the following research questions: 
<list>
<list-item id="j_infor398_li_010">
<label>1.</label>
<p>What acoustic features are used?</p>
</list-item>
<list-item id="j_infor398_li_011">
<label>2.</label>
<p>What types of acoustic model are used?</p>
</list-item>
<list-item id="j_infor398_li_012">
<label>3.</label>
<p>What acoustic units are used in acoustic modelling?</p>
</list-item>
<list-item id="j_infor398_li_013">
<label>4.</label>
<p>What types of decoder are used?</p>
</list-item>
<list-item id="j_infor398_li_014">
<label>5.</label>
<p>What metrics are used to evaluate systems’ quality?</p>
</list-item>
</list>
</p>
</sec>
<sec id="j_infor398_s_003">
<label>3</label>
<title>Feature Representation</title>
<p>Sound is a continuous physical phenomenon of mechanical vibration transmission in the form of an acoustic wave. However, most machine learning models do not accept continuous data as input. Thus, the extraction of features from the audio recording has two main goals:</p>
<list>
<list-item id="j_infor398_li_015">
<label>•</label>
<p>representing audio in a way that would be suitable to machine learning methods,</p>
</list-item>
<list-item id="j_infor398_li_016">
<label>•</label>
<p>the preservation of the largest possible amount of information needed to solve the problem (i.e. finding keywords) and the exclusion of the largest possible amount of information irrelevant to the task (“noise” such as background sounds or the variability of speech).</p>
</list-item>
</list>
<p>Most voice activation systems use an approach similar to speech recognition systems (Hinton <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_020">2012</xref>).</p>
<list>
<list-item id="j_infor398_li_017">
<label>1.</label>
<p>The original recording is segmented in possibly overlapping <bold>frames</bold>.</p>
</list-item>
<list-item id="j_infor398_li_018">
<label>2.</label>
<p>In each frame, a numerical vector that describes the behaviour of the sound at this time interval is computed (usually, this vector is computed using the discrete Fourier transform). Let’s say that this vector has dimension <inline-formula id="j_infor398_ineq_002">
<alternatives><mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">n</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">f</mml:mi></mml:mrow></mml:msub></mml:math><tex-math><![CDATA[$
{n_{\mathrm{f}}}$]]></tex-math></alternatives></inline-formula>.</p>
</list-item>
<list-item id="j_infor398_li_019">
<label>3.</label>
<p>The resulting numerical matrix of the size <inline-formula id="j_infor398_ineq_003">
<alternatives><mml:math><mml:mi mathvariant="italic">T</mml:mi><mml:mo>×</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="italic">n</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">f</mml:mi></mml:mrow></mml:msub></mml:math><tex-math><![CDATA[$
T\times {n_{\mathrm{f}}}$]]></tex-math></alternatives></inline-formula> is used as the result of feature extraction (where <italic>T</italic> is the number of frames).</p>
</list-item>
</list>
<p>Thus the audio data can be viewed as a 2D-image or a time series. The specially selected transformation used in the second step is responsible for extracting the most discriminative features for the voice activation task.</p>
<p>Of course, not all the systems go this way. For example, Kumatani <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_036">2017</xref>) use raw waveform (without any selected transformations), and Lehtonen (<xref ref-type="bibr" rid="j_infor398_ref_039">2005</xref>) develops a specific digital signal processing pipeline.</p>
<p>Sometimes, feature quantization is used to increase the speed of operation, reduce consumption or for specific algorithms (Feng and Mazor, <xref ref-type="bibr" rid="j_infor398_ref_009">1992</xref>).</p>
<p><bold>Mel Frequency Cepstral Coefficients</bold> (MFCC) is the most frequently used feature type in the studied sources. It is calculated in the following way:</p>
<list>
<list-item id="j_infor398_li_020">
<label>1.</label>
<p>The audio is segmented into short frames (popular choice is to have 25 ms segments with the overlap of 10 ms).</p>
</list-item>
<list-item id="j_infor398_li_021">
<label>2.</label>
<p>For each frame the periodogram estimate of the power spectrum is computed. This is similar to the way human cochlea processes the information (different nerves fire signals depending on the frequency of the audio). To get the estimate, first Discrete Fourier Transform of each frame is computed via: 
<disp-formula id="j_infor398_eq_001">
<alternatives><mml:math display="block"><mml:mtable displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi mathvariant="italic">S</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">j</mml:mi></mml:mrow></mml:msub><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">k</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo><mml:mo>=</mml:mo>
<mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mstyle displaystyle="true"><mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle></mml:mrow><mml:mrow><mml:mi mathvariant="italic">n</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="italic">N</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi mathvariant="italic">s</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">j</mml:mi></mml:mrow></mml:msub><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">n</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo><mml:mi mathvariant="italic">h</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">n</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo><mml:mo movablelimits="false">exp</mml:mo><mml:mfenced separators="" open="(" close=")"><mml:mrow><mml:mstyle displaystyle="true"><mml:mfrac><mml:mrow><mml:mo>−</mml:mo><mml:mn>2</mml:mn><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">π</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">N</mml:mi></mml:mrow></mml:mfrac></mml:mstyle><mml:mi mathvariant="italic">k</mml:mi><mml:mi mathvariant="italic">n</mml:mi></mml:mrow></mml:mfenced><mml:mo mathvariant="normal">,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math><tex-math><![CDATA[\[ {S_{j}}(k)={\sum \limits_{n=1}^{N}}{s_{j}}(n)h(n)\exp \left(\frac{-2i\pi }{N}kn\right),\]]]></tex-math></alternatives>
</disp-formula> 
where <italic>j</italic> is the frame number, <inline-formula id="j_infor398_ineq_004">
<alternatives><mml:math><mml:mn>1</mml:mn><mml:mo>⩽</mml:mo><mml:mi mathvariant="italic">k</mml:mi><mml:mo>⩽</mml:mo><mml:mi mathvariant="italic">K</mml:mi></mml:math><tex-math><![CDATA[$
1\leqslant k\leqslant K$]]></tex-math></alternatives></inline-formula>, <italic>K</italic> is the DFT length, <inline-formula id="j_infor398_ineq_005">
<alternatives><mml:math><mml:mi mathvariant="italic">h</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">n</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$
h(n)$]]></tex-math></alternatives></inline-formula> is an <italic>N</italic> sample long analysis window (e.g. Hamming window), <inline-formula id="j_infor398_ineq_006">
<alternatives><mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">s</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">j</mml:mi></mml:mrow></mml:msub><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">n</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$
{s_{j}}(n)$]]></tex-math></alternatives></inline-formula> is the <italic>n</italic>-th sample of the <italic>j</italic>-th frame. After that, the periodogram estimate is computed by: 
<disp-formula id="j_infor398_eq_002">
<alternatives><mml:math display="block"><mml:mtable displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi mathvariant="italic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">j</mml:mi></mml:mrow></mml:msub><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">k</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="italic">N</mml:mi></mml:mrow></mml:mfrac></mml:mstyle><mml:msup><mml:mrow><mml:mo fence="true" maxsize="1.19em" minsize="1.19em" stretchy="true">|</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="italic">S</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">j</mml:mi></mml:mrow></mml:msub><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">k</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo><mml:mo fence="true" maxsize="1.19em" minsize="1.19em" stretchy="true">|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math><tex-math><![CDATA[\[ {P_{j}}(k)=\frac{1}{N}{\big|{S_{j}}(k)\big|^{2}}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
</list-item>
<list-item id="j_infor398_li_022">
<label>3.</label>
<p>Apply the Mel-filterbank to the power spectra summing the energy in each filter. The Mel scale relates perceived frequency. Human ear is more sensitive to small changes in low frequencies than in the higher spectra. In order to convert frequency <italic>f</italic> to Mel scale, the following formula is used: 
<disp-formula id="j_infor398_eq_003">
<alternatives><mml:math display="block"><mml:mtable displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="italic">M</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">f</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>1125</mml:mn><mml:mo movablelimits="false">ln</mml:mo><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi mathvariant="italic">f</mml:mi><mml:mo mathvariant="normal" stretchy="false">/</mml:mo><mml:mn>700</mml:mn><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math><tex-math><![CDATA[\[ M(f)=1125\ln (1+f/700).\]]]></tex-math></alternatives>
</disp-formula>
</p>
</list-item>
<list-item id="j_infor398_li_023">
<label>4.</label>
<p>The logarithm of filterbank energies is taken. This also relates to human perception: the loudness does not change linearly with the energy. Logarithm is good approximation and also it allows to perform channel normalization with simple subtraction (e.g. cepstral mean normalization).</p>
</list-item>
<list-item id="j_infor398_li_024">
<label>5.</label>
<p>Discrete cosine transform is applied. This is done to decorrelate the filterbank energies which were computed with overlapping filters.</p>
</list-item>
</list>
<p>Although the vast majority of articles use <bold>Log Mel-filterbank</bold> (fbank) or <bold>Mel Frequency Cepstral Coefficients</bold> or their derivatives, the question arises whether this approach is universal, i.e. suitable to all situations. It turns out that this is not the case, for example, during the development of voice activation systems for the Japanese (Ida and Yamasaki, <xref ref-type="bibr" rid="j_infor398_ref_023">1998</xref>), the prosodic information had to be used to achieve acceptable quality, as MFCC did not give sufficient results. This situation happens with some other languages, too (Zheng <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_087">1999</xref>).</p>
<p>Among the common techniques, one can use <bold>stacking</bold> (concatenation of feature vectors from the current and neighbouring frames) and the calculation of <bold>delta</bold> or <bold>derivatives</bold> (i.e. the calculation of a discrete time derivative using features from neighbouring frames). Also, a mean normalization or a variance normalization is often used. In the <bold>cepstral</bold> range, this transformation is usually abbreviated as <bold>cmvn</bold>.</p>
<p>For a detailed description of the mentioned features, you can refer to the relevant articles or reviews (Giannakopoulos, <xref ref-type="bibr" rid="j_infor398_ref_012">2015</xref>). The visualization of some of the features for the phrase “Hello, world!” is shown in Fig. <xref rid="j_infor398_fig_001">1</xref> and is computed using the framework for speech recognition <monospace>kaldi</monospace> (Povey <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_052">2011</xref>).</p>
<fig id="j_infor398_fig_001">
<label>Fig. 1.</label>
<caption>
<p>Feature visualization for audio file with “Hello, world!” pronunciation.</p>
</caption>
<graphic xlink:href="infor398_g001.jpg"/>
</fig>
<p>The acoustic features used in studied sources are presented in Table <xref rid="j_infor398_tab_007">7</xref>, in Appendix <xref rid="j_infor398_s_010">A</xref>. The number of times these features were used in the sources is presented in Table <xref rid="j_infor398_tab_001">1</xref>.</p>
<table-wrap id="j_infor398_tab_001">
<label>Table 1</label>
<caption>
<p>The umber of times acoustic features and transformations were used in studied sources.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Acoustic features and transformations</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Number of sources</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Mel Frequency Cepstral Coefficients</td>
<td style="vertical-align: top; text-align: left">25</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Derivatives or deltas</td>
<td style="vertical-align: top; text-align: left">24</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Log Mel-filterbank</td>
<td style="vertical-align: top; text-align: left">9</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Mean/variance normalization</td>
<td style="vertical-align: top; text-align: left">8</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Linear predictive coding</td>
<td style="vertical-align: top; text-align: left">6</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Energy or log-energy</td>
<td style="vertical-align: top; text-align: left">5</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Fourier transform</td>
<td style="vertical-align: top; text-align: left">4</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Stacking, perceptual linear prediction</td>
<td style="vertical-align: top; text-align: left">3</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Gain normalization, prosodic information, linear discriminant analysis over MFCC autoregressive moving average, spectral entropy, spectral flatness burst degree, bisector frequency, formant frequencies, feature space Maximum Likelihood Linear Regression, raw waveform</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="j_infor398_s_004">
<label>4</label>
<title>Acoustic Model</title>
<p>The task of the acoustic model is to model acoustic properties of the selected acoustic unit. For example, an acoustic model can provide a probability distribution over the vectors of MFCC-features when a certain word is pronounced. Practically, the acoustic model is used to compute <inline-formula id="j_infor398_ineq_007">
<alternatives><mml:math><mml:mi mathvariant="italic">P</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">S</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi mathvariant="italic">u</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$
P(S|u)$]]></tex-math></alternatives></inline-formula>, where <italic>S</italic> – sound and <italic>u</italic> is some acoustic unit.</p>
<p>Often it is more natural or easier to compute <inline-formula id="j_infor398_ineq_008">
<alternatives><mml:math><mml:mi mathvariant="italic">P</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">u</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi mathvariant="italic">S</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$
P(u|S)$]]></tex-math></alternatives></inline-formula>, and then get <inline-formula id="j_infor398_ineq_009">
<alternatives><mml:math><mml:mi mathvariant="italic">P</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">S</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi mathvariant="italic">u</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$
P(S|u)$]]></tex-math></alternatives></inline-formula> via Bayes theorem. Especially often this technique is used in conjunction with Hidden Markov Models (HMM).</p>
<p>The most common acoustic model for voice activation is built as follows. The set of HMM states is logically divided into two parts: a part that represents audio event of keyword pronunciation and a <bold>garbage model</bold> (a model of the rest of the sound: noise, background speech, actual voice request). Figure <xref rid="j_infor398_fig_002">2</xref> shows a typical HMM used in Amazon’s spotter for the keyword “Alexa”.</p>
<p>Each state of the model represents an acoustic unit (see Section <xref rid="j_infor398_s_005">5</xref> for details), for example, a phoneme. Model “says” that at each frame (see Section <xref rid="j_infor398_s_003">3</xref>) the acoustic environment is in one of the states of the HMM and generates a <bold>visible</bold> variable, for example, the vector of the MFCC-features (or more generally, sound). Each state has a distribution of probabilities over the sound. Thus, when we receive an audio file, we know the sound and the probability distributions, but we do not know in what state the model was in each of the frames. However, for each possible sequence of states, we can calculate the probability of this sequence. By decoding (Section <xref rid="j_infor398_s_006">6</xref>), we can find the most probable sequence. If this sequence generated a keyword, then we can say that the activation occurred (there are other options for decoding and determining the activation).</p>
<fig id="j_infor398_fig_002">
<label>Fig. 2.</label>
<caption>
<p>Hidden Markov model example for Amazon’s keyword spotter (Guo <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_017">2018</xref>).</p>
</caption>
<graphic xlink:href="infor398_g002.jpg"/>
</fig>
<p>It is necessary to be able to calculate <inline-formula id="j_infor398_ineq_010">
<alternatives><mml:math><mml:mi mathvariant="italic">P</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">S</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi mathvariant="italic">s</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$
P(S|s)$]]></tex-math></alternatives></inline-formula> (<italic>s</italic> is an HMM state) to find the keyword. Such calculation is called acoustic modelling. Gaussian mixture model (GMM) or neural networks are the most frequent choices for acoustic model. Note that these choices coincide with the choices for acoustic models in automatic speech recognition systems. Before (Hinton <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_020">2012</xref>) GMM acoustic models were considered state-of-the-art, and after the publication they were almost completely replaced by neural networks.</p>
<p>Note that it is the question of definitions what to consider an acoustic model in the HMM-GMM setup. You can either consider GMM (so the part which actually computes <inline-formula id="j_infor398_ineq_011">
<alternatives><mml:math><mml:mi mathvariant="italic">P</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">S</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi mathvariant="italic">s</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$
P(S|s)$]]></tex-math></alternatives></inline-formula>, recall that the HMM state often represents some acoustic unit) or the whole HMM, because it expresses <inline-formula id="j_infor398_ineq_012">
<alternatives><mml:math><mml:mi mathvariant="italic">P</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">S</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi mathvariant="italic">w</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$
P(S|w)$]]></tex-math></alternatives></inline-formula> like in Zheng <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_087">1999</xref>) (<italic>w</italic> is a keyword).</p>
<p>Good acoustic model is the key for a high quality voice activation system. Therefore, it is not surprising that the calculations associated with the acoustic model usually take the biggest part of the voice activation system runtime. This is why in many studies this part is speeded up. For example, Fernández-Marqués <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_010">2018</xref>) apply the binary arithmetic (instead of floating arithmetic) in the model, Sun <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_065">2017</xref>), Szöke <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_067">2010</xref>) represent the architecture of a neural network where each layer of the matrix multiplication of <inline-formula id="j_infor398_ineq_013">
<alternatives><mml:math><mml:mi mathvariant="italic">N</mml:mi><mml:mo>×</mml:mo><mml:mi mathvariant="italic">M</mml:mi></mml:math><tex-math><![CDATA[$
N\times M$]]></tex-math></alternatives></inline-formula> is replaced by the product of two matrices with sizes <inline-formula id="j_infor398_ineq_014">
<alternatives><mml:math><mml:mi mathvariant="italic">N</mml:mi><mml:mo>×</mml:mo><mml:mi mathvariant="italic">K</mml:mi></mml:math><tex-math><![CDATA[$
N\times K$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor398_ineq_015">
<alternatives><mml:math><mml:mi mathvariant="italic">K</mml:mi><mml:mo>×</mml:mo><mml:mi mathvariant="italic">M</mml:mi></mml:math><tex-math><![CDATA[$
K\times M$]]></tex-math></alternatives></inline-formula>, where <italic>K</italic> is much smaller than <italic>N</italic> and <italic>M</italic>. Thus, a big number of operations is saved, and not a lot of expressive power of the model is lost (with the appropriate method of training).</p>
<p>Another way to build a speech recognition system is not to use HMM, but to calculate some (heuristically selected) value based on the outputs of the acoustic model. For a successful use of this approach, see Chen <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>).</p>
<p>The acoustic models used in studied sources are presented in Table <xref rid="j_infor398_tab_008">8</xref>, in Appendix <xref rid="j_infor398_s_010">A</xref>. The number of times these models were used in the sources is presented in Table <xref rid="j_infor398_tab_002">2</xref>.</p>
<table-wrap id="j_infor398_tab_002">
<label>Table 2</label>
<caption>
<p>The number of times a specific acoustic model was used in studied sources.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Acoustic model</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Number of sources</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">GMM</td>
<td style="vertical-align: top; text-align: left">18</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Neural network</td>
<td style="vertical-align: top; text-align: left">10</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Time-delayed neural network</td>
<td style="vertical-align: top; text-align: left">4</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">RNN, gated RNN, LSTM, bidirectional LSTM</td>
<td style="vertical-align: top; text-align: left">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Polynomial model, continous density neural tree, mixture of central distance normal distributions, support vector machine, deep neural network with highway blocks, binary deep neural network, convolutional neural network</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="j_infor398_s_005">
<label>5</label>
<title>Acoustic Units</title>
<p>The choice of an elementary unit for acoustic modelling (acoustic unit) affects the resulting quality. A system developer is faced with the following tradeoff: the larger the unit is (e.g. a <bold>word</bold>), the more stable it is (meaning, that produced acoustic features have less variability) and accordingly, it is easier to find such a pattern in audio stream. However, such a system is not flexible.</p>
<p>If a smaller unit has been chosen, for example, <bold>phoneme</bold>, then we are faced with a more difficult task of finding a pattern, but on the other hand we can build a system that finds an arbitrary word from a system that finds phonemes.</p>
<p>Sometimes the solution for this tradeoff is to choose <bold>syllables</bold> or <bold>part of the words</bold> if it is difficult to define syllables.</p>
<p>Also, one can choose not a whole phoneme as a unit, but a <bold>part of the phoneme</bold> (for example, <monospace>the beginning of the phoneme A</monospace> or <monospace>the middle of the phoneme B</monospace>) or context-dependent phoneme (for example, <monospace>phoneme A, going after phoneme B</monospace>). A phoneme without context is often called <bold>monophone</bold>, and a context-dependent is called <bold>biphone</bold> (if the dependency is only on one side) or <bold>triphone</bold> (if the dependency is on both the left and the right). There is also a possibility to combine these approaches and use <bold>part of the context-dependent phoneme</bold>. In this case, the system will probably have impractically many units, so they are often clustered (by pronunciation) into clusters called <bold>senones</bold>.</p>
<p>We must note that the term senone does not have a strict definition. Some authors like Yu and Deng (<xref ref-type="bibr" rid="j_infor398_ref_083">2014</xref>) define senone as a tied (clustered) triphone state. Some, like authors of Janus Toolkit, call all acoustic units senones (Janus Toolkit Documentation, <xref ref-type="bibr" rid="j_infor398_ref_027">2019</xref>).</p>
<p>The solution of this tradeoff depends on the size of the training data (at a small size it is much more difficult to build a whole word model than a phoneme model), the choice of the acoustic model, the key phrase, and the language. As far as we know, at the moment there is no algorithm or rules, under what conditions which acoustic unit to choose.</p>
<p>The acoustic units used in studied sources are presented in Table <xref rid="j_infor398_tab_006">6</xref> in Appendix <xref rid="j_infor398_s_010">A</xref>. Number of times these units were used in the sources are presented in Table <xref rid="j_infor398_tab_003">3</xref>.</p>
<table-wrap id="j_infor398_tab_003">
<label>Table 3</label>
<caption>
<p>Number of times specific acoustic unit was used in studied sources.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Acoustic unit</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Number of sources</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Monophone</td>
<td style="vertical-align: top; text-align: left">19</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Whole word</td>
<td style="vertical-align: top; text-align: left">13</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Syllable</td>
<td style="vertical-align: top; text-align: left">5</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Letter, part of the word, part of the phoneme</td>
<td style="vertical-align: top; text-align: left">3</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Triphone</td>
<td style="vertical-align: top; text-align: left">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">State unit (learnt “phoneme”), senone</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="j_infor398_s_006">
<label>6</label>
<title>Decoding</title>
<p>As a result of the acoustic model application to an audio stream we receive the values characterizing probability that at a certain moment this or that acoustic unit was pronounced. Voice activation system needs to make a decision whether the keyword was uttered in an audio stream or not according to the obtained one or more numeric series. To do this, different approaches of <bold>decoding</bold> are used.</p>
<p>In the simplest case, it is only necessary to compare the obtained number with the threshold value to make a decision. E.g. when the acoustic unit is the whole keyword the decision is made by comparing the computed probability with 0.5.</p>
<p>Smoothing is usually used to improve the recognition quality in the case of comparison with the threshold (Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>; Lehtonen, <xref ref-type="bibr" rid="j_infor398_ref_039">2005</xref>). The motivation for this technique is that the keyword is an acoustic event that has a certain duration in the time dimension. Thus, the actual keyword utterance should generate a high probability of multiple counts in a row. Thus, when applying the smoothing function to the time series, we avoid false positives caused by fluctuations of the acoustic model. Silaghi and Vargiya (<xref ref-type="bibr" rid="j_infor398_ref_063">2005</xref>) suggested an interesting variant of smoothing. In the case of acoustic units, the probabilities of each phoneme are normalized to the probability of the least probable phoneme.</p>
<p>In systems that use a comparison with a template utterance, Dynamic Time Warping (DTW) is often used. DTW is an analogue of the Levenshtein distance for numerical series. The motivation of this method is that the duration of the recorded pattern is likely to differ from the pronunciation in real conditions. Thus, we cannot compare two audio fragment directly, namely, one needs “to strech” or “to squeeze” certain intervals of the template over time. DTW distance is usually computed with dynamic programming. For a more detailed description and various modifications, please refer to Zehetner <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_084">2014</xref>).</p>
<p>Decoding becomes more meaningful in the case of HMM. Indeed, in this formulation, we need to solve a typical problem for HMM: find the most probable sequence of hidden states (if this sequence corresponds to a keyphrase, then, in some approaches, it means activation) or find the total probability of passing through some sequences of states (for example, we can say that we do not care how many frames in a row the first phrase phoneme was pronounced, how many, the second and so on; only the order is important).</p>
<p>The Viterby algorithm uses dynamic programming to find the most likely sequence of hidden states in hidden Markov model given the observations. Naturally, this algorithm is widely used in works about HMM-voice activation systems. Many authors explore a variety of approaches and heuristics to speed up the algorithm, adapt it to find sequences satisfying some additional properties, and so on. For example, Liu <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_043">2000</xref>) use various techniques of hypotheses pruning and rescoring probabilities using a bi-gram language model. In Zhu <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_088">2013</xref>), the possibility of using the Viterbi algorithm on sliding windows of the audio stream is considered. Junkawitsch <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_028">1997</xref>) consider a modification of the Viterbi algorithm that approximates finding the optimal sequence that has the highest probability normalized by the utterance length. Several additional modifications of the Viterbi algorithm are considered in Wilcox and Bush (<xref ref-type="bibr" rid="j_infor398_ref_078">1992</xref>).</p>
<p>In addition, Wilcox and Bush (<xref ref-type="bibr" rid="j_infor398_ref_078">1992</xref>) discuss how to use the forward–backward algorithm for quick estimation of probabilities needed in decoding.</p>
<p>We would also like to mention the standard technique of using HMM-derived probabalities and deriving decoding to comparing to the threshold. This approach is conventionally called <bold>likelihood ratio</bold>. Often, two HMM are used: the <bold>speech model</bold> representing all the keyword pronunciations, and the <bold>garbage model</bold> representing all other audio events. In such systems, one can find the probability of passing through the garbage model and the probability of passing through the part with the keyword. Then the ratio of these two probabilities shows confidence in the presence of a key phrase in the audio stream. This ratio is compared with the threshold in many voice activation systems. It is worth noting that finding the balance of coefficients in such models is a difficult task, which is usually solved by optimizing the parameters on the held-out data set.</p>
<p>Some authors use completely different approaches to decode. For example, Manor and Greenberg (<xref ref-type="bibr" rid="j_infor398_ref_044">2017</xref>) describe an application of fuzzy logic to decoding.</p>
<p>The approaches to decode used in studied sources are presented in Table <xref rid="j_infor398_tab_009">9</xref> in Appendix <xref rid="j_infor398_s_010">A</xref>. The numbers of times the specific approach was used in studied sources are presented in Table <xref rid="j_infor398_tab_004">4</xref>.</p>
<table-wrap id="j_infor398_tab_004">
<label>Table 4</label>
<caption>
<p>Number of times the specific approach to decoding was used in studied sources.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Decoding approach</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Number of sources</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Viterby</td>
<td style="vertical-align: top; text-align: left">15</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Comparing to threshold</td>
<td style="vertical-align: top; text-align: left">11</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DTW</td>
<td style="vertical-align: top; text-align: left">5</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Forward–Backward algorithm, likelihood ratio</td>
<td style="vertical-align: top; text-align: left">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Fuzzy logic</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="j_infor398_s_007">
<label>7</label>
<title>Quality Assessment</title>
<p>A large number of metrics can be used to compare different approaches of voice activation systems. These metrics can be grouped by the aspect of the system they measure:</p>
<list>
<list-item id="j_infor398_li_025">
<label>•</label>
<p>classification quality,</p>
</list-item>
<list-item id="j_infor398_li_026">
<label>•</label>
<p>operation speed,</p>
</list-item>
<list-item id="j_infor398_li_027">
<label>•</label>
<p>amount of used RAM and CPU.</p>
</list-item>
</list>
<p>Metrics for speed measurement are standard and non-specific for voice activation systems. The most commonly used are real time factor (RTF) – total processing time of the audio stream divided by the length of the stream, latency (average delay of the response signal from pronouncing) and total processing time (this metric is less indicative than RTF).</p>
<p>For resource usage, it is the most popular to measure the amount of RAM used and CPU load (as a percentage of the compute core). To improve both parameters, different approaches to quantize the parameters of the acoustic model are often used (Fernández-Marqués <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_010">2018</xref>).</p>
<p>But at the moment there are no standard metrics to measure the quality of classification. Moreover, similar metrics, unfortunately, are called differently in different sources. We think it would be profitable to have standartized set of metrics in that area.</p>
<p>The main problem is that the voice activation system must satisfy two opposite properties to work well: it must be sensitive enough to react to the keyword utterances, and it must be robust enough not to react to sound events similar to the keywords, but that are not actual keywords. Any system can be made arbitrarily sensitive, reacting to each event, and arbitrarily robust, not reacting to any events. The challenge is to choose the right balance between these two operating points. Therefore, one must either use at least two metrics (for example, precision and recall), or use one common metric (for example, f1-score) to measure the quality of a classification,. In the second case, an unsuccessful choice of metrics can lead to false conclusions, since there is no single correct balance between the importance of sensitivity and robustness.</p>
<p>The following metrics are often used to measure classification quality:</p>
<list>
<list-item id="j_infor398_li_028">
<label>•</label>
<p>detection rate (precision) is the number of correctly recognized keywords relative to the total number of accepted keywords,</p>
</list-item>
<list-item id="j_infor398_li_029">
<label>•</label>
<p>substitution rate is the number of mis-recognized keywrods relative to the total number of accepted keywords,</p>
</list-item>
<list-item id="j_infor398_li_030">
<label>•</label>
<p>deletion rate (false reject rate, opposite to recall, miss rate) is the number of un-detected keywords relative to the total number of keywords,</p>
</list-item>
<list-item id="j_infor398_li_031">
<label>•</label>
<p>rejection rate is the number of keywords which are rejected relative to the total number of keywords (false reject rate – FRR),</p>
</list-item>
<list-item id="j_infor398_li_032">
<label>•</label>
<p>false alarm rate (FAR) is the number of false alarms (relative to the number of utterances without keyword; sometimes per keyword or per hour of speech),</p>
</list-item>
<list-item id="j_infor398_li_033">
<label>•</label>
<p>accuracy (recognition rate) is the number of correctly classified utterances relative to the total number of utterances,</p>
</list-item>
<list-item id="j_infor398_li_034">
<label>•</label>
<p>true positive rate (same as recall),</p>
</list-item>
<list-item id="j_infor398_li_035">
<label>•</label>
<p>true negative rate (opposite to FAR).</p>
</list-item>
</list>
<p>As you can see, there is no accepted pair of metrics, moreover, often the same metrics are not called the same in different sources.</p>
<p>Figure of merit is one of the most used metrics in voice activation system research. FOM is the average of correct detections per <italic>k</italic> false positive activations per hour for each natural number <italic>k</italic> from 1 to 10. This metric was especially often used until the 2010s. Recently, such high rates of false positives per hour are unreasonably high, so FOM does not reflect the relevant modes of operation of the modern voice activation system. Other common metrics are equal error rate (the smallest value that can take both FAR and FRR at the same time), ROC-AUC (the area under the precision-recall curve).</p>
<p>Some papers suggest more complex ways of measuring classification quality. For example, <bold>disriminative error rate</bold> is introduced in Cuayáhuitl and Serridge (<xref ref-type="bibr" rid="j_infor398_ref_008">2002</xref>). In this metric different errors (when the system asked the user for confirmation, or rejected the operation without confirmation) have different penalties. In our opinion, this approach is more suitable for quality assesment for real product use of systems.</p>
<p>It is hard to compare results from different works not only because different metrics are used, but also because the choice of the dataset and the keyword deeply affects the results. If two works use false alarms per hour to describe their system quality, but one uses a dataset of speech recordings and the other uses a dataset from real user devices (where speech may take 3–6 hours for each 24 hour recording), then these works would have completely different metrics even with the same voice activation system.</p>
<p>We think it is safe to assume that industry research provides the best or close to the best voice activation systems today because of big amount of audio data and computation resources. Shan <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_059">2018</xref>) reports system with <inline-formula id="j_infor398_ineq_016">
<alternatives><mml:math><mml:mn>1.02</mml:mn><mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$
1.02\% $]]></tex-math></alternatives></inline-formula> FRR with 1 false alarm per hour. This model has 84,000 parameters. Raziel and Hyun-Jin (<xref ref-type="bibr" rid="j_infor398_ref_053">2018</xref>) claims that their “Ok Google” voice activation systems has FRR from <inline-formula id="j_infor398_ineq_017">
<alternatives><mml:math><mml:mn>0.87</mml:mn><mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$
0.87\% $]]></tex-math></alternatives></inline-formula> (clean non-accented utterances) to <inline-formula id="j_infor398_ineq_018">
<alternatives><mml:math><mml:mn>8.90</mml:mn><mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$
8.90\% $]]></tex-math></alternatives></inline-formula> (real user query logs) with 0.1 false alarm per hour with 700,000 parameters. Fernández-Marqués <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_010">2018</xref>) tell that it’s possible to create a competitive voice activation system that would use 15.8 kB of memory and would perform 2 million operations per inference pass.</p>
<p>The metrics used in studied sources are presented in Table <xref rid="j_infor398_tab_010">10</xref> in Appendix <xref rid="j_infor398_s_010">A</xref>. The numbers of times these metrics were used in studied sources are presented in Table <xref rid="j_infor398_tab_005">5</xref>.</p>
<table-wrap id="j_infor398_tab_005">
<label>Table 5</label>
<caption>
<p>Number of times the specific metric was used in studied sources.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Metrics</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Number of sources</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">FOM</td>
<td style="vertical-align: top; text-align: left">20</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">False alarm rate</td>
<td style="vertical-align: top; text-align: left">12</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">ROC</td>
<td style="vertical-align: top; text-align: left">8</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">False reject rate, accuracy</td>
<td style="vertical-align: top; text-align: left">6</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">False alarm per kw per hour</td>
<td style="vertical-align: top; text-align: left">5</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Detection rate, recall</td>
<td style="vertical-align: top; text-align: left">4</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Custom, recognition rate, real time factor</td>
<td style="vertical-align: top; text-align: left">3</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Equal error rate, deletion rate, rejection rate, precision</td>
<td style="vertical-align: top; text-align: left">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Insertion rate, discriminative error rate, substitution rate, true positive rate, false positive rate, miss rate, F1, latency, mean time between false alarms, processing time, misses, hits, RAM usage, flops, accuracy to size, accuracy to ops</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="j_infor398_s_008">
<label>8</label>
<title>Unconventional Approaches</title>
<p>Some approaches to the construction of voice activation systems are difficult to describe according to the classification proposed in Section <xref rid="j_infor398_s_002">2</xref>.</p>
<p>First of all it worth to mention approaches of comparison with a template, for example using DTW. In such systems, the user first records one or several keywords pronunciations, and then the necessary sound fragments are compared with the recordings and the triggering is announced if the selected similarity measure exceeds some prespecified threshold. The advantages of this approach include the simplicity of both learning (memorization) and operation. In addition, in this approach, it is natural to use personalization: indeed, one can argue that recorded patterns reflect the specific features of the user pronunciation, which allow to distinguish it from other users if appropriate similarity metric is used. However, in practice this approach is not very robust. The quality of its operation depends on how well the similarity measure is chosen and what features are used. The task to eliminate all the noise and disimilarity in environments by appropriate choice of features and similarity measure has proven to be difficult. DTW is one way to calculate the measure of similarity of two time series, possibly of different length. Systems using such approaches are described in Morgan <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_048">1991</xref>), Naylor <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_050">1992</xref>), Zeppenfeld and Waibel (<xref ref-type="bibr" rid="j_infor398_ref_085">1992</xref>), Kosonocky and Mammone (<xref ref-type="bibr" rid="j_infor398_ref_035">1995</xref>), Kurniawati <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_037">2012</xref>). Zehetner <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_084">2014</xref>) discuss the different underlying metrics of the similarity to use in DTW framework. Szöke <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_068">2015</xref>) discuss the possibility of using DTW even for the case where a keyword can be subjected to declensions, conjugations, or even word order permutations.</p>
<p>Another interesting approach is to model the appearance (or absence) of keywords with the help of point processes and, in particular, Poisson processes. In such systems, the parameters of two process families are evaluated: for each selected feature for sound with (1) and without a keyword (2). An interesting feature of such systems is the ability to select these parameters during operation, thereby adapting to the channel, user and usage scenarios. For more information on the proposed see Jansen and Niyogi (<xref ref-type="bibr" rid="j_infor398_ref_026">2009c</xref>), Jansen and Niyogi (<xref ref-type="bibr" rid="j_infor398_ref_025">2009b</xref>). Sadhu and Ghosh (<xref ref-type="bibr" rid="j_infor398_ref_057">2017</xref>) describe how to apply this approach in systems with limited resources using unsupervised online learning.</p>
<p>Finally we would like to mention the <bold>discriminative keyword spotting</bold>, an approach that was introduced in Keshet <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_030">2009</xref>). In this approach, instead of using an HMM or a similar model, the audio track is embedded in the feature space. Then, a linear (or more complex) model in this space is trained to distinguish <italic>positive</italic> (with a keyword) and <italic>negative</italic> (without a keyword) examples. This allows the use of support vector machine-like (SVM) approaches to maximize the margin from separating hyperplane. In addition, the task of training in such a system can be set as the task of maximizing the area under the ROC-curve, which is one of the common metrics for assessing the quality of the voice activation system. In such systems, it is necessary to use feature engineering, which can be both a advantage (one can easily embed prior knowledge) and a disadvantage (incorrect prior knowledge leads to poor quality of work; in addition, feature engineering is a complex manual process). In subsequent works, this approach is developed. Wöllmer <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_080">2009b</xref>) add a hidden layer of bidirectional LSTM network as features, Tabibian <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_070">2011</xref>) use a genetic algorithm instead of a linear classifier, and Tabibian <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_072">2014</xref>) describe the use of kernel trick within the framework of discriminative keyword spotting. A very detailed explanation of disriminative keyword spotting can also be found in Tabibian <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_071">2013</xref>; <xref ref-type="bibr" rid="j_infor398_ref_073">2016</xref>).</p>
</sec>
<sec id="j_infor398_s_009">
<label>9</label>
<title>Conclusion</title>
<p>In this research, we have made a systematic literature review of voice activation systems. We proposed the structure of a typical voice activation system and considered main approaches described in the literature for each of the modules of such a system.</p>
<p>Regarding the feature representation, most of the techniques are shared with automatic speech recognition. The majority of cited works use MFCC or Log Mel-filterbank features. In this area, we see the reduction of the inductive bias over the time: more and more recent papers like (Raziel and Hyun-Jin, <xref ref-type="bibr" rid="j_infor398_ref_053">2018</xref>) or (Myer and Tomar, <xref ref-type="bibr" rid="j_infor398_ref_049">2018</xref>) don not use DCT-step, probably because deep neural networks work reasonably well even with correlated features. We expect further simplification: using raw waveform or some unsupervised approach like contrastive predictive coding in Oord <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_051">2018</xref>).</p>
<p>GMM, widely used in acoustic modelling, are replaced with different types of neural networks. We are not aware of any state-of-the-art solutions that do not use deep learning in the voice activation problem. One of the main questions in that area is how to apply neural networks having limited resources. Some possible answers are: to apply quantization, to use a special network topology like time-delayed neural network or to use a cascade of the models waking up the more powerful and consuming model only if the smaller model is activated.</p>
<p>At the moment, the most widely used systems use phonemes as acoustic units. Phonemes are stable enough to be reliably found in audio stream and flexible enough to be used for the majority (if not all) keywords.</p>
<p>We believe that voice activation research could greatly benefit from creating open datasets in order to compare different systems. Today it is complicated to compare different works because of different train and test data, different keywords, and sometimes different target metrics.</p>
<p>As a result of the literature review, we noticed that there are some questions to which there are no clear answers in the published sources. So we would like to focus on them and conduct research in these areas: 
<list>
<list-item id="j_infor398_li_036">
<label>•</label>
<p><bold>Are there common acoustic features suitable for all languages? How do we understand what features does it need for a given language?</bold> Indeed, most of the works are building voice activation systems for the English language, for which the MFCC and Log Mel-filterbank have proven themselves well. Researchers who used systems for other languages faced the necessity to use other features (Ida and Yamasaki, <xref ref-type="bibr" rid="j_infor398_ref_023">1998</xref>; Zheng <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_087">1999</xref>).</p>
</list-item>
<list-item id="j_infor398_li_037">
<label>•</label>
<p><bold>Does the quality of activation depend on which acoustic unit is used for the language?</bold> A similar question was investigated for Spanish in Cuayáhuitl and Serridge (<xref ref-type="bibr" rid="j_infor398_ref_008">2002</xref>) and for Chinese in Liu <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor398_ref_043">2000</xref>), but the problem of determining the most appropriate acoustic unit for an arbitrary language was not investigated.</p>
</list-item>
<list-item id="j_infor398_li_038">
<label>•</label>
<p><bold>Are there any criteria on how to choose keywords to activate?</bold> This question is important both for the practical application of the voice activation system and for an objective comparison of systems with each other. We noticed that the acoustic features and the length of the keyword have a significant impact on the quality of activation. For example, in Jansen and Niyogi (<xref ref-type="bibr" rid="j_infor398_ref_024">2009a</xref>) it is shown that there is a strong correlation between the quality of work and the length of the keyword. However, an open question as to what other properties of the key phrase are important for the good operation of the system remains. Also it would be interesting to investigate whether are there any general rules for choosing a good keyword.</p>
</list-item>
</list>
</p>
</sec>
<sec id="j_infor398_s_010">
<label>A</label>
<title>Appendix</title>
<table-wrap id="j_infor398_tab_006">
<label>Table 6</label>
<caption>
<p>Acoustic units used in studied sources.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Acoustic unit</td>
<td style="vertical-align: top; text-align: justify; border-top: solid thin; border-bottom: solid thin">Sources</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Whole word</td>
<td style="vertical-align: top; text-align: justify">(Morgan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_047">1990</xref>; Rose and Paul, <xref ref-type="bibr" rid="j_infor398_ref_056">1990</xref>; Morgan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_048">1991</xref>; Naylor <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_050">1992</xref>; Rohlicek <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_055">1993</xref>; Cuayáhuitl and Serridge, <xref ref-type="bibr" rid="j_infor398_ref_008">2002</xref>; Baljekar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_002">2014</xref>; Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>; Zehetner <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_084">2014</xref>; Hou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_021">2016</xref>; Manor and Greenberg, <xref ref-type="bibr" rid="j_infor398_ref_044">2017</xref>; Fernández-Marqués <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_010">2018</xref>; Myer and Tomar, <xref ref-type="bibr" rid="j_infor398_ref_049">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Monophone</td>
<td style="vertical-align: top; text-align: justify">(Rose and Paul, <xref ref-type="bibr" rid="j_infor398_ref_056">1990</xref>; Rohlicek <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_055">1993</xref>; Cuayáhuitl and Serridge, <xref ref-type="bibr" rid="j_infor398_ref_008">2002</xref>; Heracleous and Shimizu, <xref ref-type="bibr" rid="j_infor398_ref_019">2003</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_066">2005</xref>; Lehtonen, <xref ref-type="bibr" rid="j_infor398_ref_039">2005</xref>; Silaghi and Vargiya, <xref ref-type="bibr" rid="j_infor398_ref_063">2005</xref>; Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_080">2009b</xref>; Jansen and Niyogi, <xref ref-type="bibr" rid="j_infor398_ref_024">2009a,c</xref>; Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_079">2009a</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_067">2010</xref>; Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_060">2011</xref>; Tabibian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_070">2011</xref>; Hou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_021">2016</xref>; Kumatani <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_036">2017</xref>; Gruenstein <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_016">2017</xref>; Tabibian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_074">2018</xref>; Myer and Tomar, <xref ref-type="bibr" rid="j_infor398_ref_049">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Triphone</td>
<td style="vertical-align: top; text-align: justify">(Rose and Paul, <xref ref-type="bibr" rid="j_infor398_ref_056">1990</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_066">2005</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Part of the word</td>
<td style="vertical-align: top; text-align: justify">(Naylor <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_050">1992</xref>; Li and Wang, <xref ref-type="bibr" rid="j_infor398_ref_042">2014</xref>; Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">State unit</td>
<td style="vertical-align: top; text-align: justify">(Zeppenfeld and Waibel, <xref ref-type="bibr" rid="j_infor398_ref_085">1992</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Part of the phoneme</td>
<td style="vertical-align: top; text-align: justify">(Rohlicek <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_054">1989</xref>; Kosonocky and Mammone, <xref ref-type="bibr" rid="j_infor398_ref_035">1995</xref>; Leow <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_041">2012</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Syllable</td>
<td style="vertical-align: top; text-align: justify">(Klemm <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_033">1995</xref>; Zheng <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_087">1999</xref>; Liu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_043">2000</xref>; Cuayáhuitl and Serridge, <xref ref-type="bibr" rid="j_infor398_ref_008">2002</xref>; Hou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_021">2016</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Letter</td>
<td style="vertical-align: top; text-align: justify">(Hwang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_022">2015</xref>; Hou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_021">2016</xref>; Lengerich and Hannun, <xref ref-type="bibr" rid="j_infor398_ref_040">2016</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Senone</td>
<td style="vertical-align: top; text-align: justify; border-bottom: solid thin">(Ge and Yan, <xref ref-type="bibr" rid="j_infor398_ref_011">2017</xref>)</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_infor398_tab_007">
<label>Table 7</label>
<caption>
<p>Acoustic features used in studied sources.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Acoustic features and transformations</td>
<td style="vertical-align: top; text-align: justify; border-top: solid thin; border-bottom: solid thin">Sources</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">MFCC</td>
<td style="vertical-align: top; text-align: justify">(Rose and Paul, <xref ref-type="bibr" rid="j_infor398_ref_056">1990</xref>; Vroomen and Normandin, <xref ref-type="bibr" rid="j_infor398_ref_076">1992</xref>; Junkawitsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_028">1997</xref>; Liu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_043">2000</xref>; Heracleous and Shimizu, <xref ref-type="bibr" rid="j_infor398_ref_019">2003</xref>; Khne <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_032">2004</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_066">2005</xref>; Keshet <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_030">2009</xref>; Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_080">2009b</xref>; Bahi and Benati, <xref ref-type="bibr" rid="j_infor398_ref_001">2009</xref>; Jansen and Niyogi, <xref ref-type="bibr" rid="j_infor398_ref_026">2009c</xref>; Vasilache and Vasilache, <xref ref-type="bibr" rid="j_infor398_ref_075">2009</xref>; Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_079">2009a</xref>; Tabibian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_070">2011</xref>; Leow <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_041">2012</xref>; Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_081">2013</xref>; Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_061">2013</xref>; Zhu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_088">2013</xref>; Baljekar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_002">2014</xref>; Sangeetha and Jothilakshmi, <xref ref-type="bibr" rid="j_infor398_ref_058">2014</xref>; Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_062">2014</xref>; Zehetner <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_084">2014</xref>; Laszko, <xref ref-type="bibr" rid="j_infor398_ref_038">2016</xref>; Manor and Greenberg, <xref ref-type="bibr" rid="j_infor398_ref_044">2017</xref>; Fernández-Marqués <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_010">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Log Mel-filterbank</td>
<td style="vertical-align: top; text-align: justify">(Morgan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_047">1990</xref>, <xref ref-type="bibr" rid="j_infor398_ref_048">1991</xref>; Zeppenfeld and Waibel, <xref ref-type="bibr" rid="j_infor398_ref_085">1992</xref>; Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>; Hwang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_022">2015</xref>; Hou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_021">2016</xref>; Gruenstein <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_016">2017</xref>; Sun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_065">2017</xref>; Myer and Tomar, <xref ref-type="bibr" rid="j_infor398_ref_049">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Fourier transform</td>
<td style="vertical-align: top; text-align: justify">(Morgan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_047">1990</xref>, <xref ref-type="bibr" rid="j_infor398_ref_048">1991</xref>; Zeppenfeld and Waibel, <xref ref-type="bibr" rid="j_infor398_ref_085">1992</xref>; Guo <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_017">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">LPC</td>
<td style="vertical-align: top; text-align: justify">(Gish <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_014">1990</xref>, <xref ref-type="bibr" rid="j_infor398_ref_015">1992</xref>; Gish and Ng, <xref ref-type="bibr" rid="j_infor398_ref_013">1993</xref>; Rohlicek <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_055">1993</xref>; Zheng <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_087">1999</xref>; Rohlicek <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_054">1989</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Derivatives or deltas</td>
<td style="vertical-align: top; text-align: justify">(Gish <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_014">1990</xref>; Rose and Paul, <xref ref-type="bibr" rid="j_infor398_ref_056">1990</xref>; Vroomen and Normandin, <xref ref-type="bibr" rid="j_infor398_ref_076">1992</xref>; Gish and Ng, <xref ref-type="bibr" rid="j_infor398_ref_013">1993</xref>; Junkawitsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_028">1997</xref>; Liu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_043">2000</xref>; Heracleous and Shimizu, <xref ref-type="bibr" rid="j_infor398_ref_019">2003</xref>; Khne <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_032">2004</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_066">2005</xref>; Keshet <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_030">2009</xref>; Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_080">2009b</xref>; Jansen and Niyogi, <xref ref-type="bibr" rid="j_infor398_ref_026">2009c</xref>; Vasilache and Vasilache, <xref ref-type="bibr" rid="j_infor398_ref_075">2009</xref>; Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_079">2009a</xref>; Tabibian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_070">2011</xref>; Leow <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_041">2012</xref>; Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_081">2013</xref>; Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_061">2013</xref>; Baljekar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_002">2014</xref>; Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>; Sangeetha and Jothilakshmi, <xref ref-type="bibr" rid="j_infor398_ref_058">2014</xref>; Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_062">2014</xref>; Hwang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_022">2015</xref>; Ge and Yan, <xref ref-type="bibr" rid="j_infor398_ref_011">2017</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Energy or log-energy</td>
<td style="vertical-align: top; text-align: justify">(Vroomen and Normandin, <xref ref-type="bibr" rid="j_infor398_ref_076">1992</xref>; Heracleous and Shimizu, <xref ref-type="bibr" rid="j_infor398_ref_019">2003</xref>; Khne <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_032">2004</xref>; Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_081">2013</xref>; Hwang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_022">2015</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Mean/variance normalization</td>
<td style="vertical-align: top; text-align: justify">(Gish <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_015">1992</xref>; Gish and Ng, <xref ref-type="bibr" rid="j_infor398_ref_013">1993</xref>; Rohlicek <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_055">1993</xref>; Jansen and Niyogi, <xref ref-type="bibr" rid="j_infor398_ref_026">2009c</xref>; Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_079">2009a</xref>; Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_060">2011</xref>; Sangeetha and Jothilakshmi, <xref ref-type="bibr" rid="j_infor398_ref_058">2014</xref>; Myer and Tomar, <xref ref-type="bibr" rid="j_infor398_ref_049">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Gain normalization</td>
<td style="vertical-align: top; text-align: justify">(Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_060">2011</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Stacking</td>
<td style="vertical-align: top; text-align: justify">(Junkawitsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_028">1997</xref>; Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>; Fernández-Marqués <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_010">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">LDA over MFCC</td>
<td style="vertical-align: top; text-align: justify">(Junkawitsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_028">1997</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Prosodic information</td>
<td style="vertical-align: top; text-align: justify">(Ida and Yamasaki, <xref ref-type="bibr" rid="j_infor398_ref_023">1998</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">PCP</td>
<td style="vertical-align: top; text-align: justify">(Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_066">2005</xref>; Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>; Ge and Yan, <xref ref-type="bibr" rid="j_infor398_ref_011">2017</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">AMA</td>
<td style="vertical-align: top; text-align: justify">(Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_060">2011</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Spectral entropy, spectral flatness, burst degree, bisector frequency</td>
<td style="vertical-align: top; text-align: justify">(Tabibian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_070">2011</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Formant frequencies</td>
<td style="vertical-align: top; text-align: justify">(Laszko, <xref ref-type="bibr" rid="j_infor398_ref_038">2016</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">f-MLLR</td>
<td style="vertical-align: top; text-align: justify">(Sadhu and Ghosh, <xref ref-type="bibr" rid="j_infor398_ref_057">2017</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Raw waveform</td>
<td style="vertical-align: top; text-align: justify; border-bottom: solid thin">(Kumatani <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_036">2017</xref>)</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_infor398_tab_008">
<label>Table 8</label>
<caption>
<p>Acoustic models used in studied resources.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: justify; border-top: solid thin; border-bottom: solid thin">Acoustic model</td>
<td style="vertical-align: top; text-align: justify; border-top: solid thin; border-bottom: solid thin">Sources</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: justify">NN</td>
<td style="vertical-align: top; text-align: justify">(Morgan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_047">1990</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_066">2005</xref>; Lehtonen, <xref ref-type="bibr" rid="j_infor398_ref_039">2005</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_067">2010</xref>; Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_007">2014b</xref>; Hou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_021">2016</xref>; Gruenstein <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_016">2017</xref>; Ge and Yan, <xref ref-type="bibr" rid="j_infor398_ref_011">2017</xref>; Wu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_082">2018</xref>; Myer and Tomar, <xref ref-type="bibr" rid="j_infor398_ref_049">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">GMM</td>
<td style="vertical-align: top; text-align: justify">(Rohlicek <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_054">1989</xref>; Rose and Paul, <xref ref-type="bibr" rid="j_infor398_ref_056">1990</xref>; Vroomen and Normandin, <xref ref-type="bibr" rid="j_infor398_ref_076">1992</xref>; Junkawitsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_028">1997</xref>; Liu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_043">2000</xref>; Heracleous and Shimizu, <xref ref-type="bibr" rid="j_infor398_ref_019">2003</xref>; Khne <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_032">2004</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_066">2005</xref>; Jansen and Niyogi, <xref ref-type="bibr" rid="j_infor398_ref_024">2009a,c</xref>; Vasilache and Vasilache, <xref ref-type="bibr" rid="j_infor398_ref_075">2009</xref>; Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_060">2011</xref>; Leow <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_041">2012</xref>; Zhu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_088">2013</xref>; Baljekar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_002">2014</xref>; Li and Wang, <xref ref-type="bibr" rid="j_infor398_ref_042">2014</xref>; Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>; Benisty <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_003">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">RNN</td>
<td style="vertical-align: top; text-align: justify">(Naylor <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_050">1992</xref>; Baljekar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_002">2014</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Gated RNN</td>
<td style="vertical-align: top; text-align: justify">(Baljekar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_002">2014</xref>; Hou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_021">2016</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">TDNN</td>
<td style="vertical-align: top; text-align: justify">(Zeppenfeld and Waibel, <xref ref-type="bibr" rid="j_infor398_ref_085">1992</xref>; Kumatani <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_036">2017</xref>; Sun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_065">2017</xref>; Myer and Tomar, <xref ref-type="bibr" rid="j_infor398_ref_049">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Polynomial model</td>
<td style="vertical-align: top; text-align: justify">(Gish and Ng, <xref ref-type="bibr" rid="j_infor398_ref_013">1993</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Continous density neural tree</td>
<td style="vertical-align: top; text-align: justify">(Kosonocky and Mammone, <xref ref-type="bibr" rid="j_infor398_ref_035">1995</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Mixture of central distance normal distributions</td>
<td style="vertical-align: top; text-align: justify">(Zheng <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_087">1999</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">LSTM</td>
<td style="vertical-align: top; text-align: justify">(Hwang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_022">2015</xref>; Hou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_021">2016</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Bi-LSTM</td>
<td style="vertical-align: top; text-align: justify">(Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_079">2009a</xref>; Zhang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_086">2016</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">SVM</td>
<td style="vertical-align: top; text-align: justify">(Tabibian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_070">2011</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">DNN with highway blocks</td>
<td style="vertical-align: top; text-align: justify">(Guo <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_017">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Binary DNN</td>
<td style="vertical-align: top; text-align: justify">(Fernández-Marqués <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_010">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify; border-bottom: solid thin">CNN</td>
<td style="vertical-align: top; text-align: justify; border-bottom: solid thin">(Myer and Tomar, <xref ref-type="bibr" rid="j_infor398_ref_049">2018</xref>)</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_infor398_tab_009">
<label>Table 9</label>
<caption>
<p>The approaches to decoding used in studied sources.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: justify; border-top: solid thin; border-bottom: solid thin">Decoding approach</td>
<td style="vertical-align: top; text-align: justify; border-top: solid thin; border-bottom: solid thin">Sources</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: justify">Comparing to threshold</td>
<td style="vertical-align: top; text-align: justify">(Morgan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_047">1990</xref>; Naylor <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_050">1992</xref>; Junkawitsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_028">1997</xref>; Keshet <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_030">2009</xref>; Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_080">2009b,a</xref>; Li and Wang, <xref ref-type="bibr" rid="j_infor398_ref_042">2014</xref>; Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>; Gruenstein <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_016">2017</xref>; Benisty <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_003">2018</xref>; Myer and Tomar, <xref ref-type="bibr" rid="j_infor398_ref_049">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Viterby</td>
<td style="vertical-align: top; text-align: justify">(Rose and Paul, <xref ref-type="bibr" rid="j_infor398_ref_056">1990</xref>; Feng and Mazor, <xref ref-type="bibr" rid="j_infor398_ref_009">1992</xref>; Wilcox and Bush, <xref ref-type="bibr" rid="j_infor398_ref_078">1992</xref>; Rohlicek <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_055">1993</xref>; Knill and Young, <xref ref-type="bibr" rid="j_infor398_ref_034">1996</xref>; Junkawitsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_028">1997</xref>; Zheng <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_087">1999</xref>; Liu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_043">2000</xref>; Vasilache and Vasilache, <xref ref-type="bibr" rid="j_infor398_ref_075">2009</xref>; Tabibian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_070">2011</xref>; Leow <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_041">2012</xref>; Zhu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_088">2013</xref>; Kumatani <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_036">2017</xref>; Ge and Yan, <xref ref-type="bibr" rid="j_infor398_ref_011">2017</xref>; Sun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_065">2017</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Forward–Backward algorithm</td>
<td style="vertical-align: top; text-align: justify">(Wilcox and Bush, <xref ref-type="bibr" rid="j_infor398_ref_078">1992</xref>; Rohlicek <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_055">1993</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">DTW</td>
<td style="vertical-align: top; text-align: justify">(Zeppenfeld and Waibel, <xref ref-type="bibr" rid="j_infor398_ref_085">1992</xref>; Kosonocky and Mammone, <xref ref-type="bibr" rid="j_infor398_ref_035">1995</xref>; Kurniawati <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_037">2012</xref>; Zehetner <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_084">2014</xref>; Hou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_021">2016</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Likelihood ratio</td>
<td style="vertical-align: top; text-align: justify">(Jansen and Niyogi, <xref ref-type="bibr" rid="j_infor398_ref_026">2009c</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_067">2010</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify; border-bottom: solid thin">Fuzzy logic</td>
<td style="vertical-align: top; text-align: justify; border-bottom: solid thin">(Manor and Greenberg, <xref ref-type="bibr" rid="j_infor398_ref_044">2017</xref>)</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_infor398_tab_010">
<label>Table 10</label>
<caption>
<p>The metrics used in studied sources.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: justify; border-top: solid thin; border-bottom: solid thin">Metrics</td>
<td style="vertical-align: top; text-align: justify; border-top: solid thin; border-bottom: solid thin">Sources</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: justify">FOM</td>
<td style="vertical-align: top; text-align: justify">(Gish <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_014">1990</xref>; Rose and Paul, <xref ref-type="bibr" rid="j_infor398_ref_056">1990</xref>; Naylor <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_050">1992</xref>; Zeppenfeld and Waibel, <xref ref-type="bibr" rid="j_infor398_ref_085">1992</xref>; Chang and Lippmann, <xref ref-type="bibr" rid="j_infor398_ref_005">1994</xref>; Gish and Ng, <xref ref-type="bibr" rid="j_infor398_ref_013">1993</xref>; Rohlicek <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_055">1993</xref>; Knill and Young, <xref ref-type="bibr" rid="j_infor398_ref_034">1996</xref>; Junkawitsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_028">1997</xref>; Zheng <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_087">1999</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_066">2005</xref>; Lehtonen, <xref ref-type="bibr" rid="j_infor398_ref_039">2005</xref>; Jansen and Niyogi, <xref ref-type="bibr" rid="j_infor398_ref_024">2009a,c</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_067">2010</xref>; Tabibian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_070">2011</xref>; Bohac, <xref ref-type="bibr" rid="j_infor398_ref_004">2012</xref>; Sangeetha and Jothilakshmi, <xref ref-type="bibr" rid="j_infor398_ref_058">2014</xref>; Sadhu and Ghosh, <xref ref-type="bibr" rid="j_infor398_ref_057">2017</xref>; Tabibian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_074">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">EER</td>
<td style="vertical-align: top; text-align: justify">(Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_067">2010</xref>; Bohac, <xref ref-type="bibr" rid="j_infor398_ref_004">2012</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Accuracy</td>
<td style="vertical-align: top; text-align: justify">(Morgan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_047">1990</xref>, <xref ref-type="bibr" rid="j_infor398_ref_048">1991</xref>; Ida and Yamasaki, <xref ref-type="bibr" rid="j_infor398_ref_023">1998</xref>; Ge and Yan, <xref ref-type="bibr" rid="j_infor398_ref_011">2017</xref>; Benisty <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_003">2018</xref>; Fernández-Marqués <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_010">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">FA/kw/h</td>
<td style="vertical-align: top; text-align: justify">(Rohlicek <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_054">1989</xref>; Vroomen and Normandin, <xref ref-type="bibr" rid="j_infor398_ref_076">1992</xref>; Feng and Mazor, <xref ref-type="bibr" rid="j_infor398_ref_009">1992</xref>; Leow <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_041">2012</xref>; Kavya and Karjigi, <xref ref-type="bibr" rid="j_infor398_ref_029">2014</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">ROC</td>
<td style="vertical-align: top; text-align: justify">(Marcus, <xref ref-type="bibr" rid="j_infor398_ref_045">1992</xref>; Siu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_064">1994</xref>; Keshet <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_030">2009</xref>; Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_080">2009b</xref>, <xref ref-type="bibr" rid="j_infor398_ref_081">2013</xref>; Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_061">2013</xref>; Sadhu and Ghosh, <xref ref-type="bibr" rid="j_infor398_ref_057">2017</xref>; Kumatani <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_036">2017</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Detection rate</td>
<td style="vertical-align: top; text-align: justify">(Feng and Mazor, <xref ref-type="bibr" rid="j_infor398_ref_009">1992</xref>; Khne <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_032">2004</xref>; Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_060">2011</xref>; Leow <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_041">2012</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Substitution rate</td>
<td style="vertical-align: top; text-align: justify">(Feng and Mazor, <xref ref-type="bibr" rid="j_infor398_ref_009">1992</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Deletion rate</td>
<td style="vertical-align: top; text-align: justify">(Feng and Mazor, <xref ref-type="bibr" rid="j_infor398_ref_009">1992</xref>; Kavya and Karjigi, <xref ref-type="bibr" rid="j_infor398_ref_029">2014</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Rejection rate</td>
<td style="vertical-align: top; text-align: justify">(Feng and Mazor, <xref ref-type="bibr" rid="j_infor398_ref_009">1992</xref>; Heracleous and Shimizu, <xref ref-type="bibr" rid="j_infor398_ref_019">2003</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Insertion rate</td>
<td style="vertical-align: top; text-align: justify">(Klemm <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_033">1995</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Recognition rate</td>
<td style="vertical-align: top; text-align: justify">(Liu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_043">2000</xref>; Heracleous and Shimizu, <xref ref-type="bibr" rid="j_infor398_ref_019">2003</xref>; Zhu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_088">2013</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Discriminative error rate</td>
<td style="vertical-align: top; text-align: justify">(Cuayáhuitl and Serridge, <xref ref-type="bibr" rid="j_infor398_ref_008">2002</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">FAR</td>
<td style="vertical-align: top; text-align: justify">(Khne <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_032">2004</xref>; Shokri <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_060">2011</xref>; Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>; Hou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_021">2016</xref>; Gruenstein <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_016">2017</xref>; Ge and Yan, <xref ref-type="bibr" rid="j_infor398_ref_011">2017</xref>; Sun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_065">2017</xref>; Tabibian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_074">2018</xref>; Benisty <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_003">2018</xref>; Guo <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_017">2018</xref>; Wu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_082">2018</xref>; Myer and Tomar, <xref ref-type="bibr" rid="j_infor398_ref_049">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">FRR</td>
<td style="vertical-align: top; text-align: justify">(Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_006">2014a</xref>; Gruenstein <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_016">2017</xref>; Sun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_065">2017</xref>; Guo <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_017">2018</xref>; Wu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_082">2018</xref>; Myer and Tomar, <xref ref-type="bibr" rid="j_infor398_ref_049">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">RTF</td>
<td style="vertical-align: top; text-align: justify">(Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_066">2005</xref>; Bohac, <xref ref-type="bibr" rid="j_infor398_ref_004">2012</xref>; Tabibian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_074">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">TPR, FPR</td>
<td style="vertical-align: top; text-align: justify">(Wöllmer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_079">2009a</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Miss rate</td>
<td style="vertical-align: top; text-align: justify">(Hou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_021">2016</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Recall</td>
<td style="vertical-align: top; text-align: justify">(Baljekar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_002">2014</xref>; Li and Wang, <xref ref-type="bibr" rid="j_infor398_ref_042">2014</xref>; Zehetner <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_084">2014</xref>; Hwang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_022">2015</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Precision</td>
<td style="vertical-align: top; text-align: justify">(Zehetner <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_084">2014</xref>; Hwang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_022">2015</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">F1, latency</td>
<td style="vertical-align: top; text-align: justify">(Hwang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_022">2015</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Mean time between false alarms</td>
<td style="vertical-align: top; text-align: justify">(Baljekar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_002">2014</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Processing time</td>
<td style="vertical-align: top; text-align: justify">(Li and Wang, <xref ref-type="bibr" rid="j_infor398_ref_042">2014</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">Misses, hits</td>
<td style="vertical-align: top; text-align: justify">(Li and Wang, <xref ref-type="bibr" rid="j_infor398_ref_042">2014</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify">RAM usage, flops, accuracy to size, accuracy to ops</td>
<td style="vertical-align: top; text-align: justify">(Fernández-Marqués <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_010">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: justify; border-bottom: solid thin">Custom</td>
<td style="vertical-align: top; text-align: justify; border-bottom: solid thin">(Marcus, <xref ref-type="bibr" rid="j_infor398_ref_045">1992</xref>; Silaghi and Vargiya, <xref ref-type="bibr" rid="j_infor398_ref_063">2005</xref>; Szöke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor398_ref_067">2010</xref>)</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</body>
<back>
<ref-list id="j_infor398_reflist_001">
<title>References</title>
<ref id="j_infor398_ref_001">
<mixed-citation publication-type="chapter"><string-name><surname>Bahi</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Benati</surname>, <given-names>N.</given-names></string-name> (<year>2009</year>). <chapter-title>A new keyword spotting approach</chapter-title>. In: <source>2009 International Conference on Multimedia Computing and Systems</source>, pp. <fpage>77</fpage>–<lpage>80</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_002">
<mixed-citation publication-type="chapter"><string-name><surname>Baljekar</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Lehman</surname>, <given-names>J.F.</given-names></string-name>, <string-name><surname>Singh</surname>, <given-names>R.</given-names></string-name> (<year>2014</year>). <chapter-title>Online word-spotting in continuous speech with recurrent neural networks</chapter-title>. In: <source>2014 IEEE Spoken Language Technology Workshop, SLT 2014</source>, <conf-loc>South Lake Tahoe, NV, USA, December 7–10, 2014</conf-loc>, pp. <fpage>536</fpage>–<lpage>541</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_003">
<mixed-citation publication-type="journal"><string-name><surname>Benisty</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Katz</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Crammer</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Malah</surname>, <given-names>D.</given-names></string-name> (<year>2018</year>). <article-title>Discriminative keyword spotting for limited-data applications</article-title>. <source>Speech Communication</source>, <volume>99</volume>, <fpage>1</fpage>–<lpage>11</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_004">
<mixed-citation publication-type="chapter"><string-name><surname>Bohac</surname>, <given-names>M.</given-names></string-name> (<year>2012</year>). <chapter-title>Performance comparison of several techniques to detect keywords in audio streams and audio scene</chapter-title>. In: <source>Proceedings ELMAR-2012</source>, pp. <fpage>215</fpage>–<lpage>218</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_005">
<mixed-citation publication-type="chapter"><string-name><surname>Chang</surname>, <given-names>E.I.</given-names></string-name>, <string-name><surname>Lippmann</surname>, <given-names>R.P.</given-names></string-name> (<year>1994</year>). <chapter-title>Figure of merit training for detection and spotting</chapter-title>. In: <string-name><surname>Cowan</surname>, <given-names>J.D.</given-names></string-name>, <string-name><surname>Tesauro</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Alspector</surname>, <given-names>J.</given-names></string-name> (Eds.), <source>Advances in Neural Information Processing Systems</source>, Vol. 6. <publisher-name>Morgan-Kaufmann</publisher-name>, pp. <fpage>1019</fpage>–<lpage>1026</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_006">
<mixed-citation publication-type="chapter"><string-name><surname>Chen</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Parada</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Heigold</surname>, <given-names>G.</given-names></string-name> (<year>2014</year>a). <chapter-title>Small-footprint keyword spotting using deep neural networks</chapter-title>. <series>4-9, 2014. IEEE</series>. In: <source>IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014</source>, <conf-loc>Florence, Italy, May 4–9, 2014</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>4087</fpage>–<lpage>4091</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_007">
<mixed-citation publication-type="chapter"><string-name><surname>Chen</surname>, <given-names>N.F.</given-names></string-name>, <string-name><surname>Sivadas</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Lim</surname>, <given-names>B.P.</given-names></string-name>, <string-name><surname>Ngo</surname>, <given-names>H.G.</given-names></string-name>, <string-name><surname>Xu</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Pham</surname>, <given-names>V.T.</given-names></string-name>, <string-name><surname>Ma</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>H.</given-names></string-name> (<year>2014</year>b). <chapter-title>Strategies for Vietnamese keyword search</chapter-title>. In: <source>IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014</source>, <conf-loc>Florence, Italy, May, May 4–9, 2014</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>4121</fpage>–<lpage>4125</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_008">
<mixed-citation publication-type="chapter"><string-name><surname>Cuayáhuitl</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Serridge</surname>, <given-names>B.</given-names></string-name> (<year>2002</year>). <chapter-title>Out-of-vocabulary word modeling and rejection for Spanish keyword spotting systems</chapter-title>. In: <string-name><surname>Coello</surname>, <given-names>C.A.C.</given-names></string-name>, <string-name><surname>deAlbornoz</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Sucar</surname>, <given-names>L.E.</given-names></string-name>, <string-name><surname>Battistutti</surname>, <given-names>O.C.</given-names></string-name> (Eds.), <source>MICAI 2002: Advances in Artificial Intelligence, Second Mexican International Conference on Artificial Intelligence</source>, <conf-loc>Merida, Yucatan, Mexico, April 22–26, 2002, Proceedings</conf-loc>, <series><italic>Lecture Notes in Computer Science</italic></series>, Vol. <volume>2313</volume>. <publisher-name>Springer</publisher-name>, pp. <fpage>156</fpage>–<lpage>165</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_009">
<mixed-citation publication-type="chapter"><string-name><surname>Feng</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Mazor</surname>, <given-names>B.</given-names></string-name> (<year>1992</year>). <chapter-title>Continuous word spotting for applications in telecommunications</chapter-title>. In: <source>The Second International Conference on Spoken Language Processing, ICSLP 1992</source>, <conf-loc>Banff, Alberta, Canada, October 13–16, 1992</conf-loc>, <publisher-name>ISCA</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_010">
<mixed-citation publication-type="chapter"><string-name><surname>Fernández-Marqués</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Tseng</surname>, <given-names>V.W.S.</given-names></string-name>, <string-name><surname>Bhattacharya</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Lane</surname>, <given-names>N.D.</given-names></string-name> (<year>2018</year>). <chapter-title>Deterministic binary filters for keyword spotting applications</chapter-title>. In: <string-name><surname>Ott</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Dressler</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Saroiu</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Dutta</surname>, <given-names>P.</given-names></string-name> (Eds.), <source>Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys 2018</source>, <conf-loc>Munich, Germany, June 10–15, 2018</conf-loc>. <publisher-name>ACM</publisher-name>, p. <fpage>529</fpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_011">
<mixed-citation publication-type="chapter"><string-name><surname>Ge</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Yan</surname>, <given-names>Y.</given-names></string-name> (<year>2017</year>). <chapter-title>Deep neural network based wake-up-word speech recognition with two-stage detection</chapter-title>. <series>5-9, 2017. IEEE</series>. In: <source>2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017</source>, <conf-loc>New Orleans, LA, USA, March 5–9, 2017</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>2761</fpage>–<lpage>2765</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_012">
<mixed-citation publication-type="journal"><string-name><surname>Giannakopoulos</surname>, <given-names>T.</given-names></string-name> (<year>2015</year>). <article-title>epyAudioAnalysis: an open-source Python library for audio signal analysis</article-title>. <source>PloS One</source>, <volume>10</volume>(<issue>12</issue>).</mixed-citation>
</ref>
<ref id="j_infor398_ref_013">
<mixed-citation publication-type="chapter"><string-name><surname>Gish</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>K.</given-names></string-name> (<year>1993</year>). <chapter-title>A segmental speech model with applications to word spotting</chapter-title>. In: <source>1993 IEEE International Conference on Acoustics, Speech, and Signal Processing</source>, Vol. 2, pp. <fpage>447</fpage>–<lpage>450</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_014">
<mixed-citation publication-type="chapter"><string-name><surname>Gish</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Chow</surname>, <given-names>Y.L.</given-names></string-name>, <string-name><surname>Rohlicek</surname>, <given-names>J.R.</given-names></string-name> (<year>1990</year>). <chapter-title>Probabilistic vector mapping of noisy speech parameters for HMM word spotting</chapter-title>. In: <source>International Conference on Acoustics, Speech, and Signal Processing</source>, Vol. 1, pp. <fpage>117</fpage>–<lpage>120</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_015">
<mixed-citation publication-type="chapter"><string-name><surname>Gish</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Rohlicek</surname>, <given-names>J.R.</given-names></string-name> (<year>1992</year>). <chapter-title>Secondary processing using speech segments for an HMM word spotting system</chapter-title>. In: <source>The Second International Conference on Spoken Language Processing, ICSLP 1992</source>, <conf-loc>Banff, Alberta, Canada, October 13–16, 1992</conf-loc>, <publisher-name>ISCA</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_016">
<mixed-citation publication-type="other"><string-name><surname>Gruenstein</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Alvarez</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Thornton</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Ghodrat</surname>, <given-names>M.</given-names></string-name> (2017). A cascade architecture for keyword spotting on mobile devices. <italic>CoRR</italic>, <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1712.03603">abs/1712.03603</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_017">
<mixed-citation publication-type="chapter"><string-name><surname>Guo</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Kumatani</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Sun</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Raju</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Strom</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Mandal</surname>, <given-names>A.</given-names></string-name> (<year>2018</year>). <chapter-title>Time-delayed bottleneck highway networks using a DFT feature for keyword spotting</chapter-title>. In: <source>2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018</source>, <conf-loc>Calgary, AB, Canada, April 15–20, 2018</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>5489</fpage>–<lpage>5493</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_018">
<mixed-citation publication-type="journal"><string-name><surname>Hao</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>X.</given-names></string-name> (<year>2002</year>). <article-title>Word spotting based ona posterior measure of keyword confidence</article-title>. <source>Journal of Computer Science and Technology</source>, <volume>17</volume>(<issue>4</issue>), <fpage>491</fpage>–<lpage>497</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_019">
<mixed-citation publication-type="chapter"><string-name><surname>Heracleous</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Shimizu</surname>, <given-names>T.</given-names></string-name> (<year>2003</year>). <chapter-title>An efficient keyword spotting technique using a complementary language for filler models training</chapter-title>. In: <source>8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 – INTERSPEECH 2003</source>, <conf-loc>Geneva, Switzerland, September 1–4, 2003</conf-loc>, <publisher-name>ISCA</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_020">
<mixed-citation publication-type="journal"><string-name><surname>Hinton</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Deng</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Dahl</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Mohamed</surname> <given-names>A.R.</given-names></string-name>, <string-name><surname>Jaitly</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Senior</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Vanhoucke</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Nguyen</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Kingsbury</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Sainath</surname>, <given-names>T.</given-names></string-name> (<year>2012</year>). <article-title>Deep neural networks for acoustic modeling in speech recognition</article-title>. <source>IEEE Signal Processing Magazine</source>, <volume>29</volume>, <fpage>82</fpage>–<lpage>97</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_021">
<mixed-citation publication-type="chapter"><string-name><surname>Hou</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Xie</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Fu</surname>, <given-names>Z.</given-names></string-name> (<year>2016</year>). <chapter-title>Investigating neural network based query-by-example keyword spotting approach for personalized wake-up word detection in Mandarin Chinese</chapter-title>. In: <source>10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016</source>, <conf-loc>Tianjin, China, October 17–20, 2016</conf-loc>. <publisher-name>IEEE</publisher-name>. pp. <fpage>1</fpage>–<lpage>5</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_022">
<mixed-citation publication-type="other"><string-name><surname>Hwang</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Sung</surname>, <given-names>W.</given-names></string-name> (2015). Online keyword spotting with a character-level recurrent neural network. <italic>CoRR</italic>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1512.08903">abs/1512.08903</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_023">
<mixed-citation publication-type="chapter"><string-name><surname>Ida</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Yamasaki</surname>, <given-names>R.</given-names></string-name> (<year>1998</year>). <chapter-title>An evaluation of keyword spotting performance utilizing false alarm rejection based on prosodic information</chapter-title>. In: <source>The 5th International Conference on Spoken Language Processing, Incorporating The 7th Australian International Speech Science and Technology Conference</source>, <conf-loc>Sydney Convention Centre, Sydney, Australia, 30th November–4th December 1998</conf-loc>, <publisher-name>ISCA</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_024">
<mixed-citation publication-type="other"><string-name><surname>Jansen</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Niyogi</surname>, <given-names>P.</given-names></string-name> (2009a). <italic>An experimental evaluation of keyword-filler hidden markov models</italic>. Tech. Rep., Department of Computer Science, University of Chicago.</mixed-citation>
</ref>
<ref id="j_infor398_ref_025">
<mixed-citation publication-type="journal"><string-name><surname>Jansen</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Niyogi</surname>, <given-names>P.</given-names></string-name> (<year>2009</year>b). <article-title>Point process models for spotting keywords in continuous speech</article-title>. <source>Transactions on Audio, Speech, and Language Processing</source>, <volume>17</volume>(<issue>8</issue>), <fpage>1457</fpage>–<lpage>1470</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_026">
<mixed-citation publication-type="chapter"><string-name><surname>Jansen</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Niyogi</surname>, <given-names>P.</given-names></string-name> (<year>2009</year>c). <chapter-title>Robust keyword spotting with rapidly adapting point process models</chapter-title>. In: <source>INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association</source>, <conf-loc>Brighton, United Kingdom, September 6–10, 2009</conf-loc>. <publisher-name>ISCA</publisher-name>, pp. <fpage>2767</fpage>–<lpage>2770</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_027">
<mixed-citation publication-type="other"><string-name><surname>Janus Toolkit Documentation</surname></string-name> (2019). <ext-link ext-link-type="uri" xlink:href="http://www.cs.cmu.edu/~tanja/Lectures/JRTkDoc/OldDoc/senones/sn_main.html">http://www.cs.cmu.edu/~tanja/Lectures/JRTkDoc/OldDoc/senones/sn_main.html</ext-link>, Accessed 30th June 2019.</mixed-citation>
</ref>
<ref id="j_infor398_ref_028">
<mixed-citation publication-type="chapter"><string-name><surname>Junkawitsch</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Ruske</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Höge</surname>, <given-names>H.</given-names></string-name> (<year>1997</year>). <chapter-title>Efficient methods for detecting keywords in continuous speech</chapter-title>. In: <string-name><surname>Kokkinakis</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Fakotakis</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Dermatas</surname>, <given-names>E.</given-names></string-name> (Eds.), <source>Fifth European Conference on Speech Communication and Technology, EUROSPEECH 1997</source>, <conf-loc>Rhodes, Greece, September 22–25, 1997</conf-loc>, <publisher-name>ISCA</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_029">
<mixed-citation publication-type="chapter"><string-name><surname>Kavya</surname>, <given-names>H.P.</given-names></string-name>, <string-name><surname>Karjigi</surname>, <given-names>V.</given-names></string-name> (<year>2014</year>). <chapter-title>Sensitive keyword spotting for crime analysis</chapter-title>. In: <source>2014 IEEE National Conference on Communication, Signal Processing and Networking (NCCSN)</source>, pp. <fpage>1</fpage>–<lpage>6</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_030">
<mixed-citation publication-type="journal"><string-name><surname>Keshet</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Grangier</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>S.</given-names></string-name> (<year>2009</year>). <article-title>Discriminative keyword spotting</article-title>. <source>Speech Communication</source>, <volume>51</volume>(<issue>4</issue>), <fpage>317</fpage>–<lpage>329</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_031">
<mixed-citation publication-type="journal"><string-name><surname>Këpuska</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Klein</surname>, <given-names>T.</given-names></string-name> (<year>2009</year>). <article-title>A novel Wake-Up-Word speech recognition system, Wake-Up-Word recognition task, technology and evaluation</article-title>. <source>Nonlinear Analysis: Theory, Methods &amp; Applications</source>, <volume>e2772</volume>(<issue>12</issue>), <fpage>45</fpage>–<lpage>e2789</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_032">
<mixed-citation publication-type="chapter"><string-name><surname>Khne</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Wolff</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Eichner</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Hoffmann</surname>, <given-names>R.</given-names></string-name> (<year>2004</year>). <chapter-title>Voice activation using prosodic features</chapter-title>. In: <source>INTERSPEECH 2004 – ICSLP, 8th International Conference on Spoken Language Processing</source>, <conf-loc>Jeju Island, Korea, October 4–8, 2004</conf-loc>, <publisher-name>ISCA</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_033">
<mixed-citation publication-type="chapter"><string-name><surname>Klemm</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Class</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Kilian</surname>, <given-names>U.</given-names></string-name> (<year>1995</year>). <chapter-title>Word- and phrase spotting with syllable-based garbage modelling</chapter-title>. In: <source>Fourth European Conference on Speech Communication and Technology, EUROSPEECH 1995</source>, <conf-loc>Madrid, Spain, September 18–21, 1995</conf-loc>, <publisher-name>ISCA</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_034">
<mixed-citation publication-type="chapter"><string-name><surname>Knill</surname>, <given-names>K.M.</given-names></string-name>, <string-name><surname>Young</surname>, <given-names>S.J.</given-names></string-name> (<year>1996</year>). <chapter-title>Fast implementation methods for Viterbi-based word-spotting</chapter-title>. In: <source>1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings</source>, Vol. 1, pp. <fpage>522</fpage>–<lpage>525</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_035">
<mixed-citation publication-type="chapter"><string-name><surname>Kosonocky</surname>, <given-names>S.V.</given-names></string-name>, <string-name><surname>Mammone</surname>, <given-names>R.J.</given-names></string-name> (<year>1995</year>). <chapter-title>A continuous density neural tree network word spotting system</chapter-title>. In: <source>1995 International Conference on Acoustics, Speech, and Signal Processing</source>, Vol. 1, pp. <fpage>305</fpage>–<lpage>308</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_036">
<mixed-citation publication-type="chapter"><string-name><surname>Kumatani</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Panchapagesan</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Kim</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Strom</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Tiwari</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Mandal</surname>, <given-names>A.</given-names></string-name> (<year>2017</year>). <chapter-title>Direct modeling of raw audio with DNNS for wake word detection</chapter-title>. In: <source>2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017</source>, <conf-loc>Okinawa, Japan, December 16–20, 2017</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>252</fpage>–<lpage>257</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_037">
<mixed-citation publication-type="chapter"><string-name><surname>Kurniawati</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Celetto</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Capovilla</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>George</surname>, <given-names>S.</given-names></string-name> (<year>2012</year>). <chapter-title>Personalized voice command systems in multi modal user interface</chapter-title>. In: <source>2012 IEEE International Conference on Emerging Signal Processing Applications, ESPA 2012</source>, <conf-loc>Las Vegas, NV, USA, January 12–14, 2012</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>45</fpage>–<lpage>47</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_038">
<mixed-citation publication-type="chapter"><string-name><surname>Laszko</surname>, <given-names>L.</given-names></string-name> (<year>2016</year>). <chapter-title>Using formant frequencies to word detection in recorded speech</chapter-title>. In: <string-name><surname>Ganzha</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Maciaszek</surname>, <given-names>L.A.</given-names></string-name>, <string-name><surname>Paprzycki</surname>, <given-names>M.</given-names></string-name> (Eds.), <source>Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, FedCSIS 2016</source>, <conf-loc>Gdańsk, Poland, September 11–14, 2016</conf-loc>, pp. <fpage>797</fpage>–<lpage>801</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_039">
<mixed-citation publication-type="other"><string-name><surname>Lehtonen</surname>, <given-names>M.</given-names></string-name> (2005). <italic>Hierarchical approach for spotting keywords</italic>. Idiap-RR Idiap-RR-41-2005, IDIAP.</mixed-citation>
</ref>
<ref id="j_infor398_ref_040">
<mixed-citation publication-type="other"><string-name><surname>Lengerich</surname>, <given-names>C.T.</given-names></string-name>, <string-name><surname>Hannun</surname>, <given-names>A.Y.</given-names></string-name> (2016). An End-to-End Architecture for Keyword Spotting and Voice Activity Detection. <italic>CoRR</italic>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1611.09405">abs/1611.09405</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_041">
<mixed-citation publication-type="chapter"><string-name><surname>Leow</surname>, <given-names>S.J.</given-names></string-name>, <string-name><surname>Lau</surname>, <given-names>T.S.</given-names></string-name>, <string-name><surname>Goh</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Peh</surname>, <given-names>H.M.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>T.K.</given-names></string-name>, <string-name><surname>Siniscalchi</surname>, <given-names>S.M.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>C.</given-names></string-name> (<year>2012</year>). <chapter-title>A new confidence measure combining hidden Markov models and artificial neural networks of phonemes for effective keyword spotting</chapter-title>. In: <source>8th International Symposium on Chinese Spoken Language Processing, ISCSLP 2012</source>. <publisher-name>Kowloon Tong, China, December 5–8, 2012</publisher-name>. <publisher-name>IEEE</publisher-name>, pp. <fpage>112</fpage>–<lpage>116</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_042">
<mixed-citation publication-type="chapter"><string-name><surname>Li</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>L.</given-names></string-name> (<year>2014</year>). <chapter-title>A novel coding scheme for keyword spotting</chapter-title>. In: <source>2014 Seventh International Symposium on Computational Intelligence and Design</source>, Vol. <volume>2</volume>, pp. <fpage>379</fpage>–<lpage>382</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_043">
<mixed-citation publication-type="journal"><string-name><surname>Liu</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Chiu</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Chang</surname>, <given-names>H.</given-names></string-name> (<year>2000</year>). <article-title>Design of vocabulary-independent Mandarin keyword spotters</article-title>. <source>IEEE Trans. Speech and Audio Processing</source>, <volume>8</volume>(<issue>4</issue>), <fpage>483</fpage>–<lpage>487</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_044">
<mixed-citation publication-type="chapter"><string-name><surname>Manor</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Greenberg</surname>, <given-names>S.</given-names></string-name> (<year>2017</year>). <chapter-title>Voice trigger system using fuzzy logic</chapter-title>. In: <source>2017 International Conference on Circuits, System and Simulation (ICCSS)</source>, pp. <fpage>113</fpage>–<lpage>118</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_045">
<mixed-citation publication-type="chapter"><string-name><surname>Marcus</surname>, <given-names>J.N.</given-names></string-name> (<year>1992</year>). <chapter-title>A novel algorithm for HMM word spotting performance evaluation and error analysis</chapter-title>. In: <source>[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing</source>, Vol. <volume>2</volume>, pp. <fpage>89</fpage>–<lpage>92</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_046">
<mixed-citation publication-type="book"><string-name><surname>Morgan</surname>, <given-names>D.P.</given-names></string-name>, <string-name><surname>Scofield</surname>, <given-names>C.L.</given-names></string-name> (<year>1991</year>). <source>Neural Networks and Speech Processing</source>. <publisher-name>Springer US</publisher-name>, <publisher-loc>Boston, MA</publisher-loc>, pp. <fpage>329</fpage>–<lpage>348</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_047">
<mixed-citation publication-type="chapter"><string-name><surname>Morgan</surname>, <given-names>D.P.</given-names></string-name>, <string-name><surname>Scofield</surname>, <given-names>C.L.</given-names></string-name>, <string-name><surname>Lorenzo</surname>, <given-names>T.M.</given-names></string-name>, <string-name><surname>Real</surname>, <given-names>E.C.</given-names></string-name>, <string-name><surname>Loconto</surname>, <given-names>D.P.</given-names></string-name> (<year>1990</year>). <chapter-title>A keyword spotter which incorporates neural networks for secondary processing</chapter-title>. In: <source>International Conference on Acoustics, Speech, and Signal Processing</source>, Vol. <volume>1</volume>, pp. <fpage>113</fpage>–<lpage>116</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_048">
<mixed-citation publication-type="chapter"><string-name><surname>Morgan</surname>, <given-names>D.P.</given-names></string-name>, <string-name><surname>Scofield</surname>, <given-names>C.L.</given-names></string-name>, <string-name><surname>Adcock</surname>, <given-names>J.E.</given-names></string-name> (<year>1991</year>). <chapter-title>Multiple neural network topologies applied to keyword spotting</chapter-title>. In: <source>[Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing</source>, Vol. <volume>1</volume>, pp. <fpage>313</fpage>–<lpage>316</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_049">
<mixed-citation publication-type="chapter"><string-name><surname>Myer</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Tomar</surname>, <given-names>V.S.</given-names></string-name> (<year>2018</year>). <chapter-title>Efficient keyword spotting using time delay neural networks</chapter-title>. In: <string-name><surname>Yegnanarayana</surname>, <given-names>B.</given-names></string-name> (Ed.), <source>Interspeech 2018, 19th Annual Conference of the International Speech Communication Association</source>, <conf-loc>Hyderabad, India, 2–6 September 2018</conf-loc>. <publisher-name>ISCA</publisher-name>, pp. <fpage>1264</fpage>–<lpage>1268</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_050">
<mixed-citation publication-type="chapter"><string-name><surname>Naylor</surname>, <given-names>J.A.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>W.Y.</given-names></string-name>, <string-name><surname>Nguyen</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>K.P.</given-names></string-name> (<year>1992</year>). <chapter-title>The application of neural networks to wordspotting</chapter-title>. In: <source>1992 Conference Record of the Twenty-Sixth Asilomar Conference on Signals, Systems Computers</source>, Vol. <volume>2</volume>, pp. <fpage>1081</fpage>–<lpage>1085</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_051">
<mixed-citation publication-type="book"><string-name><surname>Oord</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Vinyals</surname>, <given-names>O.</given-names></string-name> (<year>2018</year>). <source>Representation Learning with Contrastive Predictive Coding</source>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_052">
<mixed-citation publication-type="chapter"><string-name><surname>Povey</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Ghoshal</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Boulianne</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Burget</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Glembek</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Goel</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Hannemann</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Motlicek</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Qian</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Schwarz</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Silovsky</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Stemmer</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Vesely</surname>, <given-names>K.</given-names></string-name> (<year>2011</year>). <chapter-title>The Kaldi speech recognition toolkit</chapter-title>. In: <source>IEEE 2011 Workshop on Automatic Speech Recognition and Understanding</source>. <publisher-name>IEEE Signal Processing Society</publisher-name>, <comment>iEEE Catalog No.: CFP11SRW-USB</comment>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_053">
<mixed-citation publication-type="other"><string-name><surname>Raziel</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Hyun-Jin</surname>, <given-names>P.</given-names></string-name> (2018). End-to-End Streaming Keyword Spotting.</mixed-citation>
</ref>
<ref id="j_infor398_ref_054">
<mixed-citation publication-type="chapter"><string-name><surname>Rohlicek</surname>, <given-names>J.R.</given-names></string-name>, <string-name><surname>Russell</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Roukos</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Gish</surname>, <given-names>H.</given-names></string-name> (<year>1989</year>). <chapter-title>Continuous hidden Markov modeling for speaker-independent word spotting</chapter-title>. In: <source>International Conference on Acoustics, Speech, and Signal Processing</source>, Vol. <volume>1</volume>, pp. <fpage>627</fpage>–<lpage>630</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_055">
<mixed-citation publication-type="chapter"><string-name><surname>Rohlicek</surname>, <given-names>J.R.</given-names></string-name>, <string-name><surname>Jeanrenaud</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Gish</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Musicus</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Siu</surname>, <given-names>M.</given-names></string-name> (<year>1993</year>). <chapter-title>Phonetic training and language modeling for word spotting</chapter-title>. In: <source>1993 IEEE International Conference on Acoustics, Speech, and Signal Processing</source>, Vol. <volume>2</volume>, pp. <fpage>459</fpage>–<lpage>462</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_056">
<mixed-citation publication-type="chapter"><string-name><surname>Rose</surname>, <given-names>R.C.</given-names></string-name>, <string-name><surname>Paul</surname>, <given-names>D.B.</given-names></string-name> (<year>1990</year>). <chapter-title>A hidden Markov model based keyword recognition system</chapter-title>. In: <source>International Conference on Acoustics, Speech, and Signal Processing</source>, Vol. <volume>1</volume>, pp. <fpage>129</fpage>–<lpage>132</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_057">
<mixed-citation publication-type="chapter"><string-name><surname>Sadhu</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Ghosh</surname>, <given-names>P.K.</given-names></string-name> (<year>2017</year>). <chapter-title>Low resource point process models for keyword spotting using unsupervised online learning</chapter-title>. In: <source>25th European Signal Processing Conference, EUSIPCO 2017</source>, <conf-loc>Kos, Greece, August 28–September 2, 2017</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>538</fpage>–<lpage>542</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_058">
<mixed-citation publication-type="journal"><string-name><surname>Sangeetha</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Jothilakshmi</surname>, <given-names>S.</given-names></string-name> (<year>2014</year>). <article-title>A novel spoken keyword spotting system using support vector machine</article-title>. <source>Engineering Applications of Artificial Intelligence</source>, <volume>36</volume>, <fpage>287</fpage>–<lpage>293</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_059">
<mixed-citation publication-type="chapter"><string-name><surname>Shan</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Xie</surname>, <given-names>L.</given-names></string-name> (<year>2018</year>). <chapter-title>Attention-based end-to-end models for small-footprint keyword spotting</chapter-title>. In: <source>Proc. Interspeech 2018</source>, pp. <fpage>2037</fpage>–<lpage>2041</lpage>. <ext-link ext-link-type="doi" xlink:href=" https://doi.org/10.21437/Interspeech.2018-1777" xlink:type="simple"> https://doi.org/10.21437/Interspeech.2018-1777</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_060">
<mixed-citation publication-type="chapter"><string-name><surname>Shokri</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Tabibian</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Akbari</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Nasersharif</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Kabudian</surname>, <given-names>J.</given-names></string-name> (<year>2011</year>). <chapter-title>A robust keyword spotting system for Persian conversational telephone speech using feature and score normalization and ARMA filter</chapter-title>. In: <source>2011 IEEE GCC Conference and Exhibition (GCC)</source>, pp. <fpage>497</fpage>–<lpage>500</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_061">
<mixed-citation publication-type="chapter"><string-name><surname>Shokri</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Davarpour</surname>, <given-names>M.H.</given-names></string-name>, <string-name><surname>Akbari</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Nasersharif</surname>, <given-names>B.</given-names></string-name> (<year>2013</year>). <chapter-title>Detecting keywords in Persian conversational telephony speech using a discriminative English keyword spotter</chapter-title>. In: <source>IEEE International Symposium on Signal Processing and Information Technology</source>, <conf-loc>Athens, Greece, December 12–15, 2013</conf-loc>. <publisher-name>IEEE Computer Society</publisher-name>, pp. <fpage>272</fpage>–<lpage>276</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_062">
<mixed-citation publication-type="chapter"><string-name><surname>Shokri</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Davarpour</surname>, <given-names>M.H.</given-names></string-name>, <string-name><surname>Akbari</surname>, <given-names>A.</given-names></string-name> (<year>2014</year>). <chapter-title>Improving keyword detection rate using a set of rules to merge HMM-based and SVM-based keyword spotting results</chapter-title>. In: <source>2014 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2014</source>, <conf-loc>Delhi, India, September 24–27, 2014</conf-loc>, <publisher-name>IEEE</publisher-name>, pp. <fpage>1715</fpage>–<lpage>1718</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_063">
<mixed-citation publication-type="chapter"><string-name><surname>Silaghi</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Vargiya</surname>, <given-names>R.</given-names></string-name> (<year>2005</year>). <chapter-title>A new evaluation criteria for keyword spotting techniques and a new algorithm</chapter-title>. <series>ISCA</series>. In: <source>INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology</source>, <conf-loc>Lisbon, Portugal, September 4–8, 2005</conf-loc>. <publisher-name>ISCA</publisher-name>, pp. <fpage>1593</fpage>–<lpage>1596</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_064">
<mixed-citation publication-type="chapter"><string-name><surname>Siu</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Gish</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Rohlicek</surname>, <given-names>J.R.</given-names></string-name> (<year>1994</year>). <chapter-title>Predicting word spotting performance</chapter-title>. In: <source>The 3rd International Conference on Spoken Language Processing, ICSLP</source>, <conf-loc>1994, Yokohama, Japan, September 18–22, 1994</conf-loc>, <publisher-name>ISCA</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_065">
<mixed-citation publication-type="chapter"><string-name><surname>Sun</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Snyder</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Gao</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Nagaraja</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Rodehorst</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Panchapagesan</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Strom</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Matsoukas</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Vitaladevuni</surname>, <given-names>S.</given-names></string-name> (<year>2017</year>). <chapter-title>Compressed time delay neural network for small-footprint keyword spotting</chapter-title>. In: <string-name><surname>Lacerda</surname>, <given-names>F.</given-names></string-name> (Ed.), <source>Interspeech 2017, 18th Annual Conference of the International Speech Communication Association</source>, <conf-loc>Stockholm, Sweden, August 20–24, 2017</conf-loc>. <publisher-name>ISCA</publisher-name>, pp. <fpage>3607</fpage>–<lpage>3611</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_066">
<mixed-citation publication-type="chapter"><string-name><surname>Szöke</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Schwarz</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Matejka</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Burget</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Karafiát</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Cernocký</surname>, <given-names>J.</given-names></string-name> (<year>2005</year>). <chapter-title>Phoneme based acoustics keyword spotting in informal continuous speech</chapter-title>. In: <string-name><surname>Matousek</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Mautner</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Pavelka</surname>, <given-names>T.</given-names></string-name> (Eds.), <source>Text, Speech and Dialogue, 8th International Conference, TSD 2005</source>, <conf-loc>Karlovy Vary, Czech Republic</conf-loc>, <conf-date>September 12–15, 2005. Proceedings</conf-date>, <series>Lecture Notes in Computer Science</series>, Vol. <volume>3658</volume>. <publisher-name>Springer</publisher-name>, pp. <fpage>302</fpage>–<lpage>309</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_067">
<mixed-citation publication-type="chapter"><string-name><surname>Szöke</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Grézl</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Cernocký</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Fapso</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Cipr</surname>, <given-names>T.</given-names></string-name> (<year>2010</year>). <chapter-title>Acoustic keyword spotter - optimization from end-user perspective</chapter-title>. In: <string-name><surname>Hakkani-Tür</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Ostendorf</surname>, <given-names>M.</given-names></string-name> (Eds.), <source>IEEE Spoken Language Technology Workshop, SLT 2010</source>, <conf-loc>Berkeley, California, USA, December 12–15, 2010</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>189</fpage>–<lpage>193</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_068">
<mixed-citation publication-type="chapter"><string-name><surname>Szöke</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Skácel</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Burget</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Cernocký</surname>, <given-names>J.</given-names></string-name> (<year>2015</year>). <chapter-title>Coping with channel mismatch in Query-by-Example – but QUESST</chapter-title>. In: <source>2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015</source>, <conf-loc>South Brisbane, Queensland, Australia, April 19–24, 2015</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>5838</fpage>–<lpage>5842</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_069">
<mixed-citation publication-type="journal"><string-name><surname>Tabibian</surname>, <given-names>S.</given-names></string-name> (<year>2017</year>). <article-title>A voice command detection system for aerospace applications. I</article-title>. <source>Journal of Speech Technology</source>, <volume>20</volume>(<issue>4</issue>), <fpage>1049</fpage>–<lpage>1061</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_070">
<mixed-citation publication-type="chapter"><string-name><surname>Tabibian</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Akbari</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Nasersharif</surname>, <given-names>B.</given-names></string-name> (<year>2011</year>). <chapter-title>An evolutionary based discriminative system for keyword spotting</chapter-title>. In: <source>2011 International Symposium on Artificial Intelligence and Signal Processing (AISP)</source>, pp. <fpage>83</fpage>–<lpage>88</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_071">
<mixed-citation publication-type="journal"><string-name><surname>Tabibian</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Akbari</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Nasersharif</surname>, <given-names>B.</given-names></string-name> (<year>2013</year>). <article-title>Keyword spotting using an evolutionary-based classifier and discriminative features</article-title>. <source>Engineering Applications of Artificial Intelligence</source>, <volume>26</volume>(<issue>7</issue>), <fpage>1660</fpage>–<lpage>1670</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_072">
<mixed-citation publication-type="journal"><string-name><surname>Tabibian</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Akbari</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Nasersharif</surname>, <given-names>B.</given-names></string-name> (<year>2014</year>). <article-title>Extension of a kernel-based classifier for discriminative spoken keyword spotting</article-title>. <source>Neural Processing Letters</source>, <volume>39</volume>(<issue>2</issue>), <fpage>195</fpage>–<lpage>218</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_073">
<mixed-citation publication-type="journal"><string-name><surname>Tabibian</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Akbari</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Nasersharif</surname>, <given-names>B.</given-names></string-name> (<year>2016</year>). <article-title>A fast hierarchical search algorithm for discriminative keyword spotting</article-title>. <source>Information Sciences</source>, <volume>336</volume>, <fpage>45</fpage>–<lpage>59</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_074">
<mixed-citation publication-type="journal"><string-name><surname>Tabibian</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Akbari</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Nasersharif</surname>, <given-names>B.</given-names></string-name> (<year>2018</year>). <article-title>Discriminative keyword spotting using triphones information and N-best search</article-title>. <source>Information Sciences</source>, <volume>423</volume>, <fpage>157</fpage>–<lpage>171</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_075">
<mixed-citation publication-type="chapter"><string-name><surname>Vasilache</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Vasilache</surname>, <given-names>A.</given-names></string-name> (<year>2009</year>). <chapter-title>Keyword spotting with duration constrained HMMs</chapter-title>. <series>24-28, 2009. IEEE</series>. In: <source>17th European Signal Processing Conference, EUSIPCO 2009</source>, <conf-loc>Glasgow, Scotland, UK, August, 24–28, 2009</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>1230</fpage>–<lpage>1234</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_076">
<mixed-citation publication-type="chapter"><string-name><surname>Vroomen</surname>, <given-names>L.C.</given-names></string-name>, <string-name><surname>Normandin</surname>, <given-names>Y.</given-names></string-name> (<year>1992</year>). <chapter-title>Robust speaker-independent hidden Markov model based word spotter</chapter-title>. In: <string-name><surname>Laface</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>De Mori</surname>, <given-names>R.</given-names></string-name> (Eds.), <source>Speech Recognition and Understanding</source>. <publisher-name>Springer Berlin Heidelberg</publisher-name>, <publisher-loc>Berlin, Heidelberg</publisher-loc>, pp. <fpage>95</fpage>–<lpage>100</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_077">
<mixed-citation publication-type="other"><string-name><surname>Warden</surname>, <given-names>P.</given-names></string-name> (2018). Speech commands: a dataset for limited-vocabulary speech recognition. <italic>CoRR</italic>. <comment> <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1804.03209">abs/1804.03209</ext-link></comment>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_078">
<mixed-citation publication-type="chapter"><string-name><surname>Wilcox</surname>, <given-names>L.D.</given-names></string-name>, <string-name><surname>Bush</surname>, <given-names>M.A.</given-names></string-name> (<year>1992</year>). <chapter-title>Training and search algorithms for an interactive wordspotting system</chapter-title>. In: <source>[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing</source>, Vol. <volume>2</volume>, pp. <fpage>97</fpage>–<lpage>100</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_079">
<mixed-citation publication-type="chapter"><string-name><surname>Wöllmer</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Eyben</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Graves</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Schuller</surname>, <given-names>B.W.</given-names></string-name>, <string-name><surname>Rigoll</surname>, <given-names>G.</given-names></string-name> (<year>2009</year>a). <chapter-title>Improving keyword spotting with a tandem BLSTM-DBN architecture</chapter-title>. In: <string-name><surname>iCasals</surname>, <given-names>J.S.</given-names></string-name>, <string-name><surname>Zaiats</surname>, <given-names>V.</given-names></string-name> (Eds.), <source>Advances in Nonlinear Speech Processing, International Conference on Nonlinear Speech Processing, NOLISP 2009</source>, <conf-loc>Vic, Spain, June 25-27. Revised Selected Papers</conf-loc>, <series>Lecture Notes in Computer Science</series>, Vol. <volume>5933</volume>. <publisher-name>Springer</publisher-name>, pp. <fpage>68</fpage>–<lpage>75</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_080">
<mixed-citation publication-type="chapter"><string-name><surname>Wöllmer</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Eyben</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Keshet</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Graves</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Schuller</surname>, <given-names>B.W.</given-names></string-name>, <string-name><surname>Rigoll</surname>, <given-names>G.</given-names></string-name> (<year>2009</year>b). <chapter-title>Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks</chapter-title>. In: <source>Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009</source>, <conf-loc>19–24 April 2009, Taipei, Taiwan</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>3949</fpage>–<lpage>3952</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_081">
<mixed-citation publication-type="journal"><string-name><surname>Wöllmer</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Schuller</surname>, <given-names>B.W.</given-names></string-name>, <string-name><surname>Rigoll</surname>, <given-names>G.</given-names></string-name> (<year>2013</year>). <article-title>Keyword spotting exploiting long short-term memory</article-title>. <source>Speech Communication</source>, <volume>55</volume>(<issue>2</issue>), <fpage>252</fpage>–<lpage>265</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_082">
<mixed-citation publication-type="chapter"><string-name><surname>Wu</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Panchapagesan</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Sun</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Gu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Thomas</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Vitaladevuni</surname>, <given-names>S.N.P.</given-names></string-name>, <string-name><surname>Hoffmeister</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Mandal</surname>, <given-names>A.</given-names></string-name> (<year>2018</year>). <chapter-title>Monophone-based background modeling for two-stage on-device wake word detection</chapter-title>. In: <source>2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018</source>, <conf-loc>Calgary, AB, Canada, April 15–20, 2018</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>5494</fpage>–<lpage>5498</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_083">
<mixed-citation publication-type="book"><string-name><surname>Yu</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Deng</surname>, <given-names>L.</given-names></string-name> (<year>2014</year>). <source>Automatic Speech Recognition: A Deep Learning Approach</source>. <publisher-name>Springer</publisher-name>, <publisher-loc>London</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_084">
<mixed-citation publication-type="chapter"><string-name><surname>Zehetner</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Hagmüller</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Pernkopf</surname>, <given-names>F.</given-names></string-name> (<year>2014</year>). <chapter-title>Wake-up-word spotting for mobile systems</chapter-title>. In: <source>22nd European Signal Processing Conference, EUSIPCO 2014</source>, <conf-loc>Lisbon, Portugal, September 1–5, 2014</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>1472</fpage>–<lpage>1476</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_085">
<mixed-citation publication-type="chapter"><string-name><surname>Zeppenfeld</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Waibel</surname>, <given-names>A.H.</given-names></string-name> (<year>1992</year>). <chapter-title>A hybrid neural network, dynamic programming word spotter</chapter-title>. In: <source>ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing</source>, Vol. <volume>2</volume>, pp. <fpage>77</fpage>–<lpage>80</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_086">
<mixed-citation publication-type="chapter"><string-name><surname>Zhang</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Qin</surname>, <given-names>Y.</given-names></string-name> (<year>2016</year>). <chapter-title>Wake-up-word spotting using end-to-end deep neural network system</chapter-title>. In: <source>23rd International Conference on Pattern Recognition, ICPR 2016</source>, <conf-loc>Cancún, Mexico, December 4–8, 2016</conf-loc>. <publisher-name>IEEE</publisher-name>, pp. <fpage>2878</fpage>–<lpage>2883</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_087">
<mixed-citation publication-type="journal"><string-name><surname>Zheng</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Xu</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Mou</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Fang</surname>, <given-names>D.</given-names></string-name> (<year>1999</year>). <article-title>HarkMan – a vocabulary-independent keyword spotter for spontaneous Chinese speech</article-title>. <source>Journal of Computer Science and Technology</source>, <volume>14</volume>(<issue>1</issue>), <fpage>18</fpage>–<lpage>26</lpage>.</mixed-citation>
</ref>
<ref id="j_infor398_ref_088">
<mixed-citation publication-type="chapter"><string-name><surname>Zhu</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Kong</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Zhou</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Xiong</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>F.</given-names></string-name> (<year>2013</year>). <chapter-title>Sensitive keyword spotting for voice alarm systems</chapter-title>. In: <source>Proceedings of 2013 IEEE International Conference on Service Operations and Logistics, and Informatics</source>, pp. <fpage>350</fpage>–<lpage>353</lpage>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>