<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
<journal-title-group><journal-title>Informatica</journal-title></journal-title-group>
<issn pub-type="epub">1822-8844</issn><issn pub-type="ppub">0868-4952</issn><issn-l>0868-4952</issn-l>
<publisher>
<publisher-name>Vilnius University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">INFOR620</article-id>
<article-id pub-id-type="doi">10.15388/26-INFOR620</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Research Article</subject></subj-group></article-categories>
<title-group>
<article-title>A Case Study of Artificial Neural Network Compression Methods for Resource-Constrained Multi-Label Classification</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0001-5425-7440</contrib-id>
<name><surname>Hołda</surname><given-names>Przemysław</given-names></name><email xlink:href="przemyslaw.holda@ibspan.waw.pl">przemyslaw.holda@ibspan.waw.pl</email><xref ref-type="aff" rid="j_infor620_aff_001"/><xref ref-type="corresp" rid="cor1">∗</xref><bio>
<p><bold>P. Hołda</bold> is a researcher at the Systems Research Institute of the Polish Academy of Sciences. He received his BSE and MSE degrees in computer science from Warsaw University of Technology. His primary research interests are modern artificial intelligence methods and agent systems.</p></bio>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-3763-2373</contrib-id>
<name><surname>Wasielewska-Michniewska</surname><given-names>Katarzyna</given-names></name><email xlink:href="katarzyna.wasielewska@ibspan.waw.pl">katarzyna.wasielewska@ibspan.waw.pl</email><xref ref-type="aff" rid="j_infor620_aff_001"/><bio>
<p><bold>K. Wasielewska-Michniewska</bold> is an assistant professor at the Systems Research Institute of the Polish Academy of Sciences. She received her MSE in computer science from Warsaw University of Technology, PhD in computer science from the Polish Academy of Sciences. Her primary research interests are semantic technology and artificial intelligence.</p></bio>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0001-7714-4844</contrib-id>
<name><surname>Ganzha</surname><given-names>Maria</given-names></name><email xlink:href="maria.ganzha@ibspan.waw.pl">maria.ganzha@ibspan.waw.pl</email><xref ref-type="aff" rid="j_infor620_aff_001"/><bio>
<p><bold>M. Ganzha</bold> is an associate professor at Warsaw University of Technology. She received her MA and PhD in mathematics from Moscow University, DSc in computer science from the Polish Academy of Sciences. Her primary research interests are agent-based technology, semantic technology, and machine learning.</p></bio>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-8069-2152</contrib-id>
<name><surname>Paprzycki</surname><given-names>Marcin</given-names></name><email xlink:href="marcin.paprzycki@ibspan.waw.pl">marcin.paprzycki@ibspan.waw.pl</email><xref ref-type="aff" rid="j_infor620_aff_001"/><bio>
<p><bold>M. Paprzycki</bold> is an associate professor at the Systems Research Institute of the Polish Academy of Sciences. He received his MS in mathematics from Adam Mickiewicz University, PhD in mathematics from Southern Methodist University, DSc in mathematics from the Bulgarian Academy of Sciences. His primary research interests are parallel and distributed computing, Internet of Things, semantic technology, and agent systems.</p></bio>
</contrib>
<aff id="j_infor620_aff_001"><institution>Systems Research Institute, Polish Academy of Sciences</institution>, Warsaw, <country>Poland</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2026</year></pub-date><pub-date pub-type="epub"><day>11</day><month>2</month><year>2026</year></pub-date><volume>37</volume><issue>1</issue><fpage>87</fpage><lpage>107</lpage><history><date date-type="received"><month>7</month><year>2025</year></date><date date-type="accepted"><month>2</month><year>2026</year></date></history>
<permissions><copyright-statement>© 2026 Vilnius University</copyright-statement><copyright-year>2026</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Proliferation of wearable healthcare devices has created the need to deliver artificial intelligence applications for these resource-constrained devices to achieve faster, localized decision-making, by bringing computation closer to the data sources, for improved responsiveness and privacy. This contribution presents the results of an experimental evaluation of artificial neural network compression techniques, including quantization, structured pruning, and knowledge distillation, applied to multi-label classification of electrocardiogram (ECG) signals. The experiments were carried out on the PTB-XL dataset using three deep learning models, i.e. an LSTM-based recurrent neural network, a 1D convolutional neural network, and a 1D residual neural network. The results show how the compression methods impact model quality and highlight opportunities to reduce model size and accelerate inference, thereby enabling effective deployment on resource-constrained, edge devices.</p>
</abstract>
<kwd-group>
<label>Key words</label>
<kwd>artificial neural networks</kwd>
<kwd>deep learning</kwd>
<kwd>quantization</kwd>
<kwd>structured pruning</kwd>
<kwd>knowledge distillation</kwd>
<kwd>ECG</kwd>
</kwd-group>
<funding-group><funding-statement>Work of Przemysław Hołda and Katarzyna Wasielewska-Michniewska was funded by the European Commission under the Horizon Europe project aerOS, grant No. 101069732.</funding-statement></funding-group>
</article-meta>
</front>
<body>
<sec id="j_infor620_s_001">
<label>1</label>
<title>Introduction</title>
<p>Resource-constrained environments refer to settings where limitations exist in computational power, memory, storage, or energy availability (Selvan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_032">2023</xref>). Examples include edge devices, such as wearable devices, IoT sensors, or embedded systems that require computational efficiency due to their restricted hardware capabilities or operational constraints, such as battery life or real-time processing needs. Constrained devices are common in mobile health applications, such as monitoring blood pressure and oxygen saturation, body temperature, breathing disorders, or heart conditions (Li <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_024">2024</xref>).</p>
<p>At the same time, current artificial intelligence applications are often based on artificial neural networks (NNs) and require significant resources. Therefore, they may not be deployable in such environments. Model compression techniques attempt to address these issues and may enable intelligent applications to run on mobile devices with limited resources.</p>
<p>In this contribution, electrocardiogram (ECG) signal analysis was used to study the effects of model compression. Electrocardiography (Mirvis and Goldberger, <xref ref-type="bibr" rid="j_infor620_ref_030">2001</xref>) is a fundamental non-invasive method of monitoring cardiac function. By placing electrodes on the patient’s skin, the ECG device captures the electrical signals of the heart that coordinate its behaviour during the cardiac cycle. ECG signals can be used, for example, to diagnose cardiovascular disease, detect sleep apnea, monitor heart rhythm, or to identify biometrics (Berkaya <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_001">2018</xref>). Currently, there is an increase in the popularity of portable devices (such as smartwatches or chest bands) that are capable of capturing the ECG without access to medical facilities (Safdar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_031">2024</xref>), which allows for continuous monitoring of individuals’ health status.</p>
<p>Traditionally, physicians analyse ECG signals by evaluating wave patterns for abnormalities. Today, computational techniques, such as deep learning models, can be used to support diagnosis (Merdjanovska and Rashkovska, <xref ref-type="bibr" rid="j_infor620_ref_029">2022</xref>). However, due to limited processing power and battery life, portable ECG devices often rely on data transfer and cloud-based infrastructures for data processing (Khan Mamun and Elfouly, <xref ref-type="bibr" rid="j_infor620_ref_020">2023</xref>). Therefore, a key challenge is to adapt computationally demanding models to run on tiny devices (Khan Mamun and Elfouly, <xref ref-type="bibr" rid="j_infor620_ref_020">2023</xref>). Here, the benefits include: (i) improved data privacy, (ii) real-time inference on the device, and (iii) reliable access to diagnostic tools in areas of low connectivity (Hohman <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_017">2024</xref>). In this context, the potential of model compression techniques, namely quantization (Gholami <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_011">2022</xref>), structured pruning (Cheng <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_005">2024</xref>), and knowledge distillation (Hinton <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_015">2015</xref>; Gou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_013">2021</xref>), can be explored to assess their effectiveness in reducing the computational demands of NN models.</p>
<p>Before proceeding, let us make several conceptual remarks. Although modern mobile devices are equipped with increasingly powerful hardware, this progress is counterbalanced by the growing size of NN models, which are trained on larger datasets and require more resources for inference. Moreover, the availability of devices’ computational power is also constrained, e.g. by the limitations in the battery capacity. Obviously, this limitation is not easy to overcome. Therefore, the focus of this contribution is to assess the effects of model compression techniques when applied to popular NN architectures, analysing the results that can be achieved. In doing so, it lays the foundation for further focused research. Separately, note that the ECG dataset serves as “sample data” that needs to be processed on resource-constrained devices, rather than being the focal point of this research.</p>
<p>Taking this into account, the key contributions of this work are as follows. (I) Three well-known deep learning model compression techniques were implemented, namely integer quantization, structured pruning, and response-based knowledge distillation suitable for the multi-label classification. (II) The methods were applied to three popular NN architectures commonly used for ECG analysis: a 1D CNN (convolutional neural network) (Lecun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_022">1998</xref>), an LSTM-based RNN (recurrent neural network) (Elman, <xref ref-type="bibr" rid="j_infor620_ref_008">1990</xref>; Hochreiter and Schmidhuber, <xref ref-type="bibr" rid="j_infor620_ref_016">1997</xref>), and a 1D ResNet (residual neural network) (He <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_014">2016</xref>; Wang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_036">2017</xref>) to evaluate how the selected compression techniques improve resource utilization (measured in terms of inference speed and disk usage) and impact the quality of the results (measured in macro-averaged AUROC score). (III) The tests were performed using one of the largest open ECG datasets, PTB-XL (Wagner <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_035">2020</xref>).</p>
</sec>
<sec id="j_infor620_s_002">
<label>2</label>
<title>Related Work</title>
<p>Various NN compression techniques can be found in the literature (Li <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_027">2023</xref>; Dantas <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_006">2024</xref>). Their goal is to deliver a more compact (and more efficient) NN model that closely preserves the predictive quality of the original model. Quantization (Gholami <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_011">2022</xref>) reduces the precision of numbers that represent the model parameters, typically from 32-bit floating-point numbers to 8-bit integers. Although it degrades model quality, it can lower the energy use and improve the inference speed (Hubara <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_018">2016</xref>). Pruning (Cheng <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_005">2024</xref>) removes “less important parts” of NNs, to decrease their size and computational load. Unstructured pruning discards individual weights scattered throughout a neural network and thus requires specialized hardware and software to take advantage of introduced sparsity (in terms of computation acceleration). On the other hand, structured pruning removes weights in a systematic way (i.e. it eliminates entire neurons, channels, filters, or layers), resulting in regular NNs, and thus is much more universal, because there is no need for dedicated hardware to obtain speed-up. Knowledge distillation (Hinton <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_015">2015</xref>; Gou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_013">2021</xref>) transfers knowledge from a “teacher” model to a “student” model by training the student model to mimic the teacher’s output. The technique is typically used to create a student model smaller than its teacher.</p>
<p>For analysing ECG, deep learning models such as CNNs, ResNets, or RNNs are often used (Khan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_019">2023</xref>; Boulif <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_002">2024</xref>; Safdar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_031">2024</xref>), but their computational demands can hinder deployment on portable or battery-powered devices. In the joint context of ECG analysis and NN compression, in (Lee <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_023">2022</xref>), techniques such as pruning, quantization, and weight clustering were used to compress, e.g. ResNet, which detected arrhythmia based on ECG data, reducing the model’s size by a factor of 10000 with an accuracy loss of approximately 1%. The authors of (Chang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_004">2022</xref>) applied pruning and quantization to a CNN model for atrial fibrillation detection, achieving a 91-fold compression with an accuracy loss of roughly 1%. In Sepahvand and Abdali-Mohammadi (<xref ref-type="bibr" rid="j_infor620_ref_033">2022</xref>), knowledge distillation was applied in an arrhythmia classification task, obtaining a model approximately 262 times smaller, while losing less than 1% of accuracy.</p>
<p>The extensive PTB-XL benchmark results are available in Strodthoff <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor620_ref_034">2020</xref>), where various NN architectures (including CNNs, ResNets, and RNNs) are compared in ECG tasks ranging from predicting diagnostic statements to determining age and gender. However, the benchmark does not consider the aspect of resource usage, which is adequate for the on-device ECG monitoring and analysis.</p>
<p>In this work, we conduct an experimental study of multiple compression strategies applied to the same ECG dataset. We systematically quantify how structured pruning, quantization, and knowledge distillation affect not only classification effectiveness but also model size and inference speed, showing the typical trade-offs involved in deploying deep ECG models on resource-constrained platforms.</p>
</sec>
<sec id="j_infor620_s_003" sec-type="methods">
<label>3</label>
<title>Methodology</title>
<sec id="j_infor620_s_004">
<label>3.1</label>
<title>Dataset</title>
<p>To perform the experiments, the PTB-XL dataset (Wagner <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_035">2020</xref>) was selected. It contained 21837 12-lead ECG 10-second recordings collected from 18885 patients. The recordings were available at sampling frequencies of 100 Hz and 500 Hz. They were supplemented with metadata, including identifiers, patient and measurement information, detailed diagnostic data, signal information, and recommended folds that split the data into training, validation, and test sets. Here, the diagnostic labels were divided into five categories: NORM (normal), MI (myocardial infarction), CD (conduction disturbances), HYP (hypertrophy), and STTC (ST-T changes). These labels could co-occur for a single recording.</p>
<p>For the experiments conducted in the presented research, the ECG time-series data were prepared for multi-label classification in the following way. Overall, standard approaches were followed (Berkaya <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_001">2018</xref>). Recordings at the sampling frequency of 100 Hz were used. A fifth-order Butterworth high-pass filter (Butterworth, <xref ref-type="bibr" rid="j_infor620_ref_003">1930</xref>) was used to remove low-frequency noise below 0.5 Hz. Power-line noise at 50 Hz was reduced using a bidirectional moving average filter. Each of the twelve leads was normalized separately. The training set was used to calculate the values needed for normalization, and then the remaining data were normalized using the computed statistics. The splits into training, validation, and test sets were performed using the recommended folds included in the dataset. Each recording was assigned a binary label vector <italic>y</italic> of length 5, reflecting its membership in the diagnostic categories, as shown in (<xref rid="j_infor620_eq_001">1</xref>). 
<disp-formula id="j_infor620_eq_001">
<label>(1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mover>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mtext>NORM</mml:mtext>
</mml:mrow>
</mml:mover>
<mml:mspace width="0.2778em"/>
<mml:mspace width="0.2778em"/>
<mml:mspace width="0.2778em"/><mml:mover>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mtext>MI</mml:mtext>
</mml:mrow>
</mml:mover>
<mml:mspace width="0.2778em"/>
<mml:mspace width="0.2778em"/>
<mml:mspace width="0.2778em"/>
<mml:mspace width="0.2778em"/><mml:mover>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mtext>CD</mml:mtext>
</mml:mrow>
</mml:mover>
<mml:mspace width="0.2778em"/>
<mml:mspace width="0.2778em"/>
<mml:mspace width="0.2778em"/>
<mml:mspace width="0.2778em"/><mml:mover>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mtext>HYP</mml:mtext>
</mml:mrow>
</mml:mover>
<mml:mspace width="0.2778em"/>
<mml:mspace width="0.2778em"/>
<mml:mspace width="0.2778em"/>
<mml:mspace width="0.2778em"/><mml:mover>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mtext>STTC</mml:mtext>
</mml:mrow>
</mml:mover>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ y=(\stackrel{\text{NORM}}{0/1}\hspace{0.2778em}\hspace{0.2778em}\hspace{0.2778em}\stackrel{\text{MI}}{0/1}\hspace{0.2778em}\hspace{0.2778em}\hspace{0.2778em}\hspace{0.2778em}\stackrel{\text{CD}}{0/1}\hspace{0.2778em}\hspace{0.2778em}\hspace{0.2778em}\hspace{0.2778em}\stackrel{\text{HYP}}{0/1}\hspace{0.2778em}\hspace{0.2778em}\hspace{0.2778em}\hspace{0.2778em}\stackrel{\text{STTC}}{0/1})\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>In accordance with the recommendation provided by the dataset authors (Wagner <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_035">2020</xref>), the metric for assessing and comparing the models was the macro-averaged (the score was calculated for each label and then averaged) area under the receiver operating characteristic (AUROC) (Fawcett, <xref ref-type="bibr" rid="j_infor620_ref_009">2006</xref>). This popular metric was selected because it measures the ability of a classifying model to discriminate between classes in a threshold-independent manner, eliminating the need to select a specific decision threshold (either globally or per model), on which many other metrics depend. AUROC has an intuitive probabilistic interpretation that its value corresponds to the probability that a classifier assigns a higher score to a randomly chosen positive instance than to a randomly chosen negative instance (Fernández <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_010">2018</xref>).</p>
</sec>
<sec id="j_infor620_s_005">
<label>3.2</label>
<title>Neural Networks</title>
<p>The following describes the NNs used in the experiments. The choice of the NN architectures was guided by their popularity in ECG classification. First, the RNN and the CNN architectures were created specifically for the purposes of this work, with their full, uncompressed versions intentionally designed to be compact (thus affecting the maximum compression rate). Next, the exact ResNet architecture was selected due to the superior benchmark results reported in Strodthoff <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor620_ref_034">2020</xref>). The chosen architecture types are generally common, supporting the usefulness of the gathered results. All NNs were implemented in PyTorch (version 2.2.2).</p>
<p>The RNN (Elman, <xref ref-type="bibr" rid="j_infor620_ref_008">1990</xref>) was designed to capture the temporal dependencies inherent in ECG waveforms. The structure of the implemented RNN included an LSTM layer (Hochreiter and Schmidhuber, <xref ref-type="bibr" rid="j_infor620_ref_016">1997</xref>) (with a hidden state dimension of 64), a layer normalization, a 1D max pooling layer (with stride and a kernel size of 125), a dropout layer (10%), a linear layer with 512 input features, and a sigmoid activation output. The 1D pooling layer divided the sequence of 1000 data points into eight segments (1.25 seconds each) and selected the maximum values. This reduced the data dimensionality and improved the results (by making the NN more robust to signal shifts). The model consisted of 22661 parameters. The Adam optimization algorithm was used, with an initial learning rate of <inline-formula id="j_infor620_ineq_001"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{-3}}$]]></tex-math></alternatives></inline-formula> and a reduce-on-plateau scheduler that decreased the learning rate by half after every ten epochs without improvement, with a minimum learning rate of <inline-formula id="j_infor620_ineq_002"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{-5}}$]]></tex-math></alternatives></inline-formula>. The batch size was 64.</p>
<p>The CNN (Lecun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_022">1998</xref>) was created for hierarchical feature extraction from the input signals. The proposed CNN consisted of four convolutional blocks. Each block included a 1D convolution layer (kernel size of 3), a batch normalization layer, a ReLU activation function, and a 1D max pooling layer (kernel size of 3). The number of output channels, for each 1D convolution, was 32, 64, 96, and 32, respectively. The NN ended with a dropout layer (5%), a linear layer with 352 input features, and a sigmoid activation function. The model consisted of 37157 parameters. SGD with Nesterov momentum (<inline-formula id="j_infor620_ineq_003"><alternatives><mml:math>
<mml:mi mathvariant="italic">μ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0.995</mml:mn></mml:math><tex-math><![CDATA[$\mu =0.995$]]></tex-math></alternatives></inline-formula>) was used for training. The <inline-formula id="j_infor620_ineq_004"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula> penalty coefficient, <italic>λ</italic>, was experimentally set to 0.007. The learning rate was varied using a cosine annealing scheduler with a period of 30 epochs. The maximum learning rate was set at <inline-formula id="j_infor620_ineq_005"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{-3}}$]]></tex-math></alternatives></inline-formula>, while the minimum was set at <inline-formula id="j_infor620_ineq_006"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{-6}}$]]></tex-math></alternatives></inline-formula>. The learning rate was adjusted after each training epoch. The batch size was 64.</p>
<p>The ResNet (He <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_014">2016</xref>) models were engineered to facilitate stable training in very deep models. The implementation of the 1D ResNet was based on the approach described in (Wang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_036">2017</xref>), with the following modifications. All shortcut connections were changed to 1D convolutional layers, with a filter length of 1 (He <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_014">2016</xref>) (followed by a batch normalization) to enable pruning of all layers (Section <xref rid="j_infor620_s_007">3.4</xref>). The output of the average pooling layer was passed to a linear layer with 128 input features, and the NN ended with a sigmoid activation function. The model consisted of 500869 parameters. The Adam optimizer was utilized. Cosine annealing was used to schedule the learning rate, with an annealing period of five epochs, a maximum learning rate of <inline-formula id="j_infor620_ineq_007"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{-3}}$]]></tex-math></alternatives></inline-formula>, and a minimum learning rate of <inline-formula id="j_infor620_ineq_008"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{-6}}$]]></tex-math></alternatives></inline-formula>. The batch size was 128.</p>
<p>To train the NNs for the multi-label classification, the cross-entropy loss (Goodfellow <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_012">2016</xref>), shown in equation (<xref rid="j_infor620_eq_002">2</xref>), was used. 
<disp-formula id="j_infor620_eq_002">
<label>(2)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">E</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mo>−</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo movablelimits="false">log</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo movablelimits="false">log</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {\mathcal{L}_{CE}}=-{\sum \limits_{i=1}^{n}}{y_{i}}\log ({p_{i}})+(1-{y_{i}})\log (1-{p_{i}}).\]]]></tex-math></alternatives>
</disp-formula> 
Here, <inline-formula id="j_infor620_ineq_009"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="double-struck">N</mml:mi></mml:math><tex-math><![CDATA[$n\in \mathbb{N}$]]></tex-math></alternatives></inline-formula> denotes the number of labels, <inline-formula id="j_infor620_ineq_010"><alternatives><mml:math>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$y\in {\{0,1\}^{n}}$]]></tex-math></alternatives></inline-formula> represents the binary label vector from the dataset (Section <xref rid="j_infor620_s_004">3.1</xref>), and <inline-formula id="j_infor620_ineq_011"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$p\in {(0;1)^{n}}$]]></tex-math></alternatives></inline-formula> is the vector of label occurrence scores returned by the model for the considered example.</p>
</sec>
<sec id="j_infor620_s_006">
<label>3.3</label>
<title>Quantization</title>
<p>Let us now describe the applied compression techniques, starting with quantization. In this work, uniform post-training quantization (PTQ) (Gholami <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_011">2022</xref>), expressed in (<xref rid="j_infor620_eq_003">3</xref>), was used. By default, static quantization was applied because it offers better inference time performance. 
<disp-formula id="j_infor620_eq_003">
<label>(3)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo mathvariant="normal" fence="true" maxsize="2.03em" minsize="2.03em">(</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">z</mml:mi>
<mml:mo mathvariant="normal" fence="true" maxsize="2.03em" minsize="2.03em">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ q(x)=r\bigg(\frac{x}{\beta -\alpha }\big({2^{b}}-1\big)+z\bigg).\]]]></tex-math></alternatives>
</disp-formula> 
Here, <italic>x</italic> represents the value to be quantized, <inline-formula id="j_infor620_ineq_012"><alternatives><mml:math>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$q(x)$]]></tex-math></alternatives></inline-formula> denotes the quantized result, <italic>α</italic> and <italic>β</italic> define the range used for clipping values, <italic>b</italic> specifies the number of bits of the quantized values, <italic>z</italic> represents the zero point, and <italic>r</italic> rounds its argument to the nearest integer value.</p>
<p>Inference, using quantized models, was run exclusively on a CPU. The quantization backend used a combination of FBGEMM<xref ref-type="fn" rid="j_infor620_fn_001">1</xref><fn id="j_infor620_fn_001"><label><sup>1</sup></label>
<p><uri>https://github.com/pytorch/FBGEMM</uri></p></fn> and oneDNN<xref ref-type="fn" rid="j_infor620_fn_002">2</xref><fn id="j_infor620_fn_002"><label><sup>2</sup></label>
<p><uri>https://github.com/oneapi-src/oneDNN</uri></p></fn> libraries provided in PyTorch. Note that, in PyTorch, bias is not quantized.</p>
<p>The implementation utilized observers that collected information on floating-point tensors passing through the NN during calibration (Gholami <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_011">2022</xref>; Wu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_037">2020</xref>), and derived quantization parameters (<italic>α</italic>, <italic>β</italic>, <italic>z</italic> in (<xref rid="j_infor620_eq_003">3</xref>)) based on these statistics. For monitoring weights, PyTorch’s <monospace>PerChannelMinMaxObserver</monospace> was used. It identified the minimum and maximum values for each channel, along the specified axis, to derive quantization parameters. The weights followed a symmetric quantization scheme (Gholami <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_011">2022</xref>) with per-channel granularity (Wu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_037">2020</xref>). They were represented as 8-bit signed integers, with range <inline-formula id="j_infor620_ineq_013"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mo>−</mml:mo>
<mml:mn>127</mml:mn>
<mml:mo>;</mml:mo>
<mml:mn>127</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[-127;127]$]]></tex-math></alternatives></inline-formula> (sacrificing one value for symmetry).</p>
<p>The activation observer was PyTorch’s <monospace>HistogramObserver</monospace>, which minimized the error between the floating-point and the quantized data distributions. Layer activations used the asymmetric quantization scheme (Gholami <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_011">2022</xref>) with per-tensor granularity (Wu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_037">2020</xref>). Their quantized form was represented using 8-bit unsigned integers, although 7 bits were used (range 0 to 127) to avoid saturation (Li and Alvarez, <xref ref-type="bibr" rid="j_infor620_ref_026">2021</xref>) caused by the assembler instruction <monospace>VPMADDUBSW</monospace> in the FBGEMM library implementation (AVX-512 VNNI instructions were not available on the CPU used). The exception was the output sigmoid activation function, which used the full 8-bit range.</p>
<p>The LSTM layer was handled differently. PyTorch’s <monospace>MinMaxObserver</monospace> was used to determine the quantization parameters for the weights. It found the quantization parameters collectively for all weights at once (using static quantization). Thus, it was per-tensor granularity, where the quantization parameters were shared for all weights. The remaining weight quantization configuration was the same as before. Here, activations were dynamically quantized (Gholami <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_011">2022</xref>) during inference, so the quantization parameters were based on the values observed at that moment. The remaining activation quantization configuration was unchanged.</p>
<p>Additionally, operator fusion was used to combine the convolutional layer, the batch normalization layer, and the ReLU activation function (if it occurred in the sequence) into a single layer. Finally, 128 randomly selected examples from the validation dataset were used to calibrate all models.</p>
</sec>
<sec id="j_infor620_s_007">
<label>3.4</label>
<title>Pruning</title>
<p>A structured variant of pruning (Cheng <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_005">2024</xref>) was chosen, because it does not require specific hardware to observe the acceleration benefits. It was implemented for the NNs proposed in Section <xref rid="j_infor620_s_005">3.2</xref>.</p>
<p>The LSTM (Hochreiter and Schmidhuber, <xref ref-type="bibr" rid="j_infor620_ref_016">1997</xref>) layer consists of the input-hidden weights <inline-formula id="j_infor620_ineq_014"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{xi}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor620_ineq_015"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{xf}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor620_ineq_016"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{xc}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor620_ineq_017"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">o</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{xo}}$]]></tex-math></alternatives></inline-formula> <inline-formula id="j_infor620_ineq_018"><alternatives><mml:math>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$\in {\mathbb{R}^{r\times d}}$]]></tex-math></alternatives></inline-formula> and the hidden-hidden weights <inline-formula id="j_infor620_ineq_019"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{hi}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor620_ineq_020"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{hf}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor620_ineq_021"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{hc}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor620_ineq_022"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:mi mathvariant="italic">o</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{ho}}$]]></tex-math></alternatives></inline-formula> <inline-formula id="j_infor620_ineq_023"><alternatives><mml:math>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$\in {\mathbb{R}^{r\times r}}$]]></tex-math></alternatives></inline-formula>. Where <inline-formula id="j_infor620_ineq_024"><alternatives><mml:math>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="double-struck">N</mml:mi></mml:math><tex-math><![CDATA[$d\in \mathbb{N}$]]></tex-math></alternatives></inline-formula> is the number of input features and <inline-formula id="j_infor620_ineq_025"><alternatives><mml:math>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="double-struck">N</mml:mi></mml:math><tex-math><![CDATA[$r\in \mathbb{N}$]]></tex-math></alternatives></inline-formula> is the number of features in the hidden state. At time step <inline-formula id="j_infor620_ineq_026"><alternatives><mml:math>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="double-struck">N</mml:mi></mml:math><tex-math><![CDATA[$t\in \mathbb{N}$]]></tex-math></alternatives></inline-formula>, the weights are used to compute the input gate in (<xref rid="j_infor620_eq_004">4</xref>), the forget gate in (<xref rid="j_infor620_eq_005">5</xref>), the cell gate in (<xref rid="j_infor620_eq_006">6</xref>), and the output gate in (<xref rid="j_infor620_eq_007">7</xref>). Here, <inline-formula id="j_infor620_ineq_027"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${x^{(t)}}\in {\mathbb{R}^{d}}$]]></tex-math></alternatives></inline-formula> is the input vector at time <italic>t</italic>, and <inline-formula id="j_infor620_ineq_028"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${h^{(t-1)}}\in {\mathbb{R}^{r}}$]]></tex-math></alternatives></inline-formula> is the hidden state at time <inline-formula id="j_infor620_ineq_029"><alternatives><mml:math>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$t-1$]]></tex-math></alternatives></inline-formula>. For brevity, the bias terms (<italic>b</italic>) are omitted from the considerations because they follow the pruning pattern derived from the weights in the solution. <disp-formula-group id="j_infor620_dg_001">
<disp-formula id="j_infor620_eq_004">
<label>(4)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mtd>
<mml:mtd class="align-even">
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">σ</mml:mi>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}{i^{(t)}}& =\sigma \big({W_{xi}}{x^{(t)}}+{W_{hi}}{h^{(t-1)}}+{b_{xi}}+{b_{hi}}\big),\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
<disp-formula id="j_infor620_eq_005">
<label>(5)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mtd>
<mml:mtd class="align-even">
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">σ</mml:mi>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}{f^{(t)}}& =\sigma \big({W_{xf}}{x^{(t)}}+{W_{hf}}{h^{(t-1)}}+{b_{xf}}+{b_{hf}}\big),\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
<disp-formula id="j_infor620_eq_006">
<label>(6)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mtd>
<mml:mtd class="align-even">
<mml:mo>=</mml:mo>
<mml:mo movablelimits="false">tanh</mml:mo>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}{c^{(t)}}& =\tanh \big({W_{xc}}{x^{(t)}}+{W_{hc}}{h^{(t-1)}}+{b_{xc}}+{b_{hc}}\big),\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
<disp-formula id="j_infor620_eq_007">
<label>(7)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mtd>
<mml:mtd class="align-even">
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">σ</mml:mi>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">o</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:mi mathvariant="italic">o</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">o</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:mi mathvariant="italic">o</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}{o^{(t)}}& =\sigma \big({W_{xo}}{x^{(t)}}+{W_{ho}}{h^{(t-1)}}+{b_{xo}}+{b_{ho}}\big).\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
</disp-formula-group> Therefore, the weights related to the <italic>s</italic>-th neuron in the input, forget, cell, and output gates can be expressed as <inline-formula id="j_infor620_ineq_030"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
<mml:mo>×</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${W_{s}}\in {\mathbb{R}^{4\times (d+r)}}$]]></tex-math></alternatives></inline-formula> in (<xref rid="j_infor620_eq_008">8</xref>), where <inline-formula id="j_infor620_ineq_031"><alternatives><mml:math>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$s\in \{1,\dots ,r\}$]]></tex-math></alternatives></inline-formula>. 
<disp-formula id="j_infor620_eq_008">
<label>(8)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfenced separators="" open="(" close=")">
<mml:mrow>
<mml:mtable columnspacing="4.0pt 4.0pt 4.0pt 4.0pt 4.0pt" equalrows="false" columnlines="none none none none none none none none none" equalcolumns="false" columnalign="center center center center center center">
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo>…</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo>…</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo>…</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo>…</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo>…</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo>…</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo>…</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo>…</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {W_{s}}=\left(\begin{array}{c@{\hskip4.0pt}c@{\hskip4.0pt}c@{\hskip4.0pt}c@{\hskip4.0pt}c@{\hskip4.0pt}c}{W_{x{i_{s1}}}}& \dots & {W_{x{i_{sd}}}}& {W_{h{i_{s1}}}}& \dots & {W_{h{i_{sr}}}}\\ {} {W_{x{f_{s1}}}}& \dots & {W_{x{f_{sd}}}}& {W_{h{f_{s1}}}}& \dots & {W_{h{f_{sr}}}}\\ {} {W_{x{c_{s1}}}}& \dots & {W_{x{c_{sd}}}}& {W_{h{c_{s1}}}}& \dots & {W_{h{c_{sr}}}}\\ {} {W_{x{o_{s1}}}}& \dots & {W_{x{o_{sd}}}}& {W_{h{o_{s1}}}}& \dots & {W_{h{o_{sr}}}}\end{array}\right).\]]]></tex-math></alternatives>
</disp-formula> 
For each <inline-formula id="j_infor620_ineq_032"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{s}}$]]></tex-math></alternatives></inline-formula>, either the <inline-formula id="j_infor620_ineq_033"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula> or the <inline-formula id="j_infor620_ineq_034"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula> norm was calculated. The neurons with the lowest values were pruned (thus, the weights and the biases associated with these neurons were removed).</p>
<p>An approach similar to (Li <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_025">2017</xref>) was applied to prune the convolutional layers. Consider a 1D convolutional layer, where weights are stored in a 3D tensor of size <inline-formula id="j_infor620_ineq_035"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">K</mml:mi></mml:math><tex-math><![CDATA[$O\times I\times K$]]></tex-math></alternatives></inline-formula>. Here, <italic>O</italic> is the number of output channels, <italic>I</italic> is the number of input channels, and <italic>K</italic> is the length of the filter (the bias terms are again omitted for brevity). Layer output channels are formed independently, so let us consider a single output channel, a matrix <inline-formula id="j_infor620_ineq_036"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{s}}$]]></tex-math></alternatives></inline-formula> of size <inline-formula id="j_infor620_ineq_037"><alternatives><mml:math>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">K</mml:mi></mml:math><tex-math><![CDATA[$I\times K$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_infor620_ineq_038"><alternatives><mml:math>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$s\in \{1,\dots ,O\}$]]></tex-math></alternatives></inline-formula>. Now, suppose that <inline-formula id="j_infor620_ineq_039"><alternatives><mml:math>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$u\in \{0,\dots ,I-1\}$]]></tex-math></alternatives></inline-formula> input channels become unavailable. Then, these missing channels carry no information. Hence, effectively, the matrix <inline-formula id="j_infor620_ineq_040"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{s}}$]]></tex-math></alternatives></inline-formula> becomes <inline-formula id="j_infor620_ineq_041"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${W^{\prime }_{s}}$]]></tex-math></alternatives></inline-formula> of size <inline-formula id="j_infor620_ineq_042"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">K</mml:mi></mml:math><tex-math><![CDATA[$(I-u)\times K$]]></tex-math></alternatives></inline-formula>, without the rows corresponding to the unavailable channels. During pruning, <inline-formula id="j_infor620_ineq_043"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${W^{\prime }_{s}}$]]></tex-math></alternatives></inline-formula> were considered, thus avoiding missing input channels from earlier layers (Li <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_025">2017</xref>). Similarly to the LSTM layer, for <inline-formula id="j_infor620_ineq_044"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${W^{\prime }_{s}}$]]></tex-math></alternatives></inline-formula> values, the norm <inline-formula id="j_infor620_ineq_045"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula> or <inline-formula id="j_infor620_ineq_046"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula> was calculated. Then, the layer’s output channels with the smallest values were pruned. In the ResNet, its shortcut connections followed the pruning pattern of the last convolutional layer in their residual blocks.</p>
<p>The pruning procedure started from the input layer and proceeded sequentially until the output layer. For the RNN model, pruning began with the LSTM layer. Based on the norm values, calculated for <inline-formula id="j_infor620_ineq_047"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{s}}$]]></tex-math></alternatives></inline-formula> (<xref rid="j_infor620_eq_008">8</xref>), neurons were removed, and the pruning schema was passed to the rest of the NN. For the CNN and the ResNet, in a convolutional layer, after receiving information about the unavailable input channels, the selected norm of <inline-formula id="j_infor620_ineq_048"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${W^{\prime }_{s}}$]]></tex-math></alternatives></inline-formula> was computed, and the output channels, with the lowest norm values, were removed and the pruning schema was passed down deeper into the NN. The remaining layers, such as the batch normalization layers, the linear layers, etc., reacted to the missing channels by pruning the appropriate weights and biases, in effect, shrinking as well.</p>
<p>Ideally, the model reduction achieved with structured pruning should lower the computational load, e.g. measured in simple hardware-agnostic proxy metrics such as FLOPs (floating-point operations), making the inference faster. However, in practice, structured pruning may lead to a suboptimal configuration of NN structures (Dong <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_007">2021</xref>; Liberis <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_028">2021</xref>). This suboptimality manifests itself in not utilizing the full potential of the underlying hardware and the library implementations, e.g. highly optimized vectorized kernels, effectively leading to notable slowdowns.</p>
<p>During the experiments, pruning was performed in one or more rounds to achieve the same degree of structural reduction measured in the fraction of structures left in the final pruned model. The fraction of available structures pruned in a single layer in one round was expressed as <inline-formula id="j_infor620_ineq_049"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$p=1-{f^{1/r}}$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_infor620_ineq_050"><alternatives><mml:math>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="double-struck">N</mml:mi></mml:math><tex-math><![CDATA[$r\in \mathbb{N}$]]></tex-math></alternatives></inline-formula> denoted the number of pruning rounds and <inline-formula id="j_infor620_ineq_051"><alternatives><mml:math>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$f\in (0;1)$]]></tex-math></alternatives></inline-formula> was the target fraction of structures left after all rounds (note the difference between weights and structures). In the solution, <italic>f</italic> was set to 75%, 50%, and 12.5%; <italic>r</italic> was 1, 5, or 10.</p>
<p>After each pruning round, the models were fine-tuned, using generally the same approach as for training the full (uncompressed) NNs, or using knowledge distillation (Section <xref rid="j_infor620_s_008">3.5</xref>). However, for the CNN model, the maximum learning rate was reduced to <inline-formula id="j_infor620_ineq_052"><alternatives><mml:math>
<mml:mn>5</mml:mn>
<mml:mo>·</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$5\cdot {10^{-4}}$]]></tex-math></alternatives></inline-formula> and the annealing period was changed to 20 epochs. The RNN and the ResNet configurations were unchanged.</p>
</sec>
<sec id="j_infor620_s_008">
<label>3.5</label>
<title>Knowledge Distillation</title>
<p>Overall, response-based knowledge distillation (Gou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_013">2021</xref>) was used, offering a simple and general approach. However, the typical scheme introduced in (Hinton <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_015">2015</xref>) suits multi-class, not multi-label classification. Thus, the authors of (Yang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_038">2023</xref>) proposed applying response-based knowledge distillation with one teacher and one student to multi-label classification, by decomposing the multi-label task into independent binary classification tasks. In these tasks, two two-element probability distributions (of the student and the teacher) are created by taking the probability of label occurrence and its complement and, finally, expressing the loss function as the divergence (<inline-formula id="j_infor620_ineq_053"><alternatives><mml:math>
<mml:mi mathvariant="script">D</mml:mi></mml:math><tex-math><![CDATA[$\mathcal{D}$]]></tex-math></alternatives></inline-formula>) between these distributions. This loss function, <inline-formula id="j_infor620_ineq_054"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathcal{L}_{MLD}}$]]></tex-math></alternatives></inline-formula> (Yang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_038">2023</xref>), is presented in (<xref rid="j_infor620_eq_009">9</xref>), where the superscript <italic>t</italic> denotes the teacher’s output and <italic>s</italic>, the student’s output. See (<xref rid="j_infor620_eq_002">2</xref>) for the remaining notation. 
<disp-formula id="j_infor620_eq_009">
<label>(9)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">(</mml:mo>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">[</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">]</mml:mo>
<mml:mspace width="0.1667em"/>
<mml:mo maxsize="1.19em" minsize="1.19em" stretchy="true">|</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" stretchy="true">|</mml:mo>
<mml:mspace width="0.1667em"/>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">[</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">]</mml:mo>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {\mathcal{L}_{MLD}}={\sum \limits_{i=1}^{n}}\mathcal{D}\big(\big[{p_{i}^{t}},1-{p_{i}^{t}}\big]\hspace{0.1667em}\big|\big|\hspace{0.1667em}\big[{p_{i}^{s}},1-{p_{i}^{s}}\big]\big).\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>Ultimately, the loss function followed expression (<xref rid="j_infor620_eq_010">10</xref>). In <inline-formula id="j_infor620_ineq_055"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">E</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathcal{L}_{CE}}$]]></tex-math></alternatives></inline-formula> (<xref rid="j_infor620_eq_002">2</xref>), <italic>p</italic> was set to be the student’s output, <inline-formula id="j_infor620_ineq_056"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${p^{s}}$]]></tex-math></alternatives></inline-formula>. As divergence, <inline-formula id="j_infor620_ineq_057"><alternatives><mml:math>
<mml:mi mathvariant="script">D</mml:mi></mml:math><tex-math><![CDATA[$\mathcal{D}$]]></tex-math></alternatives></inline-formula>, in (<xref rid="j_infor620_eq_009">9</xref>), Kullback-Leibler divergence (Kullback and Leibler, <xref ref-type="bibr" rid="j_infor620_ref_021">1951</xref>) was used. For all models, the best coefficient <italic>α</italic> was empirically determined to be 0.4. 
<disp-formula id="j_infor620_eq_010">
<label>(10)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">K</mml:mi>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">E</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {\mathcal{L}_{KD}}=\alpha {\mathcal{L}_{CE}}+(1-\alpha ){\mathcal{L}_{MLD}}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>During the experiments, the teacher model was a full floating-point model trained without knowledge distillation with the same base architecture as the student. The student was either a new instance of an uncompressed model to see if the approach can improve the base models’ results, or a pruned (smaller) model to observe the impact during the fine-tuning stage.</p>
</sec>
<sec id="j_infor620_s_009">
<label>3.6</label>
<title>Additional Details of the Experimental Setup</title>
<p>For memory allocation, instead of <monospace>malloc</monospace>, <monospace>tcmalloc</monospace> was used for the RNN, and <monospace>jemalloc</monospace> for the CNN and the ResNet models, as they offered slightly faster performance.</p>
<p>The average inference time was calculated for the models running on the CPU (Section <xref rid="j_infor620_s_010">4</xref>). Although it may not be correlated with the execution on other hardware, clock time was selected in order to show the actual latency that captures various factors, such as memory access or kernel launch overheads. The number of threads available to PyTorch was limited to 1 to mitigate the influence of parallel execution, which was beyond the scope of this research.</p>
<p>Before the time measurements, the models were warmed up on 100 examples. The batch size was set to 1. Inference time was measured for 10000 examples for the CNN, and 2000 examples for the RNN and the ResNet, and the results were averaged. In PyTorch, the floating-point version of the RNN model can use an efficient implementation for the LSTM layer from the oneDNN library (formerly known as MKL-DNN and DNNL). Therefore, the floating-point RNN inference time was measured twice, with and without access to the library.</p>
<p>The models’ size was measured by storing PyTorch’s <monospace>state_dict</monospace> on the disk and then obtaining its size. Note that, in this representation, additional data are stored in the created file, such as information about tensors representing weights or quantization parameters. Therefore, the reported size was worse than the ideal theoretical reduction (i.e. a 4-fold reduction in the case of quantizing the model from 32-bit floating-point to 8-bit integer representation).</p>
</sec>
</sec>
<sec id="j_infor620_s_010">
<label>4</label>
<title>Experiments and Results</title>
<p>Across this section, the following notation is used. M<inline-formula id="j_infor620_ineq_058"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula> is a floating-point model; M<inline-formula id="j_infor620_ineq_059"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>Q</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{Q}}}$]]></tex-math></alternatives></inline-formula> is the quantized M<inline-formula id="j_infor620_ineq_060"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula>; M<inline-formula id="j_infor620_ineq_061"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula> is a floating-point model trained with knowledge distillation, where the full M<inline-formula id="j_infor620_ineq_062"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula> was the teacher; M<inline-formula id="j_infor620_ineq_063"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>QKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{QKD}}}$]]></tex-math></alternatives></inline-formula> is the quantized M<inline-formula id="j_infor620_ineq_064"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula>. The notation <inline-formula id="j_infor620_ineq_065"><alternatives><mml:math>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mi mathvariant="italic">n</mml:mi></mml:math><tex-math><![CDATA[$r/Ln$]]></tex-math></alternatives></inline-formula> means that the model results from pruning in the total of <italic>r</italic> rounds using the <inline-formula id="j_infor620_ineq_066"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mi mathvariant="italic">n</mml:mi></mml:math><tex-math><![CDATA[$Ln$]]></tex-math></alternatives></inline-formula> norm (<inline-formula id="j_infor620_ineq_067"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula> or <inline-formula id="j_infor620_ineq_068"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula>). The numerical results are reported as the mean value, with the standard deviation, presented as 0.abcd(xyz), which should be understood as 0.abcd ± 0.0xyz, e.g. 0.9186(013) stands for 0.9186 ± 0.0013. The term “parameters left” refers to the fraction of the weights and the biases left in the model after pruning.</p>
<p>The experimental process was as follows. First, a basic model, M<inline-formula id="j_infor620_ineq_069"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula>, was trained. Next, a new model, M<inline-formula id="j_infor620_ineq_070"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula>, was trained using M<inline-formula id="j_infor620_ineq_071"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula> as the teacher in the knowledge distillation schema. After that, M<inline-formula id="j_infor620_ineq_072"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula>, or M<inline-formula id="j_infor620_ineq_073"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula>, was pruned in <italic>r</italic> rounds using the <inline-formula id="j_infor620_ineq_074"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula> or the <inline-formula id="j_infor620_ineq_075"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula> norm, with each round followed by fine-tuning (with knowledge distillation for M<inline-formula id="j_infor620_ineq_076"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula>). Throughout the process, models reached their target stage; for not pruned models, that was the end of the initial training, while for pruned models, it was the point where the desired target fraction of structures left, <italic>f</italic> (Section <xref rid="j_infor620_s_007">3.4</xref>), was achieved. After reaching the target stage, each resulting model was quantized, producing M<inline-formula id="j_infor620_ineq_077"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>Q</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{Q}}}$]]></tex-math></alternatives></inline-formula> or M<inline-formula id="j_infor620_ineq_078"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>QKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{QKD}}}$]]></tex-math></alternatives></inline-formula>, and tested.</p>
<p>The experiments were run five times for each architecture. Tables <xref rid="j_infor620_tab_001">1</xref>, <xref rid="j_infor620_tab_002">2</xref>, and <xref rid="j_infor620_tab_003">3</xref> present the mean (from the five runs) macro-averaged AUROC in each target stage, achieved by the full model (100% of parameters) or by the best pruning approach. The highest score in each row is written in bold, and the highest score in the table is additionally underscored. Figures <xref rid="j_infor620_fig_001">1</xref>, <xref rid="j_infor620_fig_003">3</xref>, <xref rid="j_infor620_fig_005">5</xref> with inference time and Figs. <xref rid="j_infor620_fig_002">2</xref>, <xref rid="j_infor620_fig_004">4</xref>, <xref rid="j_infor620_fig_006">6</xref> with the models’ size contain the averaged measurements from the five runs, along with the error bars representing the standard deviation. These measurements were performed for each intermediate model produced during the experiments.</p>
<p>To accelerate development and collect preliminary results, the experiments were conducted on the Intel i9-12900H CPU and the NVIDIA GeForce RTX 3080Ti Laptop GPU rather than on an ECG device (Section <xref rid="j_infor620_s_019">5.5</xref>). For consistency, the CPU scaling governor was set to performance mode and Intel Turbo Boost was enabled.</p>
<sec id="j_infor620_s_011">
<label>4.1</label>
<title>Recurrent Neural Network</title>
<table-wrap id="j_infor620_tab_001">
<label>Table 1</label>
<caption>
<p>RNN AUROC scores with pruning techniques.</p>
</caption>
<table>
<thead>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left; border-top: solid thin; border-bottom: solid thin">Model</td>
<td colspan="4" style="vertical-align: top; text-align: center; border-top: solid thin; border-bottom: solid thin">Parameters left</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">100%</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">61.45%</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">31.94%</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">4.61%</td>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left">M<inline-formula id="j_infor620_ineq_079"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">0.9186(013)</td>
<td style="vertical-align: top; text-align: center"><bold>0.9199</bold>(009)</td>
<td style="vertical-align: top; text-align: center">0.9186(005)</td>
<td style="vertical-align: top; text-align: center">0.8755(091)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">1/<inline-formula id="j_infor620_ineq_080"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_081"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_082"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left">M<inline-formula id="j_infor620_ineq_083"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>Q</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{Q}}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center"><bold>0.9156</bold>(008)</td>
<td style="vertical-align: top; text-align: center">0.9155(010)</td>
<td style="vertical-align: top; text-align: center">0.9136(010)</td>
<td style="vertical-align: top; text-align: center">0.8697(101)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">1/<inline-formula id="j_infor620_ineq_084"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_085"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_086"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left">M<inline-formula id="j_infor620_ineq_087"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">0.9207(005)</td>
<td style="vertical-align: top; text-align: center"><underline><bold>0.9208</bold></underline>(011)</td>
<td style="vertical-align: top; text-align: center">0.9195(021)</td>
<td style="vertical-align: top; text-align: center">0.8809(058)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_088"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">5/<inline-formula id="j_infor620_ineq_089"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">5/<inline-formula id="j_infor620_ineq_090"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left; border-bottom: solid thin">M<inline-formula id="j_infor620_ineq_091"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>QKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{QKD}}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">0.9174(008)</td>
<td style="vertical-align: top; text-align: center"><bold>0.9178</bold>(008)</td>
<td style="vertical-align: top; text-align: center">0.9167(014)</td>
<td style="vertical-align: top; text-align: center">0.8724(070)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">10/<inline-formula id="j_infor620_ineq_092"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">10/<inline-formula id="j_infor620_ineq_093"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">1/<inline-formula id="j_infor620_ineq_094"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>What follows is a summary of the results obtained by the RNN during the experiments. First, we analyse the AUROC results in Table <xref rid="j_infor620_tab_001">1</xref>. Models M<inline-formula id="j_infor620_ineq_095"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula>, M<inline-formula id="j_infor620_ineq_096"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula>, and M<inline-formula id="j_infor620_ineq_097"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>QKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{QKD}}}$]]></tex-math></alternatives></inline-formula> achieved their best scores at 61.45% of the original parameters left, demonstrating that the applied pruning method (both with and without knowledge distillation) has the potential of improving the quality of the base model. This behaviour can be attributed to the regularizing effect of pruning (Hohman <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_017">2024</xref>). However, more aggressive parameter reduction that shrank the models to 31.94% of parameters resulted in degraded quality, and further reduction to 4.61% led to a poor fit of the models. Therefore, practitioners must carefully consider the trade-off between the model size and the predictive quality while applying pruning.</p>
<p>The models fine-tuned with knowledge distillation (M<inline-formula id="j_infor620_ineq_098"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula>, M<inline-formula id="j_infor620_ineq_099"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>QKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{QKD}}}$]]></tex-math></alternatives></inline-formula>) outperformed their respective counterparts fine-tuned without knowledge distillation (M<inline-formula id="j_infor620_ineq_100"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula>, M<inline-formula id="j_infor620_ineq_101"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>Q</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{Q}}}$]]></tex-math></alternatives></inline-formula>) in terms of predictive quality across all pruning levels. Interestingly, for the models M<inline-formula id="j_infor620_ineq_102"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula> and M<inline-formula id="j_infor620_ineq_103"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>QKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{QKD}}}$]]></tex-math></alternatives></inline-formula> at 31.94% and 4.61% different pruning approaches resulted in highest scores (i.e. 5/<inline-formula id="j_infor620_ineq_104"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula> versus 10/<inline-formula id="j_infor620_ineq_105"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula>, and 5/<inline-formula id="j_infor620_ineq_106"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula> versus 1/<inline-formula id="j_infor620_ineq_107"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula>). This observation indicate that the best quantized model does not necessarily derive from the highest-scoring floating-point model.</p>
<fig id="j_infor620_fig_001">
<label>Fig. 1</label>
<caption>
<p>RNN inference time during pruning.</p>
</caption>
<graphic xlink:href="infor620_g001.jpg"/>
</fig>
<p>Next, Fig. <xref rid="j_infor620_fig_001">1</xref> presents the average inference time of the both floating-point (M<inline-formula id="j_infor620_ineq_108"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula>, M<inline-formula id="j_infor620_ineq_109"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula>) and the quantized (M<inline-formula id="j_infor620_ineq_110"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>Q</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{Q}}}$]]></tex-math></alternatives></inline-formula>, M<inline-formula id="j_infor620_ineq_111"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>QKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{QKD}}}$]]></tex-math></alternatives></inline-formula>) models during pruning. The time measurements of the floating-point models were performed with and without access to the oneDNN library (Section <xref rid="j_infor620_s_009">3.6</xref>). As expected, the floating-point models without access to the oneDNN implementation were slower than their quantized counterparts. Moreover, the floating-point models using the oneDNN library were significantly faster than the other two (those without access to the library and quantized ones), thus showing that quantization may not always be universally advantageous in terms of performance. However, it has to be stressed that this conclusion is likely to be platform-specific. Additionally, since during pruning, the absolute differences of latency improvements for floating-point models were similar to each other and more significant than the ones observed in the quantized models, the biggest relative improvements were observed for the floating-point models with access to the oneDNN library.</p>
<fig id="j_infor620_fig_002">
<label>Fig. 2</label>
<caption>
<p>RNN size during pruning.</p>
</caption>
<graphic xlink:href="infor620_g002.jpg"/>
</fig>
<p>During pruning, the size of the models decreased linearly; the same was true also for the CNN and the ResNet. The floating-point RNN models shrank from 93.677 KB to 7.149 KB, while the quantized models decreased from 31.194 KB to 7.770 KB. Therefore, the storage advantage of using quantization gradually diminished with size, to a certain point where the quantized NN actually consumed more disk space, due to additional overhead data stored on the disk (Section <xref rid="j_infor620_s_009">3.6</xref>). Figure <xref rid="j_infor620_fig_002">2</xref> visualizes these size measurements throughout pruning.</p>
</sec>
<sec id="j_infor620_s_012">
<label>4.2</label>
<title>Convolutional Neural Network</title>
<table-wrap id="j_infor620_tab_002">
<label>Table 2</label>
<caption>
<p>CNN AUROC scores with pruning techniques.</p>
</caption>
<table>
<thead>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left; border-top: solid thin; border-bottom: solid thin">Model</td>
<td colspan="4" style="vertical-align: top; text-align: center; border-top: solid thin; border-bottom: solid thin">Parameters left</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">100%</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">57.95%</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">27.27%</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">2.56%</td>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left">M<inline-formula id="j_infor620_ineq_112"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center"><bold>0.9311</bold>(006)</td>
<td style="vertical-align: top; text-align: center">0.9309(003)</td>
<td style="vertical-align: top; text-align: center">0.9302(003)</td>
<td style="vertical-align: top; text-align: center">0.8984(031)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_113"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_114"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">5/<inline-formula id="j_infor620_ineq_115"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left">M<inline-formula id="j_infor620_ineq_116"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>Q</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{Q}}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center"><bold>0.9253</bold>(028)</td>
<td style="vertical-align: top; text-align: center">0.9227(056)</td>
<td style="vertical-align: top; text-align: center">0.9216(036)</td>
<td style="vertical-align: top; text-align: center">0.8865(081)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">1/<inline-formula id="j_infor620_ineq_117"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">5/<inline-formula id="j_infor620_ineq_118"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_119"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left">M<inline-formula id="j_infor620_ineq_120"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center"><underline><bold>0.9320</bold></underline>(007)</td>
<td style="vertical-align: top; text-align: center">0.9311(005)</td>
<td style="vertical-align: top; text-align: center">0.9299(011)</td>
<td style="vertical-align: top; text-align: center">0.9032(037)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_121"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">5/<inline-formula id="j_infor620_ineq_122"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_123"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left; border-bottom: solid thin">M<inline-formula id="j_infor620_ineq_124"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>QKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{QKD}}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center"><bold>0.9268</bold>(019)</td>
<td style="vertical-align: top; text-align: center">0.9252(036)</td>
<td style="vertical-align: top; text-align: center">0.9242(016)</td>
<td style="vertical-align: top; text-align: center">0.8808(187)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">5/<inline-formula id="j_infor620_ineq_125"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">1/<inline-formula id="j_infor620_ineq_126"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">10/<inline-formula id="j_infor620_ineq_127"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Next, we discuss the results obtained by the subsequent proposed model, CNN. The data in Table <xref rid="j_infor620_tab_002">2</xref> summarize the AUROC scores obtained by the CNN at different target stages. In contrast to the patterns observed for the RNN (Section <xref rid="j_infor620_s_011">4.1</xref>) and ResNet (Section <xref rid="j_infor620_s_013">4.3</xref>), the CNN models suffered a noticeable drop in AUROC scores once parameters were removed. The smallest degradation appeared in the models trained with knowledge distillation, suggesting that it helped the pruned networks to retain part of the teacher’s representational power. However, the benefit was not uniform. At 27.27% parameters left, the model fine-tuned with knowledge distillation, M<inline-formula id="j_infor620_ineq_128"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula>, was slightly wore than its non-distilled counterpart, M<inline-formula id="j_infor620_ineq_129"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula>. Furthermore, at a high sparsity level of 2.56% of parameters left, M<inline-formula id="j_infor620_ineq_130"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>Q</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{Q}}}$]]></tex-math></alternatives></inline-formula> was a bit better than M<inline-formula id="j_infor620_ineq_131"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>QKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{QKD}}}$]]></tex-math></alternatives></inline-formula>. Another aspect to note in the results is that different combinations of pruning rounds and norms led to the best-performing floating-point and quantized models at the same pruning level. For instance, at 57.95% of parameters left, the best results for the floating-point model, M<inline-formula id="j_infor620_ineq_132"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula>, were achieved with 10/<inline-formula id="j_infor620_ineq_133"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula> pruning method, but the best results for the quantized model, M<inline-formula id="j_infor620_ineq_134"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>Q</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{Q}}}$]]></tex-math></alternatives></inline-formula>, were achieved after quantizing the floating-point model obtained with 1/<inline-formula id="j_infor620_ineq_135"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula> pruning method. Therefore, one cannot always assume that among multiple floating-point models, the one with the best results will still be the best after quantization.</p>
<p>Although relatively small and compact, the CNN architecture used in the reported experiments turned out to be well-performing on the given (ECG) dataset. In fact, pruning the CNN so that only 27.27% of its original parameters remained, it still outperformed the best RNN results. Even more encouraging is that the base M<inline-formula id="j_infor620_ineq_136"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula> CNN reached the AUROC score that surpassed the best single model scores reported by (Strodthoff <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_034">2020</xref>) for the very same task. These findings suggest that a carefully designed, lightweight CNN can match (or even beat) larger and more complex models on ECG classification while using a fraction of the parameters, highlighting the importance of tailored architecture design.</p>
<fig id="j_infor620_fig_003">
<label>Fig. 3</label>
<caption>
<p>CNN inference time during pruning.</p>
</caption>
<graphic xlink:href="infor620_g003.jpg"/>
</fig>
<p>Next, Fig. <xref rid="j_infor620_fig_003">3</xref> illustrates the inference latency of the CNN variants in the experiments. Initially, the quantized models outperformed their floating-point counterparts, achieving lower inference times on the tested hardware. However, during pruning, the trend shows that the models performed worse after quantization, mainly due to the degraded performance of the quantized max pooling operation with less than 32 channels, causing quantized models to exhibit higher latency than the floating-point versions despite their reduced arithmetic precision.</p>
<fig id="j_infor620_fig_004">
<label>Fig. 4</label>
<caption>
<p>CNN size during pruning.</p>
</caption>
<graphic xlink:href="infor620_g004.jpg"/>
</fig>
<p>The models’ on-disk size measurements are presented in Fig. <xref rid="j_infor620_fig_004">4</xref>. The original floating-point networks occupied 159.734 KB; the smallest models required only 13.174 KB of storage. The quantized networks started at 51.962 KB and were pruned down to 11.898 KB. Thus, even though the pruned quantized models incurred a latency penalty, they consistently maintained a smaller on-disk size compared to their floating-point equivalents.</p>
</sec>
<sec id="j_infor620_s_013">
<label>4.3</label>
<title>Residual Neural Network</title>
<table-wrap id="j_infor620_tab_003">
<label>Table 3</label>
<caption>
<p>ResNet AUROC scores with pruning techniques.</p>
</caption>
<table>
<thead>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left; border-top: solid thin; border-bottom: solid thin">Model</td>
<td colspan="4" style="vertical-align: top; text-align: center; border-top: solid thin; border-bottom: solid thin">Parameters left</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">100%</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">56.60%</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">25.47%</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">1.77%</td>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left">M<inline-formula id="j_infor620_ineq_137"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">0.9282(006)</td>
<td style="vertical-align: top; text-align: center"><bold>0.9283</bold>(010)</td>
<td style="vertical-align: top; text-align: center">0.9274(008)</td>
<td style="vertical-align: top; text-align: center">0.9228(011)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">1/<inline-formula id="j_infor620_ineq_138"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">1/<inline-formula id="j_infor620_ineq_139"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_140"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left">M<inline-formula id="j_infor620_ineq_141"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>Q</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{Q}}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">0.9093(101)</td>
<td style="vertical-align: top; text-align: center"><bold>0.9107</bold>(065)</td>
<td style="vertical-align: top; text-align: center">0.9069(101)</td>
<td style="vertical-align: top; text-align: center">0.8936(103)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">1/<inline-formula id="j_infor620_ineq_142"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">1/<inline-formula id="j_infor620_ineq_143"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">1/<inline-formula id="j_infor620_ineq_144"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left">M<inline-formula id="j_infor620_ineq_145"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">0.9292(004)</td>
<td style="vertical-align: top; text-align: center"><underline><bold>0.9297</bold></underline>(010)</td>
<td style="vertical-align: top; text-align: center">0.9294(009)</td>
<td style="vertical-align: top; text-align: center">0.9257(009)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_146"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_147"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">10/<inline-formula id="j_infor620_ineq_148"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: left; border-bottom: solid thin">M<inline-formula id="j_infor620_ineq_149"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>QKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{QKD}}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center"><bold>0.9193</bold>(047)</td>
<td style="vertical-align: top; text-align: center">0.9183(039)</td>
<td style="vertical-align: top; text-align: center">0.9122(082)</td>
<td style="vertical-align: top; text-align: center">0.9053(111)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">5/<inline-formula id="j_infor620_ineq_150"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">5/<inline-formula id="j_infor620_ineq_151"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">10/<inline-formula id="j_infor620_ineq_152"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The following section summarizes the AUROC results obtained by the proposed ResNet-based architecture (Table <xref rid="j_infor620_tab_003">3</xref>). Several insights emerge when we compare unpruned baselines, pruned variants, and models trained with knowledge distillation. It is also worth mentioning that this (baseline) model contained significantly more parameters than the other two analysed NNs.</p>
<p>First, pruning positively impacted some of the scores. The removal of about half of the parameters yielded slight AUROC improvements for models M<inline-formula id="j_infor620_ineq_153"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula>, M<inline-formula id="j_infor620_ineq_154"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>Q</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{Q}}}$]]></tex-math></alternatives></inline-formula>, M<inline-formula id="j_infor620_ineq_155"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula>. Similar to the RNN, this increase can be attributed to the regularizing effect of pruning, which mitigates overfitting and encourages the network to focus on the most informative feature maps. In contrast, M<inline-formula id="j_infor620_ineq_156"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>QKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{QKD}}}$]]></tex-math></alternatives></inline-formula> experienced a slight drop in AUROC following a comparable pruning. Interestingly, it was observed that M<inline-formula id="j_infor620_ineq_157"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula> pruned in a single round to 56.60% and 25.47% of parameters left performed better than the same model pruned (and fine-tuned) in 5 or 10 rounds, which led to over-fitting; this was not observed for M<inline-formula id="j_infor620_ineq_158"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula>. However, M<inline-formula id="j_infor620_ineq_159"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula> and M<inline-formula id="j_infor620_ineq_160"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula> shared a mutual behaviour for much more disruptive pruning, to 1.77% parameters left, where a gradual approach with more rounds offered superior results compared to faster pruning in fewer rounds. Moreover, across the experiments, the model variants obtained using knowledge distillation (M<inline-formula id="j_infor620_ineq_161"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>FKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{FKD}}}$]]></tex-math></alternatives></inline-formula> and M<inline-formula id="j_infor620_ineq_162"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>QKD</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{QKD}}}$]]></tex-math></alternatives></inline-formula>) consistently surpassed their non-distilled counterparts. This behaviour demonstrated that transferring label information during training not only enhances generalization but also protects the models from the adverse effects of parameter removal. Finally, it is worth noting that the unpruned baseline M<inline-formula id="j_infor620_ineq_163"><alternatives><mml:math>
<mml:msub>
<mml:mrow/>
<mml:mrow>
<mml:mtext>F</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${_{\text{F}}}$]]></tex-math></alternatives></inline-formula> achieved the AUROC score of 0.9282, matching the performance reported in (Strodthoff <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_034">2020</xref>) for the same task using an analogous model that scored 0.930. This parity can be treated as a confirmation of the validity of the proposed implementation.</p>
<fig id="j_infor620_fig_005">
<label>Fig. 5</label>
<caption>
<p>ResNet inference time during pruning.</p>
</caption>
<graphic xlink:href="infor620_g005.jpg"/>
</fig>
<p>Next, Fig. <xref rid="j_infor620_fig_005">5</xref> shows how inference latency for ResNet models changed during the experiments. For the vast majority of pruning levels, the quantized models were noticeably faster than their 32-bit floating-point counterparts; the reduced arithmetic complexity translated directly into lower inference times. However, near the maximum level of pruning, when the remaining parameter count (and thus model size) of the ResNet is roughly on par with that of the analysed baseline CNN (both architectures were convolution-oriented), the observed trend flipped. In this ultra-pruned regime, the quantized ResNet actually incurred slightly higher latency than the floating-point version, a behaviour that mirrored the pattern observed for the CNN in Fig. <xref rid="j_infor620_fig_003">3</xref>.</p>
<fig id="j_infor620_fig_006">
<label>Fig. 6</label>
<caption>
<p>ResNet size during pruning.</p>
</caption>
<graphic xlink:href="infor620_g006.jpg"/>
</fig>
<p>The initial floating-point models occupied 2.042902 MB at their largest and were pruned down to just 0.065302 MB in their most compact form. As anticipated, applying quantization reduced these footprints even further, and the quantized models consistently outpaced their floating-point counterparts in storage efficiency. Their size ranged from 0.553758 MB (100% of parameters) to 0.041374 MB (1.77% of parameters left). The trends are visualized in Fig. <xref rid="j_infor620_fig_006">6</xref>.</p>
</sec>
</sec>
<sec id="j_infor620_s_014">
<label>5</label>
<title>Discussion</title>
<sec id="j_infor620_s_015">
<label>5.1</label>
<title>Neural Networks</title>
<p>The uncompressed, full-precision versions of each proposed NN achieved classification performance (measured by macro-averaged AUROC) that closely matched or slightly exceeded the benchmark results reported in Strodthoff <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor620_ref_034">2020</xref>). Moreover, these types of NNs represent different computation characteristics that range from the sequential processing of RNNs to the shortcut-connected layers of ResNets. Therefore, they were deemed good candidates for the subsequent experimentation with the compression techniques, which were the focal point of this work.</p>
</sec>
<sec id="j_infor620_s_016">
<label>5.2</label>
<title>Quantization</title>
<p>In the conducted experiments, in almost all tested cases but one, quantization reduced the models’ storage footprint owing to the use of lower precision numbers to represent the weights, i.e. 8-bit integers instead of 32-bit floating-point parameters. In theory, the use of this kind of quantization should result in a four-fold reduction in the on-disk size storage requirements, yet in practice, we observed smaller (worse) savings. Moreover, for the heavily pruned RNN, the quantization was detrimental size-wise, i.e. it resulted in a bigger footprint than the full-precision model. Both the discrepancy between the actual and four-fold reduction and the RNN exception were traced back to the implementation details of PyTorch and the method used to measure the disk usage (Section <xref rid="j_infor620_s_009">3.6</xref>).</p>
<p>Despite making the models smaller and using a representation that theoretically offers faster computation with integer arithmetic, the effect of quantization on the inference speed on the tested CPU was variable, did not always result in faster inference, and heavily depended on the model architecture, the NN size, and the libraries used. Quantized RNN models were faster than their full-precision counterparts. However, using the oneDNN library for the floating-point models proved to be much faster and outperformed even the integer variants. Next, quantized CNN models experienced a slowdown at the beginning of the pruning. Although the inference speed kept improving, they remained slower than the full-precision models. In effect, during pruning, the floating-point CNNs were superior to the quantized models in terms of inference speed. The ResNet architecture, which was the largest network in the experiments, benefited the most from the quantization. However, once it reached a size comparable to that of the used CNNs, a similar slowdown occurred that resulted in the quantized models being slower than their full-precision counterparts. These results underscore that the impact of quantization on speed is highly context-dependent. However, it is important to note that the results could also differ on other hardware, e.g. without a dedicated floating-point unit, or in other runtime environments, e.g. due to the specifics of quantized kernel implementations.</p>
<p>As expected, replacing floating-point 32-bit weights with quantized integers degraded the predictive quality of the tested models. However, the exact magnitude of impact on the AUROC score depended on the NN architecture and the NN size. The largest considered model, ResNet, suffered the biggest negative impact on the relative scores between non-quantized and quantized models. At the same time, the most negligible negative effect was observed for the RNN architecture, which had the smallest number of parameters. In the case of the CNN architecture, the penalty resulting from quantization was between the other two architectures. Nonetheless, in all cases, the loss in predictive quality was gradual rather than catastrophic. Potentially, more sophisticated calibration or quantization-aware training could help recover part of the gap.</p>
</sec>
<sec id="j_infor620_s_017">
<label>5.3</label>
<title>Pruning</title>
<p>Structured pruning was naturally beneficial for reducing the models’ disk usage because, in the proposed implementation, the pruned structures were actually removed from the network and not merely masked. As expected, the decrease followed a linear trend as a function of the number of parameters left in the model.</p>
<p>Beyond storage savings, structured pruning yielded improvements in inference latency across all architectures, although the magnitude and pattern of these improvements varied. Overall, floating-point models experienced more substantial speedups than their quantized counterparts. However, the exact impact was more nuanced between the tested architectures. Across floating-point and 8-bit quantized RNNs (both with and without access to the oneDNN library), pruning reduced inference time and models followed a similar pattern. It should be noted that the quantized models were faster only than the floating-point models that did not use the oneDNN implementation. For the CNN architecture, the quantized models were faster only at the very beginning of the experiments. Moreover, the latency of the floating-point models decreased more steadily than that of the quantized ones. As the models became smaller, the more erratic graph of the quantized models’ latency diverged from the floating-point models’ graph even more, underlining the runtime overheads of the quantized kernels implementation. For the largest tested architecture, ResNet, during pruning, the quantized models remained faster than the floating-point models, up to about 10% of parameters left in the models. Beyond that point, a behaviour was observed that was similar to the one described for the CNN architecture.</p>
<p>Additionally, an interesting side finding emerged when we examined pruning patterns whose remaining structures aligned with multiples of eight. NNs with a dominant number of structures divisible by eight were faster than their slightly bigger or smaller versions, e.g. a layer with 32 output channels had a faster execution than a layer with 33 or 31 channels. This “hardware sweet spot” pointed to the phenomenon of a suboptimal configuration (Section <xref rid="j_infor620_s_007">3.4</xref>) and the need to carefully consider the target hardware and runtime implementation while choosing the pruning pattern to obtain the lowest latency.</p>
<p>Surprisingly, pruning could improve the results obtained by the models, which was observed in the case of the RNN and the ResNet architectures. This behaviour could be attributed to the regularizing effect of pruning. Hence, it was not observed in the CNN, as it was regularized and quasi-optimally small, so further regularization only had adverse effects.</p>
<p>Finally, no significant differences were noted during the experiments between the <inline-formula id="j_infor620_ineq_164"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$L1$]]></tex-math></alternatives></inline-formula> and the <inline-formula id="j_infor620_ineq_165"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$L2$]]></tex-math></alternatives></inline-formula> methods that chose the structures to be pruned. These observations were similar to the results reported in Li <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor620_ref_025">2017</xref>). Instead, the number of pruning rounds, followed by fine-tuning, turned out to be a more important parameter. However, conclusions about the exact impact of pruning rounds remains an important direction for future work.</p>
</sec>
<sec id="j_infor620_s_018">
<label>5.4</label>
<title>Knowledge Distillation</title>
<p>The knowledge distillation scheme utilized in the experiments had no direct impact on the model size or its inference latency. Instead, its primary benefit was improving the predictive quality of the base and pruned models (and therefore the quantized models), with only a few exceptions. In practical terms, this means that if a predefined quality acceptance threshold is present (e.g. a maximum allowable drop in accuracy), the use of knowledge distillation makes it possible to achieve more aggressive compression ratios (e.g. by applying pruning).</p>
<p>Our study deliberately focused on small networks of moderate depth. While response-based distillation, where the student mimics the output of the teacher, has proved to be effective in this context, prior work suggests that it may be less suitable for very deep NNs (Gou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor620_ref_013">2021</xref>). In such cases, one might need to resort to other distillation schemes, e.g. matching activations from intermediate layers. Because the experiments did not include very deep models, a thorough investigation of alternative distillation paradigms remains beyond the scope of this work.</p>
</sec>
<sec id="j_infor620_s_019">
<label>5.5</label>
<title>Limitations</title>
<p>Although this study covers various aspects of neural compression methods, its shortcomings must be realized. One significant limitation comes from the use of a single dataset, PTB-XL. It might introduce bias in the findings, which is associated with the specifics of the data, such as bias related to the demographics, the acquisition methods, or the representation of conditions. Another constraint arises from the runtime environment settings for the experiments. The inference latency results were obtained using a standard CPU, which may be too demanding for environments with strictly limited computing resources. Additionally, the experiments were performed using the full PyTorch runtime, which comes with requirements that may be unrealistic for tiny devices. For instance, it requires using an operating system or involves dynamic memory allocation. Other machine learning runtimes can be better suited for small computing devices, e.g. LiteRT (TensorFlow Lite) for Microcontrollers.<xref ref-type="fn" rid="j_infor620_fn_003">3</xref><fn id="j_infor620_fn_003"><label><sup>3</sup></label>
<p><uri>https://ai.google.dev/edge/litert/microcontrollers</uri></p></fn> While the shortcomings played a role in providing a clear and concise environment for experimentation, they might limit the representativeness of the results. However, the obtained results create a valuable proof of concept and motivation for further experimentation with the methods aimed at optimizing and deploying NNs on strictly resource-constrained devices.</p>
</sec>
</sec>
<sec id="j_infor620_s_020">
<label>6</label>
<title>Conclusion</title>
<p>In this study, three neural network compression techniques – quantization, structured pruning, and knowledge distillation – were applied to three different NN architectures used for ECG signal analysis. These NN compression methods were chosen because they address challenges associated with the use of NNs on devices with strict resource constraints, such as limited storage space or low computational capability.</p>
<p>The results indicated that model compression is a comprehensive problem in which the choice of techniques, software stack, and underlying hardware all play an important role. Quantization delivered smaller models, but the actual reduction was lower than the theoretical maximum. Its impact on inference speed was mixed, underscoring that low-bit representations alone are not always sufficient to guarantee practical efficiency gains. Pruning consistently reduced the model footprint and, in most cases, improved inference speed. However, it also revealed a strong dependence on hardware characteristics and pruning patterns. In fact, a poorly chosen pruning pattern could noticeably slow down a model. Knowledge distillation complemented both methods by improving the quality of the floating-point models and, consequently, the quality of their quantized versions.</p>
<p>Overall, neural network compression for resource-constrained devices is not a simple “plug-and-play” solution but rather a design space where effective strategies must be planned with awareness of the target runtime platform.</p>
<p>Finally, the compression techniques used in this study are not limited to ECG signal analysis. They are broadly applicable in other areas, such as autonomous systems, computational photography, or smart homes. However, their impact on model quality and resource consumption may vary in different domains due to factors such as the type of data, task complexity, inference requirements, or hardware architecture. Therefore, careful adaptation of the methods is crucial to ensure that they achieve their full potential when utilized in other application scenarios.</p>
</sec>
</body>
<back>
<ref-list id="j_infor620_reflist_001">
<title>References</title>
<ref id="j_infor620_ref_001">
<mixed-citation publication-type="journal"><string-name><surname>Berkaya</surname>, <given-names>S.K.</given-names></string-name>, <string-name><surname>Uysal</surname>, <given-names>A.K.</given-names></string-name>, <string-name><surname>Gunal</surname>, <given-names>E.S.</given-names></string-name>, <string-name><surname>Ergin</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Gunal</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Gulmezoglu</surname>, <given-names>M.B.</given-names></string-name> (<year>2018</year>). <article-title>A survey on ECG analysis</article-title>. <source>Biomedical Signal Processing and Control</source>, <volume>43</volume>, <fpage>216</fpage>–<lpage>235</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_002">
<mixed-citation publication-type="chapter"><string-name><surname>Boulif</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Ananou</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Ouladsine</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Delliaux</surname>, <given-names>S.</given-names></string-name> (<year>2024</year>). <chapter-title>Focal-based deep learning model for automatic arrhythmia diagnosis</chapter-title>. In: <source>International Conference on Computational Science</source>. <publisher-name>Springer</publisher-name>, pp. <fpage>355</fpage>–<lpage>370</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_003">
<mixed-citation publication-type="journal"><string-name><surname>Butterworth</surname>, <given-names>S.</given-names></string-name>, (<year>1930</year>). <article-title>On the theory of filter amplifiers</article-title>. <source>Wireless Engineer</source>, <volume>7</volume>(<issue>6</issue>), <fpage>536</fpage>–<lpage>541</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_004">
<mixed-citation publication-type="chapter"><string-name><surname>Chang</surname>, <given-names>X.Q.</given-names></string-name>, <string-name><surname>Chew</surname>, <given-names>A.F.</given-names></string-name>, <string-name><surname>Choong</surname>, <given-names>B.C.M.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Han</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>He</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Xiaolin</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Panicker</surname>, <given-names>R.C.</given-names></string-name>, <string-name><surname>John</surname>, <given-names>D.</given-names></string-name> (<year>2022</year>). <chapter-title>Atrial fibrillation detection using weight-pruned, log-quantised convolutional neural networks</chapter-title>. In: <source>2022 IEEE 13th Latin America Symposium on Circuits and System (LASCAS)</source>. <publisher-name>IEEE</publisher-name>, pp. <fpage>1</fpage>–<lpage>4</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_005">
<mixed-citation publication-type="journal"><string-name><surname>Cheng</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Shi</surname>, <given-names>J.Q.</given-names></string-name> (<year>2024</year>). <article-title>A survey on deep neural network pruning: taxonomy, comparison, analysis, and recommendations</article-title>. <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, <volume>46</volume>(<issue>12</issue>), <fpage>10558</fpage>–<lpage>10578</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_006">
<mixed-citation publication-type="journal"><string-name><surname>Dantas</surname>, <given-names>P.V.</given-names></string-name>, <string-name><surname>Da Silva</surname>, <given-names>W.S.</given-names></string-name>, <string-name><surname>Cordeiro</surname>, <given-names>L.C.</given-names></string-name>, <string-name><surname>Carvalho</surname>, <given-names>C.B.</given-names></string-name> (<year>2024</year>). <article-title>A comprehensive review of model compression techniques in machine learning</article-title>. <source>Applied Intelligence</source>, <volume>54</volume>(<issue>22</issue>), <fpage>11804</fpage>–<lpage>11844</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_007">
<mixed-citation publication-type="chapter"><string-name><surname>Dong</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Gao</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Wawrzynek</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>So</surname>, <given-names>H.K.</given-names></string-name>, <string-name><surname>Keutzer</surname>, <given-names>K.</given-names></string-name> (<year>2021</year>). <chapter-title>Hao: Hardware-aware neural architecture optimization for efficient inference</chapter-title>. In: <source>2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)</source>. <publisher-name>IEEE</publisher-name>, pp. <fpage>50</fpage>–<lpage>59</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_008">
<mixed-citation publication-type="journal"><string-name><surname>Elman</surname>, <given-names>J.L.</given-names></string-name> (<year>1990</year>). <article-title>Finding structure in time</article-title>. <source>Cognitive Science</source>, <volume>14</volume>(<issue>2</issue>), <fpage>179</fpage>–<lpage>211</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_009">
<mixed-citation publication-type="journal"><string-name><surname>Fawcett</surname>, <given-names>T.</given-names></string-name> (<year>2006</year>). <article-title>An introduction to ROC analysis</article-title>. <source>Pattern Recognition Letters</source>, <volume>27</volume>(<issue>8</issue>), <fpage>861</fpage>–<lpage>874</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_010">
<mixed-citation publication-type="chapter"><string-name><surname>Fernández</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>García</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Galar</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Prati</surname>, <given-names>R.C.</given-names></string-name>, <string-name><surname>Krawczyk</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Herrera</surname>, <given-names>F.</given-names></string-name> (<year>2018</year>). <chapter-title>Performance measures</chapter-title>. In: <source>Learning from Imbalanced Data Sets</source>. <publisher-name>Springer International Publishing</publisher-name>, pp. <fpage>47</fpage>–<lpage>61</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_011">
<mixed-citation publication-type="chapter"><string-name><surname>Gholami</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Kim</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Dong</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Yao</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Mahoney</surname>, <given-names>M.W.</given-names></string-name>, <string-name><surname>Keutzer</surname>, <given-names>K.</given-names></string-name> (<year>2022</year>). <chapter-title>A survey of quantization methods for efficient neural network inference</chapter-title>. In: <source>Low-Power Computer Vision</source>. <publisher-name>Chapman and Hall/CRC</publisher-name>, pp. <fpage>291</fpage>–<lpage>326</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_012">
<mixed-citation publication-type="book"><string-name><surname>Goodfellow</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Courville</surname>, <given-names>A.</given-names></string-name> (<year>2016</year>). <source>Deep Learning</source>. <publisher-name>MIT Press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_013">
<mixed-citation publication-type="journal"><string-name><surname>Gou</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Maybank</surname>, <given-names>S.J.</given-names></string-name>, <string-name><surname>Tao</surname>, <given-names>D.</given-names></string-name> (<year>2021</year>). <article-title>Knowledge distillation: a survey</article-title>. <source>International Journal of Computer Vision</source>, <volume>129</volume>(<issue>6</issue>), <fpage>1789</fpage>–<lpage>1819</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_014">
<mixed-citation publication-type="chapter"><string-name><surname>He</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Ren</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Sun</surname>, <given-names>J.</given-names></string-name> (<year>2016</year>). <chapter-title>Deep residual learning for image recognition</chapter-title>. In: <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, pp. <fpage>770</fpage>–<lpage>778</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_015">
<mixed-citation publication-type="other"><string-name><surname>Hinton</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Vinyals</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Dean</surname>, <given-names>J.</given-names></string-name> (2015). <italic>Distilling the Knowledge in a Neural Network</italic>. arXiv:<ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1503.02531">1503.02531</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_016">
<mixed-citation publication-type="journal"><string-name><surname>Hochreiter</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Schmidhuber</surname>, <given-names>J.</given-names></string-name> (<year>1997</year>). <article-title>Long short-term memory</article-title>. <source>Neural Computation</source>, <volume>9</volume>(<issue>8</issue>), <fpage>1735</fpage>–<lpage>1780</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_017">
<mixed-citation publication-type="chapter"><string-name><surname>Hohman</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Kery</surname>, <given-names>M.B.</given-names></string-name>, <string-name><surname>Ren</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Moritz</surname>, <given-names>D.</given-names></string-name> (<year>2024</year>). <chapter-title>Model compression in practice: lessons learned from practitioners creating on-device machine learning experiences</chapter-title>. In: <source>Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems</source>, pp. <fpage>1</fpage>–<lpage>18</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_018">
<mixed-citation publication-type="other"><string-name><surname>Hubara</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Courbariaux</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Soudry</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>El-Yaniv</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name> (2016). Binarized neural networks. In: <italic>NIPS’16: Proceedings of the 30th International Conference on Neural Information Processing Systems</italic>, pp. 4114–412.</mixed-citation>
</ref>
<ref id="j_infor620_ref_019">
<mixed-citation publication-type="journal"><string-name><surname>Khan</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Yuan</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Rehman</surname>, <given-names>A.U.</given-names></string-name> (<year>2023</year>). <article-title>ECG classification using 1-D convolutional deep residual neural network</article-title>. <source>PLOS One</source>, <volume>18</volume>(<issue>4</issue>), <fpage>0284791</fpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_020">
<mixed-citation publication-type="journal"><string-name><surname>Khan Mamun</surname>, <given-names>M.M.R.</given-names></string-name>, <string-name><surname>Elfouly</surname>, <given-names>T.</given-names></string-name> (<year>2023</year>). <article-title>AI-enabled electrocardiogram analysis for disease diagnosis</article-title>. <source>Applied System Innovation</source>, <volume>6</volume>(<issue>5</issue>), <fpage>95</fpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_021">
<mixed-citation publication-type="journal"><string-name><surname>Kullback</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Leibler</surname>, <given-names>R.A.</given-names></string-name> (<year>1951</year>). <article-title>On information and sufficiency</article-title>. <source>The Annals of Mathematical Statistics</source>, <volume>22</volume>(<issue>1</issue>), <fpage>79</fpage>–<lpage>86</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_022">
<mixed-citation publication-type="journal"><string-name><surname>Lecun</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Bottou</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Haffner</surname>, <given-names>P.</given-names></string-name> (<year>1998</year>). <article-title>Gradient-based learning applied to document recognition</article-title>. <source>Proceedings of the IEEE</source>, <volume>86</volume>(<issue>11</issue>), <fpage>2278</fpage>–<lpage>2324</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_023">
<mixed-citation publication-type="journal"><string-name><surname>Lee</surname>, <given-names>K.-S.</given-names></string-name>, <string-name><surname>Park</surname>, <given-names>H.-J.</given-names></string-name>, <string-name><surname>Kim</surname>, <given-names>J.E.</given-names></string-name>, <string-name><surname>Kim</surname>, <given-names>H.J.</given-names></string-name>, <string-name><surname>Chon</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Kim</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Jang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Kim</surname>, <given-names>J.-K.</given-names></string-name>, <string-name><surname>Jang</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Gil</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Son</surname>, <given-names>H.S.</given-names></string-name> (<year>2022</year>). <article-title>Compressed deep learning to classify arrhythmia in an embedded wearable device</article-title>. <source>Sensors</source>, <volume>22</volume>(<issue>5</issue>), <fpage>1776</fpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_024">
<mixed-citation publication-type="journal"><string-name><surname>Li</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Y.</given-names></string-name> (<year>2024</year>). <article-title>A review of IoT applications in healthcare</article-title>. <source>Neurocomputing</source>, <volume>565</volume>, <elocation-id>127017</elocation-id>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_025">
<mixed-citation publication-type="chapter"><string-name><surname>Li</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Kadav</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Durdanovic</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Samet</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Graf</surname>, <given-names>H.P.</given-names></string-name> (<year>2017</year>). <chapter-title>Pruning filters for efficient ConvNets</chapter-title>. In: <source>International Conference on Learning Representations</source>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_026">
<mixed-citation publication-type="other"><string-name><surname>Li</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Alvarez</surname>, <given-names>R.</given-names></string-name> (2021). <italic>On the Quantization of Recurrent Neural Networks</italic>. arXiv:<ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2101.05453">2101.05453</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_027">
<mixed-citation publication-type="journal"><string-name><surname>Li</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Meng</surname>, <given-names>L.</given-names></string-name> (<year>2023</year>). <article-title>Model compression for deep neural networks: a survey</article-title>. <source>Computers</source>, <volume>12</volume>(<issue>3</issue>), <fpage>60</fpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_028">
<mixed-citation publication-type="chapter"><string-name><surname>Liberis</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Dudziak</surname>, <given-names>Ł.</given-names></string-name>, <string-name><surname>Lane</surname>, <given-names>N.D.</given-names></string-name> (<year>2021</year>). <chapter-title><italic>μ</italic>nas: Constrained neural architecture search for microcontrollers</chapter-title>. In: <source>Proceedings of the 1st Workshop on Machine Learning and Systems</source>, pp. <fpage>70</fpage>–<lpage>79</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_029">
<mixed-citation publication-type="journal"><string-name><surname>Merdjanovska</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Rashkovska</surname>, <given-names>A.</given-names></string-name> (<year>2022</year>). <article-title>Comprehensive survey of computational ECG analysis: databases, methods and applications</article-title>. <source>Expert Systems with Applications</source>, <volume>203</volume>, <elocation-id>117206</elocation-id>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_030">
<mixed-citation publication-type="journal"><string-name><surname>Mirvis</surname>, <given-names>D.M.</given-names></string-name>, <string-name><surname>Goldberger</surname>, <given-names>A.L.</given-names></string-name> (<year>2001</year>). <article-title>Electrocardiography</article-title>. <source>Heart Disease</source>, <volume>1</volume>, <fpage>82</fpage>–<lpage>128</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_031">
<mixed-citation publication-type="journal"><string-name><surname>Safdar</surname>, <given-names>M.F.</given-names></string-name>, <string-name><surname>Nowak</surname>, <given-names>R.M.</given-names></string-name>, <string-name><surname>Pałka</surname>, <given-names>P.</given-names></string-name> (<year>2024</year>). <article-title>Pre-processing techniques and artificial intelligence algorithms for electrocardiogram (ECG) signals analysis: a comprehensive review</article-title>. <source>Computers in Biology and Medicine</source>, <volume>170</volume>, <elocation-id>107908</elocation-id>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_032">
<mixed-citation publication-type="chapter"><string-name><surname>Selvan</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Schön</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Dam</surname>, <given-names>E.B.</given-names></string-name> (<year>2023</year>). <chapter-title>Operating critical machine learning models in resource constrained regimes</chapter-title>. In: <source>International Conference on Medical Image Computing and Computer-Assisted Intervention</source>. <publisher-name>Springer</publisher-name>, pp. <fpage>325</fpage>–<lpage>335</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_033">
<mixed-citation publication-type="journal"><string-name><surname>Sepahvand</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Abdali-Mohammadi</surname>, <given-names>F.</given-names></string-name> (<year>2022</year>). <article-title>A novel method for reducing arrhythmia classification from 12-lead ECG signals to single-lead ECG with minimal loss of accuracy through teacher-student knowledge distillation</article-title>. <source>Information Sciences</source>, <volume>593</volume>, <fpage>64</fpage>–<lpage>77</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_034">
<mixed-citation publication-type="journal"><string-name><surname>Strodthoff</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Wagner</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Schaeffter</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Samek</surname>, <given-names>W.</given-names></string-name> (<year>2020</year>). <article-title>Deep learning for ECG analysis: benchmarks and insights from PTB-XL</article-title>. <source>IEEE Journal of Biomedical and Health Informatics</source>, <volume>25</volume>(<issue>5</issue>), <fpage>1519</fpage>–<lpage>1528</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_035">
<mixed-citation publication-type="journal"><string-name><surname>Wagner</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Strodthoff</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Bousseljot</surname>, <given-names>R.-D.</given-names></string-name>, <string-name><surname>Kreiseler</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Lunze</surname>, <given-names>F.I.</given-names></string-name>, <string-name><surname>Samek</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Schaeffter</surname>, <given-names>T.</given-names></string-name> (<year>2020</year>). <article-title>PTB-XL, a large publicly available electrocardiography dataset</article-title>. <source>Scientific Data</source>, <volume>7</volume>, <elocation-id>154</elocation-id>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_036">
<mixed-citation publication-type="chapter"><string-name><surname>Wang</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Yan</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Oates</surname>, <given-names>T.</given-names></string-name> (<year>2017</year>). <chapter-title>Time series classification from scratch with deep neural networks: a strong baseline</chapter-title>. In: <source>2017 International Joint Conference on Neural Networks (IJCNN)</source>. <publisher-name>IEEE</publisher-name>, pp. <fpage>1578</fpage>–<lpage>1585</lpage>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_037">
<mixed-citation publication-type="other"><string-name><surname>Wu</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Judd</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Isaev</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Micikevicius</surname>, <given-names>P.</given-names></string-name> (2020). <italic>Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation</italic>. arXiv:<ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2004.09602">2004.09602</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor620_ref_038">
<mixed-citation publication-type="chapter"><string-name><surname>Yang</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Xie</surname>, <given-names>M.-K.</given-names></string-name>, <string-name><surname>Zong</surname>, <given-names>C.-C.</given-names></string-name>, <string-name><surname>Feng</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Niu</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Sugiyama</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>S.-J.</given-names></string-name> (<year>2023</year>). <chapter-title>Multi-label knowledge distillation</chapter-title>. In: <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source>, pp. <fpage>17271</fpage>–<lpage>17280</lpage>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
