<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
<journal-title-group><journal-title>Informatica</journal-title></journal-title-group>
<issn pub-type="epub">1822-8844</issn><issn pub-type="ppub">0868-4952</issn><issn-l>0868-4952</issn-l>
<publisher>
<publisher-name>Vilnius University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">INFOR473</article-id>
<article-id pub-id-type="doi">10.15388/22-INFOR473</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Research Article</subject></subj-group></article-categories>
<title-group>
<article-title>Approach for Multi-Label Text Data Class Verification and Adjustment Based on Self-Organizing Map and Latent Semantic Analysis</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Stefanovič</surname><given-names>Pavel</given-names></name><email xlink:href="pavel.stefanovic@vilniustech.lt">pavel.stefanovic@vilniustech.lt</email><xref ref-type="aff" rid="j_infor473_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref><bio>
<p><bold>P. Stefanovič</bold> received a PhD degree in computer science from the Institute of Mathematics and Informatics, Vilnius University, Lithuania, in 2015. He is currently employed as a researcher and associate professor at the Faculty of Fundamental Sciences, Vilnius Gediminas Technical University. His research interests include data mining methods, natural language pre-processing, machine learning methods, visualization of multidimensional data, data clustering methods. He is the author of 12 publications.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Kurasova</surname><given-names>Olga</given-names></name><email xlink:href="olga.kurasova@mif.vu.lt">olga.kurasova@mif.vu.lt</email><xref ref-type="aff" rid="j_infor473_aff_002">2</xref><bio>
<p><bold>O. Kurasova</bold> received a PhD degree in computer science from the Institute of Mathematics and Informatics, Vytautas Magnus University, Lithuania, in 2005. She is currently employed as a principal researcher and a professor at the Institute of Data Science and Digital Technologies, Vilnius University. Her research interests include data mining methods, optimization theory and applications, artificial intelligence, neural networks, visualization of multidimensional data, multiple criteria decision support, parallel computing, and image processing. She is the author of more than 80 scientific publications.</p></bio>
</contrib>
<aff id="j_infor473_aff_001"><label>1</label>Department of Information Systems, Faculty of Fundamental Sciences, <institution>Vilnius Gediminas Technical University</institution>, Saulėtekio al. 11, LT-10223, Vilnius, <country>Lithuania</country></aff>
<aff id="j_infor473_aff_002"><label>2</label>Institute of Data Science and Digital Technologies, <institution>Vilnius University</institution>, Akademijos str. 4, LT-08412, Vilnius, <country>Lithuania</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2022</year></pub-date><pub-date pub-type="epub"><day>10</day><month>1</month><year>2022</year></pub-date><volume>33</volume><issue>1</issue><fpage>109</fpage><lpage>130</lpage><history><date date-type="received"><month>6</month><year>2021</year></date><date date-type="accepted"><month>1</month><year>2022</year></date></history>
<permissions><copyright-statement>© 2022 Vilnius University</copyright-statement><copyright-year>2022</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>In this paper, a new approach has been proposed for multi-label text data class verification and adjustment. The approach helps to make semi-automated revisions of class assignments to improve the quality of the data. The data quality significantly influences the accuracy of the created models, for example, in classification tasks. It can also be useful for other data analysis tasks. The proposed approach is based on the combination of the usage of the text similarity measure and two methods: latent semantic analysis and self-organizing map. First, the text data must be pre-processed by selecting various filters to clean the data from unnecessary and irrelevant information. Latent semantic analysis has been selected to reduce the vectors dimensionality of the obtained vectors that correspond to each text from the analysed data. The cosine similarity distance has been used to determine which of the multi-label text data class should be changed or adjusted. The self-organizing map has been selected as the key method to detect similarity between text data and make decisions for a new class assignment. The experimental investigation has been performed using the newly collected multi-label text data. Financial news data in the Lithuanian language have been collected from four public websites and classified by experts into ten classes manually. Various parameters of the methods have been analysed, and the influence on the final results has been estimated. The final results are validated by experts. The research proved that the proposed approach could be helpful to verify and adjust multi-label text data classes. 82% of the correct assignments are obtained when the data dimensionality is reduced to 40 using the latent semantic analysis, and the self-organizing map size is reduced from 40 to 5 by step 5.</p>
</abstract>
<kwd-group>
<label>Key words</label>
<kwd>multi-label text data</kwd>
<kwd>clustering</kwd>
<kwd>self-organizing map</kwd>
<kwd>latent semantic analysis</kwd>
<kwd>Lithuanian language</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="j_infor473_s_001">
<label>1</label>
<title>Introduction</title>
<p>Nowadays, the amount of information is growing at a very high rate, and systems store it in various formats. Most of the data collected are unstructured, leading to various problems such as preparing, processing, and analysing such types of data. One of the unstructured data types is text. There are many different tasks where text analysis is used, but usually it is applied in text data classification and clustering, semantic analysis, context analysis, etc. Many different classification algorithms are suitable for text data analysis starting from the traditional classification algorithms (Joulin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_013">2016</xref>) like decision trees, multinomial Naive Bayes (MNB), support vector machine (SVM), and going to the deep learning algorithms (Minaee <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_023">2021</xref>) such as a long short-term memory (LSTM), convolutional neural networks, and even the newest method – transformers (Khan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_015">2021</xref>). The sentiment analysis is a branch of the classification tasks, where text data needs to be classified according to the sentiment, usually positive, negative, and neutral. It is often applied in social network analysis, movie reviews and comments analysis, etc. (Bhuiyan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_004">2017</xref>; Kharlamov <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_016">2019</xref>). When performing text data clustering, it is possible to discover the relationship or similarities between different texts, using such clustering algorithms as k-means, hierarchical clustering, and other semi-supervised clustering methods (Aggarwal and Zhai, <xref ref-type="bibr" rid="j_infor473_ref_001">2012</xref>). The context analysis is the highest level of text data analysis, when not only the text data is categorized or classified, but the meaning of the text is also tried to take into account (Hernández-Alvarez and Gomez, <xref ref-type="bibr" rid="j_infor473_ref_010">2016</xref>).</p>
<p>In solving any task, a process starts with data selection. If the data is not correctly prepared, pre-processed, or other mistakes are involved, there is a high risk that the model will work improperly. Therefore, the data must be well prepared, i.e. the classes of analysed data must be correctly assigned, the data classes do not have to depend on different experts’ labelling, and the classes must be unequivocally correct. Usually, when the new data is collected, researchers, experts of a specific field, or other persons need to assign a class manually, the errors or inaccuracies can be made by mistake. There is always a possibility of a human factor mistake that later influences the model results. When the text data are analysed, the data can be assigned not only to one class but also to more classes, for example, the number of classes can depend on the text length. Usually, a dominant topic or context of the analysed text data is considered as a class of text data. In this case, the multi-label text data is obtained, so analysis becomes more complex. There are various researches where the multi-label text data has been analysed using different techniques, but the class adjustment and verification are not considered (Nanculef <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_024">2014</xref>; Park and Lee, <xref ref-type="bibr" rid="j_infor473_ref_025">2008</xref>). The main problem of multi-label text data class verification and adjustment is deciding which class of the data item should be changed and which class should be assigned instead. A process when the multi-label text data class is changed to another class is called a class adjustment.</p>
<p>In this paper, a new approach has been proposed based on latent semantic analysis (LSA) and self-organizing map (SOM). The newly collected data from four leading financial news websites in Lithuania have been experimentally analysed (LFND, <xref ref-type="bibr" rid="j_infor473_ref_020">2021</xref>). Each data item is assigned to one or two classes at the same time. The collected data will be used in the future to train a machine learning model that will be able to assign the class for new input data. The class obtained later will be used to extract the full context in the text data. In one step of our proposed approach, SOM has been used. SOM has a problem dealing with high dimensional data, so the dimensionality should be reduced. There are a lot of dimensionality reduction methods (Blum <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_006">2013</xref>), such as principal component analysis, multidimensional scaling, manifold learning, etc., but none of those take into account the context of the text data when the dimensionality is reduced. Thus, in our proposed approach, latent sentiment analysis has been used to reduce the dimensionality of the data. The latent sentiment analysis is suitable for text data analysis. In some steps of the approach, we used the cosine similarity distance to calculate similarities between text data. Various parameters of the approach have been analysed to determine what influences the final results. The validation of the results has been performed by experts, where all new assigned classes or adjusted classes have been reviewed.</p>
<p>The novelty of the proposed approach gives the possibility to perform the semi-automated adjusting of a multi-label text data class, which could lead to higher accuracy results in the classification tasks. The data quality significantly influences the final results of classification models, so it is always important to improve the data quality. When analysing text data, it is difficult to determine unambiguously to which class a text data item belongs. Often in practice, the text data can be assigned to more than one class. So, it becomes a difficult task even for highly qualified experts to decide which class is more important and should be chosen. In this research, the definition of data quality is related to the problem to determine the multi-label text data class incorrectly assigned by researchers or experts in the data labelling process. Our proposed approach can be an assistant to experts and researchers that analyse the newly collected data.</p>
<p>The paper is organized as follows. In Section <xref rid="j_infor473_s_002">2</xref>, the related works are reviewed. In Section <xref rid="j_infor473_s_003">3</xref>, all parts of the proposed apporach are described. The data description and experimental investigation are given in Section <xref rid="j_infor473_s_008">4</xref>. Section <xref rid="j_infor473_s_012">5</xref> concludes the paper.</p>
</sec>
<sec id="j_infor473_s_002">
<label>2</label>
<title>Related Works</title>
<p>The performed literature analysis has shown that there are no well-known and widely used methods for multi-label text data adjustment. Usually, when analysing such kind of data, all researches focus on solving classification and clustering problems. In Ueda and Saito (<xref ref-type="bibr" rid="j_infor473_ref_032">2003</xref>) research, the probabilistic generative model has been proposed to solve multi-class and multi-label text data clustering problems. In the multi-class data tasks, each data item is assigned just to one class from all possible classes, and in the case of multi-label, the data item can have more than one class. The authors’ solution is based on a binary classifier to decide which class the text must be assigned to. In other words, a lot of classifiers are created where each of them can assign to a specific class. Later, the new text data is fed to a lot of trained classifiers, and as a result, all possible true/false classes are returned. In such a way, all classes which correspond to true class are considered as multi-label text data class. The problem in this method is that a huge amount of classifiers have to be prepared and trained, so the text data of various fields need models before the start of the work.</p>
<p>Hmeidi <italic>et al.</italic> in two of their publications (Ahmed <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_002">2015</xref>; Hmeidi <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_011">2016</xref>) used different strategies to analyse the multi-label text of the Arabic language. One of the methods used a lexicon-based method for multi-label text data classification. The keywords that are most associated with each analysed data class have been extracted automatically from the training data along with a threshold that was later used to determine whether each test text belongs to a certain class. In such a way, a lexicon-based text data classification helps to match the vocabularies associated with each class in the lexicon with text vectors found in text data and classifying them accordingly. The other research showed that in such a way, the built lexicon can be a valuable factor in boosting the accuracy of the unsupervised classification, especially when it is automated (Kim <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_017">2014</xref>). The other method that authors have used in their research is a so-called problem transformation method available in the MEKA system. It is a simple way to transform the multi-label data into a single-label one that is suitable for standard classification using various classification methods such as k-nearest neighbours, SVM, decision tree, etc.</p>
<p>Another method that can be used for text analysis is a latent Dirichlet allocation (LDA) (Blei <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_005">2003</xref>). LDA is a model that discovers underlying topics in a collection of documents and infers word probabilities in topics. A user can select the wishing number of topics in the method as a parameter, and the number of words, which reflect these topics best. LDA is a completely unsupervised algorithm. Most importantly, LDA makes the explicit assumption that each word is generated from one underlying topic. In Ramage <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor473_ref_026">2009</xref>) publication, the LDA approach has been proposed for multi-label text analysis where additionally the user supervision has been included. In the paper, the authors demonstrate the model effectiveness on tasks related to credit attribution within documents, including document visualizations and tag-specific snippet extraction.</p>
<p>In order to demonstrate the performance of the proposed approach, text data in Lithuanian was examined. The various text pre-processing filters or models can be used in text analysis, but usually, they easily fit only for English texts, and when language is less popular, different problems arise and solutions have to be found to adapt the models. The spelling of the Lithuanian language is complicated because of word form variety as well as sentence structure. There are researches where the Lithuanian language is analysed, but none of them analyse the multi-label text data. Krilavičius <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor473_ref_019">2012</xref>) have presented a combined application of natural language processing and information retrieval for Lithuanian media analysis. It has been demonstrated that these combinations with appropriate changes can be successfully applied to Lithuanian media text data. Kapočiūtė-Dzikienė <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor473_ref_014">2019</xref>) tried to classify the Lithuanian comments collected from the news websites as positive, negative, or neutral. Conventional machine learning methods such as SVM and MNB, and deep learning methods have been used. In other research, Štrimaitis <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor473_ref_031">2021</xref>), the sentiment analysis of Lithuanian financial context text data has been performed. LSTM algorithm, SVM, and MNB algorithms have been used, where the highest accuracy is obtained by the MNB classifier. All researchers used different variations of Lithuanian text pre-processing to achieve the best classification accuracy possible.</p>
<p>The performed related works analysis has shown that the main focus of researches is usually to find out which classification algorithm can obtain the best accuracy results on the multi-label text data and to carry out a comparative analysis. The most commonly used technique is to create binary classifiers for each multi-label text data class, and as a result, the model can predict to which classes the new data item belongs. Also, the analysis shows that there are not many studies performed using the Lithuanian language text data, especially using the multi-label text data. The specificity of the Lithuanian language requires slightly different data preparation, and it must be taken into account. Most scientific papers performing a multi-label data classification do not analyse the quality of the data. Data quality is unambiguously trusted and not questioned. It is obvious that the results are highly dependent on the preparation of data in the supervised learning models. Therefore, it is important to develop methods that could help improve the quality of multi-label text data that would lead to better classification results.</p>
</sec>
<sec id="j_infor473_s_003">
<label>3</label>
<title>Multi-Label Text Data Class Verification and Adjustment</title>
<p>In this paper, the approach that can adjust and verify the classes of multi-label text data has been proposed. The concept of the proposed approach is presented in Fig. <xref rid="j_infor473_fig_001">1</xref>. The main parts are as follows: data pre-processing, the finding of the most commonly used words in each analysed data class, the usage of LSA to reduce the dimensionality of vectors, corresponding to text data items, SOM to discover similarities between text data, and rules for assignment to a new class.</p>
<fig id="j_infor473_fig_001">
<label>Fig. 1</label>
<caption>
<p>The proposed approach for multi-label text data class verification and adjustment.</p>
</caption>
<graphic xlink:href="infor473_g001.jpg"/>
</fig>
<p>Obviously, the results of this approach can depend on various methods and selected parameters, so the experimental investigation has been performed to confirm the usability of the approach, and the obtained results are presented and discussed in Section <xref rid="j_infor473_s_008">4</xref>. To evaluate the quality of the approach, we cooperated with the experts and asked to verify the new assigned class of the multi-label text data. After data have been verified by experts, the new classes of multi-label text data are obtained and the performance of the proposed approach is evaluated.</p>
<sec id="j_infor473_s_004">
<label>3.1</label>
<title>Latent Semantic Analysis</title>
<p>LSA is one of the models (Dumais, <xref ref-type="bibr" rid="j_infor473_ref_008">2004</xref>) which is often used in natural language processing tasks. The advantage of this method over other dimensionality reduction methods is that the dimensionality reduction considers the context of the text data. The main aim of the LSA model is to detect a relationship between text data and the words they contain. LSA assumes that words in the text data that are close by their meaning will occur in similar pieces of text, by the so-called distributional hypothesis. Also, the LSA model is a dimensionality reduction method, so it helps to reduce the dimensionality of huge text vectors. Suppose we have a text data <inline-formula id="j_infor473_ineq_001"><alternatives><mml:math>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$X=\{{X_{1}},{X_{2}},\dots ,{X_{N}}\}$]]></tex-math></alternatives></inline-formula> and a bag of words is created. The bag of words is a list of words from all text data, excluding the words that do not satisfy the conditions defined by the various pre-processing filters. Each data item is described by the words obtained after the bag of words is created. According to the frequency of the words in the text data, a so-called text matrix is created: 
<disp-formula id="j_infor473_eq_001">
<label>(1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mfenced separators="" open="(" close=")">
<mml:mrow>
<mml:mtable columnspacing="4.0pt 4.0pt 4.0pt" equalrows="false" columnlines="none none none none none none none none none" equalcolumns="false" columnalign="center center center center">
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>11</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>12</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo stretchy="false">⋯</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>21</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>22</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo stretchy="false">⋯</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mo>⋮</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:mo>⋮</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:mo stretchy="false">⋱</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:mo>⋮</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo stretchy="false">⋯</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \left(\begin{array}{c@{\hskip4.0pt}c@{\hskip4.0pt}c@{\hskip4.0pt}c}{x_{11}}& {x_{12}}& \cdots & {x_{1n}}\\ {} {x_{21}}& {x_{22}}& \cdots & {x_{2n}}\\ {} \vdots & \vdots & \ddots & \vdots \\ {} {x_{N1}}& {x_{N2}}& \cdots & {x_{Nn}}\end{array}\right),\]]]></tex-math></alternatives>
</disp-formula> 
here <inline-formula id="j_infor473_ineq_002"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{pl}}$]]></tex-math></alternatives></inline-formula> is the frequency of the <italic>l</italic>th word in the <italic>p</italic>th text, <inline-formula id="j_infor473_ineq_003"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi></mml:math><tex-math><![CDATA[$p=1,\dots ,N$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor473_ineq_004"><alternatives><mml:math>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi></mml:math><tex-math><![CDATA[$l=1,\dots ,n$]]></tex-math></alternatives></inline-formula>. <italic>N</italic> is the number of the analysed texts, and <italic>n</italic> is the number of words in the bag of words. In the simplest case, the frequency value is equal to the number of words appeared in the text. Usually, in the literature, the relative frequency is used. In this case, the word frequency in the text is divided from the total appearance of the word overall text data. Each row of the matrix (<xref rid="j_infor473_eq_001">1</xref>) is a text vector <inline-formula id="j_infor473_ineq_005"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${X^{\prime }_{p}}\in {R^{n}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor473_ineq_006"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi></mml:math><tex-math><![CDATA[$p=1,\dots ,N$]]></tex-math></alternatives></inline-formula>, corresponding to a text which represents a numerical expression of the text. Given the text matrix, the mathematical technique (in the LSA model) called singular value decomposition is used to reduce the number of columns while preserving the similarity structure among rows. In such a way, the dimensionality reduction is performed, where the new dimensionality <italic>D</italic> of analysed data is obtained. Usually, two parameters influence the LSA model output: the number of expected dimensions and the exponent scaling feature component strengths. The exponent scaling feature helps to highlight the more important words in the list of the bag of words.</p>
</sec>
<sec id="j_infor473_s_005">
<label>3.2</label>
<title>Text Similarity Measures</title>
<p>In the stage of a class assignment in our proposed approach, we used the similarity measure. The related works showed that the most common similarity measure used in text analysis is a cosine similarity distance. Other similarity measures such as Dice coefficient, overlap, etc. (Stefanovič <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_030">2019</xref>) can also be used in text similarity detection. The performed primary research showed that there is no significant difference between usage of cosine and Dice coefficient in our model. The cosine similarity distance between text vectors <inline-formula id="j_infor473_ineq_007"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${X^{\prime }_{p}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor473_ineq_008"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${Y^{\prime }}$]]></tex-math></alternatives></inline-formula> (in our case, <inline-formula id="j_infor473_ineq_009"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${Y^{\prime }}$]]></tex-math></alternatives></inline-formula> is the text vector that represents the most frequent words of each class) can be calculated using the formula (<xref rid="j_infor473_eq_002">2</xref>): 
<disp-formula id="j_infor473_eq_002">
<label>(2)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mo movablelimits="false">cos</mml:mo>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">(</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
</mml:msqrt>
<mml:mo>×</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="1em"/>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo movablelimits="false">…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \cos \big({X^{\prime }_{p}},{Y^{\prime }}\big)=\frac{{X^{\prime }_{p}}\times {Y^{\prime }}}{\sqrt{|{X^{\prime }_{p}}|}\times \sqrt{|{Y^{\prime }}|}},\hspace{1em}p=1,\dots ,N.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>Let’s say we have three words that describe our selected and analysed data class the best: “euros”, “lent”, “more”. The text data samples are given in Table <xref rid="j_infor473_tab_001">1</xref>. The text matrix (<xref rid="j_infor473_eq_001">1</xref>) with relative word frequency is formed, and we want to find which of the texts <inline-formula id="j_infor473_ineq_010"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{1}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor473_ineq_011"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{2}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor473_ineq_012"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{3}}$]]></tex-math></alternatives></inline-formula> is most similar to our three words presented as a text vector <inline-formula id="j_infor473_ineq_013"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${Y^{\prime }}=(1,1,1)$]]></tex-math></alternatives></inline-formula> that reflect the class context.</p>
<table-wrap id="j_infor473_tab_001">
<label>Table 1</label>
<caption>
<p>The example of text data.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Texts</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Text content</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Text matrix</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Cosine similarity distance</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor473_ineq_014"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Today I found 50 euros</td>
<td rowspan="3" style="vertical-align: middle; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor473_ineq_015"><alternatives><mml:math>
<mml:mfenced separators="" open="(" close=")">
<mml:mrow>
<mml:mtable columnspacing="4.0pt 4.0pt" equalrows="false" columnlines="none none none none none none none none none" equalcolumns="false" columnalign="center center center">
<mml:mtr>
<mml:mtd class="array">
<mml:mn>0.25</mml:mn>
</mml:mtd>
<mml:mtd class="array">
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd class="array">
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mn>0.5</mml:mn>
</mml:mtd>
<mml:mtd class="array">
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd class="array">
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mn>0.25</mml:mn>
</mml:mtd>
<mml:mtd class="array">
<mml:mn>1</mml:mn>
</mml:mtd>
<mml:mtd class="array">
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:math><tex-math><![CDATA[$\left(\begin{array}{c@{\hskip4.0pt}c@{\hskip4.0pt}c}0.25& 0& 0\\ {} 0.5& 0& 1\\ {} 0.25& 1& 0\end{array}\right)$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0.5774</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor473_ineq_016"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">100 euros is more than 50 euros</td>
<td style="vertical-align: top; text-align: left">0.7746</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor473_ineq_017"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">The man lent 100 euros to his friend</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.7001</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The cosine similarity distance is in the interval <inline-formula id="j_infor473_ineq_018"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[0,1]$]]></tex-math></alternatives></inline-formula>. If the value of cosine similarity is equal to 1, it means that the texts are obviously the same, and vice versa, the value 0 means the two texts are completely different. As we can see in Table <xref rid="j_infor473_tab_001">1</xref>, the most similar text to our selected three words is the text <inline-formula id="j_infor473_ineq_019"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{2}}$]]></tex-math></alternatives></inline-formula> and cosine distance is equal to 0.7746. The word “euros” appears two times and the word “more” once. The other text <inline-formula id="j_infor473_ineq_020"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{3}}$]]></tex-math></alternatives></inline-formula> is close to <inline-formula id="j_infor473_ineq_021"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{2}}$]]></tex-math></alternatives></inline-formula>, and the text <inline-formula id="j_infor473_ineq_022"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{1}}$]]></tex-math></alternatives></inline-formula> is most dissimilar.</p>
</sec>
<sec id="j_infor473_s_006">
<label>3.3</label>
<title>Self-Organizing Map</title>
<p>There are many methods applied in data science, most of them are used nowadays to solve data classification and clustering tasks. In our research, we pay attention to data clustering. Various clustering methods can be used (Aggarwal and Zhai, <xref ref-type="bibr" rid="j_infor473_ref_001">2012</xref>), such as density-based clustering, hierarchical clustering, k-means, etc. In our proposed approach, we used SOM. SOM is one of the artificial neural network models proposed by Kohonen (<xref ref-type="bibr" rid="j_infor473_ref_018">2012</xref>). The main advantage of this method is that it not just clusters the data, but also shows the results in a visual form that can be interpreted much easier by a researcher. The SOM visual form can be presented in various ways (Stefanovič and Kurasova, <xref ref-type="bibr" rid="j_infor473_ref_027">2011</xref>; Dzemyda and Kurasova, <xref ref-type="bibr" rid="j_infor473_ref_009">2002</xref>) but the main aim of SOM is to preserve the topology of multidimensional data when they are transformed into a lower-dimensional space (usually two-dimensional). SOM can be applied in various fields such as data mining (López <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_022">2019</xref>), text mining (Yoshioka and Dozono, <xref ref-type="bibr" rid="j_infor473_ref_034">2018</xref>), and even in image analysis tasks (Licen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_021">2020</xref>; Aly and Almotairi, <xref ref-type="bibr" rid="j_infor473_ref_003">2020</xref>). SOM can be used to cluster, classify, and visualize the data. SOM is a set of nodes connected via a rectangular or hexagonal topology. The rectangular topology of SOM is presented in Fig. <xref rid="j_infor473_fig_002">2</xref>.</p>
<fig id="j_infor473_fig_002">
<label>Fig. 2</label>
<caption>
<p>Two-dimensional SOM (rectangular topology).</p>
</caption>
<graphic xlink:href="infor473_g002.jpg"/>
</fig>
<p>The set of weights forms a vector <inline-formula id="j_infor473_ineq_023"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{ij}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor473_ineq_024"><alternatives><mml:math>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$i=1,\dots ,{k_{a}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor473_ineq_025"><alternatives><mml:math>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$j=1,\dots ,{k_{b}}$]]></tex-math></alternatives></inline-formula> that is usually called a neuron or codebook vector, where <inline-formula id="j_infor473_ineq_026"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${k_{a}}$]]></tex-math></alternatives></inline-formula> is the number of columns, and <inline-formula id="j_infor473_ineq_027"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${k_{b}}$]]></tex-math></alternatives></inline-formula> is the number of rows of SOM. All texts of an analysed data are given to the SOM as text vectors. The learning process of the SOM algorithm starts from the initialization of the components of the vectors <inline-formula id="j_infor473_ineq_028"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{ij}}$]]></tex-math></alternatives></inline-formula>, where they can be initialized at random, linear, or by the principal components. At each learning step, an input vector <inline-formula id="j_infor473_ineq_029"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${X^{\prime }_{p}}$]]></tex-math></alternatives></inline-formula> is passed to the SOM. The vector <inline-formula id="j_infor473_ineq_030"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${X^{\prime }_{p}}$]]></tex-math></alternatives></inline-formula> is compared to all neurons <inline-formula id="j_infor473_ineq_031"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{ij}}$]]></tex-math></alternatives></inline-formula>. Usually, the Euclidean distance between this input vector <inline-formula id="j_infor473_ineq_032"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${X^{\prime }_{p}}$]]></tex-math></alternatives></inline-formula> and each neuron <inline-formula id="j_infor473_ineq_033"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{ij}}$]]></tex-math></alternatives></inline-formula> is calculated. The vector <inline-formula id="j_infor473_ineq_034"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{w}}$]]></tex-math></alternatives></inline-formula> with the minimal Euclidean distance to <inline-formula id="j_infor473_ineq_035"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${X^{\prime }_{p}}$]]></tex-math></alternatives></inline-formula> is designated as a neuron winner. All neuron components are adapted according to the learning rule: 
<disp-formula id="j_infor473_eq_003">
<label>(3)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">(</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {M_{ij}}(t+1)={M_{ij}}(t)+{h_{ij}^{w}}\big({X^{\prime }_{p}}-{M_{ij}}(t)\big),\]]]></tex-math></alternatives>
</disp-formula> 
here <italic>t</italic> is the number of iteration, <inline-formula id="j_infor473_ineq_036"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${h_{ij}^{w}}$]]></tex-math></alternatives></inline-formula> is a neighbouring function, <italic>w</italic> is a pair of indices of the neuron winner of vector <inline-formula id="j_infor473_ineq_037"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${X^{\prime }_{p}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor473_ineq_038"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi></mml:math><tex-math><![CDATA[$p=1,\dots ,N$]]></tex-math></alternatives></inline-formula>. The learning is repeated until the maximum number of iterations is reached. A lot of SOM visualization ways use colouring techniques to show the distance in the map. It shows how much the vectors of the neighbouring cells are close in the dimensionality space of the analysed data. The most popular one is based on the so-called unified distance matrix (u-matrix) (Ultsch and Siemon, <xref ref-type="bibr" rid="j_infor473_ref_033">1989</xref>). SOM is coloured by the values of u-matrix elements. If the greyscale is used, a dark colour between the neurons corresponds to a large distance. A light colour between the neurons signifies that the codebook vectors are close to each other in the input space. Light areas can be thought of as clusters and dark areas as cluster separators.</p>
</sec>
<sec id="j_infor473_s_007">
<label>3.4</label>
<title>The Proposed Approach</title>
<p>Suppose we have a multi-label text data <inline-formula id="j_infor473_ineq_039"><alternatives><mml:math>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$X=\{{X_{1}},{X_{2}},\dots ,{X_{N}}\}$]]></tex-math></alternatives></inline-formula>, where <italic>N</italic> is the number of data items. At least some of the data items <inline-formula id="j_infor473_ineq_040"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{p}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor473_ineq_041"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi></mml:math><tex-math><![CDATA[$p=1,\dots ,N$]]></tex-math></alternatives></inline-formula> are assigned to more than one class (for example, some data items have been assigned to one class, some – to two classes). First of all, the data must be pre-processed to avoid the artificial similarity between text data. Later, tokenization has to be performed. Tokenization is a way of separating text data into smaller units called tokens. In this research, tokens can be words, characters, punctuation signs, etc. Next, unnecessary tokens have to be removed from the texts. There are various pre-processing filters that can be applied, such as number removing, erase punctuation, case converting to lower or uppercase, the character length in the tokens, stop word lists, etc. In our previous research (Stefanovič and Kurasova, <xref ref-type="bibr" rid="j_infor473_ref_028">2014</xref>), we analysed the influence of text data pre-processing filters on SOM clustering results. The experiment results showed that it is advisable to use filters to remove the numbers and leave the numbers inside the words, for example, “Covid-19”. Also, a case converting filter has to be chosen, it does not matter whether all tokens will be converted to the lower case or uppercase. It is preferable to use a stemming algorithm, which allows reducing the number of the same meaning words, for example: from “accepted”, “accepting”, “acceptable”, just one word “accept” will be included. The punctuation eraser filter must be used, too. One more important filter is the usage of the stop words list. This list has to include commonly used words of the analysed language. It is also desirable to include specific words that are often used in the texts in the domain under analysis. To avoid the high number of unimportant words in the text data, it is suggested selecting the tokens length filter. Usually, according to the previous research, the length of tokens that has to be selected is not less than three characters.</p>
<p>Then, all analysed text data has to be split into the subsets <inline-formula id="j_infor473_ineq_042"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{1}},{S_{2}},\dots ,{S_{C}}$]]></tex-math></alternatives></inline-formula> of each different class, where <italic>C</italic> is the number of classes in the analysed data. If the text is assigned to more than one class, the same text can be included in different subsets. For example, we have the text “Businesses affected by the COVID pandemic will receive financial support”, which belong to the “Pandemic” and “Finance” classes. In this case, the same text will be included in two subsets. After subsets are formed, the most frequent words lists <inline-formula id="j_infor473_ineq_043"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{{S_{1}}}},{W_{{S_{2}}}},\dots ,{W_{{S_{C}}}}$]]></tex-math></alternatives></inline-formula> of each subset are found (the most frequent words). In the next step, text vectors of each different words lists <inline-formula id="j_infor473_ineq_044"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">z</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">z</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">z</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">z</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${W_{{S_{z}}}}=({w_{z1}},{w_{z2}},\dots ,{w_{zT}})$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor473_ineq_045"><alternatives><mml:math>
<mml:mi mathvariant="italic">z</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi></mml:math><tex-math><![CDATA[$z=1,\dots ,C$]]></tex-math></alternatives></inline-formula> are formed, where <italic>T</italic> is the selected number of words. The cosine similarity distance is calculated by Eq. (<xref rid="j_infor473_eq_002">2</xref>), and a distance matrix is formed: 
<disp-formula id="j_infor473_eq_004">
<label>(4)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mfenced separators="" open="(" close=")">
<mml:mrow>
<mml:mtable columnspacing="4.0pt 4.0pt 4.0pt" equalrows="false" columnlines="none none none none none none none none none" equalcolumns="false" columnalign="center center center center">
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>11</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>12</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo stretchy="false">⋯</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>21</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>22</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo stretchy="false">⋯</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mo>⋮</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:mo>⋮</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:mo stretchy="false">⋱</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:mo>⋮</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mo stretchy="false">⋯</mml:mo>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \left(\begin{array}{c@{\hskip4.0pt}c@{\hskip4.0pt}c@{\hskip4.0pt}c}{d_{11}}& {d_{12}}& \cdots & {d_{1C}}\\ {} {d_{21}}& {d_{22}}& \cdots & {d_{2C}}\\ {} \vdots & \vdots & \ddots & \vdots \\ {} {d_{N1}}& {d_{N2}}& \cdots & {d_{NC}}\end{array}\right),\]]]></tex-math></alternatives>
</disp-formula> 
here <inline-formula id="j_infor473_ineq_046"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">z</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${d_{pz}}$]]></tex-math></alternatives></inline-formula> is a cosine similarity distance value between pre-processed text vectors <inline-formula id="j_infor473_ineq_047"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${X^{\prime }_{p}}$]]></tex-math></alternatives></inline-formula> and text vectors of words lists <inline-formula id="j_infor473_ineq_048"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">z</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${W_{{S_{z}}}}$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_infor473_ineq_049"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi></mml:math><tex-math><![CDATA[$p=1,\dots ,N$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor473_ineq_050"><alternatives><mml:math>
<mml:mi mathvariant="italic">z</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi></mml:math><tex-math><![CDATA[$z=1,\dots ,C$]]></tex-math></alternatives></inline-formula>. LSA is used to reduce the dimensionality of the pre-processed text vectors. Without LSA usage, the dimensionality of the text matrix is usually very high (depends on the size of the analysed text data), so the dimensionality reduction helps SOM in better performance. After the LSA model, the data is fed to the SOM. The various parameters of the SOM can be selected, but considering our previous research (Stefanovic and Kurasova, <xref ref-type="bibr" rid="j_infor473_ref_029">2014</xref>), we suggest using the Gaussian neighbouring function and linear learning rate. In the proposed approach, the SOM size needs to be chosen by the researcher, starting with the largest number and moving to the smallest one, reducing the size by the chosen step. Also, the parameter <italic>L</italic> has to be selected, which indicates the limit when the class assignment has to be done in a SOM cell. For example, we chose <inline-formula id="j_infor473_ineq_051"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>80</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=80\% $]]></tex-math></alternatives></inline-formula>, and there is a SOM cell where 10 data items fall: 8 items belong to the first class, 1 item belongs to the second class, and 1 item belongs to the third class. In this case, the dominant class is the first class, and the new class assignment is possible because the limit <italic>L</italic> is satisfied. It means that in this SOM cell, the data items have to be assigned to the dominant class. The pseudocode of the new class assignment is presented in Algorithm <xref rid="j_infor473_fig_003">1</xref>. In the output of Algorithm <xref rid="j_infor473_fig_003">1</xref>, the adjusted text data classes are obtained. For simplicity, the pseudocode shows only the case of one or two classes. The proposed approach can be applied and used when text data has been assigned to more classes, in this case, just a few inspection conditions need to be added.</p>
<fig id="j_infor473_fig_003">
<label>Algorithm 1</label>
<caption>
<p>New class assignment using SOM.</p>
</caption>
<graphic xlink:href="infor473_g003.jpg"/>
</fig>
</sec>
</sec>
<sec id="j_infor473_s_008">
<label>4</label>
<title>Experimental Investigation</title>
<sec id="j_infor473_s_009">
<label>4.1</label>
<title>Text Data Analysed</title>
<p>To perform the experimental investigation, the newly collected data has been used (LFND, <xref ref-type="bibr" rid="j_infor473_ref_020">2021</xref>). The data is collected from public financial Lithuania news websites and stored in a database as texts. The analysed data is a set of text <inline-formula id="j_infor473_ineq_052"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{1}},{X_{2}},\dots ,{X_{N}}$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_infor473_ineq_053"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>12484</mml:mn></mml:math><tex-math><![CDATA[$N=12484$]]></tex-math></alternatives></inline-formula>. In cooperation with a company, whose main field is accounting and business management software development with more than 30 years of experience, the five experts from the financial department assigned all text data manually to 10 classes (Collective, Development, Finance, Industry, Innovation, International, Law enforcement, Pandemic, Politics, and Reliability). In the process of data class assignment, the rule was that each text data item could be assigned to no more than two classes. As mentioned before, the problem with the manual assignment of the class is that every expert can interpret the text differently (human factor). Another problem is that some text can be assigned to more than two classes, so it is difficult to decide which classes must be main. This imbalance can lead to inaccurate results in other steps, thus it is important to discover the ways to solve this problem. For instance, we have the text “The pandemic has had many financial consequences around the world”. This sentence could be assigned to the classes “International”, “Pandemic”, and “Finance”.</p>
<fig id="j_infor473_fig_004">
<label>Fig. 3</label>
<caption>
<p>The token number distribution of unpre-processed text data.</p>
</caption>
<graphic xlink:href="infor473_g004.jpg"/>
</fig>
<p>The token number distribution of unpre-processed text data is presented in Fig. <xref rid="j_infor473_fig_004">3</xref>. As we can see in Fig. <xref rid="j_infor473_fig_004">3</xref>, the majority of the text length is no more than 54 tokens. There are 2490 texts whose number of tokens is equal from 54 to 105, and there are just 202 texts that are longer than 105 tokens.</p>
<p>In this research, the multi-label text data is analysed, thus some texts belong to one or two classes. Suppose, we have a text that belongs to “Pandemic” and “Collective” classes, so this text will be considered as “Pandemic, and “Collective” at the same time. The data class distribution is presented in Fig. <xref rid="j_infor473_fig_005">4</xref> (if data item has more than one class, it is presented in both classes). Because of the reason that this data is collected from a financial news website, the majority of the text belongs to the class “Finance”. The number of data items from the other classes is similar, except the number of classes “Industry”, “Development”, and “Collective” is larger. The smallest number of data items are from class “Politics”.</p>
<p>There are 6025 texts that got just one class assigned, and the rest 6459 of data items are assigned to two classes. The total number of tokens over all the data is equal to 438730 when the data is unpre-processed (59148 unique tokens) and, respectively, after pre-processing (filters are described in Subsection <xref rid="j_infor473_s_007">3.4</xref>) overall number of tokens is equal to 254615 (22730 unique tokens). The most frequent words of each data class are presented in Fig. <xref rid="j_infor473_fig_006">5</xref>, which allows us to find which words represent each class the best.</p>
<fig id="j_infor473_fig_005">
<label>Fig. 4</label>
<caption>
<p>Distribution of data class.</p>
</caption>
<graphic xlink:href="infor473_g005.jpg"/>
</fig>
<fig id="j_infor473_fig_006">
<label>Fig. 5</label>
<caption>
<p>Word clouds of each class.</p>
</caption>
<graphic xlink:href="infor473_g006.jpg"/>
</fig>
</sec>
<sec id="j_infor473_s_010">
<label>4.2</label>
<title>Experimental Research Results and Validation</title>
<p>As it was mentioned before, first of all, the text data has to be pre-processed. In our experimental investigation, we used the following pre-processing filters: removed numbers, tokens were converted to the lower case, used Lithuanian language snowball stemming algorithm (Jocas, <xref ref-type="bibr" rid="j_infor473_ref_012">2020</xref>), erased punctuation, used smaller than three characters tokens’ length, and used the Lithuanian language stop words list. An example of SOM using the analysed data is presented in Fig. <xref rid="j_infor473_fig_007">6</xref>. An Orange data mining tool has been used for the visual presentation of SOM (Demšar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor473_ref_007">2013</xref>). In this type of visualization, the circles show just the class label of the majority data items which fall in the one SOM cell, so some data items from other classes can be in the same cell as well. In this example, the selected size of the SOM is equal to <inline-formula id="j_infor473_ineq_054"><alternatives><mml:math>
<mml:mn>10</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>10</mml:mn></mml:math><tex-math><![CDATA[$10\times 10$]]></tex-math></alternatives></inline-formula> (it can be various depending on the researcher selection), but because of the u-matrix visualization, the additional cells are included in the SOM, which are used to represent the distance between clusters. The darker colours mean that the distance is larger than the light colour cells. As we can see, on the left side and the left top corner of the SOM, the blue colour dominates, indicating that the majority of the data is from class “Collective”. On the right side and the right top corner, “Finance” class data items are placed. On the top of the SOM, the light blue circles represent the data items that belong to the “Law enforcement” class. All other class members spread overall SOM, and the small clusters are formed.</p>
<fig id="j_infor473_fig_007">
<label>Fig. 6</label>
<caption>
<p>Data presented in <inline-formula id="j_infor473_ineq_055"><alternatives><mml:math>
<mml:mn>10</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>10</mml:mn></mml:math><tex-math><![CDATA[$10\times 10$]]></tex-math></alternatives></inline-formula> SOM using u-matrix: a) coloured by the first class; b) coloured by the second class.</p>
</caption>
<graphic xlink:href="infor473_g007.jpg"/>
</fig>
<p>There are a lot of various options that can be selected in the approach, so according to our previous research (Stefanovic and Kurasova, <xref ref-type="bibr" rid="j_infor473_ref_029">2014</xref>), we choose the following SOM parameters by default: SOM size is equal to <inline-formula id="j_infor473_ineq_056"><alternatives><mml:math>
<mml:mn>40</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>40</mml:mn></mml:math><tex-math><![CDATA[$40\times 40$]]></tex-math></alternatives></inline-formula>, and reduced by 5 until the SOM size is equal to <inline-formula id="j_infor473_ineq_057"><alternatives><mml:math>
<mml:mn>5</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>5</mml:mn></mml:math><tex-math><![CDATA[$5\times 5$]]></tex-math></alternatives></inline-formula>; neighbouring function is Gaussian and learning rate is linear; iteration number equals to 100; the initial SOM neurons are generated at random. In the class assignment part, we used the cosine similarity distance, and the most frequent word lists of each class have 15 words. The primary research showed that the words overlap from different classes when selecting the high number of the most frequent word, so the final results can be worse. To determine which dimensionality has to be selected as an output of the LSA model, the research has been conducted, where the reduced dimensionality <italic>D</italic> varies from 10 to 50 by step 10, and the limit percent of new class assignment <inline-formula id="j_infor473_ineq_058"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>90</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=90\% $]]></tex-math></alternatives></inline-formula>. The process described in Algorithm <xref rid="j_infor473_fig_003">1</xref> has been performed, and after all the steps, each data item has been assigned to adjusted classes. As mentioned before, the experts have been asked to review the new data class assignments and mark one of the tags described in Table <xref rid="j_infor473_tab_002">2</xref>. All assigned tags have been calculated and used to evaluate the proposed approach.</p>
<table-wrap id="j_infor473_tab_002">
<label>Table 2</label>
<caption>
<p>Class assignments tags and their descriptions.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Tag name</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Tag description</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Accept</td>
<td style="vertical-align: top; text-align: justify">The “Accept” tag is used if the new class is unambiguously assigned correctly. For example, the text primarily belongs to one class, and the approach finds that an additional class has to be assigned. If the new class assignment is correct, the expert marks it as “Accept”. In other situation, if the text primarily is assigned to two classes, and the approach changes one of the class correctly, this tag also is used.</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Decline</td>
<td style="vertical-align: top; text-align: justify">“Decline” is marked if the approach assigns the class obviously incorrectly. For example, the text primarily belongs to the classes “Finance” and “Politics”. The proposed approach assigned a new class “Development” instead of the class “Finance”, but this class is incorrect, so the expert has to mark the tag “Decline”.</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Possible</td>
<td style="vertical-align: top; text-align: justify; border-bottom: solid thin">Suppose the text is primarily assigned to two classes – “Industry” and “Finance”. The proposed approach makes a new assignment, and as a result, the “Finance” class has been changed to the class “Innovation”. If the analysed text data can have more than two classes and the “Innovation” class is correct, the tag “Possible” has to be used. In such a way, the artificial balancing of assigned classes can be done when the class depends not on the human point of view but is based only on the words in the text.</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>First of all, the primary research has been performed to find out how the size of the SOM influences the number of new assignments. Thus, for simplicity, when using LSA, the dimensionality is reduced to <inline-formula id="j_infor473_ineq_059"><alternatives><mml:math>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>10</mml:mn></mml:math><tex-math><![CDATA[$D=10$]]></tex-math></alternatives></inline-formula>, and the results are presented in Fig. <xref rid="j_infor473_fig_008">7</xref>. As we can see, in the beginning, when the SOM size is <inline-formula id="j_infor473_ineq_060"><alternatives><mml:math>
<mml:mn>40</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>40</mml:mn></mml:math><tex-math><![CDATA[$40\times 40$]]></tex-math></alternatives></inline-formula>, the number of the “Accept” assignment is equal to 35. When reducing the SOM size, the “Accept” number is also decreasing. The “Decline” and “Possible” curves are similar enough. When the SOM size is <inline-formula id="j_infor473_ineq_061"><alternatives><mml:math>
<mml:mn>30</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>30</mml:mn></mml:math><tex-math><![CDATA[$30\times 30$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_infor473_ineq_062"><alternatives><mml:math>
<mml:mn>25</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>25</mml:mn></mml:math><tex-math><![CDATA[$25\times 25$]]></tex-math></alternatives></inline-formula>, the number of “Decline” assignment is slightly larger.</p>
<fig id="j_infor473_fig_008">
<label>Fig. 7</label>
<caption>
<p>Class assignments reviewed by experts, <inline-formula id="j_infor473_ineq_063"><alternatives><mml:math>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>10</mml:mn></mml:math><tex-math><![CDATA[$D=10$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor473_ineq_064"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>90</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=90\% $]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<graphic xlink:href="infor473_g008.jpg"/>
</fig>
<p>In this experimental investigation, we will assume that if the new class assignments reviewed by experts are tagged as “Accept” and “Possible”, it will be considered as a correct new class assignment, and the data is not corrupted. In such a way, the correct assignment ratio can be calculated using the simple formula (<xref rid="j_infor473_eq_005">5</xref>). The ratio can be expressed as percentages: 
<disp-formula id="j_infor473_eq_005">
<label>(5)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mtext mathvariant="italic">Correct assignment ratio</mml:mtext>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mtext mathvariant="italic">Accept</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext mathvariant="italic">Possible</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">Accept</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext mathvariant="italic">Decline</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext mathvariant="italic">Possible</mml:mtext>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \textit{Correct assignment ratio}=\frac{\textit{Accept}+\textit{Possible}}{\textit{Accept}+\textit{Decline}+\textit{Possible}}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<fig id="j_infor473_fig_009">
<label>Fig. 8</label>
<caption>
<p>Class assignments reviewed by experts, <inline-formula id="j_infor473_ineq_065"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>90</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=90\% $]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<graphic xlink:href="infor473_g009.jpg"/>
</fig>
<p>The same calculations have been performed using the approach with different reduced dimensionalities, and the overall correct assignment value is presented in Fig. <xref rid="j_infor473_fig_009">8</xref>. As we can see, using <inline-formula id="j_infor473_ineq_066"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>90</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=90\% $]]></tex-math></alternatives></inline-formula>, the highest number of assignments (296 assignments) is obtained when <inline-formula id="j_infor473_ineq_067"><alternatives><mml:math>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>40</mml:mn></mml:math><tex-math><![CDATA[$D=40$]]></tex-math></alternatives></inline-formula>. The highest correct assignment ratio is <inline-formula id="j_infor473_ineq_068"><alternatives><mml:math>
<mml:mn>89.61</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$89.61\% $]]></tex-math></alternatives></inline-formula>, obtained when <inline-formula id="j_infor473_ineq_069"><alternatives><mml:math>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[$D=50$]]></tex-math></alternatives></inline-formula>, but the results are not significantly different from the case when <inline-formula id="j_infor473_ineq_070"><alternatives><mml:math>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>40</mml:mn></mml:math><tex-math><![CDATA[$D=40$]]></tex-math></alternatives></inline-formula>. The worst result considering the number of assignments and correct assignment values is when the dimensionality equals 10. According to the obtained results, we select <inline-formula id="j_infor473_ineq_071"><alternatives><mml:math>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>40</mml:mn></mml:math><tex-math><![CDATA[$D=40$]]></tex-math></alternatives></inline-formula> in the following research, because of the highest number of assignments and almost the highest correct assignment ratio. To find out which limit <italic>L</italic> should be chosen, an experiment has been performed. The limit <italic>L</italic> has been changed from <inline-formula id="j_infor473_ineq_072"><alternatives><mml:math>
<mml:mn>95</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$95\% $]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_infor473_ineq_073"><alternatives><mml:math>
<mml:mn>70</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$70\% $]]></tex-math></alternatives></inline-formula> by step <inline-formula id="j_infor473_ineq_074"><alternatives><mml:math>
<mml:mn>5</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$5\% $]]></tex-math></alternatives></inline-formula>. The total number of new class assignments has been calculated. Also, the counter has been used to determine how many times the same text has changed the class (one time, two times, three times, and four times) in overall steps of the SOM size reduction. The results are presented in Fig. <xref rid="j_infor473_fig_010">9</xref>.</p>
<fig id="j_infor473_fig_010">
<label>Fig. 9</label>
<caption>
<p>Dependence of the numbers of new class assignments on SOM size, <inline-formula id="j_infor473_ineq_075"><alternatives><mml:math>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>40</mml:mn></mml:math><tex-math><![CDATA[$D=40$]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<graphic xlink:href="infor473_g010.jpg"/>
</fig>
<p>As we can see, when the limit is equal to <inline-formula id="j_infor473_ineq_076"><alternatives><mml:math>
<mml:mn>90</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$90\% $]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor473_ineq_077"><alternatives><mml:math>
<mml:mn>85</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$85\% $]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_infor473_ineq_078"><alternatives><mml:math>
<mml:mn>80</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$80\% $]]></tex-math></alternatives></inline-formula>, the number of assignments tends to decrease, so the selected limit is suitable. If the highest limit of <inline-formula id="j_infor473_ineq_079"><alternatives><mml:math>
<mml:mn>95</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$95\% $]]></tex-math></alternatives></inline-formula> is selected and the SOM size is <inline-formula id="j_infor473_ineq_080"><alternatives><mml:math>
<mml:mn>5</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>5</mml:mn></mml:math><tex-math><![CDATA[$5\times 5$]]></tex-math></alternatives></inline-formula>, the data is clustered too much, so the number of assignments increases. Obviously, the limits <inline-formula id="j_infor473_ineq_081"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>75</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=75\% $]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor473_ineq_082"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>70</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=70\% $]]></tex-math></alternatives></inline-formula> are not suitable for the analysed data because when the SOM size is <inline-formula id="j_infor473_ineq_083"><alternatives><mml:math>
<mml:mn>5</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>5</mml:mn></mml:math><tex-math><![CDATA[$5\times 5$]]></tex-math></alternatives></inline-formula>, the number of new assignments highly increases. Also, overall steps of the SOM size reduction, some of the text data class has been changed even four times. The high number of new assignments indicates that data items are assigned to a new class in each step of the SOM size reduction. The new assignment becomes pointless, because usually, the text data class is continuously changed. A deeper analysis, when the limits are from <inline-formula id="j_infor473_ineq_084"><alternatives><mml:math>
<mml:mn>95</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$95\% $]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_infor473_ineq_085"><alternatives><mml:math>
<mml:mn>80</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$80\% $]]></tex-math></alternatives></inline-formula>, has been performed, and the results are presented in Fig. <xref rid="j_infor473_fig_011">10</xref>. The correct assignment ratio in each step has shown that when the limit <inline-formula id="j_infor473_ineq_086"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>95</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=95\% $]]></tex-math></alternatives></inline-formula>, only 5 of 70 assignments were “Decline”, the other assignments were “Accept” and “Possible”. If the limit <inline-formula id="j_infor473_ineq_087"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>90</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=90\% $]]></tex-math></alternatives></inline-formula>, there 152 “Accept”, 31 “Decline”, and 113 texts are tagged as “Possible”. Almost all correct assignment ratios are higher than <inline-formula id="j_infor473_ineq_088"><alternatives><mml:math>
<mml:mn>85</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$85\% $]]></tex-math></alternatives></inline-formula>, and lower just when the SOM size is equal to <inline-formula id="j_infor473_ineq_089"><alternatives><mml:math>
<mml:mn>35</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>35</mml:mn></mml:math><tex-math><![CDATA[$35\times 35$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_infor473_ineq_090"><alternatives><mml:math>
<mml:mn>30</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>30</mml:mn></mml:math><tex-math><![CDATA[$30\times 30$]]></tex-math></alternatives></inline-formula>. As we can see, when the limits are equal to <inline-formula id="j_infor473_ineq_091"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>85</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=85\% $]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor473_ineq_092"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>80</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=80\% $]]></tex-math></alternatives></inline-formula>, the correct assignments ratio is near <inline-formula id="j_infor473_ineq_093"><alternatives><mml:math>
<mml:mn>80</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$80\% $]]></tex-math></alternatives></inline-formula>, only when the size of SOM is equal to <inline-formula id="j_infor473_ineq_094"><alternatives><mml:math>
<mml:mn>5</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>5</mml:mn></mml:math><tex-math><![CDATA[$5\times 5$]]></tex-math></alternatives></inline-formula>, the ratio decreases to <inline-formula id="j_infor473_ineq_095"><alternatives><mml:math>
<mml:mn>39.13</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$39.13\% $]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor473_ineq_096"><alternatives><mml:math>
<mml:mn>20.22</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$20.22\% $]]></tex-math></alternatives></inline-formula>, respectively.</p>
<fig id="j_infor473_fig_011">
<label>Fig. 10</label>
<caption>
<p>Class assignments reviewed by experts, <inline-formula id="j_infor473_ineq_097"><alternatives><mml:math>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>40</mml:mn></mml:math><tex-math><![CDATA[$D=40$]]></tex-math></alternatives></inline-formula>, where <italic>L</italic> is from <inline-formula id="j_infor473_ineq_098"><alternatives><mml:math>
<mml:mn>95</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$95\% $]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_infor473_ineq_099"><alternatives><mml:math>
<mml:mn>80</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$80\% $]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<graphic xlink:href="infor473_g011.jpg"/>
</fig>
<p>A deeper analysis of the SOM size <inline-formula id="j_infor473_ineq_100"><alternatives><mml:math>
<mml:mn>5</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>5</mml:mn></mml:math><tex-math><![CDATA[$5\times 5$]]></tex-math></alternatives></inline-formula> has been performed to analyse why the correct assignment ratio decreases significantly. As we can see (Fig. <xref rid="j_infor473_fig_012">11</xref>), the highest number of “Decline” is when trying to assign the “Law enforcement” class. The analysis of texts where a class is assigned incorrectly showed that one of the common reasons influencing the results is some specific words in the Lithuanian texts. For example, the word “research” can often be found in law enforcement and innovation context texts. The word “research” can indicate some criminal situations, law enforcement investigations, but also it could refer to the scientific research context in the class of “Innovation”. One of the problem solutions is to include such words in the stop word list, but it is possible that this word can be useful in some situations.</p>
<fig id="j_infor473_fig_012">
<label>Fig. 11</label>
<caption>
<p>The distribution of new class assigments, when SOM size is <inline-formula id="j_infor473_ineq_101"><alternatives><mml:math>
<mml:mn>5</mml:mn>
<mml:mo>×</mml:mo>
<mml:mn>5</mml:mn></mml:math><tex-math><![CDATA[$5\times 5$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor473_ineq_102"><alternatives><mml:math>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>40</mml:mn></mml:math><tex-math><![CDATA[$D=40$]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<graphic xlink:href="infor473_g012.jpg"/>
</fig>
<p>The correct assignment ratio per all steps of the SOM size reduction is obtained and presented in Fig. <xref rid="j_infor473_fig_013">12</xref>. As we can see, with each reduction of the SOM size, the correct assignments are gradually decreasing, but the number of assignments is increasing. When the limit percent <inline-formula id="j_infor473_ineq_103"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>80</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=80\% $]]></tex-math></alternatives></inline-formula> is selected, the correct assignments ratio is equal to <inline-formula id="j_infor473_ineq_104"><alternatives><mml:math>
<mml:mn>76.46</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$76.46\% $]]></tex-math></alternatives></inline-formula>, and the number of assignments is 1546, which means that approximately <inline-formula id="j_infor473_ineq_105"><alternatives><mml:math>
<mml:mn>13</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$13\% $]]></tex-math></alternatives></inline-formula> of the data items class has been changed. In this case, the 364 of 1546 multi-label text data class has been assigned incorrectly. The rest of the assigned classes have been tagged as “Accept” (619), and “Possible” equal to 563. The highest correct assignments ratio is obtained when the limit is equal to <inline-formula id="j_infor473_ineq_106"><alternatives><mml:math>
<mml:mn>95</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$95\% $]]></tex-math></alternatives></inline-formula>, but just 70 times text data class have been changed. The ratio value is high, so these class assignments are indisputably correct.</p>
<fig id="j_infor473_fig_013">
<label>Fig. 12</label>
<caption>
<p>The correct assignment ratio over all steps of the proposed approach, <inline-formula id="j_infor473_ineq_107"><alternatives><mml:math>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>40</mml:mn></mml:math><tex-math><![CDATA[$D=40$]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<graphic xlink:href="infor473_g013.jpg"/>
</fig>
<p>The experimental investigation has shown that the optimal limit percent is equal to <inline-formula id="j_infor473_ineq_108"><alternatives><mml:math>
<mml:mn>85</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$85\% $]]></tex-math></alternatives></inline-formula> because the correct assignments ratio is more than <inline-formula id="j_infor473_ineq_109"><alternatives><mml:math>
<mml:mn>82</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$82\% $]]></tex-math></alternatives></inline-formula>, where just 155 of 865 assignments were incorrect. The best way to improve the results of the proposed approach could be manually prepared keywords of an analysed data for each class. By selecting not overlapping words between different classes, the new class assignment ratio could be higher.</p>
</sec>
<sec id="j_infor473_s_011">
<label>4.3</label>
<title>Discussion</title>
<p>The comprehensively experimental investigation has been performed using one Lithuanian multi-label text data, and the usability of the proposed approach has been experimentally proved. The analysed data has been chosen because of the following reasons: the data size (the usage of the proposed approach with higher amount of text data); the text data is not in the English language (the English language usually is suitable for various methods and is structurally simpler); the data must be multi-label (one data item belongs to more than one class). The same level experimental investigation using other multi-label text data has not been performed, because data with similar properties have not been found. Usually, all of the data in the various freely accessed databases are of small size or artificially made. Primary research has shown that the chosen language does not significantly influence the obtained results, the concept of the model remains the same. Therefore, the proposed approach can be used to adjust multi-label text data classes in any language. There is also a limitation on how many classes need to have one item of the analysed data.</p>
<p>When analysing other data with other properties, for example, text data that have more classes, different language and different lengths of text, the selected parameters used in the proposed approach should be tuned according to the specificity of text data. For example, an English stemming algorithm should be used to analyse an English text, and if the text is much longer, the frequency of words included may also be higher than three, etc. Different pre-processed data may affect the parameters of the LSA and SOM algorithms (reduced data dimensionality may be higher, while SOM size reduction starts with smaller/larger SOM size). Each newly proposed approach has its limitations and threats, but the results of the experimental study are promising and will be used in future work on classification tasks.</p>
</sec>
</sec>
<sec id="j_infor473_s_012">
<label>5</label>
<title>Conclusions</title>
<p>The multi-label text class verification and adjustment is a complex task because many different factors can influence the final results, such as language specificity, the natural language pre-processing methods, and the model used to assign the classes. Nowadays, many types of research focus on multi-label text data classification using conventional machine learning algorithms or deep learning algorithms, but no effort is made to improve the data quality. In text analysis, many human factors are involved in the preparation of the data. When text data has been labelled manually, errors or inconsistencies are sometimes unavoidable. Therefore, the main aim of the proposed approach is to improve the quality of the data.</p>
<p>The experimental investigation has proved that the proposed approach can be used for multi-label text class adjustment and verification. The main steps of the proposed approach are as follows: 1) data are pre-processed; 2) LSA is used to reduce the dimensionality of the data; 3) the most frequent words in the texts of each class are collected; 4) SOM is used to detect similarities between texts; 5) each data item is assigned to a class according to SOM and cosine similarity distance; 6) new classes are verified and adjusted by experts. The dimensionality reduction analysis using LSA has shown that the highest number of new assignments is made when the dimensionality is reduced to <inline-formula id="j_infor473_ineq_110"><alternatives><mml:math>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>40</mml:mn></mml:math><tex-math><![CDATA[$D=40$]]></tex-math></alternatives></inline-formula>. The SOM visualization obtained has shown the distribution of the analysed data on the map, as well as the relationship between the data items. In some cells of the SOM, different class data items, which show the similarity between data items, fall, so the appropriate decisions must be made by adjusting or verifying the class. The class has been changed according to the dominant class in the SOM cell by assigning the text data to an additional class or changing one of the previous classes to a new one. To decide which class has to be changed, the cosine similarity distance was calculated. When the dominant class limit <inline-formula id="j_infor473_ineq_111"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>95</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=95\% $]]></tex-math></alternatives></inline-formula>, just 70 new assignments were made, but just 5 of them were incorrect. The results have been verified by experts. By increasing the limit <italic>L</italic> of the dominant class, the number of assignments is also increasing, but the correct class assignment ratio is decreasing. 1546 new assignments have been made when the dominant class limit <inline-formula id="j_infor473_ineq_112"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>80</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$L=80\% $]]></tex-math></alternatives></inline-formula>, and the correct assignment ratio is equal to 76.46%.</p>
<p>Deeper research has shown that the use of the automatic extraction of the most frequent words from each class has advantages and disadvantages. The advantage is that the researcher does not have to worry about the context of the data, but on the other hand, the proposed approach probably would be more accurate with a manually extracted and verified word list of each class by experts. More detailed research should be performed in future work to prove this hypothesis.</p>
</sec>
</body>
<back>
<ref-list id="j_infor473_reflist_001">
<title>References</title>
<ref id="j_infor473_ref_001">
<mixed-citation publication-type="chapter"><string-name><surname>Aggarwal</surname>, <given-names>C.C.</given-names></string-name>, <string-name><surname>Zhai</surname>, <given-names>C.</given-names></string-name> (<year>2012</year>). <chapter-title>A survey of text clustering algorithms</chapter-title>. In: <source>Mining Text Data</source>, pp. <fpage>77</fpage>–<lpage>128</lpage>. <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_002">
<mixed-citation publication-type="chapter"><string-name><surname>Ahmed</surname>, <given-names>N.A.</given-names></string-name>, <string-name><surname>Shehab</surname>, <given-names>M.A.</given-names></string-name>, <string-name><surname>Al-Ayyoub</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Hmeidi</surname>, <given-names>I.</given-names></string-name> (<year>2015</year>). <chapter-title>Scalable multi-label arabic text classification</chapter-title>. In: <source>2015 6th International Conference on Information and Communication Systems (ICICS)</source>, pp. <fpage>212</fpage>–<lpage>217</lpage>. <publisher-name>IEEE</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_003">
<mixed-citation publication-type="journal"><string-name><surname>Aly</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Almotairi</surname>, <given-names>S.</given-names></string-name> (<year>2020</year>). <article-title>Deep convolutional self-organizing map network for robust handwritten digit recognition</article-title>. <source>IEEE Access</source>, <volume>8</volume>, <fpage>107035</fpage>–<lpage>107045</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_004">
<mixed-citation publication-type="chapter"><string-name><surname>Bhuiyan</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Ara</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Bardhan</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Islam</surname>, <given-names>M.R.</given-names></string-name> (<year>2017</year>). <chapter-title>Retrieving YouTube video by sentiment analysis on user comment</chapter-title>. In: <source>2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA)</source>, pp. <fpage>474</fpage>–<lpage>478</lpage>. <publisher-name>IEEE</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_005">
<mixed-citation publication-type="journal"><string-name><surname>Blei</surname>, <given-names>D.M.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>A.Y.</given-names></string-name>, <string-name><surname>Jordan</surname>, <given-names>M.I.</given-names></string-name> (<year>2003</year>). <article-title>Latent dirichlet allocation</article-title>. <source>Journal of Machine Learning Research</source>, <volume>3</volume>, <fpage>993</fpage>–<lpage>1022</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_006">
<mixed-citation publication-type="journal"><string-name><surname>Blum</surname>, <given-names>M.G.</given-names></string-name>, <string-name><surname>Nunes</surname>, <given-names>M.A.</given-names></string-name>, <string-name><surname>Prangle</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Sisson</surname>, <given-names>S.A.</given-names></string-name>, <etal>et al.</etal>(<year>2013</year>). <article-title>A comparative review of dimension reduction methods in approximate Bayesian computation</article-title>. <source>Statistical Science</source>, <volume>28</volume>(<issue>2</issue>), <fpage>189</fpage>–<lpage>208</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_007">
<mixed-citation publication-type="journal"><string-name><surname>Demšar</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Curk</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Erjavec</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Gorup</surname>, <given-names>Č.</given-names></string-name>, <string-name><surname>Hočevar</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Milutinovič</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Možina</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Polajnar</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Toplak</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Starič</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Štajdohar</surname> <given-names>M.</given-names></string-name>, <string-name><surname>Umek</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Žagar</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Žbontar</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Žitnik</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Zupan</surname>, <given-names>B.</given-names></string-name> (<year>2013</year>). <article-title>Orange: data mining toolbox in Python</article-title>. <source>Journal of Machine Learning Research</source>, <volume>14</volume>(<issue>1</issue>), <fpage>2349</fpage>–<lpage>2353</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_008">
<mixed-citation publication-type="journal"><string-name><surname>Dumais</surname>, <given-names>S.T.</given-names></string-name> (<year>2004</year>). <article-title>Latent semantic analysis</article-title>. <source>Annual Review of Information Science and Technology</source>, <volume>38</volume>(<issue>1</issue>), <fpage>188</fpage>–<lpage>230</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_009">
<mixed-citation publication-type="journal"><string-name><surname>Dzemyda</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Kurasova</surname>, <given-names>O.</given-names></string-name> (<year>2002</year>). <article-title>Comparative analysis of the graphical result presentation in the SOM software</article-title>. <source>Informatica</source>, <volume>13</volume>(<issue>3</issue>), <fpage>275</fpage>–<lpage>286</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_010">
<mixed-citation publication-type="journal"><string-name><surname>Hernández-Alvarez</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Gomez</surname>, <given-names>J.M.</given-names></string-name> (<year>2016</year>). <article-title>Survey about citation context analysis: Tasks, techniques, and resources</article-title>. <source>Natural Language Engineering</source>, <volume>22</volume>(<issue>3</issue>), <fpage>327</fpage>–<lpage>349</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_011">
<mixed-citation publication-type="journal"><string-name><surname>Hmeidi</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Al-Ayyoub</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Mahyoub</surname>, <given-names>N.A.</given-names></string-name>, <string-name><surname>Shehab</surname>, <given-names>M.A.</given-names></string-name> (2016). <article-title>A lexicon based approach for classifying Arabic multi-labeled text</article-title>. <source>International Journal of Web Information Systems</source>. <volume>12</volume>(<issue>4</issue>), <fpage>504</fpage>–<lpage>532</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_012">
<mixed-citation publication-type="other"><string-name><surname>Jocas</surname>, <given-names>D.</given-names></string-name> (2020). <italic>Lithuanian Stemming Algorithm</italic>. <uri>https://snowballstem.org/algorithms/lithuanian/stemmer.html</uri>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_013">
<mixed-citation publication-type="other"><string-name><surname>Joulin</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Grave</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Bojanowski</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Mikolov</surname>, <given-names>T.</given-names></string-name> (2016). Bag of tricks for efficient text classification. arXiv preprint <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/arXiv:1607.01759">arXiv:1607.01759</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_014">
<mixed-citation publication-type="journal"><string-name><surname>Kapočiūtė-Dzikienė</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Damaševičius</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Woźniak</surname>, <given-names>M.</given-names></string-name> (<year>2019</year>). <article-title>Sentiment analysis of Lithuanian texts using traditional and deep learning approaches</article-title>. <source>Computers</source>, <volume>8</volume>(<issue>1</issue>), <fpage>4</fpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_015">
<mixed-citation publication-type="journal"><string-name><surname>Khan</surname>, <given-names>J.Y.</given-names></string-name>, <string-name><surname>Khondaker</surname>, <given-names>M.T.I.</given-names></string-name>, <string-name><surname>Afroz</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Uddin</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Iqbal</surname>, <given-names>A.</given-names></string-name> (<year>2021</year>). <article-title>A benchmark study of machine learning models for online fake news detection</article-title>. <source>Machine Learning with Applications</source>, <volume>4</volume>, <fpage>100032</fpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_016">
<mixed-citation publication-type="chapter"><string-name><surname>Kharlamov</surname>, <given-names>A.A.</given-names></string-name>, <string-name><surname>Orekhov</surname>, <given-names>A.V.</given-names></string-name>, <string-name><surname>Bodrunova</surname>, <given-names>S.S.</given-names></string-name>, <string-name><surname>Lyudkevich</surname>, <given-names>N.S.</given-names></string-name> (<year>2019</year>). <chapter-title>Social network sentiment analysis and message clustering</chapter-title>. In: <source>International Conference on Internet Science</source>, pp. <fpage>18</fpage>–<lpage>31</lpage>. <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_017">
<mixed-citation publication-type="journal"><string-name><surname>Kim</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Chung</surname>, <given-names>B.-s.</given-names></string-name>, <string-name><surname>Choi</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Jung</surname>, <given-names>J.-Y.</given-names></string-name>, <string-name><surname>Park</surname>, <given-names>J.</given-names></string-name> (<year>2014</year>). <article-title>Language independent semantic kernels for short-text classification</article-title>. <source>Expert Systems with Applications</source>, <volume>41</volume>(<issue>2</issue>), <fpage>735</fpage>–<lpage>743</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_018">
<mixed-citation publication-type="book"><string-name><surname>Kohonen</surname>, <given-names>T.</given-names></string-name> (<year>2012</year>). <source>Self-Organizing Maps</source>, Vol. <volume>30</volume>. <publisher-name>Springer Science &amp; Business Media</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_019">
<mixed-citation publication-type="chapter"><string-name><surname>Krilavičius</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Medelis</surname>, <given-names>Ž.</given-names></string-name>, <string-name><surname>Kapočiūtė-Dzikienė</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Žalandauskas</surname>, <given-names>T.</given-names></string-name> (<year>2012</year>). <chapter-title>News media analysis using focused crawl and natural language processing: case of Lithuanian news websites</chapter-title>. In: <source>International Conference on Information and Software Technologies</source>, pp. <fpage>48</fpage>–<lpage>61</lpage>. <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_020">
<mixed-citation publication-type="other"><string-name><surname>LFND</surname></string-name> (2021). <italic>Lithuanian Financial News Dataset (LFND) (multi-labeled)</italic>. <uri>https://www.kaggle.com/pavelstefanovi/lithuanian-financial-news-dataset-multilabeled</uri>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_021">
<mixed-citation publication-type="journal"><string-name><surname>Licen</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Di Gilio</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Palmisani</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Petraccone</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>de Gennaro</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Barbieri</surname>, <given-names>P.</given-names></string-name> (<year>2020</year>). <article-title>Pattern recognition and anomaly detection by self-organizing maps in a multi month e-nose survey at an industrial site</article-title>. <source>Sensors</source>, <volume>20</volume>(<issue>7</issue>), <fpage>1887</fpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_022">
<mixed-citation publication-type="journal"><string-name><surname>López</surname>, <given-names>A.U.</given-names></string-name>, <string-name><surname>Mateo</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Navío-Marco</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Martínez-Martínez</surname>, <given-names>J.M.</given-names></string-name>, <string-name><surname>Gómez-Sanchís</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Vila-Francés</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Serrano-López</surname>, <given-names>A.J.</given-names></string-name> (<year>2019</year>). <article-title>Analysis of computer user behavior, security incidents and fraud using self-organizing maps</article-title>. <source>Computers &amp; Security</source>, <volume>83</volume>, <fpage>38</fpage>–<lpage>51</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_023">
<mixed-citation publication-type="journal"><string-name><surname>Minaee</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Kalchbrenner</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Cambria</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Nikzad</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Chenaghlu</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Gao</surname>, <given-names>J.</given-names></string-name> (<year>2021</year>). <article-title>Deep learning–based text classification: a comprehensive review</article-title>. <source>ACM Computing Surveys (CSUR)</source>, <volume>54</volume>(<issue>3</issue>), <fpage>1</fpage>–<lpage>40</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_024">
<mixed-citation publication-type="journal"><string-name><surname>Nanculef</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Flaounas</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Cristianini</surname>, <given-names>N.</given-names></string-name> (<year>2014</year>). <article-title>Efficient classification of multi-labeled text streams by clashing</article-title>. <source>Expert Systems with Applications</source>, <volume>41</volume>(<issue>11</issue>), <fpage>5431</fpage>–<lpage>5450</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_025">
<mixed-citation publication-type="journal"><string-name><surname>Park</surname>, <given-names>C.H.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>M.</given-names></string-name> (<year>2008</year>). <article-title>On applying linear discriminant analysis for multi-labeled problems</article-title>. <source>Pattern Recognition Letters</source>, <volume>29</volume>(<issue>7</issue>), <fpage>878</fpage>–<lpage>887</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_026">
<mixed-citation publication-type="chapter"><string-name><surname>Ramage</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Hall</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Nallapati</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Manning</surname>, <given-names>C.D.</given-names></string-name> (<year>2009</year>). <chapter-title>Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora</chapter-title>. In: <source>Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing</source>, pp. <fpage>248</fpage>–<lpage>256</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_027">
<mixed-citation publication-type="journal"><string-name><surname>Stefanovič</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Kurasova</surname>, <given-names>O.</given-names></string-name> (<year>2011</year>). <article-title>Visual analysis of self-organizing maps</article-title>. <source>Nonlinear Analysis: Modelling and Control</source>, <volume>16</volume>(<issue>4</issue>), <fpage>488</fpage>–<lpage>504</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_028">
<mixed-citation publication-type="journal"><string-name><surname>Stefanovič</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Kurasova</surname>, <given-names>O.</given-names></string-name> (<year>2014</year>). <article-title>Creation of text document matrices and visualization by self-organizing map</article-title>. <source>Information Technology and Control</source>, <volume>43</volume>(<issue>1</issue>), <fpage>37</fpage>–<lpage>46</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_029">
<mixed-citation publication-type="journal"><string-name><surname>Stefanovic</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Kurasova</surname>, <given-names>O.</given-names></string-name> (<year>2014</year>). <article-title>Investigation on learning parameters of self-organizing maps</article-title>. <source>Baltic Journal of Modern Computing</source>, <volume>2</volume>(<issue>2</issue>), <fpage>45</fpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_030">
<mixed-citation publication-type="journal"><string-name><surname>Stefanovič</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Kurasova</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Štrimaitis</surname>, <given-names>R.</given-names></string-name> (<year>2019</year>). <article-title>The n-grams based text similarity detection approach using self-organizing maps and similarity measures</article-title>. <source>Applied Sciences</source>, <volume>9</volume>(<issue>9</issue>), <fpage>1870</fpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_031">
<mixed-citation publication-type="journal"><string-name><surname>Štrimaitis</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Stefanovič</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Ramanauskaitė</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Slotkienė</surname>, <given-names>A.</given-names></string-name> (<year>2021</year>). <article-title>Financial context news sentiment analysis for the Lithuanian language</article-title>. <source>Applied Sciences</source>, <volume>11</volume>(<issue>10</issue>), <fpage>4443</fpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_032">
<mixed-citation publication-type="chapter"><string-name><surname>Ueda</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Saito</surname>, <given-names>K.</given-names></string-name> (<year>2003</year>). <chapter-title>Parametric mixture models for multi-labeled text</chapter-title>. In: <source>Advances in Neural Information Processing Systems</source>, pp. <fpage>737</fpage>–<lpage>744</lpage>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_033">
<mixed-citation publication-type="book"><string-name><surname>Ultsch</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Siemon</surname>, <given-names>H.P.</given-names></string-name> (<year>1989</year>). <source>Exploratory Data Analysis: Using Kohonen Networks on Transputers</source>. <publisher-name>Univ., FB Informatik</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor473_ref_034">
<mixed-citation publication-type="journal"><string-name><surname>Yoshioka</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Dozono</surname>, <given-names>H.</given-names></string-name> (2018). <article-title>The classification of the documents based on Word2Vec and 2-layer self organizing maps</article-title>. <source>International Journal of Machine Learning and Computing</source>, <volume>8</volume>(<issue>3</issue>), <fpage>252</fpage>–<lpage>255</lpage>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
