<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
<journal-title-group><journal-title>Informatica</journal-title></journal-title-group>
<issn pub-type="epub">1822-8844</issn><issn pub-type="ppub">0868-4952</issn><issn-l>0868-4952</issn-l>
<publisher>
<publisher-name>Vilnius University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">INFOR527</article-id>
<article-id pub-id-type="doi">10.15388/23-INFOR527</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Research Article</subject></subj-group></article-categories>
<title-group>
<article-title>Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-1816-1377</contrib-id>
<name><surname>Blanco-Fernández</surname><given-names>Yolanda</given-names></name><email xlink:href="yolanda@det.uvigo.es">yolanda@det.uvigo.es</email><xref ref-type="aff" rid="j_infor527_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref><bio>
<p><bold>Y. Blanco-Fernández</bold> obtained her PhD in telecommunications engineering from the University of Vigo in 2007 and currently serves as an associate professor at the same institution. Her research focuses on semantic reasoning in personalization systems, wireless ad hoc networks for mobile devices, machine learning, deep learning models, and natural language processing. She has authored 50+ JCR-indexed journal articles, 13 book chapters, and presented 93 communications at international conferences. She has also advised 5 doctoral theses and has contributed to 30+ competitively funded projects, both nationally and internationally (including H2020 and FP7). She has also engaged in 4 technology transfer contracts. Since 2021, she has held the position of deputy director at the Research Center for Telecommunication Technologies (atlanTTic).</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Gil-Solla</surname><given-names>Alberto</given-names></name><email xlink:href="agil@det.uvigo.es">agil@det.uvigo.es</email><xref ref-type="aff" rid="j_infor527_aff_001">1</xref><bio>
<p><bold>A. Gil-Solla</bold> holds a degree in telecommunication engineering (1991) and earned his PhD in telecommunication (2000) from the University of Vigo. Currently, he holds the position of professor at the same institution, where he teaches in the Telecommunication programme. He has advised 5 PhD theses and supervised more than 20 undergraduate theses. He is a member of the Group of Services of the Information Society, which is part of the Department of Telematic Engineering at the University of Vigo. Throughout his career, he has been involved in over 40 national and international research projects, including FP7 and H2020 initiatives, with many of them being carried out in collaboration with industrial partners. His research interests focus on the design and development of intelligent systems for personalization of Internet and mobile applications. This includes automatic content recommendation, particularly utilizing Natural Language Processing techniques and other Machine Learning approaches involving neural networks. He has authored over 50 publications in journals indexed in the JCR, as well as more than 60 presentations at international conferences.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Pazos-Arias</surname><given-names>José J.</given-names></name><email xlink:href="jose@det.uvigo.es">jose@det.uvigo.es</email><xref ref-type="aff" rid="j_infor527_aff_001">1</xref><bio>
<p><bold>J.J. Pazos-Arias</bold> is a telecommunications engineer (1987) and holds a PhD in telecommunications engineering (1995) from the Polytechnic University of Madrid. He joined the University of Vigo in 1988 and has held the position of professor since 2009 in the Department of Telematics Engineering. Since June 2016, he has been a Numerary Academician of the Royal Academy of Sciences of Galicia. He co-authored 75 articles in JCR-indexed journals, contributed to 21 chapters in internationally recognized books, and presented over 150 communications at international congresses. Additionally, he has edited 2 books of research monographs and served as the principal investigator in 4 out of 5 projects of the National R&amp;D Plan in which he participated. He has also advised 9 doctoral theses. He has been involved in numerous projects funded through competitive calls. Over the past decade, he has participated or is currently participating in two projects of the EU H2020 program, one project of the 7th EU Framework Program, one project under the Erasmus+ program, 4 projects funded through competitive national calls in collaboration with European partners, several regional projects, and 13 collaborative projects with companies funded through competitive national calls. He assumed the role of principal investigator in many of these projects. In terms of transferring research results, he has contributed to more than 30 technology transfer contracts and 17 contracts for training courses. Additionally, he has been the person responsible for overseeing many of these activities. In 1995, he founded the Information Society Services Group (GSSI) and continues to serve as its head. This group has attained the classification of a Reference Group within the R&amp;D system of the Galician region, securing significant stable funding not tied to specific projects. He is also a member of the Research Center for Telecommunication Technologies (atlanTTic).</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Quisi-Peralta</surname><given-names>Diego</given-names></name><email xlink:href="dquisi@ups.edu.ec">dquisi@ups.edu.ec</email><xref ref-type="aff" rid="j_infor527_aff_002">2</xref><bio>
<p><bold>D. Quisi-Peralta</bold> received a degree in computer systems engineering from the Universidad Politécnica Salesiana (Ecuador) in 2013. Furthermore, he obtained a master’s degree in advanced computer technologies from the University of Castilla-La Mancha (Spain) in 2015, as well as a master’s in Strategic Management of Communication Technologies from the University of Cuenca (Ecuador) in 2017. Currently, he is a PhD student at the School of Telecommunications Engineering at the University of Vigo (Spain). He works as the director of the ICT department at Livingnet and serves as an external researcher for PUCE (Pontificia Universidad Católica del Ecuador) and UPS (Universidad Politécnica Salesiana). His research interests encompass the application of AI technologies, data mining, ontologies, large language models, computer vision, and application development.</p></bio>
</contrib>
<aff id="j_infor527_aff_001"><label>1</label>atlanTTic Research Center for Telecommunication Technologies, <institution>University of Vigo</institution>, <country>Spain</country></aff>
<aff id="j_infor527_aff_002"><label>2</label><institution>Universidad Politécnica Salesiana de Cuenca</institution>, <country>Ecuador</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2023</year></pub-date><pub-date pub-type="epub"><day>8</day><month>9</month><year>2023</year></pub-date><volume>34</volume><issue>3</issue><fpage>491</fpage><lpage>527</lpage><history><date date-type="received"><month>3</month><year>2023</year></date><date date-type="accepted"><month>8</month><year>2023</year></date></history>
<permissions><copyright-statement>© 2023 Vilnius University</copyright-statement><copyright-year>2023</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.</p>
</abstract>
<kwd-group>
<label>Key words</label>
<kwd>embedding models</kwd>
<kwd>Named Entity Recognition</kwd>
<kwd>Doc2Vec</kwd>
<kwd>ad hoc corpus</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="j_infor527_s_001">
<label>1</label>
<title>Introduction</title>
<p>The learning of word embeddings has gained momentum in many Natural Language Processing (NLP) applications, ranging from text document summarisation (Mohd <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_057">2020</xref>), fake news detection (Faustini and Covões, <xref ref-type="bibr" rid="j_infor527_ref_025">2017</xref>; Silva <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_074">2020</xref>), and term similarity measure (Lastra <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_046">2019</xref>; Gali <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_027">2019</xref>) to sentiment classification (Rezaeinia <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_073">2019</xref>; Giatsoglou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_031">2017</xref>; Park <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_060">2021</xref>), edutainment (Blanco <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_015">2020</xref>), Named Entity Recognition (Turian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_080">2010</xref>; Gutiérrez-Batista <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_034">2018</xref>), classification tasks (Jung <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_039">2022</xref>) and personalization systems (Valcarce <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_081">2019</xref>), just to name a few. Most popular methods consider a large corpus of texts and represent each word with a real-valued dense vector, which captures its meaning assuming that words sharing common contexts in the input corpus are semantically related to each other (and consequently their respective word vectors are close in the vector space) (Mikolov <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_056">2013b</xref>). Drawing inspiration from such word representations, in last years document embeddings have emerged as a natural extension of word embeddings, by mapping variable-length documents (sentences, paragraphs or full documents) to vector representations. Their effectiveness has been remarkable in a wide diversity of tasks, such as text classification and sentiment analysis (Fu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_026">2018</xref>; Le and Mikolov, <xref ref-type="bibr" rid="j_infor527_ref_048">2014</xref>; Bansal and Srivastava, <xref ref-type="bibr" rid="j_infor527_ref_007">2019</xref>), multi-document summarisation (Lamsiyah <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_045">2021</xref>; Rani and Lobiyal, <xref ref-type="bibr" rid="j_infor527_ref_070">2022</xref>), forum question duplication (Lau and Baldwin, <xref ref-type="bibr" rid="j_infor527_ref_047">2016</xref>), document similarity (Dai <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_022">2020</xref>), sentence pair similarity (Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_019">2019</xref>), and even semantic relatedness and paraphrase detection (Logeswaran and Lee, <xref ref-type="bibr" rid="j_infor527_ref_051">2018</xref>).</p>
<p>Mostly-adopted approaches to word and document embeddings leverage unsupervised learning methods from large collections of unlabelled documents that serve as training corpora in considering word-word co-occurrences. In the literature, commonly-used corpora compile a huge number of unrelated texts, such as a full collection of English Wikipedia, the Associated Press English news articles released from 2009 to 2015,<xref ref-type="fn" rid="j_infor527_fn_001">1</xref><fn id="j_infor527_fn_001"><label><sup>1</sup></label>
<p><uri>https://github.com/jhlau/doc2vec</uri></p></fn> or a dataset of high quality English paragraphs containing over three billion words (Han <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_035">2013</xref>). Such collections lead to learning general-domain embedding models that do not perform well when working in a very specific domain (for example, on a particular historical event or a concrete medical discipline), whose common vocabulary is unlikely to be included in a generic corpus. As stated in Nooralahzadeh <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor527_ref_058">2018</xref>), “<italic>domain-specific terms are challenging for general domain embeddings since there are few statistical clues in the underlying corpora for these items</italic>”. This idea was also previously stemmed from the results attained in Bollegala <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor527_ref_016">2015</xref>), Pilehvar and Collier (<xref ref-type="bibr" rid="j_infor527_ref_067">2016</xref>).</p>
<p>Bearing this limitation in mind, many researchers leveraged the findings of fields such as multi-task learning and transfer learning (Axelrod <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_006">2011</xref>) and adopted a mixed-domain training in two phases: First, a general domain corpus is used to train an embedding model, which is next trained incrementally with specialised documents that are related to the particular domain. Thanks to this continual training, the general knowledge of the first phase can be transfered to the second one in order to be exploited along with the lexical and semantic specificities in that domain (Liu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_050">2015</xref>; Xu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_084">2019</xref>). However, there also exist works that concluded that, in domains with abundant unlabelled texts, the domain-specific training is not improved with the transfer from general domains. As a running example in biomedicine, the authors of Gu <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor527_ref_033">2021</xref>) showed that “<italic>domain-specific training from scratch substantially outperforms continual pretraining of generic language models, thus demonstrating that the prevailing assumption in support of mixed-domain pretraining is not always applicable</italic>”.</p>
<p>The benefits of resorting to only a domain-specific training from scratch have also been confirmed in other works (Nooralahzadeh <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_058">2018</xref>; Kim <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_042">2018</xref>; Lau and Baldwin, <xref ref-type="bibr" rid="j_infor527_ref_047">2016</xref>). In particular, the authors of Nooralahzadeh <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor527_ref_058">2018</xref>) concluded that models learned from ad hoc corpora provide “<italic>better results than general domain models for a domain-specific benchmark</italic>”, demonstrating besides that “<italic>constructing domain-specific word embeddings is beneficial even with limited input data</italic>”. Actually, these approaches rely on small-sized ad hoc corpora, whose documents are gathered by hand and indiscriminately from publicly-available sources in the Internet (Chiu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_020">2016</xref>; Nooralahzadeh <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_058">2018</xref>; Gu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_033">2021</xref>; Cano and Morisio, <xref ref-type="bibr" rid="j_infor527_ref_017">2017</xref>). Specifically, to the best of the authors’ knowledge, existing approaches do not consider the relevance of a document (in the particular domain) when deciding whether or not that text should be included in the ad hoc corpus. This process is obviously costly and clearly unfeasible without automatic assistance, in view of the myriad of possible domains/topics (and the huge amount of available documents on each of them). However, according to the results achieved in Chiu <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor527_ref_020">2016</xref>), the relevance of the specialised texts chosen as training documents is a critical parameter in learning of embedding models. Specifically, these researchers handled two ad hoc corpora including each one a different number of in-domain documents about biomedicine. Their results confirmed that the highest quality embedding models were learned from the smallest ad hoc corpus, proving that “<italic>bigger corpora do not necessarily produce better biomedical domain word embeddings</italic>”. In other words, disregarding the suitability of the considered documents in the particular domain may distort the training.</p>
<p>Taking into account the above conditions, the interest and main contributions of the proposed approach can be summarised as follows:</p>
<list>
<list-item id="j_infor527_li_001">
<label>•</label>
<p>Ad hoc corpora enable learning successful embedding models in very specific domains (e.g. medicine, history or chemical engineering, to name a few), where the huge public generic datasets that are usually adopted fail to accurately model the peculiarities (and the particular vocabulary) of these specialised domains.</p>
</list-item>
<list-item id="j_infor527_li_002">
<label>•</label>
<p>The approach described in the paper automatically builds such corpora retrieving large amounts of candidate training texts from Internet sources and incorporating only those that are relevant in the context under consideration. For that purpose, NER facilities (to identify named entities in the input text) and Doc2Vec models (to assess the relationship between the input text and each of the retrieved candidate documents) are exploited.</p>
</list-item>
<list-item id="j_infor527_li_003">
<label>•</label>
<p>The only input required in the approach is a fragment of text representative of the context, making human assistance during the creation of the ad hoc corpus unnecessary and allowing to deal with any topic or domain, however specific they may be.</p>
</list-item>
<list-item id="j_infor527_li_004">
<label>•</label>
<p>The automatic procedure for building the resulting corpus greatly simplifies the hard work associated with traditional manual collection procedures, while providing more and better domain-specific documents.</p>
</list-item>
</list>
<p>This paper is organized as follows: Section <xref rid="j_infor527_s_002">2</xref> explores relevant works within the context of our approach to automatic custom training corpus creation. Section <xref rid="j_infor527_s_006">3</xref> focuses on the procedure devised to gather a set of in-domain candidate texts from the Internet, using a particular history event (the Battle of Thermopylae between Greeks and Persians in 480 BC) to illustrate the approach. The mechanism to validate the selection of <italic>relevant</italic> candidates to be incorporated into the automatically-built tailor-made corpus is detailed in Section <xref rid="j_infor527_s_015">4</xref>. In this section, we also describe the tests conducted to evaluate the consistency of several Doc2Vec models, including a model trained on a general-domain corpus sourced from news articles of the Associated Press (AP), as well as in-domain models learned from ad hoc corpora. Finally, Section <xref rid="j_infor527_s_020">5</xref> concludes the paper and highlights further research directions.</p>
</sec>
<sec id="j_infor527_s_002">
<label>2</label>
<title>Related Work</title>
<p>Our literature review is organized into three main sections. In Section <xref rid="j_infor527_s_003">2.1</xref>, the focus is on prominent embedding models that lay the foundation for our research, whose learning in specialised scopes requires domain-specific training documents. Since our research contributes to the automatic generation of such kind of ad hoc corpora, Section <xref rid="j_infor527_s_004">2.2</xref> describes relevant related approaches for constructing custom datasets, emphasizing the key distinctions from our procedure. Our algorithm for selecting in-domain training documents starts by identifying named entities in the input text that contextualizes the specific theme for building the ad hoc dataset. Since named entities can have multiple possible interpretations, accurately distinguishing the associated meaning for each entity is crucial in this process. To achieve this, the commonly-adopted approaches to named entity disambiguation will be thoroughly reviewed in Section <xref rid="j_infor527_s_005">2.3</xref>.</p>
<sec id="j_infor527_s_003">
<label>2.1</label>
<title>Embedding Models</title>
<p>The germ of learning language representations using models pre-trained on large collections of unlabelled texts springs from word embeddings such as Word2Vec (Mikolov <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_056">2013b</xref>) and GloVe (Pennington <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_064">2014</xref>). Word2Vec trains a model on the context of each word such that similar words have similar vector representations. Considering word-word co-occurrences, these embeddings capture semantics and meaning-related relationships (enabling, for instance, to detect that two words are similar or opposite, or that the pair of words <italic>Spain</italic> and <italic>Madrid</italic> have an analogous relation to <italic>Canada</italic> and <italic>Ottawa</italic>), along with syntax and grammar-related relationships (<italic>have</italic> and <italic>had</italic> are at same level as <italic>are</italic> and <italic>were</italic>). Word2Vec is a feed-forward neural network that learns vectors to improve its predictive ability, offering two different models: CBOW (where the goal is to predict a word based on the words in its context) and Skip-Gram (where the aim is to predict surrounding words given an input word) (Khatua <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_040">2019</xref>). This model suffers from two main weaknesses which, briefly, are mainly related to the impossibility of (i) dealing with words that do not appear in the training corpus, and (ii) considering different meanings for the same word.</p>
<list>
<list-item id="j_infor527_li_005">
<label>•</label>
<p>Regarding the first limitation, these kind of approaches are not able to embed Out-Of- Vocabulary (OOV) words unseen in the training corpus, which makes it impossible to deal with rare/unusual terms and misspelling. The embedding model FastText circumvents the OOV problem by working at character-n-gram level (Armand <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_005">2017</xref>).</p>
</list-item>
<list-item id="j_infor527_li_006">
<label>•</label>
<p>On the other hand, more sophisticated models have emerged in the literature which capture the contextualized meaning of words. In particular, models like ELMo (Petters <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_065">2018</xref>), GPT (Radford <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_069">2019</xref>) and BERT (Devlin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_023">2019</xref>) enable to learn contextual relationships among words. For instance, in the sentences “<italic>I have hit my head</italic>” and “<italic>The Head of school sent a message to the students</italic>” the meaning of the word <italic>head</italic> depends on its left context in the first sentence (<italic>I have hit my</italic>) and on the right context in the second one (<italic>of school sent a message to the students</italic>). Bearing this motivation example in mind, some approaches moved from fixed word embeddings to contextualized models that consider both the sense of the next and previous words.</p>
</list-item>
</list>
<p>The main differences between the most popular contextualized embeddings (BERT, ELMo and GPT) are related to architectural internals: While ELMo relies on a Long Short-Term Memory (LSTM) model, GPT, BERT and its variants resort to a Transformer-based architecture (Reimers and Gurevych, <xref ref-type="bibr" rid="j_infor527_ref_072">2019</xref>; Wang and Jay-Kuo, <xref ref-type="bibr" rid="j_infor527_ref_083">2020</xref>). Details of both architectures, out of the scope of this paper, can be found in (Ethayarajh, <xref ref-type="bibr" rid="j_infor527_ref_024">2019</xref>). Before the irruption of the latest BERT-based document embedding models, the commonly adopted approach was Paragraph Vector (also called Doc2Vec), the natural extension of Word2Vec for learning vector representations for pieces of variable-length texts (sentences, paragraphs and documents) (Le and Mikolov, <xref ref-type="bibr" rid="j_infor527_ref_048">2014</xref>; Lau and Baldwin, <xref ref-type="bibr" rid="j_infor527_ref_047">2016</xref>; Kim <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_042">2018</xref>; Bhattacharya <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_014">2022</xref>). Similar to Word2Vec, Doc2Vec works with two different models: Distributed Memory version of Paragraph Vector (PV-DM) and Distributed Bag Of Words version of Paragraph Vector (PV-DBOW). In both models the training enables to learn a vector for the initial text (called paragraph vector in Le and Mikolov, <xref ref-type="bibr" rid="j_infor527_ref_048">2014</xref>), considering or not the Word2Vec embeddings of the single words depending on the approach, replicating such procedure in the prediction phase to provide a vector representing the paragraph/document.</p>
<p>The results described in Kim <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor527_ref_042">2018</xref>) highlighted that Doc2Vec outperformed previous sentence embeddings methods, ranging from simple approaches that use a weighted average of all the words in the document (Grefenstette <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_032">2013</xref>; Mikolov <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_055">2013a</xref>) to more sophisticated models like Skip-Thought (Kiros <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_043">2015</xref>) and Quick-Thoughts (Kiros <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_044">2018</xref>) that have attained good performance on diverse NLP tasks, such as semantic relatedness, paraphrase detection, image-sentence ranking and question-type classification (Kiros <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_044">2018</xref>). The main features of the above models are summarised in Table <xref rid="j_infor527_tab_001">1</xref>.</p>
<table-wrap id="j_infor527_tab_001">
<label>Table 1</label>
<caption>
<p>Embedding models defined in the literature and their main features.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Embedding model</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Word-level model</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Sentence/document level model</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Contextualized embeddings</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Architecture</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Word2Vec</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">×</td>
<td style="vertical-align: top; text-align: left">×</td>
<td style="vertical-align: top; text-align: left">Feed-forward neural network</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">GloVe</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">×</td>
<td style="vertical-align: top; text-align: left">×</td>
<td style="vertical-align: top; text-align: left">Count-based model</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">FastText</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">×</td>
<td style="vertical-align: top; text-align: left">×</td>
<td style="vertical-align: top; text-align: left">Feed-forward neural network</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">ELMo</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">×</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">LSTM</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">GPT</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">×</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">Transformer</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">BERT</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">×</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">Transformer</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Doc2Vec</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">Feed-forward neural network</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Skip-Thought</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">✓</td>
<td style="vertical-align: top; text-align: left">Encoder-decoder</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Quick-Thoughts</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">✓</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">✓</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">✓</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Encoder-decoder</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As noted in Section <xref rid="j_infor527_s_001">1</xref>, many existing embedding models enable to fine-tune the training process on a new in-domain dataset for a concrete NLP task. However, as far as the authors of this paper know, there are no relevant approaches in the literature that automatically assemble such a dataset as a custom-built training corpus containing a large amount of relevant documents to capture meaningful domain-specific information. This would lead to more accurate embedding representations, which could improve the performance of existing (word-level and document- level) models. In this regard, it should be noted that this paper is not about using an ad hoc corpus to train existing models with an ad hoc corpus in order to compare their respective performances and assess their strengths. Instead, the goal of this research is to devise a semantics-driven mechanism to automatically collect an in-domain dataset and verify that can lead to models with better performance that the ones trained with a generic corpus. To do so, the presented research considers a particular model (in this case, Doc2Vec) and a very specific application domain (a historical event like the Battle of the Thermopylae, as will be described in the validation scenario presented in Section <xref rid="j_infor527_s_015">4</xref>).</p>
<p>Having tested this hypothesis with a Doc2Vec model, the viability and utility of the proposed semantics-driven mechanism are confirmed, and the door is open for its application in different scenarios involving other more sophisticated Transformer-based embeddings defined in the literature. This further goal is beyond the scope of the experimental validation presented in this paper, where the focus is on assessing the quality of a tailor-made training corpus rather than on quantifying the performance of the multiple models that could be learned from it.</p>
</sec>
<sec id="j_infor527_s_004">
<label>2.2</label>
<title>Automatic Creation of Corpora</title>
<p>The web has long been considered a mega corpus with the potential to uncover new information across diverse fields (Crystal, <xref ref-type="bibr" rid="j_infor527_ref_021">2011</xref>; Gatto, <xref ref-type="bibr" rid="j_infor527_ref_029">2014</xref>). Consequently, numerous works in the literature focus on processing web-based corpora to find, extract, or transform information. This trend has intensified with the explosive growth of NLP research over the past decade, leading to extensive efforts in gathering both labelled and raw text corpora for discovering linguistic regularities, generating embeddings, and performing downstream tasks like classification, sentiment analysis, summarisation, or Q&amp;A. Despite this importance, relevant results regarding automating the generation of such corpora remain scarce; there is hardly any research work related to automatic corpus creation.</p>
<p>The corpus construction approaches from the web that can be found in the literature primarily aim to support research and professional training in linguistic and translation fields. Typically, their objective is to create corpora for gaining an overview of a given language. Thus, these approaches mainly involve exploring the web to study pages based on their language rather than their content, resulting in general corpora rather than specialised ones. The most prevalent approach in these initiatives is the BootCat tool (Baroni and Bernardini, <xref ref-type="bibr" rid="j_infor527_ref_011">2004</xref>), followed by successive refinements known as the WebBootCat web application (Baroni <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_012">2006</xref>) later marketed as SketchEngine (Kilgarriff <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_041">2014</xref>), relies on issuing a general set of search-engine queries involving domain-specific keywords to obtain focused collections of documents. This approach requires users to provide initial seeds (keywords) to begin the search. Subsequently, a large number of queries combining these seeds are issued against search engines like Google, Yahoo, or Bing, and the more relevant results are recovered to form the corpus.</p>
<p>However, these approaches primarily rely on web crawling followed by some linguistic processing to filter inadequate results. They work with literal keywords, querying for documents (web pages) containing those keywords, and repeat the process with the links contained in each page. They do not explore categories to classify the pages nor similarities among documents to establish their relevance for inclusion in the corpus or to trigger new searching processes or URL selection beyond those explicitly contained in the recovered documents. The text gathering procedure is somewhat coarse, and manual refinement is often necessary to guide these tools in the search for appropriate texts for the corpus to achieve a quality output. As a result, these projects lead to relatively small corpora, suitable for studying a language in a teaching environment but insufficient for training neural networks.</p>
<p>Similar approaches are used in projects creating corpora for linguistic and translation purposes, such as specialised health corpora (Symseridou, <xref ref-type="bibr" rid="j_infor527_ref_078">2018</xref>) for instructing translation professionals in required abilities like locating terms, studying collocations, grammar, and syntax. The same approach is followed in Castagnoli (<xref ref-type="bibr" rid="j_infor527_ref_018">2015</xref>) to build corpora for health, law, and cell phone scenarios, and in Lynn <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor527_ref_052">2015</xref>), aimed at constructing linguistic corpora for less commonly used languages (e.g. Irish). In all these cases, the core algorithm is based on web crawling, with language being the main driving constraint for selecting pages to add or continue searching.</p>
<p>The evolution of these approaches has been significantly influenced by the explosive growth of the web and subsequent restrictions imposed by search engines regarding massive querying of the web (Barbaresi, <xref ref-type="bibr" rid="j_infor527_ref_008">2013a</xref>). This led to the exploration of alternative ways to discover documents for the corpus, such as exploring social networks and blogging platforms (Barbaresi, <xref ref-type="bibr" rid="j_infor527_ref_009">2013b</xref>), the Open Directory Project and Wikipedia (Barbaresi, <xref ref-type="bibr" rid="j_infor527_ref_010">2014</xref>), or the Common Crawl platform, a free Internet crawling initiative (Smith <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_075">2013</xref>).</p>
<p>There are also automatic corpus-construction experiences centred not on the whole web but on closed repositories, aimed at filtering relevant documents for specific queries. These projects typically involve structured and well-known repositories, where the goal is to create information corpora restricted to characteristics specified by users, resembling more of a database query than web exploration. Their objectives focus on information analysis to guarantee compliance with restrictions rather than finding more similar texts from the current one. An example is Primpeli <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor527_ref_068">2019</xref>), which focuses on recovering text resources about e-commerce items from the WDC Product Data Corpus, extracted from the Common Crawl data repository. The compiled data, originally attached to product pages by e-commerce companies, forms an automatically created corpus used to group similar products in clusters. The quality of this corpus is confirmed by Peeters <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor527_ref_062">2020</xref>), Peeters and Bizer (<xref ref-type="bibr" rid="j_infor527_ref_061">2021</xref>), where embedding models were trained using such corpora. Zhang and Song (<xref ref-type="bibr" rid="j_infor527_ref_088">2022</xref>) utilizes information extracted from the same sources to feed various processes in the field of NLP, such as embedding generation or model training. Both scenarios share some similarities with ours as a corpus is created for training Machine Learning models. However, the documents collected in their research are located in a specific source, originating from an already available corpus containing semantic annotations, with no relevance to the subject being measured for such documents, and no new text is discovered from the analysed ones.</p>
<p>Numerous other research works claim automatic corpus creation in Machine Learning settings. For example, Zhang <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor527_ref_087">2021</xref>) creates a new corpus from the Amazon product dataset to train a new BERT model; Abacha and Dina (<xref ref-type="bibr" rid="j_infor527_ref_001">2016</xref>) automatically creates a corpus of equivalent pairs of questions (Textual Entailments) from the National Library of Medicine’s database of questions (USA); Zanzotto and Pennacchiotti (<xref ref-type="bibr" rid="j_infor527_ref_086">2010</xref>) extracts pairs of entailments from Wikipedia by studying the successive historic revisions of some articles; and Zhou <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor527_ref_089">2022</xref>) builds a new corpus of Paraphrase Detection by refining existing corpora like the Stanford Natural Language Inference corpus (SNLI) and the Multi-Genre Natural Language Inference corpus (MNLI). However, to our knowledge, all these approaches focus on processing closed repositories to obtain a new corpus, with no exploration of the web (or large repositories like Wikipedia) to search for new unknown documents. Moreover, no examples exist of studying the relevance of documents for a given subject openly specified by the user; the working theme is already fixed at the creation of the project. In the first group, new documents are discovered simply by crawling (with the language constraint), while no new previously unknown document is discovered in the second group.</p>
<p>In summary, the literature includes several works on the automatic creation of corpora. On the one hand, there is an older research line centred on linguistic and translation fields, where approaches are relatively simple, exploring the web from user-provided seeds, and generally involving simple web crawlers primarily driven by the language of pages. On the other hand, more elaborate approaches are found in specific fields like health or e-commerce, as well as a significant number of cases also centered on Machine Learning model training. However, to our knowledge, all these approaches are focused on processing closed repositories to create a new corpus, with no exploration of the web (or large repositories like Wikipedia) to search for new unknown documents. In neither approach do examples exist of studying the relevance of documents for a given subject openly specified by the user, and no new documents are discovered beyond the initial corpus.</p>
</sec>
<sec id="j_infor527_s_005">
<label>2.3</label>
<title>Named Entity Disambiguation</title>
<p>In the literature, named entity disambiguation (NED) is commonly defined as the process of determining the precise meaning or sense of a named entity within a given context. These named entities can be identified by well-known named entity recognition tools like DBpedia-Spotlight (Mendes <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_054">2011</xref>). More specifically, the goal of NED is to resolve ambiguity by associating the named entity with a specific concept within a semantic knowledge base. Previous studies have addressed the challenge of entity disambiguation through the utilization of statistical methods and rule-based approaches. These works take into account the contextual words surrounding the target named entity during the disambiguation process. However, they often neglect the semantic nuances of words and lack generalizability since the rules are typically specific to certain domains (An <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_004">2020</xref>; Songa <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_076">2019</xref>).</p>
<p>Subsequently, more advanced mechanisms emerged, such as the methods based on entity features. In these approaches, when an entity possesses multiple interpretations, inconsistent entities are filtered out by assessing their semantic similarity. The disambiguation process considers the semantic attributes of the entity, the contextual information surrounding the entity, and even its frequency of occurrence in the processed text. Notably, these methods leverage contextual embedding models, which assign different vector representations to entities with the same spelling based on their specific meanings within each context. To achieve this, entity features-based disambiguation methods typically obtain the contextual embedding vector of the target entity. They then calculate the semantic distance between this vector and the embedding vectors of each candidate entity to effectively disambiguate and remove any ambiguous entities (Barrena <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_013">2015</xref>; Zwicklbauer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_090">2016</xref>). However, despite the efficacy of this approach, it disregards the structural characteristics of the knowledge base in which the target entity is situated, such as the interconnections between entities. Consequently, it fails to capture the global semantic features of each entity (Adjali <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_002">2020</xref>). Additionally, this disambiguation method requires large training corpus to learn an embedding model.</p>
<p>To address this challenge, recent studies have turned to deep neural networks for entity disambiguation. Specifically, neural network-based approaches have gained popularity by incorporating the subgraph structure features of knowledge bases. These features are utilized as inputs to graph neural networks, enabling the disambiguation of entities within the knowledge base (Ma <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_053">2021</xref>). Various methods have been explored, including convolutional and recurrent neural networks, as well as LSTM networks, to disambiguate entities based on extracted associations among them (Geng <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_030">2021</xref>; Phan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_066">2017</xref>).</p>
<p>Transformer-based language models have also demonstrated significant promising in capturing complex linguistic knowledge, leading researchers to employ attention mechanisms to obtain contextual embedding vectors for each entity and consider coherence between entities for joint disambiguation (Ganea and Hofmann, <xref ref-type="bibr" rid="j_infor527_ref_028">2017</xref>). Furthermore, graph neural networks have been trained to acquire entity graph embeddings that encode global semantic features, subsequently transferred to statistical models to address entity ambiguity. While these approaches demonstrate potential in achieving human-level performance in entity disambiguation, they often require substantial amounts of training data and computational resources (Hu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_036">2020</xref>). Therefore, challenges persist in optimizing these models and reducing their reliance on in-domain training datasets, which may not always be readily available, especially in highly specific or specialised domains like those handled in our ad hoc corpus generation approach. In simple terms, these models are not suitable for our purposes because they require training with ad hoc corpora, which is precisely what our work seeks to achieve, that is, automatically gathering collections of in-domain texts that were previously absent in the literature.</p>
<p>While we acknowledge the positive outcomes achieved by existing approaches in entity disambiguation within recent literature, their complexity and requirements, such as domain-specific training datasets and high computational demands, surpass the needs of our ad hoc corpus generation algorithm. In contrast, we employ a simpler yet effective mechanism, as evidenced by the obtained results, to identify the right entities in the given initial text. Details will be given in Section <xref rid="j_infor527_s_008">3.2</xref>.</p>
</sec>
</sec>
<sec id="j_infor527_s_006">
<label>3</label>
<title>How to Build a Domain-Specific Training Corpus</title>
<p>The candidate documents to be incorporated into the automatically-built ad hoc corpus (for training the embedding models) are gathered from the Internet by a procedure that has been implemented in Python and made freely available in a GitHub repository (<uri>https://github.com/gssi-uvigo/Plethora</uri>). Specifically, the approach takes as input an initial text and retrieves Wikipedia articles that have a meaningful relationship with it through an algorithm that can be outlined as follows:</p>
<list>
<list-item id="j_infor527_li_007">
<label>•</label>
<p>First, the NER facilities provided by the DBpedia Spotlight tool are exploited to identify DBpedia named entities present in the input text (denoted as DB-SL entities). In addition, the approach searches for other semantically related entities that share some common features with these DB-SL entities (e.g. semantic topics and categories or wikicats). This step is addressed in Sections <xref rid="j_infor527_s_007">3.1</xref>, <xref rid="j_infor527_s_008">3.2</xref> and <xref rid="j_infor527_s_009">3.3</xref>.</p>
</list-item>
<list-item id="j_infor527_li_008">
<label>•</label>
<p>Next, the goal is to retrieve (and preprocess to remove irrelevant information) Wikipedia articles in which the previously identified entities are mentioned, as described in Sections <xref rid="j_infor527_s_010">3.4</xref> and <xref rid="j_infor527_s_011">3.5</xref>.</p>
</list-item>
<list-item id="j_infor527_li_009">
<label>•</label>
<p>Finally, the retrieved texts that are actually relevant (according to the relatedness measured between each of them and the input text) are incorporated into the ad hoc corpus. As detailed in Section <xref rid="j_infor527_s_012">3.6</xref>, this stage of the algorithm is driven by a semantic similarity metric based on a Doc2Vec embedding model.</p>
</list-item>
</list>
<p>Before delving into each phase of our algorithm, it is essential to justify the usage of Wikipedia in our research. Specifically, we prioritize retrieving texts from this source due to several compelling advantages: (i) Wikipedia serves as an extensive repository encompassing information about any subject, and it includes entries for relevant individuals, places, or events; (ii) DBpedia provides a wealth of semantic information about these entries, enhancing the depth and context of our analysis; and (iii) there is a well-established and reliable mechanism to follow links between these repositories, and even connect them to others, which facilitates the discovery and retrieval of new documents.</p>
<p>While our approach to constructing ad hoc corpora is equally effective for texts from Wikipedia or any other source, there are additional remarkable features of this information repository that make it particularly suitable: documents cover a wide range of topics and disciplines; these articles are written and reviewed by a committed community of volunteer contributors; and it is constantly updated, reflecting recent advances in different fields of knowledge. Further evidence supporting the quality, representativeness, and significance of the texts within this online encyclopedia is demonstrated by the use of large corpora of articles extracted from Wikipedia in Transformer architectures. These architectures have garnered remarkable achievements in the field of NLP by employing such corpora for pre-training their base models, enabling them to acquire extensive language knowledge and a broad contextual understanding. This initial pre-training phase primes the models before they are fine-tuned for specific NLP tasks, underscoring the significance and value of the texts extracted from Wikipedia in fostering the advancement of sophisticated language models.</p>
<sec id="j_infor527_s_007">
<label>3.1</label>
<title>Strategy and Sources of Information</title>
<p>As the aim is to build an ad hoc corpus, it is necessary to define some way of characterising the topic on which this tailor-made dataset should be based (i.e. a seed describing the context of interest). For that purpose, a short initial text is used, representative of the thematic to which all the document in the corpus should be more or less related. In other words, the goal is to search the Internet for documents with some kind of relationship to this initial text.</p>
<p>All along this document, the following initial text will be used.</p><graphic xlink:href="infor527_g001.jpg"/>
<p>This 1926 character-long text (hereafter denoted as <inline-formula id="j_infor527_ineq_001"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>) is related to the Battle of Thermopylae among Greeks and Persians in 480 BC (Blanco <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_015">2020</xref>). That is, the aim is to compose a corpus related to the Greco-Persian wars, starting from a brief text related to the second Persian invasion of Greece and specifically to the famous and inspiring Battle of Thermopylae.</p>
<p>The approach conducted in this paper to discover documents leverages the Semantic Web and the Linked Open Data (LOD) (Oliveira <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_059">2017</xref>) initiatives, which form a global repository of interrelated knowledge with a multitude of structured data to study their relationships and obtain new information from the available one. Thus, the core of the procedure is based on identifying relevant entities in the initial text (e.g. people, locations, events…), discovering categories in which these entities are classified and, finally, gathering other entities also classified in those categories. For each entity discovered, its description can be retrieved from the LOD repositories, becoming a new candidate text to be included in the domain-specific corpus.</p>
<p>To delimit this work, the initial source of the data considered is Wikipedia and its structured counterpart, the DBpedia (Lehmann <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_049">2012</xref>). Given the vast amount of information available in these repositories, we leverage the existing capability to freely query them through well-known endpoints by SPARQL (SPARQL Protocol And RDF Query Language) queries. In particular, SPARQL is a language explicitly designed for retrieving data stored in RDF format through queries to repositories like DBpedia. DBpedia is not an isolated information repository but allows establishing links to other well-known datasets to enhance query results, such as YAGO (Pellissier <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_063">2020</xref>) and WikiData (Ismayilov <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_037">2015</xref>) that are extensively used in this work.<xref ref-type="fn" rid="j_infor527_fn_002">2</xref><fn id="j_infor527_fn_002"><label><sup>2</sup></label>
<p>Of course, even though other sources of information could be easily explored through the appropriate study of their URL formats, it will be shown that the components and tools involved in this research are representative enough of the potentialities of this approach.</p></fn> In sum, SPARQL plays a crucial role in the Semantic Web and Linked Open Data (LOD) initiatives due to its remarkable capabilities in pattern searching and result filtering based on specified conditions, enabling efficient access to information within semantic repositories.</p>
<p>Regarding the categories in which to classify the DB-SL entities identified in <inline-formula id="j_infor527_ineq_002"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>, it is interesting to highlight two properties that are frequently used in metadata descriptions to link Wikipedia pages to categories (and their members).</p>
<list>
<list-item id="j_infor527_li_010">
<label>•</label>
<p>On the one hand, the <monospace>dct:subject</monospace> property links pages to categories in the Wikipedia categorization system. Each category is usually composed of several words joined by an underscore character (<monospace>Traitors_in_history</monospace>, <monospace>People_of_the_Greco-Persian_Wars</monospace>) and may represent classifications by page contents or by administrative goals (<monospace>Wikipedia_administration_templates</monospace>, <monospace>Articles_with_broken_or_outdated_citations</monospace>…).</p>
</list-item>
<list-item id="j_infor527_li_011">
<label>•</label>
<p>On the other one, through the <monospace>rdf:type</monospace> property, pages (and the entities they represent) are associated to YAGO wikicats, some classes of the YAGO ontology reflecting the Wikipedia categorization system. The format of such wikicats is <monospace>WikicatW1W2</monospace><inline-formula id="j_infor527_ineq_003"><alternatives><mml:math>
<mml:mo>…</mml:mo></mml:math><tex-math><![CDATA[$\dots $]]></tex-math></alternatives></inline-formula><monospace>Wn</monospace>, that is, the <monospace>Wikicat</monospace> string followed by a set of words concatenated where each word starts with an uppercase letter and continues with lowercase ones (e.g. <monospace>BattlesInvolvingAthens</monospace>, <monospace>LocationsInGreekMythology</monospace>).</p>
</list-item>
</list>
<p>As will be described in the next sections, both properties are exploited in the paper as they are significant sources of information on the subject of a document, thus helping to discover new data and to assess its relevance.</p>
</sec>
<sec id="j_infor527_s_008">
<label>3.2</label>
<title>Identifying DBpedia Entities in the Initial Text</title>
<p>The identification of the relevant entities present in the initial text <inline-formula id="j_infor527_ineq_004"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> is based on the Named Entity Recognition capabilities provided by DBpedia Spotlight (DB-SL), which enable to move from raw text (strings referred as to surface forms) to structured data (the URLs of the corresponding DBpedia entities). Through several sophisticated procedures (such as entity detection, name resolution, candidate disambiguation…), DB-SL establishes the association among surface forms and DBpedia entities depending on the context: the same text can lead to different entities, and different surface forms can lead to the same entity.</p>
<p>To this aim, DB-SL provides both a web application interface available online and a well-known endpoint running an API to programmatically access the service remotely.<xref ref-type="fn" rid="j_infor527_fn_003">3</xref><fn id="j_infor527_fn_003"><label><sup>3</sup></label>
<p><uri>http://model.dbpedia-Spotlight.org/en/annotate</uri></p></fn> This last option is the most interesting one since the aim of this work is develop an automatic service that should be as autonomous as possible. However, this official service rejects bulk queries as it is only provided for testing purposes, so to speed up the execution it is advisable to install a local copy of DBpedia-Spotlight using a Docker image, for instance, provided by its creators.<xref ref-type="fn" rid="j_infor527_fn_004">4</xref><fn id="j_infor527_fn_004"><label><sup>4</sup></label>
<p>Available at <uri>https://hub.docker.com/r/dbpedia/dbpedia-Spotlight</uri></p></fn></p>
<p>So, the text <inline-formula id="j_infor527_ineq_005"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> is sent to the local DB-SL deployment to identify the relevant DBpedia entities contained in it. As a result, some DBpedia entities are retrieved, including the surface forms, links to DBpedia pages associated to them, and some additional information (e.g. the candidates for disambiguation and their rankings). The set of DBpedia entities obtained is denoted as <inline-formula id="j_infor527_ineq_006"><alternatives><mml:math>
<mml:mtext mathvariant="italic">DE</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{DE}({T_{0}})$]]></tex-math></alternatives></inline-formula> in Eq. (<xref rid="j_infor527_eq_001">1</xref>) (the mathematical notation adopted throughout the description of the approach can be found in Appendix <xref rid="j_infor527_app_001">A</xref>): 
<disp-formula id="j_infor527_eq_001">
<label>(1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mtext mathvariant="italic">DE</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>is a DBpedia entity present in</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>according to DB-SL</mml:mtext>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \textit{DE}({T_{0}})=\{{e_{i}}/{e_{i}}\hspace{2.5pt}\text{is a DBpedia entity present in}\hspace{2.5pt}{T_{0}}\hspace{2.5pt}\text{according to DB-SL}\}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>In this example, <bold>18 entities</bold> were detected in the text0 <inline-formula id="j_infor527_ineq_007"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>.<xref ref-type="fn" rid="j_infor527_fn_005">5</xref><fn id="j_infor527_fn_005"><label><sup>5</sup></label>
<p>In the following, some numbers will be provided, which are related to this default text. These numbers are continuously changing in a minor way, as Wikipedia pages are frequently added, removed, or modified, and new categories are constantly being created.</p></fn></p>
<p>Sometimes, DB-SL is not able to properly disambiguate candidates and provides some wrong entity associations in the results. In the example, for instance, the <monospace>First_French_Empire</monospace> entity is incorrectly identified by DB-SL from the word <monospace>empire</monospace> of the initial text. In other cases, some entities not denoting real persons, locations, events... but concepts are identified (e.g. <monospace>Battle</monospace>). These cases can lead to a huge amount of useless data that will be discarded later, but this usually introduces a heavy computation load and disk requirements that it would be convenient to avoid. In these circumstances, a named entity disambiguation method becomes necessary.</p>
<p>In spite of the notable performance achieved by the existing approaches in named entity disambiguation described in Section <xref rid="j_infor527_s_005">2.3</xref>, their complexity and requirements, such as the necessity of domain-specific training datasets that are hard to find and high computational demands, go beyond the needs of our ad hoc corpus generation algorithm. Depending on such custom in-domain datasets for disambiguating named entities in a work like ours, which specifically aims to construct tailor-made ad hoc corpora that are missing in the literature, would be impractical. In these circumstances, we have opted to employ a simpler yet effective mechanism to identify the correct entities in the initial text <inline-formula id="j_infor527_ineq_008"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>.</p>
<p>In particular, our mechanism leverages the semantic attributes, such as wikicats and subjects, associated with each entity in DBpedia and other linked repositories. Indeed, our approach specifically targets the identification of overlaps between the attributes of the target entity and those of other entities within <inline-formula id="j_infor527_ineq_009"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>. The underlying assumption is that the correct interpretation of a target named entity will be categorized under the same wikicats and subjects as the other named entities identified in <inline-formula id="j_infor527_ineq_010"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>. By considering these shared attributes, the mechanism increases the likelihood of accurately disambiguating the target entity within its given context.</p>
<p>Applying this procedure, some entities are discarded, such as <monospace>First_French_Empire</monospace> or <monospace>Battle</monospace>. Finally, after this filtering, only <bold>10 entities</bold> remained and were used in the following phases (<monospace>Themistocles</monospace>, <monospace>Ephialtes_of_Trachis</monospace>, <monospace>Thespiae</monospace>, <monospace>Battle_of_Artemisium</monospace>, <monospace>Battle_of_Marathon</monospace>, <monospace>Leonidas_I</monospace>, <monospace>Darius_I</monospace>, <monospace>Sparta</monospace>, <monospace>Xerxes_I</monospace>, <monospace>Battle_of_Thermopylae</monospace>). As it can be seen, all of them have a strong relationship with the initial text. So, Eq. (<xref rid="j_infor527_eq_001">1</xref>) is refined resulting in Eq. (<xref rid="j_infor527_eq_002">2</xref>): 
<disp-formula id="j_infor527_eq_002">
<label>(2)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mtext mathvariant="italic">DE</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>is a DBpedia entity present in</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>according to DB-SL</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mo>∧</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>is related to other</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>entities through wikicats/subjects</mml:mtext>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}\textit{DE}({T_{0}})=\{& {e_{i}}/{e_{i}}\hspace{2.5pt}\text{is a DBpedia entity present in}\hspace{2.5pt}{T_{0}}\hspace{2.5pt}\text{according to DB-SL}\hspace{2.5pt}\wedge \\ {} & {e_{i}}\hspace{2.5pt}\text{is related to other}\hspace{2.5pt}{T_{0}}\hspace{2.5pt}\text{entities through wikicats/subjects}\}.\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>This approach has proven to be sufficient as any errors in the disambiguation process only have a limited impact on our algorithm. As explained throughout the paper, if we fail to identify any entity, we consider tangentially related documents to <inline-formula id="j_infor527_ineq_011"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> as alternatives, but these documents are subsequently discarded in the following steps of the algorithm. Essentially, misidentifying an entity may cause a delay in constructing the ad hoc corpus (which is not critical since real-time requirements are absent), but it will not result in irrelevant documents (according to the specific theme of <inline-formula id="j_infor527_ineq_012"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>) being included within this custom dataset.</p>
</sec>
<sec id="j_infor527_s_009">
<label>3.3</label>
<title>Selecting All Relevant Wikicats that Characterise the Initial Text</title>
<p>With the goal of finding new documents that are significantly related to the initial text, <inline-formula id="j_infor527_ineq_013"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> is characterised with the set of wikicats linked to all the entities discovered in the previous step. As shown in Eq. (<xref rid="j_infor527_eq_003">3</xref>), the set of wikicats that characterise a given entity <italic>e</italic> is denoted as <inline-formula id="j_infor527_ineq_014"><alternatives><mml:math>
<mml:mtext mathvariant="italic">WK</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{WK}(e)$]]></tex-math></alternatives></inline-formula>. 
<disp-formula id="j_infor527_eq_003">
<label>(3)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mtext mathvariant="italic">WK</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>is a wikicat associated to the entity</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \textit{WK}(e)=\{w{k_{j}}/w{k_{j}}\hspace{2.5pt}\text{is a wikicat associated to the entity}\hspace{2.5pt}e\}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>To carry out this characterisation process, it is necessary to analyse the property <monospace>rdf:type</monospace> of each entity <italic>e</italic>, as wikicats are the values of this property belonging to the <monospace>yago</monospace> namespace and starting with the string “<monospace>Wikicat</monospace>”, as depicted in the next example: 
<disp-formula id="j_infor527_eq_004">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mtext mathvariant="monospace">entity</mml:mtext><mml:mover>
<mml:mo stretchy="true">→</mml:mo>
<mml:mrow>
<mml:mtext mathvariant="monospace">rdf:type</mml:mtext>
</mml:mrow>
</mml:mover>
<mml:mtext mathvariant="monospace">yago:Wikicat5th-centuryBCRulers</mml:mtext>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \texttt{entity}\xrightarrow{\texttt{rdf:type}}\texttt{yago:Wikicat5th-centuryBCRulers}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>This <monospace>rdf:type</monospace> property is returned by DB-SL, but most of the times incompletely, so it is necessary to resort to the original information repository to collect extended structural descriptions of the identified entities, including the categories to which they belong to. In particular, the well-known properties <monospace>rdf:type</monospace> and <monospace>dct:subject</monospace> are used to discover the categories to which a given entity is associated. As an example, note the SPARQL query launched against the DBpedia well-known endpoint<xref ref-type="fn" rid="j_infor527_fn_006">6</xref><fn id="j_infor527_fn_006"><label><sup>6</sup></label>
<p><uri>https://dbpedia.org/sparql</uri></p></fn> to complete the information related to the entity <monospace>Leonidas_I</monospace>:</p><graphic xlink:href="infor527_g002.jpg"/>
<p>Simple wikicats consisting of a single word are eliminated – e.g. <monospace>WikicatKings</monospace> or <monospace>WikicatBattle</monospace> – as they mostly lead to very general concepts, not sufficiently related to the initial text <inline-formula id="j_infor527_ineq_015"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>. As mentioned before, these cases are likely to lead to a large number of URLs that would be ranked lower later and discarded, introducing unnecessarily huge computation requirements. Any relevant URLs reached from such wikicats are likely to be also obtained from other more significant wikicats. So, Eq. (<xref rid="j_infor527_eq_003">3</xref>) is refined resulting in Eq. (<xref rid="j_infor527_eq_005">4</xref>): 
<disp-formula id="j_infor527_eq_005">
<label>(4)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mtext mathvariant="italic">WK</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>is a “not simple” wikicat associated to the entity</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \textit{WK}(e)=\{w{k_{j}}/w{k_{j}}\hspace{2.5pt}\text{is a ``not simple'' wikicat associated to the entity}\hspace{2.5pt}e\}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>Next, the set <inline-formula id="j_infor527_ineq_016"><alternatives><mml:math>
<mml:mtext mathvariant="italic">WK</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{WK}({T_{0}})$]]></tex-math></alternatives></inline-formula> is created as the aggregation of wikicats (removing duplicates) coming from the different entities contained in <inline-formula id="j_infor527_ineq_017"><alternatives><mml:math>
<mml:mtext mathvariant="italic">DE</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{DE}({T_{0}})$]]></tex-math></alternatives></inline-formula>. This is the set of relevant wikicats that characterise <inline-formula id="j_infor527_ineq_018"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>, as shown in Eq. (<xref rid="j_infor527_eq_006">5</xref>): 
<disp-formula id="j_infor527_eq_006">
<label>(5)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mtext mathvariant="italic">WK</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">⋃</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mtext mathvariant="italic">WK</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="1em"/>
<mml:mo>∀</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mtext mathvariant="italic">DE</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \textit{WK}({T_{0}})=\bigcup \limits_{i}\textit{WK}({e_{i}}),\hspace{1em}\forall {e_{i}}\in \textit{DE}({T_{0}}).\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>In the example, <bold>42 different wikicats</bold> were detected associated to the <bold>10 entities</bold> identified in the input text.</p>
<p>Sometimes, depending on the entities found by DB-SL, a large number of wikicats are collected in this phase. In order not to disperse the search, at this time the user has the possibility to discard some of those wikicats (see Fig. <xref rid="j_infor527_fig_001">1</xref>) if they are not meaningful in the target context, keeping only the selected set of wikicats for the following steps.<xref ref-type="fn" rid="j_infor527_fn_007">7</xref><fn id="j_infor527_fn_007"><label><sup>7</sup></label>
<p>This is just an optimization to speed up the process.</p></fn></p>
<fig id="j_infor527_fig_001">
<label>Fig. 1</label>
<caption>
<p>Snapshot of the corpus builder tool developed to explore and identify wikicats that are relevant in the context of the initial text <inline-formula id="j_infor527_ineq_019"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> (available at <monospace>https://github.com/gssi-uvigo/Plethora</monospace>).</p>
</caption>
<graphic xlink:href="infor527_g003.jpg"/>
</fig>
</sec>
<sec id="j_infor527_s_010">
<label>3.4</label>
<title>Discovering New URLs Associated to the Relevant Wikicats</title>
<p>So far, a number of entities have been identified in <inline-formula id="j_infor527_ineq_020"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> and some of their properties (wikicats and subjects) have been obtained. Now, it is time to exploit the rich capabilities of the LOD infrastructure to perform the reverse operation, that is, to collect objects (identified by URLs) that meet some requirements. For this purpose, well-known repositories will be explored to discover web pages characterised with the same tags that describe the initial text <inline-formula id="j_infor527_ineq_021"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>.</p>
<p>First, for each wikicat in <inline-formula id="j_infor527_ineq_022"><alternatives><mml:math>
<mml:mtext mathvariant="italic">WK</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{WK}({T_{0}})$]]></tex-math></alternatives></inline-formula>, the DBpedia repository is queried to gather all the known DBpedia entities that are associated with it (denoted as <inline-formula id="j_infor527_ineq_023"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">DB</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${U_{\textit{DB}}}({w_{k}})$]]></tex-math></alternatives></inline-formula> in Eq. (<xref rid="j_infor527_eq_007">6</xref>)). As each DBpedia entity is linked to a Wikipedia page, the set of Wikipedia URLs about entities tagged with that wikicat is also retrieved in this process. That is: 
<disp-formula id="j_infor527_eq_007">
<label>(6)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">DB</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>is a URL tagged with wikicat</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>according to DBpedia</mml:mtext>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {U_{\textit{DB}}}({w_{k}})=\{{u_{j}}/{u_{j}}\hspace{2.5pt}\text{is a URL tagged with wikicat}\hspace{2.5pt}{w_{k}}\hspace{2.5pt}\text{according to DBpedia}\}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>To fetch this set of URLs, the following SPARQL query is sent to the well-known DBpedia endpoint<xref ref-type="fn" rid="j_infor527_fn_008">8</xref><fn id="j_infor527_fn_008"><label><sup>8</sup></label>
<p><uri>https://dbpedia.org/sparql</uri></p></fn> (being <monospace>Wikicat5th-centuryBCRulers</monospace> the wikicat searched in this example):</p><graphic xlink:href="infor527_g004.jpg"/>
<p>The approach considers the <monospace>primaryTopic</monospace> property as this relationship is the one that leads directly to the Wikipedia page corresponding to the DBpedia entity (if any). If this property is not defined, the URL is discarded as it does not correspond to a Wikipedia page (since the interest does not lie in the URL of the DBpedia entity but in the text of its corresponding Wikipedia page). This text is the training document that will be added to the ad hoc corpus if is related enough to the initial text.</p>
<p>In addition, Wikidata (Vrandeĉić and Krötzsch, <xref ref-type="bibr" rid="j_infor527_ref_082">2014</xref>; Yoo and Jeong, <xref ref-type="bibr" rid="j_infor527_ref_085">2020</xref>) is also queried to gather all the Wikipedia pages related to the components of the wikicat name. Wikidata is the central repository of structured information for all the projects of the Wikimedia Foundation, storing more than 92 million data items (text, images, dates, …) accessible by SPARQL queries. Same as Wikipedia, Wikidata is aimed at a crowdsourced data acquisition, being freely editable by people or programs, not only regarding contents, but also data structure. To fetch this second set of URLs, the following SPARQL query is made to the Wikidata well-known access endpoint<xref ref-type="fn" rid="j_infor527_fn_009">9</xref><fn id="j_infor527_fn_009"><label><sup>9</sup></label>
<p><uri>https://query.wikidata.org/sparql</uri></p></fn> (being this time <monospace>Greco-PersianWars</monospace> the wikicat searched in the query):</p><graphic xlink:href="infor527_g005.jpg"/>
<p>This query provides a second set of URLs (denoted as <inline-formula id="j_infor527_ineq_024"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">WK</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${U_{\textit{WK}}}({w_{k}})$]]></tex-math></alternatives></inline-formula> in Eq. (<xref rid="j_infor527_eq_008">7</xref>)), usually larger than the first one, although composed of less reliable documents. 
<disp-formula id="j_infor527_eq_008">
<label>(7)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">WK</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>is related to the wikicat</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>according to Wikidata</mml:mtext>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {U_{\textit{WK}}}({w_{k}})=\{{u_{j}}/{u_{j}}\hspace{2.5pt}\text{is related to the wikicat}\hspace{2.5pt}{w_{k}}\hspace{2.5pt}\text{according to Wikidata}\}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>The second query permits to collect some interesting documents tightly related to the application scenario that, by any reason, have not been tagged by users with the set of characterising wikicats (may be even they are tagged with some similar wikicat that has not been retrieved in the first phase, e.g. <monospace>NavalBattlesInvolvingGreece</monospace>). In any case, less reliable documents will obtain a low rank in the following steps, and they will be discarded.</p>
<p>Finally, both sets of URLs (represented in Eqs. (<xref rid="j_infor527_eq_007">6</xref>) and (<xref rid="j_infor527_eq_008">7</xref>)) are joined (removing duplicates) resulting in Eq. (<xref rid="j_infor527_eq_009">8</xref>): 
<disp-formula id="j_infor527_eq_009">
<label>(8)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">DW</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">DB</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>∪</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">WK</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {U_{\textit{DW}}}({w_{k}})={U_{\textit{DB}}}({w_{k}})\cup {U_{\textit{WK}}}({w_{k}}).\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>At this point, <inline-formula id="j_infor527_ineq_025"><alternatives><mml:math>
<mml:mi mathvariant="italic">U</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$U({T_{0}})$]]></tex-math></alternatives></inline-formula> is defined as the set of URLs that are associated (in DBpedia or Wikidata) with some wikicat included in <inline-formula id="j_infor527_ineq_026"><alternatives><mml:math>
<mml:mtext mathvariant="italic">WK</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{WK}({T_{0}})$]]></tex-math></alternatives></inline-formula> (that is, URLs of pages that may have a strong relationship with <inline-formula id="j_infor527_ineq_027"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>), as shown in Eq. (<xref rid="j_infor527_eq_010">9</xref>): 
<disp-formula id="j_infor527_eq_010">
<label>(9)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">U</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mo>∃</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mtext mathvariant="italic">WK</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mspace width="2.5pt"/>
<mml:mtext>such that</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">}</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ U({T_{0}})=\big\{{u_{j}}/\exists {w_{k}}\in \textit{WK}({T_{0}})\hspace{2.5pt}\text{such that}\hspace{2.5pt}{u_{j}}\in {U_{DW}}({w_{k}})\big\}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>In the example, <bold>90735 different URLs</bold> were identified from the <bold>42 wikicats</bold> collected. All of them corresponded to Wikipedia pages that were likely related to the context defined by the initial text <inline-formula id="j_infor527_ineq_028"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>.</p>
</sec>
<sec id="j_infor527_s_011">
<label>3.5</label>
<title>Fetching and Cleaning Discovered URLs</title>
<p>Every URL <inline-formula id="j_infor527_ineq_029"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${u_{j}}$]]></tex-math></alternatives></inline-formula> included in <inline-formula id="j_infor527_ineq_030"><alternatives><mml:math>
<mml:mi mathvariant="italic">U</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$U({T_{0}})$]]></tex-math></alternatives></inline-formula> is now downloaded and cleaned (markup, styling, references, graphical items, etc., are removed) to become a candidate text to be included in the corpus.<xref ref-type="fn" rid="j_infor527_fn_010">10</xref><fn id="j_infor527_fn_010"><label><sup>10</sup></label>
<p>Beautiful Soup Python library has been used for this purpose.</p></fn> Actually, only those documents with a minimum text length (currently 300 bytes) are considered as candidate texts, just because short texts usually denote meaningless pages, unlikely to contain DBpedia entities (for instance, disambiguation pages). In this regard, note that in the example considered in the paper, <bold>83919 texts</bold> were downloaded which had a length above the mentioned threshold. At this point, several thousands of documents with content that is likely to be similar to some extent to the initial text <inline-formula id="j_infor527_ineq_031"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> are available, which are candidates to be incorporated into the ad hoc corpus pursued in the approach. This set of candidate texts are denoted as <inline-formula id="j_infor527_ineq_032"><alternatives><mml:math>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{CT}({T_{0}})$]]></tex-math></alternatives></inline-formula> in Eq. (<xref rid="j_infor527_eq_011">10</xref>): 
<disp-formula id="j_infor527_eq_011">
<label>(10)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>is the cleaned text of some</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">U</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">}</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \textit{CT}({T_{0}})=\big\{{\textit{CT}_{j}}/{\textit{CT}_{j}}\hspace{2.5pt}\text{is the cleaned text of some}\hspace{2.5pt}{u_{j}}\in U({T_{0}})\big\}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>Of course, a large number of these documents might have a tangential relationship to the initial text (e.g. a battle involving Athens corresponding to a different historical stage). It is therefore necessary to measure their similarity to the particular context, order them according to that metric, and discard the irrelevant ones.</p>
</sec>
<sec id="j_infor527_s_012">
<label>3.6</label>
<title>Assessing Relevance of Each Candidate Text</title>
<p>By simple visual inspection, it is easy to notice that a significant amount of the documents obtained in the previous phase do not have a strong-enough relationship to the proposed domain, and for that reason they should not be incorporated into a domain-specific corpus. Of course, the wikicats retrieved could be quite specific (e.g. <monospace>Greco-PersianWars</monospace>) leading to documents that are likely to belong to the thematic of the initial text. But they can be also wide spectrum, mixing similar URLs with unrelated ones (e.g. <monospace>PeopleFromAthens</monospace> leading both to <monospace>Themistocles</monospace> – leader of Greeks in the scenario under consideration – and <monospace>Queen Sofia of Spain</monospace>, currently alive and probably irrelevant in the context of the battle of the Thermopylae).</p>
<p>To detect and discard these uninteresting documents, the similarity between the initial text <inline-formula id="j_infor527_ineq_033"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> and each of the candidate texts <inline-formula id="j_infor527_ineq_034"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{j}}$]]></tex-math></alternatives></inline-formula> is measured. Depending on these values, the relationship between <inline-formula id="j_infor527_ineq_035"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{j}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_036"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> can be relevant (and thus the document is incorporated into the ad-hoc corpus) or irrelevant (the document is discarded), as depicted in Eq. (<xref rid="j_infor527_eq_012">11</xref>): 
<disp-formula id="j_infor527_eq_012">
<label>(11)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mtext mathvariant="italic">Corpus</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>∧</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="2.5pt"/>
<mml:mtext>is</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mtext mathvariant="italic">relevant</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mtext>according to</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mtext mathvariant="italic">similarity</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \textit{Corpus}({T_{0}})=\{{T_{j}}/{T_{j}}\in \textit{CT}({T_{0}})\wedge {T_{j}}\hspace{2.5pt}\text{is}\hspace{2.5pt}\textit{relevant}\hspace{2.5pt}\text{according to}\hspace{2.5pt}\textit{similarity}({T_{j,}}{T_{0}})\}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>The different <italic>similarity</italic> metrics that have been taken into account to detect the relationship between each of the candidate texts and <inline-formula id="j_infor527_ineq_037"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> are explained in Section <xref rid="j_infor527_s_013">3.6.1</xref>. The way in which the best metric has been identified is detailed in Section <xref rid="j_infor527_s_014">3.6.2</xref>. Finally, the <italic>relevance</italic> criteria considered to evaluate the candidate texts are discussed in Section <xref rid="j_infor527_s_015">4</xref>.</p>
<sec id="j_infor527_s_013">
<label>3.6.1</label>
<title>How to Measure Similarity Between Each Candidate <inline-formula id="j_infor527_ineq_038"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{j}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_039"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula></title>
<p>When it comes to detecting resemblance between each candidate text and the input one (<inline-formula id="j_infor527_ineq_040"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>), part of the significant information bound to them (that has been discovered by the procedures described in previous sections) can be leveraged, such as their common wikicats and subjects. Besides, it is also possible to resort to approaches defined in the literature – based on using word-level and sentence-level embeddings – that have attained good performance for measuring semantic similarity between texts (Le and Mikolov, <xref ref-type="bibr" rid="j_infor527_ref_048">2014</xref>; Lau and Baldwin, <xref ref-type="bibr" rid="j_infor527_ref_047">2016</xref>; Fu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_026">2018</xref>; Gali <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_027">2019</xref>; Dai <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_022">2020</xref>; Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_019">2019</xref>). Details of the four metrics adopted in this work and the results of the experiments justifying the adoption of a Doc2Vec-based solution are presented next. 
<list>
<list-item id="j_infor527_li_012">
<label>1.</label>
<p><bold>Wikicats Jaccard similarity (</bold><inline-formula id="j_infor527_ineq_041"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="bold-italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textbf{\textit{Sim}}_{W}}$]]></tex-math></alternatives></inline-formula><bold>).</bold></p>
<p>This metric measures the similarity between the initial text <inline-formula id="j_infor527_ineq_042"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> and the candidate one <inline-formula id="j_infor527_ineq_043"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{c}}$]]></tex-math></alternatives></inline-formula> according to the coincidence of the wikicats that characterise each of them. That is:</p><p><graphic xlink:href="infor527_g006.jpg"/></p>
</list-item>
<list-item id="j_infor527_li_013">
<label>2.</label>
<p><bold>Subjects Jaccard similarity (</bold><inline-formula id="j_infor527_ineq_044"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="bold-italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textbf{\textit{Sim}}_{S}}$]]></tex-math></alternatives></inline-formula><bold>).</bold></p>
<p>This metric is similar to the previous one but using common subjects between <inline-formula id="j_infor527_ineq_045"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_046"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{c}}$]]></tex-math></alternatives></inline-formula>, instead of wikicats.</p>
</list-item>
<list-item id="j_infor527_li_014">
<label>3.</label>
<p><bold>spaCy similarity (</bold><inline-formula id="j_infor527_ineq_047"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="bold-italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textbf{\textit{Sim}}_{C}}$]]></tex-math></alternatives></inline-formula><bold>)</bold>.</p>
<p>The computation of similarity values by means of this metric is driven by spaCy,<xref ref-type="fn" rid="j_infor527_fn_011">11</xref><fn id="j_infor527_fn_011"><label><sup>11</sup></label>
<p><ext-link ext-link-type="uri" xlink:href="http://www.spacy.io">www.spacy.io</ext-link></p></fn> a Python package for natural language processing. As usual in similar packages, it provides functions for tokenizing texts, removing stopwords and punctuation, classifying words according grammatical categories… In addition, it also implements mechanisms to assign a vector to each word. For this task, it uses algorithms like Glove or Word2Vec (a variant of this by default) to assign a word embeddings vector to each word of the vocabulary.<xref ref-type="fn" rid="j_infor527_fn_012">12</xref><fn id="j_infor527_fn_012"><label><sup>12</sup></label>
<p>Note that several English vocabularies can be loaded at startup, from small to large ones.</p></fn> And, naturally, it provides functions to measure similarity between words, comparing vectors through the traditional cosine-based similarity.</p>
<p>Directly derived from this, spaCy also provides a simple mechanism to measure similarity between texts, generating a vector for each text (the average of the corresponding vectors for each word of the text) and computing cosine-based similarity between the text vectors.</p>
</list-item>
<list-item id="j_infor527_li_015">
<label>4.</label>
<p><bold>Doc2Vec similarity (</bold><inline-formula id="j_infor527_ineq_048"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="bold-italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="bold-italic">AP</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textbf{\textit{Sim}}_{\textbf{\textit{AP}}}}$]]></tex-math></alternatives></inline-formula><bold>)</bold>.</p>
<p>Some works in the literature have shown that Doc2Vec performs robustly in measuring document similarity when trained using large external corpora (Lau and Baldwin, <xref ref-type="bibr" rid="j_infor527_ref_047">2016</xref>; Dai <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_022">2020</xref>). Bearing these results in mind, this embedding model has been explored to select the most similar candidate documents to the initial text <inline-formula id="j_infor527_ineq_049"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>. Since the aim is to build an ad hoc corpus, at this point there was no dataset to learn a custom-trained Doc2Vec model. Therefore, a model pre-trained on some publicly available generic corpus has been adopted. In particular, this paper uses the well-known Doc2Vec model trained on a large corpus of news from the Associated Press (AP).<xref ref-type="fn" rid="j_infor527_fn_013">13</xref><fn id="j_infor527_fn_013"><label><sup>13</sup></label>
<p><uri>https://github.com/shreyanse081/gensim_Doc-Word2Vec</uri></p></fn></p>
<p>Using this model and the Gensim implementation of the Doc2Vec algorithm, the approach obtained the characteristic vector for both each candidate text <inline-formula id="j_infor527_ineq_050"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{c}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_051"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>, and then used the cosine-based similarity to measure similarity between vectors (and so documents).<xref ref-type="fn" rid="j_infor527_fn_014">14</xref><fn id="j_infor527_fn_014"><label><sup>14</sup></label>
<p>The tutorial available at <uri>https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html</uri> provides examples on how to use GenSim to load a corpus, train a Doc2Vec model using that dataset and infer the corresponding document vector.</p></fn> For the training process, a simple pre-processing has been carried out on every text using the Gensim API: tokenize to obtain a list of words, lowercase them, remove punctuation, and remove stop-words.<xref ref-type="fn" rid="j_infor527_fn_015">15</xref><fn id="j_infor527_fn_015"><label><sup>15</sup></label>
<p>Tests of keeping and removing stop-words before training showed that better results were obtained by removing them.</p></fn></p>
</list-item>
</list>
</p>
</sec>
<sec id="j_infor527_s_014">
<label>3.6.2</label>
<title>How to Select the Best Similarity Metric in the Approach</title>
<p>In order to decide which of the four metrics described in the previous section (<inline-formula id="j_infor527_ineq_052"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{W}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor527_ineq_053"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{S}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor527_ineq_054"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{C}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_055"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{\textit{AP}}}$]]></tex-math></alternatives></inline-formula>) is the best for the intended purpose, a supervised approach will be followed to measure how well they can identify the similarity between <inline-formula id="j_infor527_ineq_056"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> and a set of documents recognized as highly similar.</p>
<list>
<list-item id="j_infor527_li_016">
<label>•</label>
<p>First, the similarity between the initial text and each candidate text will be computed using each one of the similarity metrics. This allows the set of candidate texts in <inline-formula id="j_infor527_ineq_057"><alternatives><mml:math>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{CT}({T_{0}})$]]></tex-math></alternatives></inline-formula> (Eq. (<xref rid="j_infor527_eq_011">10</xref>)) to be sorted in decreasing order according to their similarity value <inline-formula id="j_infor527_ineq_058"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{x}}$]]></tex-math></alternatives></inline-formula> with respect to <inline-formula id="j_infor527_ineq_059"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>, that is, each <inline-formula id="j_infor527_ineq_060"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{x}}$]]></tex-math></alternatives></inline-formula> will lead to a different <inline-formula id="j_infor527_ineq_061"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{x}}({T_{0}})$]]></tex-math></alternatives></inline-formula>. In this regard, note that the starting point is the input text <inline-formula id="j_infor527_ineq_062"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> containing several DBpedia entities detected by the DBpedia Spotlight (<inline-formula id="j_infor527_ineq_063"><alternatives><mml:math>
<mml:mtext mathvariant="italic">DE</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{DE}({T_{0}})$]]></tex-math></alternatives></inline-formula> in Eq. (<xref rid="j_infor527_eq_002">2</xref>)). As such entities represent Wikipedia pages, their corresponding cleaned texts (included as candidates in <inline-formula id="j_infor527_ineq_064"><alternatives><mml:math>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{CT}({T_{0}})$]]></tex-math></alternatives></inline-formula>) are also available, and they will be the testing set, as it is clear that all of them are texts quite related to the subject of <inline-formula id="j_infor527_ineq_065"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>.</p>
</list-item>
<list-item id="j_infor527_li_017">
<label>•</label>
<p>Each entity <inline-formula id="j_infor527_ineq_066"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${e_{i}}$]]></tex-math></alternatives></inline-formula> appearing in <inline-formula id="j_infor527_ineq_067"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> will be located in a different position <inline-formula id="j_infor527_ineq_068"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${P_{{x_{i}}}}$]]></tex-math></alternatives></inline-formula> in each <inline-formula id="j_infor527_ineq_069"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{x}}({T_{0}})$]]></tex-math></alternatives></inline-formula> (so that the higher the similarity, the lower the position). So, the best similarity metric <inline-formula id="j_infor527_ineq_070"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{x}}$]]></tex-math></alternatives></inline-formula> is the one that locates all the <inline-formula id="j_infor527_ineq_071"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>’s entities the lower the better in the set <inline-formula id="j_infor527_ineq_072"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{x}}({T_{0}})$]]></tex-math></alternatives></inline-formula>.<xref ref-type="fn" rid="j_infor527_fn_016">16</xref><fn id="j_infor527_fn_016"><label><sup>16</sup></label>
<p>Currently, the approach considers only entities of types Person, Location and Event (the 10 entities identified in the example text meet this requirement).</p></fn> In particular, the approach selects the similarity value with the lowest average position for all the entities of <inline-formula id="j_infor527_ineq_073"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> (denoted as <inline-formula id="j_infor527_ineq_074"><alternatives><mml:math>
<mml:mtext mathvariant="italic">Avrg</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{Avrg}({P_{x}})$]]></tex-math></alternatives></inline-formula>). The results of this procedure are shown in the snapshot of the tool developed that is depicted in Fig. <xref rid="j_infor527_fig_002">2</xref>.</p>
</list-item>
</list>
<fig id="j_infor527_fig_002">
<label>Fig. 2</label>
<caption>
<p>Interface of the corpus builder tool developed by the authors of the paper, which aims to evaluate the similarity between the initial text <inline-formula id="j_infor527_ineq_075"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> and the selected candidate texts through the four metrics considered in the approach (<inline-formula id="j_infor527_ineq_076"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{W}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor527_ineq_077"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{S}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor527_ineq_078"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{C}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_079"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{\textit{AP}}}$]]></tex-math></alternatives></inline-formula>).</p>
</caption>
<graphic xlink:href="infor527_g007.jpg"/>
</fig>
<p>Returning to the example illustrated throughout this section, Table <xref rid="j_infor527_tab_002">2</xref> shows the positions occupied by the 10 DB-SL entities (discovered in <inline-formula id="j_infor527_ineq_080"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>) in the ordered sets that have been obtained using the four metrics considered in the tests (<inline-formula id="j_infor527_ineq_081"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{W}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor527_ineq_082"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{S}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor527_ineq_083"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{C}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_084"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{\textit{AP}}}$]]></tex-math></alternatives></inline-formula>). Note that <inline-formula id="j_infor527_ineq_085"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{\textit{AP}}}$]]></tex-math></alternatives></inline-formula> is the Doc2Vec similarity metric computed with the AP pre-trained model.</p>
<table-wrap id="j_infor527_tab_002">
<label>Table 2</label>
<caption>
<p>Positions occupied by the 10 DB-SL entities (discovered in the initial text <inline-formula id="j_infor527_ineq_086"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>) in a set of candidate documents that have been ordered (as per their similarity to that text) using each of the four metrics considered in the approach. The best metric is the one that finds the entities in the lowest positions in the ordered set, that is, in the documents most similar to the input text (<inline-formula id="j_infor527_ineq_087"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{\textit{AP}}}$]]></tex-math></alternatives></inline-formula> in this case).</p>
</caption>
<graphic xlink:href="infor527_g008.jpg"/>
</table-wrap>
<p>These data are not deterministic for the Doc2Vec-based similarity <inline-formula id="j_infor527_ineq_088"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{\textit{AP}}}$]]></tex-math></alternatives></inline-formula>, because the Doc2Vec model provides slightly different similarity values for any pair of documents in every execution. This is expected behaviour, as inferring a vector for a new document is not a deterministic process but an iterative one with some randomization involved in the negative sampling feature in the training process.<xref ref-type="fn" rid="j_infor527_fn_017">17</xref><fn id="j_infor527_fn_017"><label><sup>17</sup></label>
<p>Note that, opposite to Word2Vec vectors for words in the vocabulary, there is no vector stored in the model for the new documents, so such vector must be computed on the way, replicating the training process.</p></fn> These slight variations are not usually important when estimating a similarity, but, in the example illustrated in the paper, 83919 documents are sorted, and such variations affect significantly the final order among them (and the positions of the entities).</p>
<p>To solve that issue, and compare Doc2Vec with the other similarity metrics, the proposed algorithm (which is outlined in Algorithm 1) computes the results for 5 different executions, and uses the average for every entity. As depicted in Table <xref rid="j_infor527_tab_002">2</xref>, the Doc2Vec-driven metric was found to be the best one according to the defined criterion (the average is 100.3). In light of the results, the Doc2Vec similarity metric (denoted by <inline-formula id="j_infor527_ineq_089"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{\textit{AP}}}$]]></tex-math></alternatives></inline-formula>) is adopted to order the candidate texts in <inline-formula id="j_infor527_ineq_090"><alternatives><mml:math>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{CT}({T_{0}})$]]></tex-math></alternatives></inline-formula>, hereafter denoted as <inline-formula id="j_infor527_ineq_091"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{\textit{AP}}}({T_{0}})$]]></tex-math></alternatives></inline-formula>.</p><graphic xlink:href="infor527_g009.jpg"/>
<p>As depicted in above sketched algorithm, a subset of the candidate documents that occupy the top positions in <inline-formula id="j_infor527_ineq_092"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{\textit{AP}}}({T_{0}})$]]></tex-math></alternatives></inline-formula> should be selected to be incorporated into the ad hoc corpus. As can be guessed, there is no red line marking which candidates should be included in the corpus and which should not. There should be negligible differences among candidates on either side of any threshold. Of course, the more candidates are added, the better it is to achieve a more stable embeddings model, but the documents will be less and less similar to the initial text. And it also seems clear that, at the end, any consistency measure can be influenced by the final application in which the corpus is used.</p>
<p>But one thing that can be done is to analyse different scenarios and study their results according to a common criterion. This allows to assess the performance of the models obtained by training on different subsets of the <inline-formula id="j_infor527_ineq_093"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{\textit{AP}}}({T_{0}})$]]></tex-math></alternatives></inline-formula>, and to observe how the selected size affects them. This is the criterion adopted in the next section to validate the proposed approach to select the documents that will finally be incorporated into the tailor-made corpus.</p>
</sec>
</sec>
</sec>
<sec id="j_infor527_s_015">
<label>4</label>
<title>Validating the Selection of In-Domain Documents for the Corpus</title>
<p>The starting point is the collection of candidate documents that have been gathered from Wikipedia and ordered according to their similarity to <inline-formula id="j_infor527_ineq_094"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>. Recall that such ordered set was denoted as <inline-formula id="j_infor527_ineq_095"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{\textit{AP}}}({T_{0}})$]]></tex-math></alternatives></inline-formula> as it was the result of using the best metric (<inline-formula id="j_infor527_ineq_096"><alternatives><mml:math>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mi mathvariant="italic">P</mml:mi></mml:math><tex-math><![CDATA[$\textit{Sim}AP$]]></tex-math></alternatives></inline-formula>), which is based on the pre-trained Doc2Vec model on the AP news collection. The documents in this set are ordered from most to least similar to <inline-formula id="j_infor527_ineq_097"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>, but it is likely that not all of them are strongly-enough related to the context because they have been collected from some wikicats that may be tangentially related to the initial text. Therefore, only a subset of these documents needs to be considered. In particular, only the texts that occupy the first positions of the ordered set should be incorporated into the ad hoc corpus being pursued. As there are no clear guidelines on this threshold (the similarity results are certainly quite continuous over the 83919 computed values), different thresholds have been chosen in order to be able to use the resulting corpora in some scenario and to evaluate the quality of the results obtained. In this regard, it was decided to select all the percentages of the set <inline-formula id="j_infor527_ineq_098"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{\textit{AP}}}({T_{0}})$]]></tex-math></alternatives></inline-formula>, from 1% to 10%, in order to train the corresponding Doc2Vec models with that set of ad hoc corpora (resulting in sizes 839, 1678, …). The consistency of such models (denoted as <inline-formula id="j_infor527_ineq_099"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{1}}$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_infor527_ineq_100"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{10}}$]]></tex-math></alternatives></inline-formula>) was then analysed individually (Section <xref rid="j_infor527_s_016">4.1</xref>) and compared with each other and with the generic AP model (Section <xref rid="j_infor527_s_017">4.2</xref>).</p>
<p>This comparison should serve to answer a relevant outstanding question: “<italic>are any of these ad hoc models better than the generic AP model?</italic>’ Note that this is the cornerstone of the research described in the paper, which aims to devise a procedure to build an ad hoc corpus (composed of documents significantly related to a given application scenario), as there are several references in the literature stating that the model derived from such a corpus should be better than a model trained on a generic corpus of documents (not particularly related to the scenario under consideration). Therefore, evidences in this respect should be provided.</p>
<sec id="j_infor527_s_016">
<label>4.1</label>
<title>Consistency Tests for Evaluating Embedding Models Learned from Ad Hoc Corpora</title>
<p>One simple consistency test to check the resulting models is to compute the self-rank of each training document when searching for similar documents. That is, for each one of the training files, the model is asked for the <italic>N</italic> most similar documents to it, as if such file were new and not already in the model. Obviously, the model should select the same file as the most similar to itself (1-<italic>rank</italic>), that is, the first in the list of most similar docs.<xref ref-type="fn" rid="j_infor527_fn_018">18</xref><fn id="j_infor527_fn_018"><label><sup>18</sup></label>
<p>See <uri>https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html</uri> for details.</p></fn> But sometimes, the model makes a mistake and finds other document that is even more similar. The less mistakes, the better the model.</p>
<p>Table <xref rid="j_infor527_tab_003">3</xref> depicts the results for the 10 Doc2Vec models (which have been trained on the 10 ad hoc corpora), where the <inline-formula id="j_infor527_ineq_101"><alternatives><mml:math>
<mml:mi mathvariant="italic">R</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$Ran{k_{i}}$]]></tex-math></alternatives></inline-formula> row shows the percentage of training document that had a 1-<inline-formula id="j_infor527_ineq_102"><alternatives><mml:math>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">k</mml:mi></mml:math><tex-math><![CDATA[$rank$]]></tex-math></alternatives></inline-formula> for each <inline-formula id="j_infor527_ineq_103"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{i}}$]]></tex-math></alternatives></inline-formula> model. The rather high value (close to 1) of the resulting figures confirms that the behaviour of all learned ad hoc models is consistent.</p>
<table-wrap id="j_infor527_tab_003">
<label>Table 3</label>
<caption>
<p>1-<italic>rank</italic> results obtained for 10 ad hoc Doc2Vec models generated in the approach (<inline-formula id="j_infor527_ineq_104"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{1}}$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_infor527_ineq_105"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{10}}$]]></tex-math></alternatives></inline-formula>). The results confirm that, given a training document, these models were almost always correct in identifying that document as the most similar to itself (on average, this was true for 98.5% of the training documents).</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_106"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{i}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">1</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">2</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">3</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">4</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">5</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">6</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">7</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">8</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">9</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">10</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_107"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Rank</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Rank}_{i}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">94.5</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">98.9</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">99.2</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">99</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">98.9</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">98.8</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">98.8</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">99</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">98.9</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">98.8</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Another simple consistency check for the models consists of observing how well they discriminate between similar and dissimilar documents. To this aim, a simple experiment was made where two lists of documents in <inline-formula id="j_infor527_ineq_108"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{\textit{AP}}}({T_{0}})$]]></tex-math></alternatives></inline-formula> were selected:</p>
<list>
<list-item id="j_infor527_li_018">
<label>•</label>
<p><inline-formula id="j_infor527_ineq_109"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${L_{1}}$]]></tex-math></alternatives></inline-formula>: the 100 most similar documents to the initial text, according to the <inline-formula id="j_infor527_ineq_110"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{\textit{AP}}}$]]></tex-math></alternatives></inline-formula> similarity metric.</p>
</list-item>
<list-item id="j_infor527_li_019">
<label>•</label>
<p><inline-formula id="j_infor527_ineq_111"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${L_{2}}$]]></tex-math></alternatives></inline-formula>: the 100 less similar documents to the initial text, as per the same metric.<xref ref-type="fn" rid="j_infor527_fn_019">19</xref><fn id="j_infor527_fn_019"><label><sup>19</sup></label>
<p>Actually, only documents larger than 3KB were included in each of the lists because very small documents were of little significance for testing purposes.</p></fn></p>
</list-item>
</list>
<p>Each of these 200 documents was divided into two parts of equal size, and several similarity values were computed for each one of the 10 ad hoc Doc2Vec models:</p>
<list>
<list-item id="j_infor527_li_020">
<label>•</label>
<p><inline-formula id="j_infor527_ineq_112"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{1}}$]]></tex-math></alternatives></inline-formula>: similarity values between both parts of each file in <inline-formula id="j_infor527_ineq_113"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${L_{1}}$]]></tex-math></alternatives></inline-formula> (<bold>these values should be high</bold> as the document is similar to <inline-formula id="j_infor527_ineq_114"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> and both parts are about the same theme). A list of 100 similarity values was obtained for each model.</p>
</list-item>
<list-item id="j_infor527_li_021">
<label>•</label>
<p><inline-formula id="j_infor527_ineq_115"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{2}}$]]></tex-math></alternatives></inline-formula>: similarity values between both parts of each file in <inline-formula id="j_infor527_ineq_116"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${L_{2}}$]]></tex-math></alternatives></inline-formula> (<bold>they should be also high</bold>, as both parts are about the same theme). A list of 100 values was computed for each model.</p>
</list-item>
<list-item id="j_infor527_li_022">
<label>•</label>
<p><inline-formula id="j_infor527_ineq_117"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{3}}$]]></tex-math></alternatives></inline-formula>: similarity values between the first part of each file in <inline-formula id="j_infor527_ineq_118"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${L_{1}}$]]></tex-math></alternatives></inline-formula> and the first part of the file occupying the same index in <inline-formula id="j_infor527_ineq_119"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${L_{2}}$]]></tex-math></alternatives></inline-formula> (<bold>these values should be low</bold> as most of the time they are unrelated documents, as can be seen from the example mentioned earlier on <monospace>Themistocles</monospace> and <monospace>Queen Sofia of Spain</monospace>).</p>
</list-item>
</list>
<p>The average of the similarities <inline-formula id="j_infor527_ineq_120"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{1}}$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_infor527_ineq_121"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{3}}$]]></tex-math></alternatives></inline-formula> was finally computed for each model <inline-formula id="j_infor527_ineq_122"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{i}}$]]></tex-math></alternatives></inline-formula>, as depicted in Table <xref rid="j_infor527_tab_004">4</xref>. Once again, the results confirm that the 10 ad hoc models are completely consistent with what expected (similarities between both parts of similar or dissimilar documents are extremely high while cross similarities are low).</p>
<table-wrap id="j_infor527_tab_004">
<label>Table 4</label>
<caption>
<p>Similarity values that our 10 Doc2Vec models have measured between documents that are related to <inline-formula id="j_infor527_ineq_123"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> to different extents. The results confirm that the 10 ad hoc models are able to detect high similarity between strongly related documents (values close to 1 included in the rows corresponding to the average of <inline-formula id="j_infor527_ineq_124"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_125"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{2}}$]]></tex-math></alternatives></inline-formula>), and low similarity between unrelated texts (very low values referring to the average of <inline-formula id="j_infor527_ineq_126"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{3}}$]]></tex-math></alternatives></inline-formula>).</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_127"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{i}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">1</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">2</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">3</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">4</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">5</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">6</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">7</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">8</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">9</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">10</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_128"><alternatives><mml:math>
<mml:mtext mathvariant="italic">Avg</mml:mtext></mml:math><tex-math><![CDATA[$\textit{Avg}$]]></tex-math></alternatives></inline-formula> <inline-formula id="j_infor527_ineq_129"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0.93</td>
<td style="vertical-align: top; text-align: left">0.93</td>
<td style="vertical-align: top; text-align: left">0.93</td>
<td style="vertical-align: top; text-align: left">0.93</td>
<td style="vertical-align: top; text-align: left">0.94</td>
<td style="vertical-align: top; text-align: left">0.93</td>
<td style="vertical-align: top; text-align: left">0.94</td>
<td style="vertical-align: top; text-align: left">0.94</td>
<td style="vertical-align: top; text-align: left">0.94</td>
<td style="vertical-align: top; text-align: left">0.93</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_130"><alternatives><mml:math>
<mml:mtext mathvariant="italic">Avg</mml:mtext></mml:math><tex-math><![CDATA[$\textit{Avg}$]]></tex-math></alternatives></inline-formula> <inline-formula id="j_infor527_ineq_131"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0.83</td>
<td style="vertical-align: top; text-align: left">0.84</td>
<td style="vertical-align: top; text-align: left">0.83</td>
<td style="vertical-align: top; text-align: left">0.84</td>
<td style="vertical-align: top; text-align: left">0.84</td>
<td style="vertical-align: top; text-align: left">0.83</td>
<td style="vertical-align: top; text-align: left">0.84</td>
<td style="vertical-align: top; text-align: left">0.84</td>
<td style="vertical-align: top; text-align: left">0.84</td>
<td style="vertical-align: top; text-align: left">0.85</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_132"><alternatives><mml:math>
<mml:mtext mathvariant="italic">Avg</mml:mtext></mml:math><tex-math><![CDATA[$\textit{Avg}$]]></tex-math></alternatives></inline-formula> <inline-formula id="j_infor527_ineq_133"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.18</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.19</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.19</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.19</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.18</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.21</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.21</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.18</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.19</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.22</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="j_infor527_s_017">
<label>4.2</label>
<title>Comparison to the Doc2Vec-AP Model</title>
<p>The previous tests showed that the models generated in this approach work well, but they did not confirm which one is the best or if they are better than the generic Doc2Vec AP model. To try to shed light on such issue, the performance of the models when they are used in some scenario has to be evaluated. The methodology adopted in the validation and the discussion on results obtained are detailed in Section <xref rid="j_infor527_s_018">4.2.1</xref> and Section <xref rid="j_infor527_s_019">4.2.2</xref>, respectively.</p>
<sec id="j_infor527_s_018">
<label>4.2.1</label>
<title>Experimental Methodology</title>
<p>The proposed validation scenario is inspired by the procedure that was adopted to select the similarity metric based on the Doc2Vec AP model (<inline-formula id="j_infor527_ineq_134"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Sim}_{\textit{AP}}}$]]></tex-math></alternatives></inline-formula>) as the best one among the 4 possibilities explored in Section <xref rid="j_infor527_s_014">3.6.2</xref>. In particular, the procedure is organized as follows:</p>
<list>
<list-item id="j_infor527_li_023">
<label>1.</label>
<p>First, the <inline-formula id="j_infor527_ineq_135"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{1}}$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_infor527_ineq_136"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{10}}$]]></tex-math></alternatives></inline-formula> models were used to measure the Doc2Vec similarity among <inline-formula id="j_infor527_ineq_137"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> and each candidate text (included in the set <inline-formula id="j_infor527_ineq_138"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{\textit{AP}}}({T_{0}})$]]></tex-math></alternatives></inline-formula>). This led to 10 sets whose documents were arranged in decreasing order according to the similarity values measured with these models (denoted as <inline-formula id="j_infor527_ineq_139"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{1}}({T_{0}})$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_infor527_ineq_140"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{10}}({T_{0}})$]]></tex-math></alternatives></inline-formula>).</p>
</list-item>
<list-item id="j_infor527_li_024">
<label>2.</label>
<p>Next, the focus was put on the positions of the <inline-formula id="j_infor527_ineq_141"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>’s entities in the ordered sets. These position values were averaged to obtain a consistency indicator that allowed the 10 ad hoc models to be compared with each other and with the generic <inline-formula id="j_infor527_ineq_142"><alternatives><mml:math>
<mml:mtext mathvariant="italic">AP</mml:mtext></mml:math><tex-math><![CDATA[$\textit{AP}$]]></tex-math></alternatives></inline-formula> one.</p>
<p><table-wrap id="j_infor527_tab_005">
<label>Table 5</label>
<caption>
<p>Set of 18 DB-SL entities considered in the experimental validation in the context of the Battle of Thermopylae.</p>
</caption>
<table>
<thead>
<tr>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">The 10 DB-SL entities initially identified from the initial text <inline-formula id="j_infor527_ineq_143"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Events</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">People</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Places</td>
</tr>
</tbody><tbody>
<tr>
<td style="vertical-align: top; text-align: left"><monospace>Battle_of_Thermopylae</monospace></td>
<td style="vertical-align: top; text-align: left"><monospace>Leonidas_I</monospace></td>
<td style="vertical-align: top; text-align: left"><monospace>Thespiae</monospace></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><monospace>Battle_of_Artemisium</monospace></td>
<td style="vertical-align: top; text-align: left"><monospace>Xerxes_I</monospace></td>
<td style="vertical-align: top; text-align: left"><monospace>Sparta</monospace></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><monospace>Battle_of_Marathon</monospace></td>
<td style="vertical-align: top; text-align: left"><monospace>Themistocles</monospace></td>
<td style="vertical-align: top; text-align: left"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left"><monospace>Ephialtes_of_Trachis</monospace></td>
<td style="vertical-align: top; text-align: left"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><monospace>Darius_I</monospace></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<td colspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">The new 8 DB-SL entities about the Battle of Thermopylae which are not present in <inline-formula id="j_infor527_ineq_144"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Events</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">People</td>
</tr>
</tbody><tbody>
<tr>
<td style="vertical-align: top; text-align: left"><monospace>Battle_of_Salamis</monospace></td>
<td style="vertical-align: top; text-align: left"><monospace>Mardonius</monospace></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><monospace>Battle_of_Mycale</monospace></td>
<td style="vertical-align: top; text-align: left"><monospace>Hydarnes</monospace></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><monospace>Battle_of_Plataea</monospace></td>
<td style="vertical-align: top; text-align: left"><monospace>Hydarnes_II</monospace></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left"><monospace>Immortals_(Achaemenid_Empire)</monospace></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><monospace>Herodotus</monospace></td>
</tr>
</tbody>
</table>
</table-wrap></p>
</list-item>
<list-item id="j_infor527_li_025">
<label>3.</label>
<p>For a more robust comparison in the application scenario linked to the <italic>Battle of Thermopylae</italic>, the list of 10 DB-SL entities initially discovered was extended with new entities that were actually significant in the context of the second Persian invasion of Greece (but were not mentioned in the initial text <inline-formula id="j_infor527_ineq_145"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>), as depicted in Table <xref rid="j_infor527_tab_005">5</xref>. This way, a very descriptive set of 18 DBpedia entities that characterised the context under consideration was defined. Of course, all of them were included in the set of candidates collected in the initial phases described in Section <xref rid="j_infor527_s_006">3</xref>. Moreover, their similarity to <inline-formula id="j_infor527_ineq_146"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> could be calculated with the Doc2Vec algorithm using the testing models <inline-formula id="j_infor527_ineq_147"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{1}}$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_infor527_ineq_148"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{10}}$]]></tex-math></alternatives></inline-formula>, and they were also present in the aforementioned ordered sets <inline-formula id="j_infor527_ineq_149"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{x}}({T_{0}})$]]></tex-math></alternatives></inline-formula> calculated from these models (with <inline-formula id="j_infor527_ineq_150"><alternatives><mml:math>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>10</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$x\in [1,10]$]]></tex-math></alternatives></inline-formula>).</p>
</list-item>
<list-item id="j_infor527_li_026">
<label>4.</label>
<p>Lastly, the positions of those 18 DBpedia entities were searched in the sets <inline-formula id="j_infor527_ineq_151"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{1}}({T_{0}})$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_infor527_ineq_152"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{10}}({T_{0}})$]]></tex-math></alternatives></inline-formula>. A typical run (remember that results slightly change from run to run due to the randomness features of the Doc2Vec algorithm) of the results obtained with both the 10 ad hoc models and the AP generic one is depicted in Table <xref rid="j_infor527_tab_006">6</xref>.</p>
</list-item>
</list>
<table-wrap id="j_infor527_tab_006">
<label>Table 6</label>
<caption>
<p>Positions occupied by the 18 DB-SL entities of <inline-formula id="j_infor527_ineq_153"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> within a set of candidate texts that have been ordered (as per their similarity with <inline-formula id="j_infor527_ineq_154"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>) considering the generic AP model and our 10 in-domain Doc2Vec models.</p>
</caption>
<graphic xlink:href="infor527_g010.jpg"/>
</table-wrap>
</sec>
<sec id="j_infor527_s_019">
<label>4.2.2</label>
<title>Discussion on Experimental Results</title>
<p>As shown in Table <xref rid="j_infor527_tab_006">6</xref>, most of the entities occupy relevant positions in all of the studied orderings. This happened to be always true, even though the actual numbers change among different trainings of the same subset of candidates. As mentioned before, the Doc2Vec training algorithm involves some randomness in its steps (for instance, some data is randomly discarded to accelerate the convergence without having significant influence in the final results). This is not important for calculating a given similarity, as the fact that this value is 0.865 or 0.863 should not make any difference. But when ordering 83919 documents, these little differences can lead to all entities slightly changing their positions up or down. For instance, the <monospace>Battle of Plataea</monospace> (the final battle of the Second Persian Invasion of Greece) moves between position 1 and position 6, all of them quite important rankings when ordering 83919 documents, in any case.</p>
<p>In addition, some discordant values have been highlighted with background gray color in Table <xref rid="j_infor527_tab_006">6</xref>. They are clearly outliers that have to be taken into account to conduct a more appropriate analysis. In this regard, note that word embeddings are machine learning techniques that can be influenced by many factors that lead sometimes to unexpected results. For example, it may happen that a key figure of the historical event under consideration is described in the candidate text (retrieved from Wikipedia) in a way that is not rich enough for these methods to lead to the expected similarity values. This is the case for the candidate text of the entity <monospace>Herodotus</monospace>: although he is the main source of information about the <monospace>Battle of Thermopylae</monospace> (hence his mention in the candidate document), in reality this battle is only a small part of his work as a historian. Experiments have also confirmed the influence of other aspects. In particular, even considering the same set of texts, there are several training parameters that can greatly affect the resulting model. With different training sets, these differences can be even more pronounced.</p>
<p>For the above reasons, it is natural that outliers appear. These are documents that should have been rated high, but are not. This happens both with small ad hoc corpora and with the huge generic AP corpus. But, as shown in Table <xref rid="j_infor527_tab_006">6</xref>, these outliers are not the same in all cases (although they are quite similar in the ad hoc corpora). Because of this, in order to appropriately compare ad hoc models among them and with AP model, it is convenient to remove such outliers, as the overall quality of a model should be assessed by the most of its results, and not influenced by a small number of irregular items.</p>
<p>This way, the Z-score (Jiang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_038">2009</xref>; Aggarwal, <xref ref-type="bibr" rid="j_infor527_ref_003">2017</xref>) and the IQR (Tukey, <xref ref-type="bibr" rid="j_infor527_ref_079">1977</xref>; Sunitha <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_077">2014</xref>) methods were adopted to identify outliers. Both methods confirmed the highlighted values as discordant with the previous ones. Only values over position 400 were analysed as it is clear that low values may be discordant from a mathematical point of view but still significative regarding similarity.<xref ref-type="fn" rid="j_infor527_fn_020">20</xref><fn id="j_infor527_fn_020"><label><sup>20</sup></label>
<p>For example, in the series 1, 2, 3, 10 the last one would be an outlier from a pure mathematical point of view, but it is obvious that, when talking about positions within a set of 83919 elements, it is a low position that must be understood as a relevant value regarding any similarity ranking.</p></fn></p>
<p>When removing discordant outliers, it should be borne in mind that averages are being compared and therefore it is necessary to include the same number of elements in the calculation. This is due to the fact that outliers occupy the highest positions in any ranking and simply discarding them in their individual rankings would lead to results that are not directly comparable. Thus, the different rankings were studied to find the one with the highest number of outliers (3 in AP), and this amount was removed from all models. Thus, the 15 entities ranked in the lowest positions were considered for all models. It is worth noting that this is the best case scenario for the AP model (the 3 highest entities were removed from that ordering, thus reducing its mean and giving more value to any other model that outperforms AP).</p>
<p>The results obtained with the 10 ad hoc models learned by taking as training datasets different percentages (from 1% to 10%) of the candidate texts in <inline-formula id="j_infor527_ineq_155"><alternatives><mml:math>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{CT}({T_{0}})$]]></tex-math></alternatives></inline-formula> are shown in the top row of Table <xref rid="j_infor527_tab_007">7</xref>, together with the outcome of the existing AP Doc2Vec model. Note that the results include the average of several runs, considering only the entities found in the 15 lowest positions for each model.</p>
<table-wrap id="j_infor527_tab_007">
<label>Table 7</label>
<caption>
<p>Average positions occupied by the 15 best ranked entities, considering both the AP model and 20 Doc2Vec in-domain models learned by taking between 1% and 20% of the initially-selected candidate documents. The results confirm that the <inline-formula id="j_infor527_ineq_156"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{5}}$]]></tex-math></alternatives></inline-formula> model obtains the best outcome as it allows finding the 15 entities in the candidate documents that have been selected as the most similar to the input text (i.e. those occupying the lowest positions in the set <inline-formula id="j_infor527_ineq_157"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{5}}({T_{0}})$]]></tex-math></alternatives></inline-formula>).</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_158"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{i}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_159"><alternatives><mml:math>
<mml:mtext mathvariant="italic">AP</mml:mtext></mml:math><tex-math><![CDATA[$\textit{AP}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_160"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_161"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_162"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_163"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{4}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_164"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{5}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_165"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{6}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_166"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>7</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{7}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_167"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>8</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{8}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_168"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>9</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{9}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_169"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{10}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_170"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Avrg</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Avrg}_{i}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_171"><alternatives><mml:math>
<mml:mtext mathvariant="bold">78</mml:mtext></mml:math><tex-math><![CDATA[$\textbf{78}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">58</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">46</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">52</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">47</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_172"><alternatives><mml:math>
<mml:mtext mathvariant="bold">43</mml:mtext></mml:math><tex-math><![CDATA[$\textbf{43}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">44</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">46</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">55</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">58</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_173"><alternatives><mml:math>
<mml:mtext mathvariant="bold">83</mml:mtext></mml:math><tex-math><![CDATA[$\textbf{83}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_174"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{i}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_175"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>11</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{11}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_176"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>12</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{12}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_177"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>13</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{13}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_178"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>14</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{14}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_179"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>15</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{15}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_180"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>16</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{16}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_181"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>17</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{17}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_182"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>18</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{18}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_183"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>19</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{19}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_184"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>20</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{20}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_185"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Avrg</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{Avrg}_{i}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">68</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">60</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">61</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">74</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">66</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">75</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_186"><alternatives><mml:math>
<mml:mtext mathvariant="bold">93</mml:mtext></mml:math><tex-math><![CDATA[$\textbf{93}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_187"><alternatives><mml:math>
<mml:mtext mathvariant="bold">105</mml:mtext></mml:math><tex-math><![CDATA[$\textbf{105}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_188"><alternatives><mml:math>
<mml:mtext mathvariant="bold">92</mml:mtext></mml:math><tex-math><![CDATA[$\textbf{92}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_189"><alternatives><mml:math>
<mml:mtext mathvariant="bold">90</mml:mtext></mml:math><tex-math><![CDATA[$\textbf{90}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Based on the results, when compared to the 78-score<xref ref-type="fn" rid="j_infor527_fn_021">21</xref><fn id="j_infor527_fn_021"><label><sup>21</sup></label>
<p>Recall that this score means that 78 is the average of the positions occupied by the 15 DB-SL entities in the ordered set of candidate texts. The lower the average, the better, as this means that the relevant entities have been identified by the model as very similar to the initial text.</p></fn> of the <inline-formula id="j_infor527_ineq_190"><alternatives><mml:math>
<mml:mtext mathvariant="italic">AP</mml:mtext></mml:math><tex-math><![CDATA[$\textit{AP}$]]></tex-math></alternatives></inline-formula> model, the ad hoc models show excellent average positions for percentages ranging from 1% to 10%. This outcome is reasonable because when training with a small percentage of candidate documents, the selected texts are on the topic of the initial one (i.e. they are very similar to <inline-formula id="j_infor527_ineq_191"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>). Consequently, the models learned from these ad hoc corpora are effective at measuring similarities between <inline-formula id="j_infor527_ineq_192"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> and any of these documents. Particularly, as shown in Table <xref rid="j_infor527_tab_007">7</xref>, the ad hoc models <inline-formula id="j_infor527_ineq_193"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{1}}$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_infor527_ineq_194"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>9</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{9}}$]]></tex-math></alternatives></inline-formula> outperform the generic <inline-formula id="j_infor527_ineq_195"><alternatives><mml:math>
<mml:mtext mathvariant="italic">AP</mml:mtext></mml:math><tex-math><![CDATA[$\textit{AP}$]]></tex-math></alternatives></inline-formula> model. This can be observed from the consistently lower average positions measured by each of these models, which are always significantly below 78. Notably, among them, model <inline-formula id="j_infor527_ineq_196"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{5}}$]]></tex-math></alternatives></inline-formula> achieves the best results.</p>
<p>To examine the evolution of the consistency across a larger number of models derived from different percentages of <inline-formula id="j_infor527_ineq_197"><alternatives><mml:math>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{CT}({T_{0}})$]]></tex-math></alternatives></inline-formula>, we replicated the above procedure for values ranging from 11% to 20%, as shown in the bottom row of Table <xref rid="j_infor527_tab_007">7</xref>. The results indicate that the quality of the learned ad hoc models starts to degrade as an increasing number of documents are included in the training. This weakening is attributed to their reduced relation to the thematic of <inline-formula id="j_infor527_ineq_198"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>, leading to the resulting model being less focused on the target scenario.</p>
<p>Based on Table <xref rid="j_infor527_tab_007">7</xref>, it can be observed that models trained with high percentages (17% to 20%) of candidate documents perform less effectively than the generic model <inline-formula id="j_infor527_ineq_199"><alternatives><mml:math>
<mml:mtext mathvariant="italic">AP</mml:mtext></mml:math><tex-math><![CDATA[$\textit{AP}$]]></tex-math></alternatives></inline-formula>. The inclusion of a large number of out-of-domain training texts leads to the development of generic models. Consequently, models <inline-formula id="j_infor527_ineq_200"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>17</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{17}}$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_infor527_ineq_201"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>20</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{20}}$]]></tex-math></alternatives></inline-formula> are inferior to the <inline-formula id="j_infor527_ineq_202"><alternatives><mml:math>
<mml:mtext mathvariant="italic">AP</mml:mtext></mml:math><tex-math><![CDATA[$\textit{AP}$]]></tex-math></alternatives></inline-formula> model, as the latter has been trained with a more extensive document set, making it better suited for detecting similarities between two generic texts rather than being specifically tailored to the topic of <inline-formula id="j_infor527_ineq_203"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>.</p>
<p>To sum up, the results presented in Table <xref rid="j_infor527_tab_007">7</xref> confirm that the initial ad hoc models significantly outperform the generic AP model. For the domain studied in this paper, a corpus containing only 5% of the candidate texts yielded the best results with an average position of 43, as opposed to the AP model’s 78-score. This implies that, based on the criteria employed in this study, such ad hoc models (which have been constructed fully automatically, without requiring human contributions except for optimizing corpus creation) are more effective than the generic AP model in identifying the documents most relevant to a specific context defined in the initial text <inline-formula id="j_infor527_ineq_204"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>.</p>
</sec>
</sec>
</sec>
<sec id="j_infor527_s_020">
<label>5</label>
<title>Conclusions and Further Work</title>
<p>In this paper, an automatic procedure based on the Linked Open Data infrastructure has been proposed, which allows easily to obtain ad hoc corpora (from a user-specified short input text) that bring benefits for existing word-level and document-level embedding models. So far, such models have been fine-tuned on small collections of in-domain documents in order to improve the performance. These documents are often compiled manually and without assessing in any way the relevance of each text in the particular domain. In opposite, the approach described in this paper automatically gathers numerous in-domain training texts by relying on NER tools and state-of-the-art embedding models in order to guarantee a meaningful relationship between each possible training document and the initial text.</p>
<p>On the one hand, DBpedia-Spotlight is used to recognize DBpedia named entities in the initial text and drive the process of building an initial ad hoc corpus. On the other one, Doc2Vec models allow to identify new relevant in-domain texts (to be incorporated into the ad hoc training dataset). This way, the final tailor-made corpus brings together a large amount of meaningful and precise information sources, which lead to learning high quality domain-specific embeddings.</p>
<p>These dense vector representations accurately model domain peculiarities, which is especially critical for exploiting the language representation capabilities of embedding models in very particular fields (e.g. medicine, History or mechanical engineering, just to name a few). These fields are not conveniently considered in either the huge publicly available generic training datasets or the small-sized hand-collected domain datasets adopted in some existing approaches. Unlike these works, our approach is able to build, without human assistance, a training corpus on any topic and domain that can be exploited by existing models, requiring only to provide as input a variable-length piece of text.</p>
<p>In line with this, well-known models (like Word2Vec and GloVe) could take advantage of this approach to fight the Out-Of-Vocabulary problem, which is stemmed from the usage of generic training corpora. This limitation has been traditionally alleviated in other models (like FastText) by working at subword-level, which introduces excessive computational costs and memory requirements (Armand <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_005">2017</xref>). However, this approach enables to embed in-domain words that would be rare/unusual in a publicly-available generic corpus, and therefore impossible to be learned from general-domain datasets or even from imprecise/incomplete domain-specific datasets.</p>
<p>Our procedure to custom corpus construction shows several advantages over the different approaches presented in the related work section that address similar objectives: the approach here described is more autonomous, less dependent on user feedback to guaranty the quality of outputs; in our work, the relevance of the documents for the given subject is measured before including them in the corpus; and the search process is based on an open large repository where new unknown documents can be discovered, analysed and included into the corpus, besides being used to trigger new searching processes, thus leading to larger custom corpora composed of thousands of documents.</p>
<p>As an honest limitation, the paper concentrates on evaluating the performance of the proposed method using Doc2Vec embeddings, while not considering other Transformer-based models. Despite acknowledging the potential and remarkable results achieved by sophisticated Transformer architectures like BERT, such approaches have been omitted due to certain characteristics of the available models that do not align well with the specific objectives of our validation. In particular, our experimental validation has demonstrated that the performance of the custom-built in-domain corpus, when compared to a generic training dataset, is superior within the context of the specific embedding model used (Doc2Vec in this case). To achieve this, we trained a Doc2Vec model from scratch using the tailored collection and compared it to a generic model. However, training a BERT model from scratch in the case of Transformers is not feasible for us due to its unaffordable computational requirements. Instead, it is common practice to start with a pre-trained model on a massive collection of generic information (referred to as base model) and then fine-tune it for specific NLP tasks and custom corpora.</p>
<p>Given that our approach aimed to compare a model trained from scratch on the ad-hoc collection against one trained on a generic collection, we found the use of Doc2Vec more suitable for the purposes of our research. This choice was driven by the need for a more equitable comparison scenario between the general and ad hoc approaches, which is made possible by comparing models trained from scratch with Doc2Vec. The proposed experimental validation has tested the research hypothesis considered in the approach and has demonstrated that the performance of the automatically-built in-domain corpus is better than that of a generic training dataset in the context of a particular embedding model (Doc2Vec in this case). In reality, replicating or even improving this behaviour with other recent and sophisticated Transformer-based embedding models (such as GPT, BERT and their multiple variants) does not invalidate the results obtained in this work with Doc2Vec. In particular, apart from the aforementioned reasons, Doc2Vec presents two additional compelling advantages. Firstly, its good performance against related document-level models (Bhattacharya <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_014">2022</xref>; Kim <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_042">2018</xref>; Grefenstette <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_032">2013</xref>; Mikolov <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_055">2013a</xref>; Kiros <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor527_ref_043">2015</xref>, <xref ref-type="bibr" rid="j_infor527_ref_044">2018</xref>) and secondly, the existence of a mature implementation of it through GenSim (Rehürek and Sojka, <xref ref-type="bibr" rid="j_infor527_ref_071">2010</xref>), which allowed to train own models from scratch and evaluate the effects of the training corpus on the quality of the resulting models (rather than being able to simply use pre-trained models that have been learned from inaccessible documents).</p>
<p>Regarding the further work, having experimentally validated that in-domain corpora improve generic training datasets in a very specific domain, short-term research plans to explore the performance of models that have first been learned from a generic corpus and then fine-tuned on a collection of in-domain texts (which will be automatically retrieved by the proposed algorithm). The goal of these experiments is to incorporate a diverse array of models, encompassing advanced Transformer-based approaches like BERT, along with numerous other models that are continually emerging in the literature.</p>
</sec>
</body>
<back>
<app-group> 
<app id="j_infor527_app_001"><label>A</label>
<title>Mathematical Notation Adopted</title>
<table-wrap id="j_infor527_tab_008">
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Notation</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Meaning</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Equations</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_205"><alternatives><mml:math>
<mml:mtext mathvariant="italic">DE</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{DE}({T_{0}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Set of DBpedia entities that DBpedia Spotlight identifies in the input text <inline-formula id="j_infor527_ineq_206"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>, which share common Wikicats and/or subjects.</td>
<td style="vertical-align: top; text-align: left">(1) and (2)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_207"><alternatives><mml:math>
<mml:mtext mathvariant="italic">WK</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{WK}(e)$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Set of relevant Wikicats that characterise a given entity <italic>e</italic>.</td>
<td style="vertical-align: top; text-align: left">(3) and (4)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_208"><alternatives><mml:math>
<mml:mtext mathvariant="italic">WK</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{WK}({T_{0}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Set of relevant Wikicats that characterise the input text <inline-formula id="j_infor527_ineq_209"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>.</td>
<td style="vertical-align: top; text-align: left">(5)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_210"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">DB</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${U_{\textit{DB}}}({w_{k}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Set of URLs of pages dealing with entities tagged with the <inline-formula id="j_infor527_ineq_211"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${w_{k}}$]]></tex-math></alternatives></inline-formula> wikicat in DBpedia.</td>
<td style="vertical-align: top; text-align: left">(6)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_212"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">WK</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${U_{\textit{WK}}}({w_{k}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Set of URLs of pages dealing with entities tagged with the <inline-formula id="j_infor527_ineq_213"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${w_{k}}$]]></tex-math></alternatives></inline-formula> wikicat in Wikidata.</td>
<td style="vertical-align: top; text-align: left">(7)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_214"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${U_{DW}}({w_{k}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Union of the sets <inline-formula id="j_infor527_ineq_215"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">DB</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${U_{\textit{DB}}}({w_{k}})$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_216"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">WK</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${U_{\textit{WK}}}({w_{k}})$]]></tex-math></alternatives></inline-formula>.</td>
<td style="vertical-align: top; text-align: left">(8)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_217"><alternatives><mml:math>
<mml:mi mathvariant="italic">U</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$U({T_{0}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Set of URLs that are associated with some wikicat included in <inline-formula id="j_infor527_ineq_218"><alternatives><mml:math>
<mml:mtext mathvariant="italic">WK</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{WK}({T_{0}})$]]></tex-math></alternatives></inline-formula>.</td>
<td style="vertical-align: top; text-align: left">(9)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_219"><alternatives><mml:math>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{CT}({T_{0}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Set of documents, related to <inline-formula id="j_infor527_ineq_220"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> in some extent, that are candidates to be incorporated into the ad hoc training corpus.</td>
<td style="vertical-align: top; text-align: left">(10)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_221"><alternatives><mml:math>
<mml:mtext mathvariant="italic">Corpus</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{Corpus}({T_{0}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Custom-built training corpus that includes only the domain-specific documents in <inline-formula id="j_infor527_ineq_222"><alternatives><mml:math>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{CT}({T_{0}})$]]></tex-math></alternatives></inline-formula> that are significantly related to <inline-formula id="j_infor527_ineq_223"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>.</td>
<td style="vertical-align: top; text-align: left">(11)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_224"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{Sim}_{W}}({T_{c}},{T_{0}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Semantic similarity metric based on the common wikicats identified between the candidate text <inline-formula id="j_infor527_ineq_225"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{c}}$]]></tex-math></alternatives></inline-formula> and the input text <inline-formula id="j_infor527_ineq_226"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>.</td>
<td style="vertical-align: top; text-align: left">3.6.1 and 3.6.2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_227"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{Sim}_{S}}({T_{c}},{T_{0}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Semantic similarity metric based on the common subjects identified between <inline-formula id="j_infor527_ineq_228"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{c}}$]]></tex-math></alternatives></inline-formula> and t <inline-formula id="j_infor527_ineq_229"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>.</td>
<td style="vertical-align: top; text-align: left">3.6.1 and 3.6.2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_230"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{Sim}_{C}}({T_{c}},{T_{0}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Semantic similarity metric between <inline-formula id="j_infor527_ineq_231"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{c}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_232"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> measured by the spaCy Python package.</td>
<td style="vertical-align: top; text-align: left">3.6.1 and 3.6.2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_233"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{Sim}_{\textit{AP}}}({T_{c}},{T_{0}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Semantic similarity metric between <inline-formula id="j_infor527_ineq_234"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{c}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_235"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula> measured by the existing AP Doc2Vec model (which has been trained with a generic collection of Associated Press news).</td>
<td style="vertical-align: top; text-align: left">3.6.1 and 3.6.2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_236"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{W}}({T_{0}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Set resulting from sorting <inline-formula id="j_infor527_ineq_237"><alternatives><mml:math>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{CT}({T_{0}})$]]></tex-math></alternatives></inline-formula> in decreasing order as per the similarity values measured (between each <inline-formula id="j_infor527_ineq_238"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{c}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_239"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>) by the metric <inline-formula id="j_infor527_ineq_240"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">W</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{Sim}_{W}}({T_{c}},{T_{0}})$]]></tex-math></alternatives></inline-formula>.</td>
<td style="vertical-align: top; text-align: left">3.6.2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_241"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{S}}({T_{0}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Set resulting from sorting <inline-formula id="j_infor527_ineq_242"><alternatives><mml:math>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{CT}({T_{0}})$]]></tex-math></alternatives></inline-formula> in decreasing order as per the similarity values measured (between each <inline-formula id="j_infor527_ineq_243"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{c}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_244"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>) by the metric <inline-formula id="j_infor527_ineq_245"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{Sim}_{S}}({T_{c}},{T_{0}})$]]></tex-math></alternatives></inline-formula>.</td>
<td style="vertical-align: top; text-align: left">3.6.2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_246"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{C}}({T_{0}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Set resulting from sorting <inline-formula id="j_infor527_ineq_247"><alternatives><mml:math>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{CT}({T_{0}})$]]></tex-math></alternatives></inline-formula> in decreasing order as per the similarity values measured (between each <inline-formula id="j_infor527_ineq_248"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{c}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_249"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>) by the metric <inline-formula id="j_infor527_ineq_250"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{Sim}_{C}}({T_{c}},{T_{0}})$]]></tex-math></alternatives></inline-formula>.</td>
<td style="vertical-align: top; text-align: left">3.6.2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_251"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{\textit{AP}}}({T_{0}})$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Set resulting from sorting <inline-formula id="j_infor527_ineq_252"><alternatives><mml:math>
<mml:mtext mathvariant="italic">CT</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textit{CT}({T_{0}})$]]></tex-math></alternatives></inline-formula> in decreasing order as per the similarity values measured (between each <inline-formula id="j_infor527_ineq_253"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{c}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor527_ineq_254"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${T_{0}}$]]></tex-math></alternatives></inline-formula>) by the metric <inline-formula id="j_infor527_ineq_255"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Sim</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{Sim}_{\textit{AP}}}({T_{c}},{T_{0}})$]]></tex-math></alternatives></inline-formula>.</td>
<td style="vertical-align: top; text-align: left">3.6.2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor527_ineq_256"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{i}}$]]></tex-math></alternatives></inline-formula> with <inline-formula id="j_infor527_ineq_257"><alternatives><mml:math>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>20</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$i\in [1,20]$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Doc2Vec model learned from an ad hoc training corpus including i% (from 1% in <inline-formula id="j_infor527_ineq_258"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{1}}$]]></tex-math></alternatives></inline-formula> to 20% in <inline-formula id="j_infor527_ineq_259"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>20</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{20}}$]]></tex-math></alternatives></inline-formula>) of the candidate documents in <inline-formula id="j_infor527_ineq_260"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{\textit{AP}}}({T_{0}})$]]></tex-math></alternatives></inline-formula>.</td>
<td style="vertical-align: top; text-align: left">4.1 and 4.2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor527_ineq_261"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{i}}({T_{0}})$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor527_ineq_262"><alternatives><mml:math>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>20</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$i\in [1,20]$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Set resulting from sorting <inline-formula id="j_infor527_ineq_263"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">CT</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AP</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textit{CT}_{\textit{AP}}}({T_{0}})$]]></tex-math></alternatives></inline-formula> by the Doc2Vec ad hoc model <inline-formula id="j_infor527_ineq_264"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{i}}$]]></tex-math></alternatives></inline-formula>.</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">4.2.1 and 4.2.2</td>
</tr>
</tbody>
</table>
</table-wrap>
</app></app-group>
<ref-list id="j_infor527_reflist_001">
<title>References</title>
<ref id="j_infor527_ref_001">
<mixed-citation publication-type="journal"><string-name><surname>Abacha</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Dina</surname>, <given-names>D.</given-names></string-name> (<year>2016</year>). <article-title>Recognizing question entailment for medical question answering</article-title>. <source>AMIA Annual Symposium Proceedings</source>, <volume>2016</volume>, <fpage>310</fpage>–<lpage>318</lpage>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_002">
<mixed-citation publication-type="chapter"><string-name><surname>Adjali</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Besancon</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Ferret</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>LeBorgne</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>B.</surname>, <given-names>G.</given-names></string-name> (<year>2020</year>). <chapter-title>Multimodal entity linking for tweets</chapter-title>. In: <source>Proceedings of the 42th European Conference on Advanced Information Retrieval</source>, <publisher-loc>Lisbon, Portugal</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_003">
<mixed-citation publication-type="book"><string-name><surname>Aggarwal</surname>, <given-names>C.C.</given-names></string-name> (<year>2017</year>). <source>Outlier Analysis</source>. <publisher-name>Springer</publisher-name>, <publisher-loc>New York</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_004">
<mixed-citation publication-type="journal"><string-name><surname>An</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name> (<year>2020</year>). <article-title>Error detection in a large-scale lexical taxonomy</article-title>. <source>Information</source>, <volume>11</volume>(<issue>2</issue>). <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.3390/info11020097" xlink:type="simple">https://doi.org/10.3390/info11020097</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_005">
<mixed-citation publication-type="chapter"><string-name><surname>Armand</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Grave</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Bojanowski</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Mikolov</surname>, <given-names>T.</given-names></string-name> (<year>2017</year>). <chapter-title>Bag of tricks for efficient text classification</chapter-title>. In: <source>Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Valencia, Spain</publisher-loc>, pp. <fpage>427</fpage>–<lpage>431</lpage>. <uri>https://www.aclweb.org/anthology/E17-2068</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_006">
<mixed-citation publication-type="chapter"><string-name><surname>Axelrod</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>He</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Gao</surname>, <given-names>J.</given-names></string-name> (<year>2011</year>). <chapter-title>A domain adaptation via pseudo in-domain data selection</chapter-title>. In: <source>Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Edinburgh, Scotland</publisher-loc>, pp. <fpage>355</fpage>–<lpage>362</lpage>. <uri>https://www.aclweb.org/anthology/D11-1033</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_007">
<mixed-citation publication-type="journal"><string-name><surname>Bansal</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Srivastava</surname>, <given-names>S.</given-names></string-name> (<year>2019</year>). <article-title>Hybrid attribute based sentiment classification of online reviews for consumer intelligence</article-title>. <source>Applied Intelligence</source>, <volume>49</volume>, <fpage>137</fpage>–<lpage>149</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s10489-018-1299-7" xlink:type="simple">https://doi.org/10.1007/s10489-018-1299-7</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_008">
<mixed-citation publication-type="chapter"><string-name><surname>Barbaresi</surname>, <given-names>A.</given-names></string-name> (<year>2013</year>a). <chapter-title>Challenges in web corpus construction for low-resource languages in a post-BootCaT world</chapter-title>. In: <source>Proceedings of the 6th Human Languages Technologies as a Challenge for Computer Science and Linguistics</source>, <publisher-loc>Poznan, Poland</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_009">
<mixed-citation publication-type="chapter"><string-name><surname>Barbaresi</surname>, <given-names>A.</given-names></string-name> (<year>2013</year>b). <chapter-title>Crawling microblogging services to gather language-classified URLs workflow and case study</chapter-title>. In: <source>Proceedings of the ACL Student Research Workshop</source>, <publisher-loc>Sofia, Bulgaria</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_010">
<mixed-citation publication-type="chapter"><string-name><surname>Barbaresi</surname>, <given-names>A.</given-names></string-name> (<year>2014</year>). <chapter-title>Finding viable seed URLs for web corpora: a scouting approach and comparative study of available resources.</chapter-title> In: <source>Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics</source>, <publisher-loc>Gothenburg, Sweeden</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_011">
<mixed-citation publication-type="chapter"><string-name><surname>Baroni</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Bernardini</surname>, <given-names>S.</given-names></string-name> (<year>2004</year>). <chapter-title>BootCaT: bootstrapping corpora and terms from the web</chapter-title>. In: <source>Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC)</source>, <publisher-loc>Lisbon, Portugal</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_012">
<mixed-citation publication-type="chapter"><string-name><surname>Baroni</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Kilgafrriff</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Pomikalk</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Rychly</surname>, <given-names>P.</given-names></string-name> (<year>2006</year>). <chapter-title>WebBootCaT: a web tool for instant corpora</chapter-title>. In: <source>Proceedings of the 12th EURALEX International Congress</source>, <publisher-loc>Torino, Italy</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_013">
<mixed-citation publication-type="chapter"><string-name><surname>Barrena</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Soroa</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Agirre</surname>, <given-names>E.</given-names></string-name> (<year>2015</year>). <chapter-title>Combining mention context and hyperlinks from wikipedia for named entity disambiguation</chapter-title>. In: <source>Proceedings of the 4th Joint Conference on Lexical and Computational Semantics</source>, <publisher-loc>Denver, USA</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_014">
<mixed-citation publication-type="journal"><string-name><surname>Bhattacharya</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Ghosh</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Pal</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Ghosh</surname>, <given-names>S.</given-names></string-name> (<year>2022</year>). <article-title>Legal case document similarity: You need both network and text</article-title>. <source>Information Processing &amp; Management</source>, <volume>59</volume>(<issue>6</issue>). <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.ipm.2022.103069" xlink:type="simple">https://doi.org/10.1016/j.ipm.2022.103069</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_015">
<mixed-citation publication-type="journal"><string-name><surname>Blanco</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Gil-Solla</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Pazos-Arias</surname>, <given-names>J.J.</given-names></string-name>, <string-name><surname>Ramos-Cabrer</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Daif</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>López-Nores</surname>, <given-names>M.</given-names></string-name> (<year>2020</year>). <article-title>Distracting users as per their knowledge: combining linked open data and word embeddings to enhance history learning</article-title>. <source>Expert Systems with Applications</source>, <volume>143</volume>, <fpage>1</fpage>–<lpage>16</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2019.113051" xlink:type="simple">https://doi.org/10.1016/j.eswa.2019.113051</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_016">
<mixed-citation publication-type="chapter"><string-name><surname>Bollegala</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Maehara</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Kawarabayashi</surname>, <given-names>K.</given-names></string-name> (<year>2015</year>). <chapter-title>Learning word representations from relational graphs</chapter-title>. In: <source>Proceedings of the 29th AAAI Conference on Artificial Intelligence</source>. <publisher-name>AAAI Press</publisher-name>, <publisher-loc>Austin Texas, USA</publisher-loc>, pp. <fpage>2146</fpage>–<lpage>2152</lpage>. <ext-link ext-link-type="doi" xlink:href="https://arxiv.org/pdf/1412.2378.pdf" xlink:type="simple">https://arxiv.org/pdf/1412.2378.pdf</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_017">
<mixed-citation publication-type="chapter"><string-name><surname>Cano</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Morisio</surname>, <given-names>M.</given-names></string-name> (<year>2017</year>). <chapter-title>Quality of word embeddings on sentiment analysis tasks</chapter-title>. In: <source>Proceedings of the 22nd International Conference on Natural Language &amp; Information Systems</source>, <publisher-loc>Liege, Belgium</publisher-loc>, pp. <fpage>1</fpage>–<lpage>8</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/978-3-319-59569-6_42" xlink:type="simple">https://doi.org/10.1007/978-3-319-59569-6_42</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_018">
<mixed-citation publication-type="chapter"><string-name><surname>Castagnoli</surname>, <given-names>S.</given-names></string-name> (<year>2015</year>). <chapter-title>Using the Web as asource of LSP corpora in the terminology classroom</chapter-title>. In: <source>Wacky! Working Papers on the Web as Corpus</source>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_019">
<mixed-citation publication-type="chapter"><string-name><surname>Chen</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Peng</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Lu</surname>, <given-names>Z.</given-names></string-name> (<year>2019</year>). <chapter-title>BioSentVec: creating sentence embeddings for biomedical texts</chapter-title>. In: <source>Proceedings of the IEEE International Conference on Healthcare Informatics</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Xian, China</publisher-loc>, pp. <fpage>1</fpage>–<lpage>5</lpage>. <ext-link ext-link-type="doi" xlink:href="https://arxiv.org/abs/1810.09302" xlink:type="simple">https://arxiv.org/abs/1810.09302</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_020">
<mixed-citation publication-type="chapter"><string-name><surname>Chiu</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Crichton</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Korhonen</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Pyysalo</surname>, <given-names>S.</given-names></string-name> (<year>2016</year>). <chapter-title>How to train good word embeddings for biomedical NLP</chapter-title>. In: <source>Proceedings of the 15th Workshop on Biomedical Natural Language Processing</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Berlin, Germany</publisher-loc>, pp. <fpage>166</fpage>–<lpage>174</lpage>. <ext-link ext-link-type="doi" xlink:href="https://www.aclweb.org/anthology/W16-2922" xlink:type="simple">https://www.aclweb.org/anthology/W16-2922</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_021">
<mixed-citation publication-type="book"><string-name><surname>Crystal</surname>, <given-names>D.</given-names></string-name> (<year>2011</year>). <source>Internet Linguistics</source>. <publisher-name>Routledge</publisher-name>, <publisher-loc>London</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_022">
<mixed-citation publication-type="other"><string-name><surname>Dai</surname>, <given-names>A.M.</given-names></string-name>, <string-name><surname>Olah</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Le</surname>, <given-names>Q.V.</given-names></string-name> (2020). Document embedding with Paragraph Vectors. <uri>https://arxiv.org/abs/1507.07998</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_023">
<mixed-citation publication-type="chapter"><string-name><surname>Devlin</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Chang</surname>, <given-names>M.W.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Toutanova</surname>, <given-names>K.</given-names></string-name> (<year>2019</year>). <chapter-title>Bert: pre-training of deep bidirectional transformers for language understanding</chapter-title>. In: <source>Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Minneapolis, USA</publisher-loc>, pp. <fpage>4171</fpage>–<lpage>4186</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.18653/v1/N19-1423" xlink:type="simple">https://doi.org/10.18653/v1/N19-1423</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_024">
<mixed-citation publication-type="chapter"><string-name><surname>Ethayarajh</surname>, <given-names>K.</given-names></string-name> (<year>2019</year>). <chapter-title>How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings</chapter-title>. In: <source>Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Hong Kong, China</publisher-loc>, pp. <fpage>55</fpage>–<lpage>65</lpage>. <uri>https://arxiv.org/abs/1909.00512v1</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_025">
<mixed-citation publication-type="journal"><string-name><surname>Faustini</surname>, <given-names>P.H.A.</given-names></string-name>, <string-name><surname>Covões</surname>, <given-names>T.F.</given-names></string-name> (<year>2017</year>). <article-title>Fake news detection in multiple platform and languages</article-title>. <source>Expert Systems with Applications</source>, <volume>158</volume>, <fpage>1</fpage>–<lpage>9</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2020.113503" xlink:type="simple">https://doi.org/10.1016/j.eswa.2020.113503</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_026">
<mixed-citation publication-type="journal"><string-name><surname>Fu</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Qu</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Lu</surname>, <given-names>L.</given-names></string-name> (<year>2018</year>). <article-title>Bag of meta-words: a novel method to represent document for the sentiment classification</article-title>. <source>Expert Systems with Applications</source>, <volume>113</volume>, <fpage>33</fpage>–<lpage>43</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2018.06.052" xlink:type="simple">https://doi.org/10.1016/j.eswa.2018.06.052</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_027">
<mixed-citation publication-type="journal"><string-name><surname>Gali</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Mariescu-Istodor</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Hostettler</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Franti</surname>, <given-names>P.</given-names></string-name> (<year>2019</year>). <article-title>Framework for syntactic string similarity measures</article-title>. <source>Expert Systems with Applications</source>, <volume>129</volume>(<issue>1</issue>), <fpage>169</fpage>–<lpage>185</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2019.03.048" xlink:type="simple">https://doi.org/10.1016/j.eswa.2019.03.048</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_028">
<mixed-citation publication-type="chapter"><string-name><surname>Ganea</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Hofmann</surname>, <given-names>T.</given-names></string-name> (<year>2017</year>). <chapter-title>Deep joint entity disambiguation with local neural attention</chapter-title>. In: <source>Proceedings of the 17th Conference on Empirical Methods in Natural Language Processing</source>, <publisher-loc>Copenhagen, Denmark</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_029">
<mixed-citation publication-type="book"><string-name><surname>Gatto</surname>, <given-names>M.</given-names></string-name> (<year>2014</year>). <source>Web as Corpus: Theory and Practice</source>. <publisher-name>A&amp;C Black</publisher-name>, <publisher-loc>London</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_030">
<mixed-citation publication-type="journal"><string-name><surname>Geng</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Han</surname>, <given-names>Y.</given-names></string-name> (<year>2021</year>). <article-title>Joint entity and relation extraction model based on rich semantics</article-title>. <source>Neurocomputing</source>, <volume>429</volume>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.neucom.2020.12.037" xlink:type="simple">https://doi.org/10.1016/j.neucom.2020.12.037</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_031">
<mixed-citation publication-type="journal"><string-name><surname>Giatsoglou</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Vozalis</surname>, <given-names>M.G.</given-names></string-name>, <string-name><surname>Diamantaras</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Chatzisavvas</surname>, <given-names>K.</given-names></string-name> (<year>2017</year>). <article-title>Sentiment analysis leveraging emotions and word embeddings</article-title>. <source>Expert Systems with Applications</source>, <volume>69</volume>(<issue>1</issue>), <fpage>214</fpage>–<lpage>224</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2016.10.043" xlink:type="simple">https://doi.org/10.1016/j.eswa.2016.10.043</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_032">
<mixed-citation publication-type="chapter"><string-name><surname>Grefenstette</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Dinu</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Sadrzadeh</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Baroni</surname>, <given-names>M.</given-names></string-name> (<year>2013</year>). <chapter-title>Multi-step regression learning for compositional distributional semantics</chapter-title>. In: <source>Proceedings of Conference on Empirical Methods in Natural Language Processing</source>, <publisher-loc>Postdam, Germany</publisher-loc>, pp. <fpage>131</fpage>–<lpage>142</lpage>. <ext-link ext-link-type="doi" xlink:href="https://www.aclweb.org/anthology/W13-0112" xlink:type="simple">https://www.aclweb.org/anthology/W13-0112</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_033">
<mixed-citation publication-type="journal"><string-name><surname>Gu</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Tinn</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Cheng</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Lucas</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Usuyama</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Naumann</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Gao</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Poon</surname>, <given-names>H.</given-names></string-name> (<year>2021</year>). <article-title>Domain-specific language model pretraining for biomedical natural language processing</article-title>. <source>ACM Transactions on Computing for Healthcare</source>, <volume>3</volume>(<issue>1</issue>). <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/3458754" xlink:type="simple">https://doi.org/10.1145/3458754</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_034">
<mixed-citation publication-type="journal"><string-name><surname>Gutiérrez-Batista</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Campaña</surname>, <given-names>J.R.</given-names></string-name>, <string-name><surname>Vila</surname>, <given-names>M.A.</given-names></string-name>, <string-name><surname>Martin-Bautista</surname>, <given-names>M.</given-names></string-name> (<year>2018</year>). <article-title>An ontology-based framework for automatic topic detection in multilingual environments</article-title>. <source>International Journal of Intelligent Systems</source>, <volume>33</volume>, <fpage>1459</fpage>–<lpage>1475</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1002/int.21986" xlink:type="simple">https://doi.org/10.1002/int.21986</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_035">
<mixed-citation publication-type="chapter"><string-name><surname>Han</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Kashyap</surname>, <given-names>A.L.</given-names></string-name>, <string-name><surname>Finin</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Mayfield</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Weese</surname>, <given-names>J.</given-names></string-name> (<year>2013</year>). <chapter-title>UMBC_EBIQUITY-CORE: semantic textual similarity systems</chapter-title>. In: <source>Proceedings of the 2nd Joint Conference on Lexical and Computational Semantics</source>. <publisher-name>Association for Computational Linguistics</publisher-name>, <publisher-loc>Atlanta, USA</publisher-loc>, pp. <fpage>44</fpage>–<lpage>52</lpage>. <ext-link ext-link-type="doi" xlink:href="https://aclanthology.org/S13-1005" xlink:type="simple">https://aclanthology.org/S13-1005</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_036">
<mixed-citation publication-type="journal"><string-name><surname>Hu</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Ding</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Shi</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Shao</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>S.</given-names></string-name> (<year>2020</year>). <article-title>Graph neural entity disambiguation</article-title>. <source>Knowledge-Based Systems</source>, <volume>195</volume>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.knosys.2020.105620" xlink:type="simple">https://doi.org/10.1016/j.knosys.2020.105620</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_037">
<mixed-citation publication-type="journal"><string-name><surname>Ismayilov</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Kontokostas</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Auer</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Lehmann</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Hellmann</surname>, <given-names>S.</given-names></string-name> (<year>2015</year>). <article-title>Wikidata through the eyes of DBpedia</article-title>. <source>Semantic Web</source>, <volume>9</volume>(<issue>4</issue>), <fpage>1</fpage>–<lpage>11</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.3233/SW-170277" xlink:type="simple">https://doi.org/10.3233/SW-170277</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_038">
<mixed-citation publication-type="journal"><string-name><surname>Jiang</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Sui</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Cao</surname>, <given-names>C.</given-names></string-name> (<year>2009</year>). <article-title>Some issues about outlier detection in rough set theory</article-title>. <source>Expert Systems with Applications</source>, <volume>36</volume>(<issue>3</issue>), <fpage>4680</fpage>–<lpage>4687</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2008.06.019" xlink:type="simple">https://doi.org/10.1016/j.eswa.2008.06.019</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_039">
<mixed-citation publication-type="journal"><string-name><surname>Jung</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Shing</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>S.</given-names></string-name> (<year>2022</year>). <article-title>Impact of preprocessing and word embeddings on extreme multi-label patent classification tasks</article-title>. <source>Applied Intelligence</source>, <volume>3</volume>(<issue/>), <fpage>4047</fpage>–<lpage>4062</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s10489-022-03655-5" xlink:type="simple">https://doi.org/10.1007/s10489-022-03655-5</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_040">
<mixed-citation publication-type="journal"><string-name><surname>Khatua</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Khatua</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Cambria</surname>, <given-names>E.</given-names></string-name> (<year>2019</year>). <article-title>A tale of two epidemics: Contextual Word2Vec for classifying Twitter streams during outbreaks</article-title>. <source>Information Processing &amp; Management</source>, <volume>56</volume>(<issue>1</issue>), <fpage>247</fpage>–<lpage>257</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.ipm.2018.10.010" xlink:type="simple">https://doi.org/10.1016/j.ipm.2018.10.010</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_041">
<mixed-citation publication-type="chapter"><string-name><surname>Kilgarriff</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Reddy</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Pomikalek</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Avinesh</surname>, <given-names>P.</given-names></string-name> (<year>2014</year>). <chapter-title>A corpus factory for many languages</chapter-title>. In: <source>Proceedings of the 7th International Conference on Language Resources and Evaluation</source>, <publisher-loc>Malta</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_042">
<mixed-citation publication-type="journal"><string-name><surname>Kim</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Seo</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Cho</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Kang</surname>, <given-names>P.</given-names></string-name> (<year>2018</year>). <article-title>Multi-co-training for document classification using various document representations: TFIDF, LDA, and Doc2Vec</article-title>. <source>Information Sciences</source>, <volume>477</volume>, <fpage>15</fpage>–<lpage>29</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.ins.2018.10.006" xlink:type="simple">https://doi.org/10.1016/j.ins.2018.10.006</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_043">
<mixed-citation publication-type="chapter"><string-name><surname>Kiros</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Salakhutdinov</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Zemel</surname>, <given-names>R.S.</given-names></string-name>, <string-name><surname>Torralba</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Urtasun</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Fidler</surname>, <given-names>S.</given-names></string-name> (<year>2015</year>). <chapter-title>Skip-Thought vectors</chapter-title>. In: <source>Proceedings of the Neural Information Processing Systems Conference</source>, pp. <fpage>1</fpage>–<lpage>11</lpage>. <uri>http://arxiv.org/abs/1506.06726</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_044">
<mixed-citation publication-type="chapter"><string-name><surname>Kiros</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Salakhutdinov</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Zemel</surname>, <given-names>R.S.</given-names></string-name>, <string-name><surname>Torralba</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Urtasun</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Fidler</surname>, <given-names>S.</given-names></string-name> (<year>2018</year>). <chapter-title>An efficient framework for learning sentence representations</chapter-title>. In: <source>Proceedings of the 6th International Conference on Learning Representations</source>, <publisher-loc>Vancouver, Canada</publisher-loc>. <uri>https://arxiv.org/abs/1803.02893</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_045">
<mixed-citation publication-type="journal"><string-name><surname>Lamsiyah</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Mahdaouy</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Espinasse</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Alaoui</surname>, <given-names>S.</given-names></string-name> (<year>2021</year>). <article-title>An unsupervised method for extractive multi-document summarization based on centroid approach and sentence embeddings</article-title>. <source>Expert Systems with Applications</source>, <volume>167</volume>(<issue>1</issue>), <fpage>1</fpage>–<lpage>16</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2020.114152" xlink:type="simple">https://doi.org/10.1016/j.eswa.2020.114152</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_046">
<mixed-citation publication-type="journal"><string-name><surname>Lastra</surname>, <given-names>J.J.</given-names></string-name>, <string-name><surname>Goikoetxea</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Mohamed</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Garcia-Serrano</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Mohamed</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Agirre</surname>, <given-names>E.</given-names></string-name> (<year>2019</year>). <article-title>A reproducible survey on word embeddings and ontology-based methods for word similarity: linear combinations outperform the state of the art</article-title>. <source>Engineering Applications of Artificial Intelligence</source>, <volume>85</volume>, <fpage>645</fpage>–<lpage>665</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.engappai.2019.07.010" xlink:type="simple">https://doi.org/10.1016/j.engappai.2019.07.010</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_047">
<mixed-citation publication-type="chapter"><string-name><surname>Lau</surname>, <given-names>J.H.</given-names></string-name>, <string-name><surname>Baldwin</surname>, <given-names>T.</given-names></string-name> (<year>2016</year>). <chapter-title>An empirical evaluation of Doc2Vec with practical insights into document embedding Generation</chapter-title>. In: <source>Proceedings of the 1st Workshop on Representation Learning for NLP</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Berlin, Germany</publisher-loc>, pp. <fpage>78</fpage>–<lpage>86</lpage>. <ext-link ext-link-type="doi" xlink:href="https://www.aclweb.org/anthology/W16-1609" xlink:type="simple">https://www.aclweb.org/anthology/W16-1609</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_048">
<mixed-citation publication-type="chapter"><string-name><surname>Le</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Mikolov</surname>, <given-names>T.</given-names></string-name> (<year>2014</year>). <chapter-title>Distributed representations of sentences and documents</chapter-title>. In: <source>Proceedings of the 31st International Conference on Machine Learning (ICML)</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Beijing, China</publisher-loc>, pp. <fpage>1188</fpage>–<lpage>1196</lpage>. <uri>https://arxiv.org/abs/1405.4053</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_049">
<mixed-citation publication-type="journal"><string-name><surname>Lehmann</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Isele</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Jakob</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Jentzsch</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Kontokostas</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Mendes</surname>, <given-names>P.N.</given-names></string-name>, <string-name><surname>Bizer</surname>, <given-names>C.</given-names></string-name> (<year>2012</year>). <article-title>DBpedia: a large-scale, multilingual knowledge base extracted from Wikipedia</article-title>. <source>Semantic Web</source>, <volume>9</volume>(<issue>4</issue>), <fpage>1</fpage>–<lpage>5</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.3233/SW-140134" xlink:type="simple">https://doi.org/10.3233/SW-140134</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_050">
<mixed-citation publication-type="chapter"><string-name><surname>Liu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Gao</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>He</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Deng</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Duh</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>Y.Y.</given-names></string-name> (<year>2015</year>). <chapter-title>Representation learning using multi-task deep neural networks for semantic classification and information retrieval</chapter-title>. In: <source>Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Denver, USA</publisher-loc>, pp. <fpage>912</fpage>–<lpage>921</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.aclweb.org/anthology/N15-1092">https://www.aclweb.org/anthology/N15-1092</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_051">
<mixed-citation publication-type="chapter"><string-name><surname>Logeswaran</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>H.</given-names></string-name> (<year>2018</year>). <chapter-title>An efficient framework for learning sentence representations</chapter-title>. In: <source>Proceedings of the 6th International Conference on Learning Representations</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Vancouver, Canada</publisher-loc>, pp. <fpage>1</fpage>–<lpage>16</lpage>. <uri>https://arxiv.org/abs/1803.02893v1</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_052">
<mixed-citation publication-type="chapter"><string-name><surname>Lynn</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Scannell</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Maguire</surname>, <given-names>E.</given-names></string-name> (<year>2015</year>). <chapter-title>Minority language twitter: part-of-speech tagging and analysis of Irish tweets</chapter-title>. In: <source>Proceedings of Workshop on Noisy User-generated Text</source>, <publisher-loc>Beijing, China</publisher-loc>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.18653/v1/W15-4301" xlink:type="simple">https://doi.org/10.18653/v1/W15-4301</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_053">
<mixed-citation publication-type="other"><string-name><surname>Ma</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Duanyang</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Yonggang</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Haodong</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Xhang</surname>, <given-names>X.</given-names></string-name> (2021). A knowledge graph entity disambiguation method based on entity-relationship embedding and graph structure embedding. <italic>Computational Intelligence and Neuroscience</italic>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1155/2021/2878189" xlink:type="simple">https://doi.org/10.1155/2021/2878189</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_054">
<mixed-citation publication-type="chapter"><string-name><surname>Mendes</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Max</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>García-Silva</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Bizer</surname>, <given-names>C.</given-names></string-name> (<year>2011</year>). <chapter-title>DBpedia Spotlight: shedding light on the web of documents</chapter-title>. In: <source>Proceedings of the 7th International Conference on Semantic Systems</source>, <publisher-loc>Graz, Austria</publisher-loc>, pp. <fpage>1</fpage>–<lpage>8</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/2063518.2063519" xlink:type="simple">https://doi.org/10.1145/2063518.2063519</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_055">
<mixed-citation publication-type="chapter"><string-name><surname>Mikolov</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Sutskever</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Corrado</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Jeffrey</surname>, <given-names>D.</given-names></string-name> (<year>2013</year>a). <chapter-title>Distributed representations of phrases and their compositionality</chapter-title>. In: <source>Proceedings of the 27th Conference on Neural Information Processing Systems</source>, <publisher-loc>Lake Tahoe, USA</publisher-loc>, pp. <fpage>1</fpage>–<lpage>11</lpage>. <uri>https://arxiv.org/abs/1310.4546</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_056">
<mixed-citation publication-type="chapter"><string-name><surname>Mikolov</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Corrado</surname>, <given-names>G.S.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Dean</surname>, <given-names>J.</given-names></string-name> (<year>2013</year>b). <chapter-title>Efficient estimation of word representations in vector space</chapter-title>. In: <source>Proceedings of the International Conference on Learning Representations</source>, <publisher-loc>Scottsdale, USA</publisher-loc>, pp. <fpage>1</fpage>–<lpage>12</lpage>. <uri>https://arxiv.org/abs/1301.3781v3</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_057">
<mixed-citation publication-type="journal"><string-name><surname>Mohd</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Jan</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Shah</surname>, <given-names>M.</given-names></string-name> (<year>2020</year>). <article-title>Text document summarization using word embeddings</article-title>. <source>Expert Systems with Applications</source>, <volume>143</volume>(<issue>1</issue>), <fpage>1</fpage>–<lpage>10</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2019.112958" xlink:type="simple">https://doi.org/10.1016/j.eswa.2019.112958</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_058">
<mixed-citation publication-type="chapter"><string-name><surname>Nooralahzadeh</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Øvrelid</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Lønning</surname>, <given-names>J.T.</given-names></string-name> (<year>2018</year>). <chapter-title>Evaluation of domain-specific word embeddings using knowledge resources</chapter-title>. In: <source>Proceedings of the 11th International Conference on Language Resources and Evaluation</source>. <publisher-name>European Language Resources Association (ELRA)</publisher-name>, <publisher-loc>Miyazaki, Japan</publisher-loc>, pp. <fpage>1438</fpage>–<lpage>1445</lpage>. <ext-link ext-link-type="doi" xlink:href="https://www.aclweb.org/anthology/L18-1228" xlink:type="simple">https://www.aclweb.org/anthology/L18-1228</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_059">
<mixed-citation publication-type="journal"><string-name><surname>Oliveira</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Delgado</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Assaife</surname>, <given-names>A.C.</given-names></string-name> (<year>2017</year>). <article-title>A recommendation approach for consuming linked open data</article-title>. <source>Expert Systems with Applications</source>, <volume>72</volume>, <fpage>407</fpage>–<lpage>420</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2016.10.037" xlink:type="simple">https://doi.org/10.1016/j.eswa.2016.10.037</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_060">
<mixed-citation publication-type="journal"><string-name><surname>Park</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Cho</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Park</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Shin</surname>, <given-names>H.</given-names></string-name> (<year>2021</year>). <article-title>Customer sentiment analysis with more sensibility</article-title>. <source>Engineering Applications of Artificial Intelligence</source>, <volume>104</volume>, <elocation-id>104356</elocation-id>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.engappai.2021.104356" xlink:type="simple">https://doi.org/10.1016/j.engappai.2021.104356</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_061">
<mixed-citation publication-type="journal"><string-name><surname>Peeters</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Bizer</surname>, <given-names>C.</given-names></string-name> (<year>2021</year>). <article-title>Dual-objective fine-tuning of BERT for entity matching</article-title>. <source>Proceedings of the VLDB Endowment</source>, <volume>14</volume><issue>10</issue>, <fpage>1913</fpage>–<lpage>1921</lpage>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_062">
<mixed-citation publication-type="chapter"><string-name><surname>Peeters</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Primpeli</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Wichtlhuber</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Bizer</surname>, <given-names>C.</given-names></string-name> (<year>2020</year>). <chapter-title>Using schema.org annotations for training and maintaining product matchers</chapter-title>. In: <source>Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics</source>, <publisher-loc>Biarritz, France</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_063">
<mixed-citation publication-type="chapter"><string-name><surname>Pellissier</surname>, <given-names>T.T.</given-names></string-name>, <string-name><surname>Weikum</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Schanek</surname>, <given-names>F.</given-names></string-name> (<year>2020</year>). <chapter-title>YAGO 4: a reasonable knowledge base</chapter-title>. In: <source>Proceedings of the 17th International Conference on the Semantic Web (ESWC)</source>, Vol. <volume>12123</volume>. <publisher-name>Springer</publisher-name>, <publisher-loc>Crete, Greece</publisher-loc>, pp. <fpage>583</fpage>–<lpage>596</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/978-3-030-49461-2_34" xlink:type="simple">https://doi.org/10.1007/978-3-030-49461-2_34</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_064">
<mixed-citation publication-type="chapter"><string-name><surname>Pennington</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Socher</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Manning</surname>, <given-names>C.D.</given-names></string-name> (<year>2014</year>). <chapter-title>GloVe: global vectors for word representation</chapter-title>. In: <source>Empirical Methods in Natural Language Processing</source>, pp. <fpage>1532</fpage>–<lpage>1543</lpage>. <ext-link ext-link-type="doi" xlink:href="http://www.aclweb.org/anthology/D14-1162" xlink:type="simple">http://www.aclweb.org/anthology/D14-1162</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_065">
<mixed-citation publication-type="chapter"><string-name><surname>Petters</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Neumann</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Iyyer</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Gardner</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Clark</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Zettlemoyer</surname>, <given-names>L.</given-names></string-name> (<year>2018</year>). <chapter-title>Deep contextualized word representations</chapter-title>. In: <source>Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>New Orleans, USA</publisher-loc>, pp. <fpage>2227</fpage>–<lpage>2237</lpage>. <uri>https://arxiv.org/abs/1802.05365v2</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_066">
<mixed-citation publication-type="chapter"><string-name><surname>Phan</surname>, <given-names>M.C.</given-names></string-name>, <string-name><surname>Sun</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Tay</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Han</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>C.</given-names></string-name> (<year>2017</year>). <chapter-title>NeuPL: attention-based semantic matching andpair-linking forentity disambiguation</chapter-title>. In: <source>Proceedings of the 2017 ACM Conference on Information Knowledge Management</source>, <publisher-loc>Singapore</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_067">
<mixed-citation publication-type="chapter"><string-name><surname>Pilehvar</surname>, <given-names>M.T.</given-names></string-name>, <string-name><surname>Collier</surname>, <given-names>N.</given-names></string-name> (<year>2016</year>). <chapter-title>Improve semantic representation for domain-specific entities</chapter-title>. In: <source>Proceedings of the 15th Workshop on Biomedical Natural Language Processing</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Berlin, Germany</publisher-loc>, pp. <fpage>12</fpage>–<lpage>16</lpage>. <ext-link ext-link-type="doi" xlink:href="https://aclanthology.org/W16-2902/" xlink:type="simple">https://aclanthology.org/W16-2902/</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_068">
<mixed-citation publication-type="chapter"><string-name><surname>Primpeli</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Peeters</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Bizer</surname>, <given-names>C.</given-names></string-name> (<year>2019</year>). <chapter-title>The WDC training dataset and gold standard for large-scale product matching</chapter-title>. In: <source>Proceedings of the 2019 World Wide Web Conference</source>, <publisher-loc>San Francisco, USA</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_069">
<mixed-citation publication-type="other"><string-name><surname>Radford</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Child</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Luan</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Amodei</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Sutskever</surname>, <given-names>I.</given-names></string-name> (2019). Language models are unsupervised multitask learners. <italic>Computer Science</italic>, 1–24. <uri>https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_070">
<mixed-citation publication-type="journal"><string-name><surname>Rani</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Lobiyal</surname>, <given-names>D.K.</given-names></string-name> (<year>2022</year>). <article-title>Document vector embedding based extractive text summarization system for Hindi and English text</article-title>. <source>Applied Intelligence</source>, <volume>52</volume>, <fpage>9353</fpage>–<lpage>9372</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s10489-021-02871-9" xlink:type="simple">https://doi.org/10.1007/s10489-021-02871-9</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_071">
<mixed-citation publication-type="chapter"><string-name><surname>Rehürek</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Sojka</surname>, <given-names>P.</given-names></string-name> (<year>2010</year>). <chapter-title>Software framework for topic modelling with large corpora</chapter-title>. In: <source>Proceedings of the LREC Workshop on New Challenges for NLP Frameworks</source>. <publisher-name>ELRA</publisher-name>, <publisher-loc>Valletta, Malta</publisher-loc>, pp. <fpage>45</fpage>–<lpage>50</lpage>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_072">
<mixed-citation publication-type="chapter"><string-name><surname>Reimers</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Gurevych</surname>, <given-names>I.</given-names></string-name> (<year>2019</year>). <chapter-title>Sentence-BERT: sentence embeddings using siamese BERT-networks</chapter-title>. In: <source>Proceedings of Conference on Empirical Methods in Natural Language Processing</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Hong Kong, China</publisher-loc>, pp. <fpage>3982</fpage>–<lpage>3992</lpage>. <uri>https://arxiv.org/abs/1908.10084</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_073">
<mixed-citation publication-type="journal"><string-name><surname>Rezaeinia</surname>, <given-names>S.M.</given-names></string-name>, <string-name><surname>Rahmani</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Ghodsi</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Veisi</surname>, <given-names>H.</given-names></string-name> (<year>2019</year>). <article-title>Sentiment analysis based on improved pre-trained word embeddings</article-title>. <source>Expert Systems with Applications</source>, <volume>177</volume>(<issue>1</issue>), <fpage>139</fpage>–<lpage>147</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2018.08.044" xlink:type="simple">https://doi.org/10.1016/j.eswa.2018.08.044</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_074">
<mixed-citation publication-type="journal"><string-name><surname>Silva</surname>, <given-names>R.M.</given-names></string-name>, <string-name><surname>Santos</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Almeida</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Pardo</surname>, <given-names>T.</given-names></string-name> (<year>2020</year>). <article-title>Towards automatically filtering fake news in Portuguese</article-title>. <source>Expert Systems with Applications</source>, <volume>146</volume>, <fpage>1</fpage>–<lpage>14</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2020.113199" xlink:type="simple">https://doi.org/10.1016/j.eswa.2020.113199</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_075">
<mixed-citation publication-type="chapter"><string-name><surname>Smith</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Plamada</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Koehn</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Callison</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Lopez</surname>, <given-names>A.</given-names></string-name> (<year>2013</year>). <chapter-title>Dirt cheap web-scale parallel text from the Common Crawl</chapter-title>. In: <source>Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics</source>, <publisher-loc>Sofia, Bulgaria</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_076">
<mixed-citation publication-type="journal"><string-name><surname>Songa</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Mina</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Da-Xionga</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Fengb</surname>, <given-names>W.Z.</given-names></string-name>, <string-name><surname>Shua</surname>, <given-names>C.</given-names></string-name> (<year>2019</year>). <article-title>Research on text error detection and repair method based on online learning community</article-title>. <source>Procedia Computer Science</source>, <volume>154</volume>, <fpage>13</fpage>–<lpage>19</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.procs.2019.06.004" xlink:type="simple">https://doi.org/10.1016/j.procs.2019.06.004</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_077">
<mixed-citation publication-type="journal"><string-name><surname>Sunitha</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>BalRaju</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Sasikiran</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Venkat Ramana</surname>, <given-names>E.</given-names></string-name> (<year>2014</year>). <article-title>Automatic outlier identification in data mining using IQR in real-time data</article-title>. <source>International Journal of Advanced Research in Computer and Communication Engineering</source>, <volume>3</volume>(<issue>6</issue>), <fpage>1</fpage>–<lpage>10</lpage>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_078">
<mixed-citation publication-type="journal"><string-name><surname>Symseridou</surname>, <given-names>E.</given-names></string-name> (<year>2018</year>). <article-title>The web as a corpus and for building corpora in the teaching of specialised translation</article-title>. <source>FITISPOS International Journal</source>, <volume>5</volume>(<issue>1</issue>), <fpage>60</fpage>–<lpage>82</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.37536/FITISPos-IJ.2018.5.1.160" xlink:type="simple">https://doi.org/10.37536/FITISPos-IJ.2018.5.1.160</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_079">
<mixed-citation publication-type="book"><string-name><surname>Tukey</surname>, <given-names>J.W.</given-names></string-name> (<year>1977</year>). <source>Exploratory Data Analysis</source>. <publisher-name>Reading Mass</publisher-name>, <publisher-loc>New York</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_080">
<mixed-citation publication-type="chapter"><string-name><surname>Turian</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Ratinov</surname>, <given-names>L.A.</given-names></string-name>, <string-name><surname>Bengio</surname>, <given-names>Y.</given-names></string-name> (<year>2010</year>). <chapter-title>Word representations: a simple and general method for semi-supervised learning</chapter-title>. In: <source>Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Uppsala, Sweden</publisher-loc>, pp. <fpage>384</fpage>–<lpage>394</lpage>. <uri>https://www.aclweb.org/anthology/P10-1040</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_081">
<mixed-citation publication-type="journal"><string-name><surname>Valcarce</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Landin</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Parapar</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Barreiro</surname>, <given-names>A.</given-names></string-name> (<year>2019</year>). <article-title>Collaborative filtering embeddings for memory-based recommender systems</article-title>. <source>Engineering Applications of Artificial Intelligence</source>, <volume>85</volume>, <fpage>347</fpage>–<lpage>356</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.engappai.2019.06.020" xlink:type="simple">https://doi.org/10.1016/j.engappai.2019.06.020</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_082">
<mixed-citation publication-type="journal"><string-name><surname>Vrandeĉić</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Krötzsch</surname>, <given-names>M.</given-names></string-name> (<year>2014</year>). <article-title>Wikidata: a free collaborative knowledgebase</article-title>. <source>Communications of the ACM</source>, <volume>57</volume>(<issue>10</issue>), <fpage>78</fpage>–<lpage>85</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/2629489" xlink:type="simple">https://doi.org/10.1145/2629489</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_083">
<mixed-citation publication-type="other"><string-name><surname>Wang</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Jay-Kuo</surname>, <given-names>C.C.</given-names></string-name> (2020). SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word Models. <uri>https://arxiv.org/abs/2002.06652</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_084">
<mixed-citation publication-type="chapter"><string-name><surname>Xu</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Shen</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Gao</surname>, <given-names>J.</given-names></string-name> (<year>2019</year>). <chapter-title>Multi-task learning with sample re-weighting for machine reading comprehension</chapter-title>. In: <source>Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>. <publisher-name>Association for Computational Linguistics (ACL)</publisher-name>, <publisher-loc>Minneapolis, USA</publisher-loc>, pp. <fpage>2644</fpage>–<lpage>2655</lpage>. <uri>https://arxiv.org/abs/1809.06963v3</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_085">
<mixed-citation publication-type="journal"><string-name><surname>Yoo</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Jeong</surname>, <given-names>O.</given-names></string-name> (<year>2020</year>). <article-title>Automating the expansion of a knowledge graph</article-title>. <source>Expert Systems with Applications</source>, <volume>141</volume>(<issue>1</issue>), <fpage>1</fpage>–<lpage>10</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2019.112965" xlink:type="simple">https://doi.org/10.1016/j.eswa.2019.112965</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_086">
<mixed-citation publication-type="chapter"><string-name><surname>Zanzotto</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Pennacchiotti</surname>, <given-names>M.</given-names></string-name> (<year>2010</year>). <chapter-title>Expanding textual entailment corpora from Wikipedia using co-training</chapter-title>. In: <source>Proceedings of the 2nd Workshop on Collaborative Constructed Semantic Resources</source>, <publisher-loc>Beijing, China</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_087">
<mixed-citation publication-type="other"><string-name><surname>Zhang</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Yuan</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Zhuang</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>H.</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Xiong</surname>, <given-names>H.</given-names></string-name> (2021). E-BERT: adapting BERT to e-commerce with adaptive hybrid masking and neighbor product reconstruction. <uri>https://arxiv.org/pdf/2009.02835</uri>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_088">
<mixed-citation publication-type="journal"><string-name><surname>Zhang</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Song</surname>, <given-names>X.</given-names></string-name> (<year>2022</year>). <article-title>An exploratory study on utilising the web of linked data for product data mining</article-title>. <source>SN Computer Science</source>, <volume>4</volume>(<issue>15</issue>). <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s42979-022-01415-3" xlink:type="simple">https://doi.org/10.1007/s42979-022-01415-3</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_089">
<mixed-citation publication-type="journal"><string-name><surname>Zhou</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Hu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Chung</surname>, <given-names>V.</given-names></string-name> (<year>2022</year>). <article-title>Automatic construction of fine-grained paraphrase corpora system using language inference model</article-title>. <source>Applied Intelligence</source>, <volume>12</volume>(<issue>1</issue>). <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.3390/app12010499" xlink:type="simple">https://doi.org/10.3390/app12010499</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor527_ref_090">
<mixed-citation publication-type="chapter"><string-name><surname>Zwicklbauer</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Seifert</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Granitzer</surname>, <given-names>M.</given-names></string-name> (<year>2016</year>). <chapter-title>DoSeR: a knowledge-base-agnostic framework for entity disambiguation using semantic embeddings</chapter-title>. In: <source>Proceedings of the 13th International Conference on Semantic Web Latest Advances and New Domains</source>, <publisher-loc>Heraklion, Crete</publisher-loc>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
