<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
<journal-title-group><journal-title>Informatica</journal-title></journal-title-group>
<issn pub-type="epub">1822-8844</issn><issn pub-type="ppub">0868-4952</issn><issn-l>0868-4952</issn-l>
<publisher>
<publisher-name>Vilnius University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">INFOR517</article-id>
<article-id pub-id-type="doi">10.15388/23-INFOR517</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Research Article</subject></subj-group></article-categories>
<title-group>
<article-title>Deriving Homogeneous Subsets from Gene Sets by Exploiting the Gene Ontology</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Stier</surname><given-names>Quirin</given-names></name><xref ref-type="aff" rid="j_infor517_aff_001"/><bio>
<p><bold>Q. Stier</bold> received his bachelor in mathematics at the University of Erlangen in 2017 and his master in data science at the University of Marburg in 2021. His master thesis investigated time series forecasting using wavelet analysis comparing it to popular current state-of-the-art methods. Currently, he is pursuing a PhD in artificial intelligence focusing on interpretable techniques applicable for human-in-the-loop processes at the University of Marburg.</p></bio>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0001-9542-5543</contrib-id>
<name><surname>Thrun</surname><given-names>Michael C.</given-names></name><email xlink:href="mthrun@informatik.uni-marburg.de">mthrun@informatik.uni-marburg.de</email><xref ref-type="aff" rid="j_infor517_aff_001"/><xref ref-type="corresp" rid="cor1">∗</xref><bio>
<p><bold>Priv.-Doz. Dr. habil. M.C. Thrun</bold> received his diploma in physics (2014) and his doctorate in data science (2017) at the Philipps-University Marburg under the chair of Databionics Prof. Dr. habil. Alfred H.G. Ultsch. Afterwards, he worked for almost two years as a Big Data Scientist for an international manufacturer. He is the author of the book “Projection-Based Clustering through Self-Organization and Swarm Intelligence”. His team specializes in explainable artificial intelligence, predicting time series and knowledge discovery using methods borrowed from nature. Additionally, they are researching the topic of recognizing and explaining diseases. In 2022, he received his habilitation in informatics at the Philipps-University Marburg with a thesis about explainable artificial intelligence and a colloquium about reinforcement learning in praxis. Currently, Thrun holds a position for lecturing on databionic methods of artificial intelligence, time series analysis and knowledge discovery in the Data Science program at the Philipps University of Marburg.</p></bio>
</contrib>
<aff id="j_infor517_aff_001">Faculty of Mathematics and Computer Science, <institution>University of Marburg</institution>, <country>Germany</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2023</year></pub-date><pub-date pub-type="epub"><day>22</day><month>5</month><year>2023</year></pub-date><volume>34</volume><issue>2</issue><fpage>357</fpage><lpage>386</lpage><history><date date-type="received"><month>6</month><year>2022</year></date><date date-type="accepted"><month>5</month><year>2023</year></date></history>
<permissions><copyright-statement>© 2023 Vilnius University</copyright-statement><copyright-year>2023</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>The Gene Ontology (GO) knowledge base provides a standardized vocabulary of GO terms for describing gene functions and attributes. It consists of three directed acyclic graphs which represent the hierarchical structure of relationships between GO terms. GO terms enable the organization of genes based on their functional attributes by annotating genes to specific GO terms. We propose an information-retrieval derived distance between genes by using their annotations. Four gene sets with causal associations were examined by employing our proposed methodology. As a result, the discovered homogeneous subsets of these gene sets are semantically related, in contrast to comparable works. The relevance of the found clusters can be described with the help of ChatGPT by asking for their biological meaning. The R package BIDistances, readily available on CRAN, empowers researchers to effortlessly calculate the distance for any given gene set.</p>
</abstract>
<kwd-group>
<label>Key words</label>
<kwd>gene ontology</kwd>
<kwd>gene analysis</kwd>
<kwd>cluster analysis</kwd>
<kwd>knowledge base</kwd>
<kwd>ChatGPT</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="j_infor517_s_001">
<label>1</label>
<title>Introduction</title>
<p>The analysis of gene expression profiles in biological materials has become routine in molecular biomedical research, including drug development. Sets of genes emerge from various sources, including microarray analyses of Taub <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor517_ref_048">1983</xref>), next-generation sequencing analysis of Mardis (<xref ref-type="bibr" rid="j_infor517_ref_032">2008</xref>), or topical searches in databases (Lötsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_030">2013</xref>; Ultsch and Lötsch, <xref ref-type="bibr" rid="j_infor517_ref_066">2014</xref>). In addition, their functional interpretation is an active research topic in biomedical informatics (Tarca <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_046">2013</xref>). Working solutions focus on computational analyses of molecular interaction networks (Alm and Arkin, <xref ref-type="bibr" rid="j_infor517_ref_003">2003</xref>; Barabási and Oltvai, <xref ref-type="bibr" rid="j_infor517_ref_006">2004</xref>) or enrichment analyses identifying over-represented functional knowledge base derived categories annotated to a particular gene set in comparison to a random gene set (Subramanian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_044">2005</xref>).</p>
<p>Available methods mainly aim to provide functional descriptions of single gene sets. However, only limited similarity analyses are performed on gene sets and these are often restricted to the comparative interpretation of parallel analyses or assessments of gene set intersections. Methods assessing gene similarity remain mainly on the single gene level, such as Resnik (<xref ref-type="bibr" rid="j_infor517_ref_039">1999</xref>) similarity. We propose to utilize a swarm intelligence-based method for identifying homogeneous structures within gene sets, which enables functional interpretations and is also amenable to comparative analyses. The method uses a distance measure which is proposed in this work and motivated by information retrieval (van Rijsbergen, <xref ref-type="bibr" rid="j_infor517_ref_069">1979</xref>). This distance measure is immediately usable for functional comparisons between (sub)sets of genes to establish groups within a larger set that share biological functions or find functionally similar sets of genes from other sources.</p>
<p>The present work pursued the hypothesis that sets of genes may be grouped, exploiting a knowledge base on a computational functional genomics basis by applying projection-based cluster analysis (Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_060">2020b</xref>). For data consisting of gene sets, the concept of how cluster structures should be defined is unknown. Thus, conventional projection or clustering algorithms are unfeasible because global criteria predefine the structures they seek (Ultsch and Lötsch, <xref ref-type="bibr" rid="j_infor517_ref_067">2017</xref>; Thrun, <xref ref-type="bibr" rid="j_infor517_ref_049">2018</xref>). If a global criterion is given, it follows that an implicit definition of the structures in data exists, and the bias is the difference between this definition and the existing structures (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_051">2021a</xref>). For example, the global criterion of Partition-Around-Medoids (PAM) (Kaufman and Rousseeuw, <xref ref-type="bibr" rid="j_infor517_ref_022">1990</xref>) is used in the Gene clustering approach of Acharya <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor517_ref_001">2017</xref>).</p>
<p>Linear projection methods are not able to detect nonlinear entangled structures (Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_060">2020b</xref>). Especially, Principal Component Analysis maximizes variance, which in cluster analysis benchmarks (Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_059">2020a</xref>) and applications (López-García <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_028">2020</xref>) tended to be rather disadvantageous. Instead, we chose a projection-based clustering method called Databionic Swarm (DBS) which, after extensive benchmarking, showed to be able to simultaneously detect more complex structures in data (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_051">2021a</xref>) and verify if structures in data exist at all (Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_059">2020a</xref>, <xref ref-type="bibr" rid="j_infor517_ref_060">2020b</xref>). In principle, the choice of underlying focusing projection method is interchangeable as long as it tries to preserve neighbourhoods non-linearly and allows a distance matrix as an input. As a negative example, multidimensional scaling has an objective function that tries to preserve all distance relations (Shepard, <xref ref-type="bibr" rid="j_infor517_ref_042">1980</xref>) which is rather not advisable for this task. Therefore, here the DBS is selected as the projection method because instead of using a global criterion, DBS exploits self-organization and emergence (Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_061">2021</xref>). Consequently, DBS can find homogeneous structures in data of any shape instead of being restricted to specific structures in data (Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_061">2021</xref>).</p>
<p>Since different clustering techniques may discover very different structures in data or no structure at all (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_051">2021a</xref>; Lötsch and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_029">2020</xref>), we are using two techniques to verify the structures found with our method. The structures we are looking for are based on the distance measure and define natural clusters (Duda <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_013">2000</xref>). Natural clusters are defined by groups of datapoints, which possess small distances among datapoints from their own group (intracluster distance) and large distances to datapoints of other groups (intercluster distance).</p>
<p>The following two approaches visualize the datapoints based on the here defined similarity measure and can indicate natural clusters. First, heat maps are used to visualize the high-dimensional distances (Wilkinson and Friendly, <xref ref-type="bibr" rid="j_infor517_ref_071">2009</xref>). By grouping the variables in the heat map according to the clustering, structures can be visually validated, since the colour of the matrix blocks on the diagonal with size corresponding to the cluster sizes should be clearly separable from the neighbouring blocks. Second, topographic maps based on the U-matrix approach are visualizing similarity between datapoints (Thrun and Lerch, <xref ref-type="bibr" rid="j_infor517_ref_056">2016</xref>). In brief, a topographic map forms a Voronoi cell around each projected datapoint. Neighbouring Voronoi cells are connected resulting in a Delaunay graph (Toussaint, <xref ref-type="bibr" rid="j_infor517_ref_065">1980</xref>). This Delaunay graph can be weighted with the input distances. A dendrogram can be derived from the Delaunay graph and can be used for clustering both by visual means or by a priorly known number of clusters (Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_061">2021</xref>). The U-matrix approach allows to add a third dimension to the two-dimensional projection. By using the distances derived from the Delaunay graph weighted with the input distances, the neighbourhood of each datapoint can be evaluated as more or less similar. Such similarity evaluation can be represented by a landscape with a colour transition analog to geographic maps. Datapoints which have low distances to their neighbours contribute low values to the height building up a landscape around them resulting in valleys, whereas datapoints with high distances to their neighbours contribute high values to the landscape height and thus result in mountain area. Clear distinguishable clusters thus result in two neighbouring valleys with a clear mountain wall separating the clusters. Since the definition of natural clusters requires structures based on distance, we are further investigating the distribution of the distances with focus on intra- and intercluster distances.</p>
<p>A feature matrix is created that was accessible for functional clustering to address this hypothesis by assigning each gene with its functional annotations in the GO database (Ashburner <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_004">2000</xref>). The feature matrix rows comprise the genes in the set, and the columns are defined by the GO terms in which the genes are annotated. Each gene can be annotated to multiple GO terms. Each GO term can belong to one of the three named ontologies. An annotation of a Gene to a GO term is a statement about the function of a particular gene. Each annotation includes an evidence code to indicate how the annotation to a particular term is supported (see <uri>http://geneontology.org/docs/guide-go-evidence-codes/</uri>). Each element of the matrix counts the occurrence, i.e. the number of times a specific gene is annotated in a specific GO term depending on the various possible evidence codes. Term-frequency-inverse document frequency (tf-idf) statistics are calculated based on the feature matrix. The tf-idf is an information retrieval technique serving to rank the relevance of the terms (Rajaraman and Ullman, <xref ref-type="bibr" rid="j_infor517_ref_038">2011</xref>) here associated with the genes. The absolute distance between the tf-idf values of each pair of genes is here defined as the distance matrix and then used in unsupervised machine learning, implemented as the swarm intelligence of the DBS (Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_061">2021</xref>). DBS identifies semantically related genes within a gene set by grouping them based on biological knowledge contained in the GO. The resulting functional and homogeneous structures in sets of genes are visualized using the topographic map of the U-matrix (Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_059">2020a</xref>). The analysis is performed on gene sets causally associated with pain and the chronification of pain (Ultsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_068">2016</xref>), hearing loss (GeneTestingRegistry, <xref ref-type="bibr" rid="j_infor517_ref_015">2018</xref>), cancer (Sondka <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_043">2018</xref>), and drug addiction (Li <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_025">2008</xref>), showing distinctive and homogeneous knowledge-based structures.</p>
<p>This work is the extended manuscript initially presented in World’CIST 2022 (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_055">2022c</xref>). The structure of the paper is as follows. After a related work section, the methodology is introduced in Section <xref rid="j_infor517_s_003">3</xref>. The first part of the results section evaluates the proposed distance measure and shows that all four gene sets are expected to have cluster structures if this distance measure is used. In the second part of the results section, the structure analysis and clustering are presented and evaluated. Finally, a discussion of the results and a conclusion follow.</p>
</sec>
<sec id="j_infor517_s_002">
<label>2</label>
<title>Related Works</title>
<p>Lippman states that methods for selecting a subset of genes can be divided into six categories based on the underlying models: Filter, Wrapper, Hybrid, Embedded, Ensemble, and Integrated (Lippmann, <xref ref-type="bibr" rid="j_infor517_ref_026">2020</xref>; Saeys <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_041">2007</xref>; Grasnick <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_016">2018</xref>). Filtering methods select genes based only on the intrinsic properties of the data using a search procedure (Lippmann, <xref ref-type="bibr" rid="j_infor517_ref_026">2020</xref>). Wrappers first apply a search procedure to generate different subsets of the total set and then apply a learning algorithm to all the subsets found, using their performance as a quality criterion and selecting the optimal subset of genes (Lippmann, <xref ref-type="bibr" rid="j_infor517_ref_026">2020</xref>). For example, Tang <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor517_ref_045">2007</xref>) applies clustering approaches to microarray gene expression data and creates a connection to a gene annotation afterwards to create meaningful results. As machine learning method applied on gene expression data it is categorized as wrapper model. Tasoulis <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor517_ref_047">2006</xref>) deploy an evolutionary algorithm to detect gene subsets depending on a neural network’s classification performance making it an approach that belongs to wrapper models.</p>
<p>Hybrid gene selection methods are combinations of Wrapper and Filter methods that attempt to exploit the good properties of both methods (Jović <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_021">2015</xref>). In embedded gene selection, the optimal subset of genes is already selected during the execution of the learning algorithm (Hira and Gillies, <xref ref-type="bibr" rid="j_infor517_ref_018">2015</xref>; Jović <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_021">2015</xref>), making embedded methods, like wrappers, highly dependent on the learning algorithm and not directly transferable to other gene selection problems (Lippmann, <xref ref-type="bibr" rid="j_infor517_ref_026">2020</xref>). Instead of a single gene selection method, ensemble methods use multiple gene selection methods and result in the subset that produces the best results in most methods (Lippmann, <xref ref-type="bibr" rid="j_infor517_ref_026">2020</xref>). Integrative gene selection uses domain knowledge from external knowledge bases, such as KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways, to select genes (Lippmann, <xref ref-type="bibr" rid="j_infor517_ref_026">2020</xref>). For example, Jin and Lu (<xref ref-type="bibr" rid="j_infor517_ref_019">2010</xref>) use differences in the word usage profile of GO terms allowing to identify subsets of genes based on information bottleneck methods. As method applied on data generated on some properties of GO terms which are part of a knowledge base it can be categorized as integrated model. In another example, the method of Wolting <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor517_ref_072">2006</xref>) measures graph similarity between GO annotations to cluster subsets of proteins and thus can be grouped as an integrated model.</p>
<p>Typical methods within the six categories use expression data of the genes to evaluate the genes of subsets quantitatively (Lippmann, <xref ref-type="bibr" rid="j_infor517_ref_026">2020</xref>). However, if no expression data are available for a gene set, these methods cannot be used (Lippmann, <xref ref-type="bibr" rid="j_infor517_ref_026">2020</xref>).</p>
<p>The most similar approach of Acharya <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor517_ref_001">2017</xref>) describes a procedure for which no expression data are required for the selection of genes. An overrepresentation analysis is performed for a given set of genes. The overrepresentation analysis is a statistical approach to estimate how likely a GO term is observed in the given gene set in contrast to pure chance (Backes <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_005">2007</xref>; Lippmann, <xref ref-type="bibr" rid="j_infor517_ref_026">2020</xref>). Thus, the overrepresentation analysis retrieves GO terms in which genes of a given set are annotated significantly more or less than expected. For each significant GO term, the information content is calculated based on the position of the GO term and the number of its descendants in the DAG resulting from the overrepresentation analysis. A matrix describing the GO annotations is created. Its entries are zero unless a gene in the respective row is annotated to the GO term of the respective column. If the gene is annotated to the GO term, the matrix contains the structural information content of the corresponding GO term. Computing the Euclidean distances (or alternatively, Manhattan or Cosinus distance) based on this matrix, the genes are then clustered with the PAM algorithm (Acharya <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_001">2017</xref>). A silhouette plot proposed by Rousseeuw (<xref ref-type="bibr" rid="j_infor517_ref_040">1987</xref>) is used to determine the optimal number of clusters. In Acharya <italic>et al.</italic> the subset selection by cluster analysis is evaluated using the average Silhouette index, Dunn (<xref ref-type="bibr" rid="j_infor517_ref_014">1974</xref>) index, and Davies-Bouldin index (Davies and Bouldin, <xref ref-type="bibr" rid="j_infor517_ref_012">1979</xref>). The approach from Wolting <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor517_ref_072">2006</xref>) uses all three categories of the GO, namely biological process, cellular component, and molecular function, which differs from the methodology applied in this work.</p>
</sec>
<sec id="j_infor517_s_003" sec-type="methods">
<label>3</label>
<title>Methods</title>
<p>Prior work did not focus on the semantic value of the GO to identify gene subsets. Furthermore, there were no available open-source codes to apply them out-of-the-box. Therefore, we further provide an R package BIDistances available on CRAN to obtain the results with our proposed method (<uri>https://CRAN.R-project.org/package=BIDistances</uri>).</p>
<p>The identification of gene subsets which are semantically related is explained in the following. The feature vector associated with each gene was obtained as the biological functions involved in the individual gene product. They were queried for each gene from the GO knowledge base (<uri>http://www.geneontology.org/</uri>) (Ashburner <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_004">2000</xref>). The GO can be accessed for example with R via the Bioconductor package <uri>https://bioconductor.org/packages/GO.db/</uri> and can be searched for “biological processes”, “cellular components” and “molecular functions”. The GO is a knowledge base consisting of GO terms which features the knowledge about the functions of genes using a controlled, and clearly defined vocabulary of GO terms annotated to specific genes (Camon <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_010">2003</xref>, <xref ref-type="bibr" rid="j_infor517_ref_011">2004</xref>). A set of genes <italic>G</italic> consists of multiple gene IDs <inline-formula id="j_infor517_ineq_001"><alternatives><mml:math>
<mml:mi mathvariant="italic">g</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">G</mml:mi></mml:math><tex-math><![CDATA[$g\in G$]]></tex-math></alternatives></inline-formula>. In practice, a gene set can be a set of sequence identifiers, i.e. a simple series of digits assigned consecutively to each sequence record processed by the National Center for Biotechnology Information (NCBI) for a specific use case provided by an expert. Each gene can be annotated to multiple GO terms. Furthermore, each GO term can belong to one of the three named ontologies. Here, all ontologies were used to retrieve GO terms. Let <italic>G</italic> be a set of genes, then we assume that gene subsets <inline-formula id="j_infor517_ineq_002"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${G_{i}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor517_ineq_003"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">⊂</mml:mo>
<mml:mi mathvariant="italic">G</mml:mi></mml:math><tex-math><![CDATA[${G_{j}}\subset G$]]></tex-math></alternatives></inline-formula> are disjunct, denoted by <inline-formula id="j_infor517_ineq_004"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>∩</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>∅</mml:mi></mml:math><tex-math><![CDATA[${G_{i}}\cap {G_{j}}=\varnothing $]]></tex-math></alternatives></inline-formula>.</p>
<p>To accommodate the perception that a GO term that appears only in some genes seems to provide a more appropriate and specific description of a gene than a general term that occurs in almost every gene, the gene versus biological process matrix was weighted using the term frequency-inverse document frequency (tf-idf) (Jones, <xref ref-type="bibr" rid="j_infor517_ref_020">1972</xref>). Inverse document frequency was developed in an information retrieval context and consists of a numerical statistic aimed at reflecting how important a word is to a document in a collection (Rajaraman and Ullman, <xref ref-type="bibr" rid="j_infor517_ref_038">2011</xref>), where it seeks to use the most relevant information for document identification. In the present context of gene set comparison, the inverse document frequency and term frequency were calculated regarding the documents as represented by the gene set <italic>G</italic> and the terms represented by the GO term set <italic>T</italic>. In general, the term frequency (tf) depends on the number of occurrences of the term in the document, although there are various ways to define tf (Manning <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_031">2008</xref>). In this work, GO-Terms <inline-formula id="j_infor517_ineq_005"><alternatives><mml:math>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$t\in T$]]></tex-math></alternatives></inline-formula> represent documents in the information retrieval sense, and the terms are the genes <italic>g</italic> of a set <italic>G</italic>. The frequency of one gene <italic>g</italic> from the gene set <italic>G</italic> is computed with an aggregation function (mean, sum, <inline-formula id="j_infor517_ineq_006"><alternatives><mml:math>
<mml:mo>…</mml:mo></mml:math><tex-math><![CDATA[$\dots $]]></tex-math></alternatives></inline-formula>) denoted as <italic>f</italic> over all GO terms <inline-formula id="j_infor517_ineq_007"><alternatives><mml:math>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$t\in T$]]></tex-math></alternatives></inline-formula> the gene is annotated to. To calculate tf, the resulting value is divided by the maximum observable value overall given genes in a set, i.e. 
<disp-formula id="j_infor517_eq_001">
<label>(1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="normal">tf</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">g</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">g</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">max</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">g</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">g</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathrm{tf}(g)=\frac{f({g_{t}})}{{\max _{g\in G}}\{f({g_{t}})\}}.\]]]></tex-math></alternatives>
</disp-formula> 
For simplification, we used the augmented frequency with the aggregation function mean for manually curated genes. For each gene, the inverse document frequency (idf) logarithmically in equation (<xref rid="j_infor517_eq_002">2</xref>) counts the number <italic>N</italic> of GO terms in which any gene <italic>g</italic> of the set is annotated, divided by the number <inline-formula id="j_infor517_ineq_008"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">g</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$n(g)$]]></tex-math></alternatives></inline-formula> of GO terms to which the specific gene <italic>g</italic> is annotated. The resulting value is translated by 1 to ensure values greater than zero before the logarithm is applied: 
<disp-formula id="j_infor517_eq_002">
<label>(2)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="normal">idf</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo movablelimits="false">log</mml:mo>
<mml:mo mathvariant="normal" fence="true" maxsize="2.03em" minsize="2.03em">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">g</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo mathvariant="normal" fence="true" maxsize="2.03em" minsize="2.03em">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathrm{idf}(t)=\log \bigg(1+\frac{N}{n(g)}\bigg).\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>Finally, the term frequency-inverse document frequency <italic>F</italic> is given as the product of the term frequency and the inverse document frequency, i.e. 
<disp-formula id="j_infor517_eq_003">
<label>(3)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">F</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">g</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">tf</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">g</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>·</mml:mo>
<mml:mi mathvariant="normal">idf</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">g</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ F(g)=\mathrm{tf}(g)\cdot \mathrm{idf}(g).\]]]></tex-math></alternatives>
</disp-formula> 
In equation (<xref rid="j_infor517_eq_003">3</xref>), <italic>F</italic> reduces the weights of genes that occur very frequently among the GO terms and increases the weight of genes that occur rarely. Thus, a gene only annotated to some GO terms is more meaningful than one annotated to almost every GO term.</p>
<p>Distance <italic>D</italic> between two genes <italic>i</italic> and <italic>j</italic> is defined in equation (<xref rid="j_infor517_eq_004">4</xref>) as the absolute difference of <italic>F</italic> computed by equation (<xref rid="j_infor517_eq_003">3</xref>) as: 
<disp-formula id="j_infor517_eq_004">
<label>(4)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" stretchy="true">|</mml:mo>
<mml:mi mathvariant="italic">F</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">F</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" stretchy="true">|</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ D(i,j)=\big|F(i)-F(j)\big|.\]]]></tex-math></alternatives>
</disp-formula> 
This work will show that the distribution of the distance <italic>D</italic> is multimodal for all four investigated sets of genes which indicates that knowledge-based structures exist and can be exploited (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_052">2021b</xref>). The existence of clusters is verified by statistical testing and structure analysis using the topographic map and heatmap (see sections below). Subsequent cluster analysis will yield homogeneous subset of genes of each gene set. The specific genes in each subset are listed in SI <xref rid="j_infor517_s_010">A</xref> (Tables 1–4). The distance is currently available on as a R package on GitHub <uri>https://github.com/Mthrun/BIDistances/</uri> and CRAN (<uri>https://CRAN.R-project.org/package=BIDistances</uri>). This work will show that the distribution of the distance <italic>D</italic> is multimodal in the case of all four investigated sets of genes. Multimodality in the distribution of distances serves as an indication that knowledge-based structures exist and can be exploited.</p>
<sec id="j_infor517_s_004">
<label>3.1</label>
<title>Identification of Homogeneous Groups in Gene Sets</title>
<p>Identification of homogeneous groups of semantically related genes is performed using unsupervised machine learning (Murphy, <xref ref-type="bibr" rid="j_infor517_ref_034">2012</xref>) implemented as the swarm intelligence of the DBS (Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_061">2021</xref>). The DBS is a flexible and robust clustering framework that consists of three independent modules: swarm-based projection, high-dimensional data visualization (Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_059">2020a</xref>), and representation-guided clustering. The first module is the parameter-free projection method Pswarm, which exploits concepts of self-organization and emergence, and game theory using swarm intelligence. Pswarm either uses a data matrix or a given distance or distance measure.</p>
<p>The intelligent agents of Pswarm operate on a toroid grid, where positions are coded into polar coordinates to allow for the precise definition of their movement, neighbourhood function, and annealing scheme. The size of the grid and, in contrast to other (focusing) projection methods, the annealing scheme does not require any parameters to be set. During learning, each agent moves across the grid or stays in its current position in the search for the most potent scent emitted by other agents. Hence, agents search for other agents carrying data with the most similar features to themselves with a data-driven decreasing search radius. The movement of every agent is modelled using a game theory approach, and the radius decreases only if a Nash Jr. (<xref ref-type="bibr" rid="j_infor517_ref_036">1950</xref>) equilibrium is found. After the self-organization of agents is finished, the output of the Pswarm algorithm is a scatter plot of projected points representing a folding of the high-dimensional data space. The second module is a parameter-free high-dimensional data visualization technique called the topographic map (Thrun and Lerch, <xref ref-type="bibr" rid="j_infor517_ref_056">2016</xref>). It uses the generalized U-matrix computed on the projected points and visualizes the folding of the high-dimensional space, i.e. how well the two-dimensional similarities between projected points represent high-dimensional distances. Moreover, the topographic map enables the estimation of the number of clusters, if any cluster tendency exists. The third module offers a clustering method that the visualization and vice versa can verify. The complete method is applied to four gene sets described in Table <xref rid="j_infor517_tab_001">1</xref>. It is accessible as the R package “DatabionicSwarm” on CRAN (<uri>https://CRAN.R-project.org/package=DatabionicSwarm</uri>). For each gene, the GO knowledge base was accessed to identify all GO terms associated with this gene resulting in a feature matrix of gene vs GO terms. This feature matrix is used to compute the distance in equation (<xref rid="j_infor517_eq_004">4</xref>).</p>
<p>Searching for multimodality in distance distributions can be reasonable if no prior knowledge about the data is available (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_052">2021b</xref>): This approach enables to identify if a distance is appropriate and the evaluation of clustering solutions using Gaussian mixture models (GMMs) of distance distributions under the assumption that distance-based structures are sought. Multimodality in the distance distribution indicates modes of intrapartition distances and interpartition distances. If a distance distribution is multimodal the GMM provides a hypothesis that intra-cluster distances are represented mostly by the left-most mode and do not overlap with the right-most mode of the full distance distributions (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_052">2021b</xref>).</p>
</sec>
<sec id="j_infor517_s_005">
<label>3.2</label>
<title>Validation of Homogeneous Structures in Comparison to Related Work</title>
<p>The validation is performed with the topographic map (Thrun and Lerch, <xref ref-type="bibr" rid="j_infor517_ref_056">2016</xref>; Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_060">2020b</xref>; Thrun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_064">2021</xref>), cluster heatmaps (Wilkinson and Friendly, <xref ref-type="bibr" rid="j_infor517_ref_071">2009</xref>), dendrograms of hierarchical clustering methodology defined in Thrun (<xref ref-type="bibr" rid="j_infor517_ref_054">2022b</xref>), and one unsupervised quality measure provided in the FCPS package available on CRAN (Thrun and Stier, <xref ref-type="bibr" rid="j_infor517_ref_057">2021</xref>) as well as through distance distributions (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_052">2021b</xref>).</p>
<p>Acharya <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor517_ref_001">2017</xref>) proposed to find subgroups of genes by applying PAM. PAM was combined with three conventional distance measures (Euclidean, Manhattan, and Cosinus distance) to get groups of semantically related genes (Acharya <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_001">2017</xref>). Hence, PAM is compared with the methodology here. In both cases the proposed distance measure tf-idf is used. Acharya <italic>et al.</italic> evaluated their results by the average Silhouette index, Dunn index, and Davies-Bouldin index. However, the Silhouette index evaluates only if spherical cluster structures exist in datasets (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_051">2021a</xref>). It is not investigated here if the Dunn index is applicable because it requires the distance measure to be a metric (see Thrun, <xref ref-type="bibr" rid="j_infor517_ref_052">2021b</xref> for discussion). Hence, the corresponding values of the Davies-Bouldin index (Davies and Bouldin, <xref ref-type="bibr" rid="j_infor517_ref_012">1979</xref>) are reported. Best clustering scheme essentially minimizes the Davies-Bouldin index because it is defined as the function of the ratio of the within cluster scatter, to the between cluster separation (Davies and Bouldin, <xref ref-type="bibr" rid="j_infor517_ref_012">1979</xref>). Davies-Bouldin index and PAM clustering are provided by the FCPS package available as an R package on CRAN (Thrun and Stier, <xref ref-type="bibr" rid="j_infor517_ref_057">2021</xref>).</p>
<p>The topographic map visualizes the high-dimensional structures of data points represented by genes here. The topographic map is visualized with so-called hypsometric tints (Thrun and Lerch, <xref ref-type="bibr" rid="j_infor517_ref_056">2016</xref>). Hypsometric tints are surface colours that represent ranges of elevation, which are combined with a specific colour scale. The colour scale is chosen to display various valleys, ridges, and basins: blue colours indicate small distances (sea level) between genes, green and brown colours indicate middle distances (low hills) between genes, and shades of white colours indicate vast distances between genes (high mountains covered with snow and ice). Valleys and basins represent homogeneous groups of genes, and the watersheds of hills and mountains represent the borders between the groups in a gene set. In this 3D landscape, the borders of the visualization are cyclically connected with a periodicity. Each point in the topographic map represents a gene coloured by its assigned group using DBS. Here the interest lies in distance-based structures in data. Therefore, heatmaps are provided in which the clustering Cls orders the distances <inline-formula id="j_infor517_ineq_009"><alternatives><mml:math>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$D(l,j)$]]></tex-math></alternatives></inline-formula> in equation (<xref rid="j_infor517_eq_004">4</xref>) with blue to yellow colours indicating low distances and orange to red colours indicating large distances which is depicted in a legend on the right. Each group is depicted on the axis by “Cls x”. If the colouring of the map’s ordered pixels indicates that the intracluster distances are smaller than the intercluster distances, then the structures are homogeneous in the meaning described above. Applying Gaussian mixture modelling to the distance distribution, a specific Bayesian hypothesis can be stated, in which range the intra-cluster distances should mainly lie (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_052">2021b</xref>).</p>
</sec>
<sec id="j_infor517_s_006">
<label>3.3</label>
<title>Retrieving Meaningful Descriptions</title>
<p>One way of explaining the clusters yielding meaningful results is to use expert knowledge. Either an expert can be asked directly or one can look answers up, for which the process can be quite cumbersome. Recent developments created chat bots answering on given questions (Brown <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_009">2020</xref>). Such methods use large amounts of data from the web and billions of parameters for the model (Brown <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_009">2020</xref>). ChatGPT expects any kind of natural language and will answer only with natural language. In that manner, questions about patterns and context regarding a given set of references such as the NCBI numbers can be given to ChatGPT. Currently, there is no way of automatic verification of the answers (Lewkowycz <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_024">2022</xref>).</p>
</sec>
</sec>
<sec id="j_infor517_s_007">
<label>4</label>
<title>Results</title>
<table-wrap id="j_infor517_tab_001">
<label>Table 1</label>
<caption>
<p>The table presents the number of items (#) of genes <inline-formula id="j_infor517_ineq_010"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">g</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${g_{t}}$]]></tex-math></alternatives></inline-formula> and GO terms <italic>t</italic>, and the Davies-Bouldin index values (DB) for the gene sets associated with pain and the chronification of pain (Ultsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_068">2016</xref>), hearing loss (GeneTestingRegistry, <xref ref-type="bibr" rid="j_infor517_ref_015">2018</xref>), cancer (Sondka <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_043">2018</xref>), and drug addiction (Li <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_025">2008</xref>). Only genes <inline-formula id="j_infor517_ineq_011"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">g</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${g_{t}}$]]></tex-math></alternatives></inline-formula> that are annotated to specific GO terms <italic>t</italic> within one of the three ontologies (Ont.) biological process (1), molecular function (2) and cellular component (3) are considered. There is no classification vector for the gene set available. Lower values of the Davies-Bouldin index indicate structures that are more homogeneous.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Name of gene set</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"># <inline-formula id="j_infor517_ineq_012"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">g</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${g_{t}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"># <italic>t</italic> in Ont. <inline-formula id="j_infor517_ineq_013"><alternatives><mml:math>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>3</mml:mn></mml:math><tex-math><![CDATA[$1+2+3$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"># <italic>t</italic> in Ont. 1</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"># <italic>t</italic> in Ont. 2</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"># <italic>t</italic> in Ont. 3</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"># groups (outliers groups)</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">DB for DBS</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">DB for PAM</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Hearing Loss</td>
<td style="vertical-align: top; text-align: left">109</td>
<td style="vertical-align: top; text-align: left">829</td>
<td style="vertical-align: top; text-align: left">540</td>
<td style="vertical-align: top; text-align: left">153</td>
<td style="vertical-align: top; text-align: left">136</td>
<td style="vertical-align: top; text-align: left">3(+1)</td>
<td style="vertical-align: top; text-align: left">0.53</td>
<td style="vertical-align: top; text-align: left">0.76</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Pain</td>
<td style="vertical-align: top; text-align: left">528</td>
<td style="vertical-align: top; text-align: left">3137</td>
<td style="vertical-align: top; text-align: left">2208</td>
<td style="vertical-align: top; text-align: left">642</td>
<td style="vertical-align: top; text-align: left">287</td>
<td style="vertical-align: top; text-align: left">3(+2)</td>
<td style="vertical-align: top; text-align: left">0.59</td>
<td style="vertical-align: top; text-align: left">0.63</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Cancer</td>
<td style="vertical-align: top; text-align: left">696</td>
<td style="vertical-align: top; text-align: left">4283</td>
<td style="vertical-align: top; text-align: left">3002</td>
<td style="vertical-align: top; text-align: left">775</td>
<td style="vertical-align: top; text-align: left">506</td>
<td style="vertical-align: top; text-align: left">3(+1)</td>
<td style="vertical-align: top; text-align: left">0.60</td>
<td style="vertical-align: top; text-align: left">0.72</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Drug Addiction</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">381</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">3107</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2140</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">586</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">381</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">3(+2)</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.51</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.60</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The proposed distance measure is evaluated on four gene sets consisting of lists containing NCBI numbers (see Table <xref rid="j_infor517_tab_001">1</xref>) genes associated with hearing loss (109), pain (528), cancer (696) and drug addiction (381). The four data sets with further relevant information are available on Zenodo: 10.5281/zenodo.7706192. The distances are computed based on equation (<xref rid="j_infor517_eq_004">4</xref>) resulting in a distance matrix. The distance feature <inline-formula id="j_infor517_ineq_014"><alternatives><mml:math>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mi mathvariant="italic">f</mml:mi></mml:math><tex-math><![CDATA[$df$]]></tex-math></alternatives></inline-formula> is defined as the vector with the elements of the upper triangle of the distance matrix (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_052">2021b</xref>).</p>
<fig id="j_infor517_fig_001">
<label>Fig. 1</label>
<caption>
<p>(a) Gaussian mixture model of the distance distribution of genes associated with hearing loss (GeneTestingRegistry, <xref ref-type="bibr" rid="j_infor517_ref_015">2018</xref>) and (b) QQ-plot with paired quantiles of Data on <italic>y</italic> axis and the Gaussian mixture model on <italic>x</italic> axis. The Gaussian mixture model (left) shows the three distance components indicating distance-based structures and the QQ-plot (right) validates the Gaussian mixture model as appropriate based on the match between blue dots and red line for most of the plot.</p>
</caption>
<graphic xlink:href="infor517_g001.jpg"/>
</fig>
<fig id="j_infor517_fig_002">
<label>Fig. 2</label>
<caption>
<p>(a) Gaussian mixture model of the distance distribution of genes associated with pain (Ultsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_068">2016</xref>) and (b) QQ-plot with paired quantiles of Data on <italic>y</italic> axis and the Gaussian mixture model on <italic>x</italic> axis. The Gaussian mixture model (left) shows the three distance components indicating distance-based structures and the QQ-plot (right) validates the Gaussian mixture model as appropriate based on the match between blue dots and red line for most of the plot.</p>
</caption>
<graphic xlink:href="infor517_g002.jpg"/>
</fig>
<p>In the first part the proposed distance measure is evaluated. In the second part structure and cluster analysis of the distances is performed. For the first part of section four figures (Figs. <xref rid="j_infor517_fig_001">1</xref>–<xref rid="j_infor517_fig_004">4</xref>) are presented. They show on their left side the visualization of the Gaussian mixture model for the distances of the knowledge-based structures and on the right side the QQ-plot of the estimated distance distribution and the Gaussian mixture model for model evaluation. In the Gaussian mixture model visualization, the black line represents the density estimation, the blue lines represent the three components of the Gaussian mixture model and the red line – the superposition of the blue modes. On the right side are the QQ-plots evaluating the respective Gaussian mixture model on their left side. The blue dots represent the pairings of the quantiles of the models and the estimated data distribution. The red line indicates the position on which the quantiles would need to be placed in order to yield an optimal match of both distributions. The density of the distances of each dataset is estimated by the procedure described in Thrun (<xref ref-type="bibr" rid="j_infor517_ref_052">2021b</xref>).</p>
<fig id="j_infor517_fig_003">
<label>Fig. 3</label>
<caption>
<p>(a) Gaussian mixture model of the distance distribution of genes associated with cancer (Sondka <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_043">2018</xref>) and (b) QQ-plot with paired quantiles of Data on <italic>y</italic> axis and the Gaussian mixture model on <italic>x</italic> axis. The Gaussian mixture model (left) shows the three distance components indicating distance-based structures and the QQ-plot (right) validates the Gaussian mixture model as appropriate based on the match between blue dots and red line for most of the plot.</p>
</caption>
<graphic xlink:href="infor517_g003.jpg"/>
</fig>
<fig id="j_infor517_fig_004">
<label>Fig. 4</label>
<caption>
<p>(a) Gaussian mixture model of the distance distribution of genes associated with drug addiction (Li <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_025">2008</xref>) and (b) QQ-plot with paired quantiles of Data on <italic>y</italic> axis and the Gaussian mixture model on <italic>x</italic> axis. The Gaussian mixture model (left) shows the three distance components indicating distance-based structures and the QQ-plot (right) validates the Gaussian mixture model as appropriate based on the match between blue dots and red line for most of the plot.</p>
</caption>
<graphic xlink:href="infor517_g004.jpg"/>
</fig>
<p>We are using three components to partition the data into three groups: low, intermediate and large distances. For this purpose, we are using three components in our Gaussian mixture model. We are choosing the variance in such a way that there is a minimum of overlap of the neighbouring Gaussian mixture model components. Then, we used these values as initial values to start an Expectation-Maximization algorithm optimizing the Gaussian mixture model to arrive at a local optimum. In order to select a final model to our satisfaction which can be considered valid, we are applying Occam’s razor to choose the simplest model that sufficiently explains the data (Blumer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_008">1987</xref>). Each model is verified by QQ-plots (Michael, <xref ref-type="bibr" rid="j_infor517_ref_033">1983</xref>; Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_058">2015</xref>). In order to assess the fine details of the distribution, we are using appropriate tools following (Thrun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_062">2020a</xref>). For the given distance distributions Figs. <xref rid="j_infor517_fig_001">1</xref>–<xref rid="j_infor517_fig_004">4</xref> and <xref rid="j_infor517_fig_008">8</xref>, multimodality is visible in the estimated probability density functions and the dip tests verify the hypothesis (Hartigan and Hartigan, <xref ref-type="bibr" rid="j_infor517_ref_017">1985</xref>).</p>
<p>Each Gaussian mixture model consists of three components representing different underlying structures. From left to right the three Gaussian mixture model components can be interpreted as model of one intra- (small distances) and two intercluster (medium and large distances) distance distributions (cf. Thrun, <xref ref-type="bibr" rid="j_infor517_ref_052">2021b</xref>). The Hartigan’s dip-test yields <italic>p</italic>-values &lt; 0.01 for all 4 cases indicating multimodal distance distributions, and therefore indicates the existence of cluster structures (clusterability) (Adolfsson <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_002">2019</xref>; Thrun, <xref ref-type="bibr" rid="j_infor517_ref_050">2020</xref>). The Root Mean Square Deviation (RMS) yields for the Gaussian mixture models the values 0.2206, 0.1896, 0.1703, 0.1875 in order of Figs. <xref rid="j_infor517_fig_001">1</xref>–<xref rid="j_infor517_fig_004">4</xref>. The chi-squared test accepts the null hypothesis that the estimated data distribution does not differ significantly from the Gaussian mixture model. An investigation of the distances of datapoints within each cluster for all four datasets shows, that all or most of the intracluster distances lie beneath the Bayesian boundary of their respective first Gaussian mixture model component. More specific, the intracluster distances of the first two datasets lie within the Bayesian Boundary of their respective first Gaussian mixture model component, for the third dataset all intracluster distances with exception of cluster 1 and 3, for which 30% and &lt;1% respectively lie above the Bayesian Boundary and for the fourth dataset 6% of cluster 1 lie above the Bayesian Boundary.</p>
<p>In the second part of the result section, the presented figures (Figs. <xref rid="j_infor517_fig_005">5</xref>, <xref rid="j_infor517_fig_006">6</xref>(a)–<xref rid="j_infor517_fig_006">6</xref>(d)) first show the visualization of knowledge-based structures with topographic maps and second an analytical verification of the found structures with the heatmaps of the distance matrices which were ordered by the classifications obtained with the DBS clustering. In the supplementary parts, there are two further techniques supporting the findings of the topographic map, namely the Mirrored-Density plot (SI <xref rid="j_infor517_s_010">A</xref>) and the dendrograms (SI <xref rid="j_infor517_s_011">B</xref>). The Mirrored-Density plots (see Figs. <xref rid="j_infor517_fig_007">7</xref>(a)–<xref rid="j_infor517_fig_007">7</xref>(d) in SI <xref rid="j_infor517_s_010">A</xref>) show the distance distribution for the complete dataset, each cluster recognized by the DBS and the remaining noise. The vertical lines indicate the Bayesian borders resulting from the GMM partitioning the distances in their previously determined group (intra- and medium and large intercluster distances). The dendrograms (see Figs. <xref rid="j_infor517_fig_008">8</xref>(a)–<xref rid="j_infor517_fig_008">8</xref>(d) in SI <xref rid="j_infor517_s_011">B</xref>) show the ultrametric proportions of the distance measure representing structures from high-dimensional data (Murtagh, <xref ref-type="bibr" rid="j_infor517_ref_035">2004</xref>). The colouring of the datapoints are based on the resulting cluster recognized by the DBS.</p>
<fig id="j_infor517_fig_005">
<label>Fig. 5</label>
<caption>
<p>Knowledge-based structures of associated gene sets indicate homogeneous groups in the topographic maps: hearing loss (GeneTestingRegistry, <xref ref-type="bibr" rid="j_infor517_ref_015">2018</xref>) (left top) – three groups and five outliers, pain (Ultsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_068">2016</xref>) (right top) – five groups, cancer (Sondka <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_043">2018</xref>) (left bottom) – three groups and one outlier and drug addiction (Li <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_025">2008</xref>) (right bottom) – three groups and two groups of outliers.</p>
</caption>
<graphic xlink:href="infor517_g005.jpg"/><graphic xlink:href="infor517_g006.jpg"/><graphic xlink:href="infor517_g007.jpg"/><graphic xlink:href="infor517_g008.jpg"/>
</fig>
<fig id="j_infor517_fig_006">
<label>Fig. 6</label>
<caption>
<p>Knowledge-based structures of associated gene sets indicate homogeneous groups verified by heatmaps. The heatmaps show the value of the proposed distance measure.</p>
</caption>
<graphic xlink:href="infor517_g009.jpg"/><graphic xlink:href="infor517_g010.jpg"/><graphic xlink:href="infor517_g011.jpg"/><graphic xlink:href="infor517_g012.jpg"/>
</fig>
<p>The topographic maps of Fig. <xref rid="j_infor517_fig_005">5</xref> show the structures found within the data, which was obtained from the GO terms restricted by those genes, which are associated with certain causes by an expert. The hearing loss gene set (GeneTestingRegistry, <xref ref-type="bibr" rid="j_infor517_ref_015">2018</xref>) on the left top shows homogeneous group of points in magenta, red and black. An outlier group of points in yellow is visible distinctively. Additionally, the matching heatmap in Fig. <xref rid="j_infor517_fig_006">6</xref>(a) indicates that the red and black groups have a lower distance in between than between the yellow group and all other groups. Because of the low distance within, all groups are homogeneous. The topographic map of Fig. <xref rid="j_infor517_fig_005">5</xref> on the top right outlines the knowledge-based structures of the set of pain genes (Ultsch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_068">2016</xref>). Similarly than before, it is visible that the five groups are homogeneous. Distinctively, two outlier groups of genes represented by points in black and red are depicted. Figure <xref rid="j_infor517_fig_005">5</xref> on the left bottom shows the knowledge-based structures of the Cancer gene set (Sondka <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_043">2018</xref>) with genes of the first group coloured as points in yellow, the second group in black, the third group in red, and one outlier in magenta. The heatmap in Fig. <xref rid="j_infor517_fig_006">6</xref>(c) agrees with this identification of groups. Figure <xref rid="j_infor517_fig_005">5</xref> on the right bottom represents the topographic map of the set of drug addiction genes (Li <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_025">2008</xref>) with genes of a group of points in magenta, another group in yellow, the third group in red, and outliers in a volcano of two subgroups in black and green. The heatmap in Fig. <xref rid="j_infor517_fig_006">6</xref>(d) agrees in general with this assessment but differs in one detail: the topographic map indicates that magenta, yellow and red groups have a low distance in between, whereas the heatmap indicates that the yellow and red groups of genes have a relatively high distance in between (Cluster No. 2 vs. Cluster No. 4). This shows the ability of the topographic map to investigate how well high-dimensional neighbourhoods are preserved in the low-dimensional planar space of the projected points, in contrast to the heatmap, which only visualizes intra and inter-cluster distances. In this case, the heatmap indicates that the intra-cluster distances are smaller than usual meaning that the clusters are close to each other. The topographic map, however, shows that the clusters can still be separated which is verified by the figures of the intra-cluster distributions (SI <xref rid="j_infor517_s_010">A</xref>) and the dendrograms (SI <xref rid="j_infor517_s_011">B</xref>). Table <xref rid="j_infor517_tab_001">1</xref> shows the number of groups identified in the DBS for which outliers are defined by the visualization if they lie in a volcano. In Fig. <xref rid="j_infor517_fig_005">5</xref>, the outliers in the left top are depicted by yellow points, in the right top by red and green points, in the left bottom in magenta and in the right bottom in black and green. Furthermore, the Davies-Bouldin indices are presented for DBS in comparison to PAM which is used in Acharya <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor517_ref_001">2017</xref>). The specific genes in each subset of each gene set are listed in Table <xref rid="j_infor517_tab_001">1</xref>.</p>
<p>We propose to identify meaningful descriptions of our found subset of genes with ChatGPT. As a test, we asked ChatGPT (see SI <xref rid="j_infor517_s_012">C</xref> for details) about patterns in the dataset regarding hearing loss (GeneTestingRegistry, <xref ref-type="bibr" rid="j_infor517_ref_015">2018</xref>). The biological functions of each subset of hearing loss genes are summarized in Fig. <xref rid="j_infor517_fig_007">7</xref>. The figure is coloured with the colours of the points of the groups presented in Fig. <xref rid="j_infor517_fig_005">5</xref>(a) within the topographic map. Each subset of genes of hearing loss has a specific set of functions.</p>
<fig id="j_infor517_fig_007">
<label>Fig. 7</label>
<caption>
<p>Biological description retrieved from ChatGPT for dataset hearingloss (GeneTestingRegistry, <xref ref-type="bibr" rid="j_infor517_ref_015">2018</xref>). In a clockwise manner starting from red, there are four colours representing four clusters. Cluster No. 1 (magenta) describes age-related genes. Cluster No. 2 (yellow) describes genes associated to autosomal recessive non-syndromic hearing loss. Cluster No. 3 (black tones) describes genes associated with the formation of gap junctions in the inner ear, syndromic hearing loss, progressive hearing loss and autosomal dominant and autosomal recessive forms of non-syndromic hearing loss. Cluster No. 4 (red tones) decribes genes which are involved in the formation of stereocilia bundles or the maintenance of the epithelial integrity of the inner ear and in ion transport, actin cytoskeleton dynamics, protein synthesis, intracellular transport, and extracellular matrix formation in the inner ear. Mutations or variants in these genes can lead to hearing loss.</p>
</caption>
<graphic xlink:href="infor517_g013.jpg"/>
</fig>
</sec>
<sec id="j_infor517_s_008">
<label>5</label>
<title>Discussion</title>
<p>This work shows the identification of gene subsets based on a clustering approach with the help of the GO and the term frequency-inverse document frequency (tf-idf). The used data was generated with the tf-idf concept based on the gene vs. term matrix extracted from the GO. The methodology is applied on four gene sets with causal associations. The unsupervised algorithm DBS uses knowledge-based structures to extract homogeneous groups. In addition, the groups are verified by heatmaps and Gaussian mixture models of distances. Constructing the distance measure on the tf-idf, genes in the same group also share similar biological functions. The existence of similar and dissimilar groups is modelled as three components consisting of low, intermediate and high distances in the Gaussian mixture model of intra- vs. intercluster distances. These knowledge-based structures can be used to define semantically related subspaces which can reduce the high-dimensional gene space in successive gene expression analysis.</p>
<p>The error of algorithms applied for identification of homogeneous structures can be generally defined as the sum of the variance, bias, and noise components. The bias is defined as the difference between the structures within the data and the algorithms ability to reproduce these structures. In case a global clustering criterion is defined, the bias is the difference between the criterions definition and the given structures within the data. The variance is the result of the varying outcome of a stochastic algorithm across multiple trials. Small or zero variance means high reproducibility (see Thrun, <xref ref-type="bibr" rid="j_infor517_ref_051">2021a</xref> for details). In prior work, PAM in combination with conventional distances was proposed by Acharya <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor517_ref_001">2017</xref>) to group semantically related genes. However, applying PAM with conventional distances shows disadvantages because the structures within the data are unknown and may not be necessarily spherical. The values of the Davies-Bouldin index are lower for DBS than PAM for the four gene sets (see Table <xref rid="j_infor517_tab_001">1</xref>), which indicates that the structures identified by DBS are more homogeneous. Consequently, it is highly likely that the found structures within the gene sets are not of spherical character. Though the algorithm DBS showed a lower bias for various types of structures evaluated in prior works (Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_061">2021</xref>), it can yield higher variance than PAM (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_051">2021a</xref>). The result in Fig. <xref rid="j_infor517_fig_005">5</xref> left top, for example, switches between five and seven groups, splitting a group into subgroups sometimes, although the main structures depicted remain stable. And finally, the clusterability (Adolfsson <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_002">2019</xref>) of the datasets was not investigated in prior work (Acharya <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_001">2017</xref>). Thus, clustering could result in arbitrary subsets of genes since cluster algorithms can be optimized on specific data even if there exists no cluster structures within the data at all (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_051">2021a</xref>).</p>
<p>Extensive prior works showed that by using the topographic map in combination with DBS the probability is high that meaningful structures in data instead of noise are identified (Thrun, <xref ref-type="bibr" rid="j_infor517_ref_051">2021a</xref>; Thrun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_064">2021</xref>; Thrun and Ultsch, <xref ref-type="bibr" rid="j_infor517_ref_061">2021</xref>, <xref ref-type="bibr" rid="j_infor517_ref_059">2020a</xref>, <xref ref-type="bibr" rid="j_infor517_ref_060">2020b</xref>; Thrun, <xref ref-type="bibr" rid="j_infor517_ref_055">2022c</xref>, <xref ref-type="bibr" rid="j_infor517_ref_053">2022a</xref>). Further verification by heatmaps <xref rid="j_infor517_fig_006">6</xref> and the quality measures in Table <xref rid="j_infor517_tab_001">1</xref> as well as intracluster distance distributions <xref rid="j_infor517_fig_007">7</xref> confirms this fact for the datasets investigated here. Of note, finding distance-based structures in data does not necessary mean that the clusters are meaningful to the domain expert. This should be investigated in future works. In theory, we could use any preferable focusing projection method that can represent structures of high-dimensional data. The resulting projection can always be visualized with the generalized U-matrix technique as a topographic map in order to show which neighbourhoods of the high-dimensional distances were preserved by the projection method (Thrun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_063">2020b</xref>). This visualization technique representes the data structures intuitively, where the colour and contour lines match the convolution of the high-dimensional space exactly (Thrun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_063">2020b</xref>). The advantage of the DBS is the ability to use the distance matrix directly. As we integrated GO-Terms for the definition of dissimilarity and identified distance-based structures, these structures are meaningful because they integrated the knowledge stored in the gene ontology (GO). Of note, finding distance-based structures in data does not necessarily mean that the clusters are meaningful to the domain expert. For the set of hearing loss genes, the identified subsets were meaningful because the AI system called ChatGPT identified specific biological functions that differed between subsets (see Fig. <xref rid="j_infor517_fig_007">7</xref>). In the current beta version of ChatGPT we were limited in the number of input and out characters which is the reason that the other three gene sets could not be investigated with the AI system. The meaningfulness of the subsets by means of AI’s like ChatGPT will be investigated in future works.</p>
<p>Alternatively, the meaning of these subsets of genes could be explored via text mining. For example, a web-based PubMed abstract biomedical named entity recognition system like PubTator (Wei <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_070">2013</xref>) can be used. PubTator can use the genes as tag in PubMed abstracts and can be accessed via the RESTful API. Genes that PubTator does not tag can be searched in the medical subject heading (MeSH) (Lipscomb, <xref ref-type="bibr" rid="j_infor517_ref_027">2000</xref>), a hierarchically organized medical vocabulary thesaurus used to index articles for PubMed. The articles are curated by NLM and indexed with several related MeSH terms; every MeSH term has a unique id and hierarchical categories. After that, either latent semantic analysis (Landauer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_023">1998</xref>) or latent Dirichlet allocation (LDA) (Blei <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_007">2003</xref>) methods could be used to extract topics per group of genes and either evaluate these groups’ meaningfulness or apply supervised learning for a combination of topic model vectors and the clustering (Phan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_037">2008</xref>) to predict genes in the given knowledge-based structure that are not given in the prior gene set (see Table <xref rid="j_infor517_tab_001">1</xref>).</p>
</sec>
<sec id="j_infor517_s_009">
<label>6</label>
<title>Conclusion</title>
<p>This work introduces a novel distance measure for identifying gene subsets. Prior work mainly focused on the use of gene expression data in order to retrieve gene subsets. Here, the grouping is purely based on the semantic knowledge stored in the gene ontology (GO). Based on the GO, a distance measure between genes is derived as the term-frequency-inverse-document-frequency (tf-idf). By using the proposed distance we show that a specific gene set can be investigated analytically and applying cluster analysis can reveal meaningful structures. In this work, an unbiased cluster algorithm named DBS is used to find groups within four gene sets. Analytical tools like Gausian mixture modeling were further used to verify the found structures based on the new distance measure. The distance is accessible in the R package BIDistances available on CRAN (<uri>https://CRAN.R-project.org/package=BIDistances</uri>). The four use cases show promising results for applying the proposed method in future.</p>
</sec>
<sec id="j_infor517_s_010">
<label>A</label>
<title>Supplementary A</title>
<fig id="j_infor517_fig_008">
<label>Fig. 8</label>
<caption>
<p>Mirrored-Density plots (Thrun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_062">2020a</xref>) showing the distributions of the distance measure for the complete dataset, the resulting clusters and the noise.</p>
</caption>
<graphic xlink:href="infor517_g014.jpg"/><graphic xlink:href="infor517_g015.jpg"/><graphic xlink:href="infor517_g016.jpg"/><graphic xlink:href="infor517_g017.jpg"/>
</fig>
</sec>
<sec id="j_infor517_s_011">
<label>B</label>
<title>Supplementary B</title>
<fig id="j_infor517_fig_009">
<label>Fig. 9</label>
<caption>
<p>Dendrogram of the distance measure for the four use case datasets. The dendrogram visualizes the ultrametric property of the distance measure, which is able to represent proximity and hierarchical structures of high-dimensional data (Murtagh, <xref ref-type="bibr" rid="j_infor517_ref_035">2004</xref>). In the dendrogram, close datapoints or clusters of datapoints are connected with each other creating a connection between two points on the <italic>x</italic> axis with small height on the <italic>y</italic> axis. The further away two datapoints or clusters of datapoints are, the higher is the height of the built connection represented on the <italic>y</italic> axis.</p>
</caption>
<graphic xlink:href="infor517_g018.jpg"/><graphic xlink:href="infor517_g019.jpg"/><graphic xlink:href="infor517_g020.jpg"/><graphic xlink:href="infor517_g021.jpg"/>
</fig>
</sec>
<sec id="j_infor517_s_012">
<label>C</label>
<title>Supplementary C</title>
<p>We conducted a search for meaningful descriptions of the dataset hearing loss (GeneTestingRegistry, <xref ref-type="bibr" rid="j_infor517_ref_015">2018</xref>) by means of text mining through the usage of an AI system called ChatGPT. Although the approach is currently restricted by the number of characters of input and output, in future, it should be possible for any set or subset of genes. For that purpose, we filtered the NCBI numbers according to the final clustering and asked ChatGPT (Brown <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor517_ref_009">2020</xref>) questions regarding the pattern for this set of genes as follows: Search for patterns for the following genes listed by NCBI numbers in context of hearing loss: <inline-formula id="j_infor517_ineq_015"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">NCBI</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">NCBI</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">NCBI</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{NCBI}_{1}},{\textit{NCBI}_{2}},\dots ,{\textit{NCBI}_{d}}$]]></tex-math></alternatives></inline-formula> and summarized the pattern. <italic>d</italic> denotes the respective number of genes in the selected cluster. We could identify different groupings of biological meaning correlating with the found clustering.</p>
<p><bold>Class 1</bold></p>
<p>The genes contained in class one are mostly associated with age-related hearing loss.</p>
<p><bold>Class 2</bold></p>
<p>Genes from class 2 are mostly associated with autosomal recessive non-syndromic hearing loss.</p>
<p><bold>Class 3</bold></p>
<p>Genes from class 3 are associated with the formation of gap junctions in the inner ear, syndromic hearing loss, progressive hearing loss and autosomal dominant and autosomal recessive forms of non-syndromic hearing loss.</p>
<p><bold>Class 4</bold></p>
<p>Genes from class 4 are involved in the formation of stereocilia bundles or the maintenance of the epithelial integrity of the inner ear and in ion transport, actin cytoskeleton dynamics, protein synthesis, intracellular transport, and extracellular matrix formation in the inner ear. Mutations or variants in these genes can lead to hearing loss.</p>
</sec>
</body>
<back>
<ack id="j_infor517_ack_001">
<title>Acknowledgements</title>
<p>We thank Luca Brinkmann for the generation of the BIDistances-package in which the here proposed distance is integrated.</p></ack>
<ref-list id="j_infor517_reflist_001">
<title>References</title>
<ref id="j_infor517_ref_001">
<mixed-citation publication-type="journal"><string-name><surname>Acharya</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Saha</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Nikhil</surname>, <given-names>N.</given-names></string-name> (<year>2017</year>). <article-title>Unsupervised gene selection using biological knowledge: application in sample clustering</article-title>. <source>BMC Bioinformatics</source>, <volume>18</volume>(<issue>1</issue>), <fpage>1</fpage>–<lpage>13</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_002">
<mixed-citation publication-type="journal"><string-name><surname>Adolfsson</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Ackerman</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Brownstein</surname>, <given-names>N.C.</given-names></string-name> (<year>2019</year>). <article-title>To cluster, or not to cluster: an analysis of clusterability methods</article-title>. <source>Pattern Recognition</source>, <volume>88</volume>, <fpage>13</fpage>–<lpage>26</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_003">
<mixed-citation publication-type="journal"><string-name><surname>Alm</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Arkin</surname>, <given-names>A.P.</given-names></string-name> (<year>2003</year>). <article-title>Biological networks</article-title>. <source>Current Opinion in Structural Biology</source>, <volume>13</volume>(<issue>2</issue>), <fpage>193</fpage>–<lpage>202</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_004">
<mixed-citation publication-type="journal"><string-name><surname>Ashburner</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Ball</surname>, <given-names>C.A.</given-names></string-name>, <string-name><surname>Blake</surname>, <given-names>J.A.</given-names></string-name>, <string-name><surname>Botstein</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Butler</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Cherry</surname>, <given-names>J.M.</given-names></string-name>, <string-name><surname>Davis</surname>, <given-names>A.P.</given-names></string-name>, <string-name><surname>Dolinski</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Dwight</surname>, <given-names>S.S.</given-names></string-name>, <string-name><surname>Eppig</surname>, <given-names>J.T.</given-names></string-name>, <string-name><surname>Harris</surname>, <given-names>M.A.</given-names></string-name>, <string-name><surname>Hill</surname>, <given-names>D.P.</given-names></string-name>, <string-name><surname>Issel-Tarver</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Kasarskis</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Lewis</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Matese</surname>, <given-names>J.C.</given-names></string-name>, <string-name><surname>Richardson</surname>, <given-names>J.E.</given-names></string-name>, <string-name><surname>Ringwald</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Rubin</surname>, <given-names>G.M.</given-names></string-name>, <string-name><surname>Sherlock</surname>, <given-names>G.</given-names></string-name> (<year>2000</year>). <article-title>Gene ontology: tool for the unification of biology</article-title>. <source>Nature Genetics</source>, <volume>25</volume>(<issue>1</issue>), <fpage>25</fpage>–<lpage>29</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_005">
<mixed-citation publication-type="journal"><string-name><surname>Backes</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Keller</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Kuentzer</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Kneissl</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Comtesse</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Elnakady</surname>, <given-names>Y.A.</given-names></string-name>, <string-name><surname>Müller</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Meese</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Lenhof</surname>, <given-names>H.-P.</given-names></string-name> (<year>2007</year>). <article-title>GeneTrail—advanced gene set enrichment analysis</article-title>. <source>Nucleic Acids Research</source>, <volume>35</volume>, <fpage>186</fpage>–<lpage>192</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1093/nar/gkm323" xlink:type="simple">https://doi.org/10.1093/nar/gkm323</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_006">
<mixed-citation publication-type="journal"><string-name><surname>Barabási</surname>, <given-names>A.-L.</given-names></string-name>, <string-name><surname>Oltvai</surname>, <given-names>Z.N.</given-names></string-name> (<year>2004</year>). <article-title>Network biology: understanding the cell’s functional organization</article-title>. <source>Nature Reviews Genetics</source>, <volume>5</volume>(<issue>2</issue>), <fpage>101</fpage>–<lpage>113</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_007">
<mixed-citation publication-type="journal"><string-name><surname>Blei</surname>, <given-names>D.M.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>A.Y.</given-names></string-name>, <string-name><surname>Jordan</surname>, <given-names>M.I.</given-names></string-name> (<year>2003</year>). <article-title>Latent Dirichlet allocation</article-title>. <source>Journal of Machine Learning Research</source>, <volume>3</volume>(<issue>Jan</issue>), <fpage>993</fpage>–<lpage>1022</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_008">
<mixed-citation publication-type="journal"><string-name><surname>Blumer</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Ehrenfeucht</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Haussler</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Warmuth</surname>, <given-names>M.K.</given-names></string-name> (<year>1987</year>). <article-title>Occam’s razor</article-title>. <source>Information Processing Letters</source>, <volume>24</volume>(<issue>6</issue>), <fpage>377</fpage>–<lpage>380</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_009">
<mixed-citation publication-type="journal"><string-name><surname>Brown</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Mann</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Ryder</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Subbiah</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Kaplan</surname>, <given-names>J.D.</given-names></string-name>, <string-name><surname>Dhariwal</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Neelakantan</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Shyam</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Sastry</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Askell</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Agarwal</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Herbert-Voss</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Krueger</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Henighan</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Child</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Ramesh</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Ziegler</surname>, <given-names>D.M.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Winter</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Hesse</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Sigler</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Litwin</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Gray</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Chess</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Clark</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Berner</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>McCandlish</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Radford</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Sutskever</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Amodei</surname>, <given-names>D.</given-names></string-name> (<year>2020</year>). <article-title>Language models are few-shot learners</article-title>. <source>Advances in Neural Information Processing Systems</source>, <volume>33</volume>, <fpage>1877</fpage>–<lpage>1901</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_010">
<mixed-citation publication-type="journal"><string-name><surname>Camon</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Magrane</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Barrell</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Binns</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Fleischmann</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Kersey</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Mulder</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Oinn</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Maslen</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Cox</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Apweiler</surname>, <given-names>R.</given-names></string-name> (<year>2003</year>). <article-title>The gene ontology annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro</article-title>. <source>Genome Research</source>, <volume>13</volume>(<issue>4</issue>), <fpage>662</fpage>–<lpage>672</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_011">
<mixed-citation publication-type="journal"><string-name><surname>Camon</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Magrane</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Barrell</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Dimmer</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Maslen</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Binns</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Harte</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Lopez</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Apweiler</surname>, <given-names>R.</given-names></string-name> (<year>2004</year>). <article-title>The gene ontology annotation (GOA) database: sharing knowledge in uniprot with gene ontology</article-title>. <source>Nucleic Acids Research</source>, <volume>32</volume>(<issue>Database issue</issue>), <fpage>262</fpage>–<lpage>266</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_012">
<mixed-citation publication-type="journal"><string-name><surname>Davies</surname>, <given-names>D.L.</given-names></string-name>, <string-name><surname>Bouldin</surname>, <given-names>D.W.</given-names></string-name> (<year>1979</year>). <article-title>A cluster separation measure</article-title>. <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, <volume>1</volume>(<issue>2</issue>), <fpage>224</fpage>–<lpage>227</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_013">
<mixed-citation publication-type="book"><string-name><surname>Duda</surname>, <given-names>R.O.</given-names></string-name>, <string-name><surname>Hart</surname>, <given-names>P.E.</given-names></string-name>, <string-name><surname>Stork</surname>, <given-names>D.G.</given-names></string-name> (<year>2000</year>). <source>Pattern Classification</source>. <publisher-name>John Wiley &amp; Sons</publisher-name>, <publisher-loc>New York, NY</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_014">
<mixed-citation publication-type="journal"><string-name><surname>Dunn</surname>, <given-names>J.C.</given-names></string-name> (<year>1974</year>). <article-title>Well-separated clusters and optimal fuzzy partitions</article-title>. <source>Journal of Cybernetics</source>, <volume>4</volume>(<issue>1</issue>), <fpage>95</fpage>–<lpage>104</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_015">
<mixed-citation publication-type="other"><string-name><surname>GeneTestingRegistry</surname></string-name> (2018). OtoGenome Test for Hearing Loss. Retrieved 2017. <uri>https://www.ncbi.nlm.nih.gov/gtr/tests/509148/</uri>. Online: accessed 24 June 2022.</mixed-citation>
</ref>
<ref id="j_infor517_ref_016">
<mixed-citation publication-type="chapter"><string-name><surname>Grasnick</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Perscheid</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Uflacker</surname>, <given-names>M.</given-names></string-name> (<year>2018</year>). <chapter-title>A framework for the automatic combination and evaluation of gene selection methods</chapter-title>. In: <source>International Conference on Practical Applications of Computational Biology &amp; Bioinformatics</source>. <publisher-name>Springer</publisher-name>, pp. <fpage>166</fpage>–<lpage>174</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_017">
<mixed-citation publication-type="journal"><string-name><surname>Hartigan</surname>, <given-names>J.A.</given-names></string-name>, <string-name><surname>Hartigan</surname>, <given-names>P.M.</given-names></string-name> (<year>1985</year>). <article-title>The dip test of unimodality</article-title>. <source>The Annals of Statistics</source>, <volume>13</volume>(<issue>1</issue>), <fpage>70</fpage>–<lpage>84</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_018">
<mixed-citation publication-type="other"><string-name><surname>Hira</surname>, <given-names>Z.M.</given-names></string-name>, <string-name><surname>Gillies</surname>, <given-names>D.F.</given-names></string-name> (2015). A review of feature selection and feature extraction methods applied on microarray data. <italic>Advances in Bioinformatics</italic>, 2015. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1155/2015/198363" xlink:type="simple">https://doi.org/10.1155/2015/198363</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_019">
<mixed-citation publication-type="journal"><string-name><surname>Jin</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Lu</surname>, <given-names>X.</given-names></string-name> (<year>2010</year>). <article-title>Identifying informative subsets of the Gene Ontology with information bottleneck methods</article-title>. <source>Bioinformatics</source>, <volume>26</volume>(<issue>19</issue>), <fpage>2445</fpage>–<lpage>2451</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_020">
<mixed-citation publication-type="journal"><string-name><surname>Jones</surname>, <given-names>K.S.</given-names></string-name> (<year>1972</year>). <article-title>A statistical interpretation of term specificity and its application in retrieval</article-title>. <source>Journal of Documentation</source>, <volume>28</volume>(<issue>1</issue>), <fpage>11</fpage>–<lpage>21</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_021">
<mixed-citation publication-type="chapter"><string-name><surname>Jović</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Brkić</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Bogunović</surname>, <given-names>N.</given-names></string-name> (<year>2015</year>). <chapter-title>A review of feature selection methods with applications</chapter-title>. In: <source>2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO)</source>, <conf-loc>Opatija, Croatia</conf-loc>, pp. <fpage>1200</fpage>–<lpage>1205</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/MIPRO.2015.7160458" xlink:type="simple">https://doi.org/10.1109/MIPRO.2015.7160458</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_022">
<mixed-citation publication-type="chapter"><string-name><surname>Kaufman</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Rousseeuw</surname>, <given-names>P.J.</given-names></string-name> (<year>1990</year>). <chapter-title>Partitioning around medoids (program PAM)</chapter-title>. In: <source>Finding Groups in Data: An Introduction to Cluster Analysis</source>, <volume>344</volume>, <fpage>68</fpage>–<lpage>125</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_023">
<mixed-citation publication-type="journal"><string-name><surname>Landauer</surname>, <given-names>T.K.</given-names></string-name>, <string-name><surname>Foltz</surname>, <given-names>P.W.</given-names></string-name>, <string-name><surname>Laham</surname>, <given-names>D.</given-names></string-name> (<year>1998</year>). <article-title>An introduction to latent semantic analysis</article-title>. <source>Discourse Processes</source>, <volume>25</volume>(<issue>2–3</issue>), <fpage>259</fpage>–<lpage>284</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_024">
<mixed-citation publication-type="other"><string-name><surname>Lewkowycz</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Andreassen</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Dohan</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Dyer</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Michalewski</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Ramasesh</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Slone</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Anil</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Schlag</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Gutman-Solo</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Neyshabur</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Gur-Ari</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Misra</surname>, <given-names>V.</given-names></string-name> (2022). Solving quantitative reasoning problems with language models. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.48550/arXiv.2206.14858" xlink:type="simple">https://doi.org/10.48550/arXiv.2206.14858</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_025">
<mixed-citation publication-type="journal"><string-name><surname>Li</surname>, <given-names>C.-Y.</given-names></string-name>, <string-name><surname>Mao</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Wei</surname>, <given-names>L.</given-names></string-name> (<year>2008</year>). <article-title>Genes and (common) pathways underlying drug addiction</article-title>. <source>PLoS Computational Biology</source>, <volume>4</volume>(<issue>1</issue>), <fpage>2</fpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_026">
<mixed-citation publication-type="other"><string-name><surname>Lippmann</surname>, <given-names>C.</given-names></string-name> (2020). <italic>Function-Preserving, Integrative Gene Selection: A Method for Reducing Disease-Related Gene Sets to Their Key Components</italic>. PhD thesis, Philipps Universityät Marburg.</mixed-citation>
</ref>
<ref id="j_infor517_ref_027">
<mixed-citation publication-type="journal"><string-name><surname>Lipscomb</surname>, <given-names>C.E.</given-names></string-name> (<year>2000</year>). <article-title>Medical subject headings (MeSH)</article-title>. <source>Bulletin of the Medical Library Association</source>, <volume>88</volume>(<issue>3</issue>), <fpage>265</fpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_028">
<mixed-citation publication-type="journal"><string-name><surname>López-García</surname> <given-names>P.A.</given-names></string-name>, <string-name><surname>Argote</surname>, <given-names>D.L.</given-names></string-name>, <string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name> (<year>2020</year>). <article-title>Projection-based classification of chemical groups for provenance analysis of archaeological materials</article-title>. <source>IEEE Access</source>, <volume>8</volume>, <fpage>152439</fpage>–<lpage>152451</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_029">
<mixed-citation publication-type="journal"><string-name><surname>Lötsch</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Ultsch</surname>, <given-names>A.</given-names></string-name> (<year>2020</year>). <article-title>Current projection methods-induced biases at subgroup detection for machine-learning based data-analysis of biomedical data</article-title>. <source>International Journal of Molecular Sciences</source>, <volume>21</volume>(<issue>79</issue>), <fpage>1</fpage>–<lpage>13</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_030">
<mixed-citation publication-type="journal"><string-name><surname>Lötsch</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Doehring</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Mogil</surname>, <given-names>J.S.</given-names></string-name>, <string-name><surname>Arndt</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Geisslinger</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Ultsch</surname>, <given-names>A.</given-names></string-name> (<year>2013</year>). <article-title>Functional genomics of pain in analgesic drug development and therapy</article-title>. <source>Pharmacology &amp; Therapeutics</source>, <volume>139</volume>(<issue>1</issue>), <fpage>60</fpage>–<lpage>70</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_031">
<mixed-citation publication-type="book"><string-name><surname>Manning</surname>, <given-names>C.D.</given-names></string-name>, <string-name><surname>Raghavan</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Schütze</surname>, <given-names>H.</given-names></string-name> (<year>2008</year>). <source>Introduction to Information Retrieval</source>, <publisher-name>Cambridge University Press</publisher-name>. <isbn>0521865719</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1017/CBO9780511809071.007" xlink:type="simple">https://doi.org/10.1017/CBO9780511809071.007</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_032">
<mixed-citation publication-type="journal"><string-name><surname>Mardis</surname>, <given-names>E.R.</given-names></string-name> (<year>2008</year>). <article-title>The impact of next-generation sequencing technology on genetics</article-title>. <source>Trends in Genetics</source>, <volume>24</volume>(<issue>3</issue>), <fpage>133</fpage>–<lpage>141</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_033">
<mixed-citation publication-type="journal"><string-name><surname>Michael</surname>, <given-names>J.R.</given-names></string-name> (<year>1983</year>). <article-title>The stabilized probability plot</article-title>. <source>Biometrika</source>, <volume>70</volume>(<issue>1</issue>), <fpage>11</fpage>–<lpage>17</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_034">
<mixed-citation publication-type="book"><string-name><surname>Murphy</surname>, <given-names>K.P.</given-names></string-name> (<year>2012</year>). <source>Machine Learning: A Probabilistic Perspective</source>. <publisher-name>MIT Press</publisher-name>, <isbn>0262304325</isbn>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_035">
<mixed-citation publication-type="journal"><string-name><surname>Murtagh</surname>, <given-names>F.</given-names></string-name> (<year>2004</year>). <article-title>On ultrametricity, data coding, and computation</article-title>. <source>Journal of Classification</source>, <volume>21</volume>(<issue>2</issue>), <fpage>167</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s00357-004-0015-y" xlink:type="simple">https://doi.org/10.1007/s00357-004-0015-y</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_036">
<mixed-citation publication-type="journal"><string-name><surname>Nash</surname> <suffix>Jr.</suffix>, <given-names>J.F.</given-names></string-name> (<year>1950</year>). <article-title>Equilibrium points in n-person games</article-title>. <source>Proceedings of the National Academy of Sciences</source>, <volume>36</volume>(<issue>1</issue>), <fpage>48</fpage>–<lpage>49</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_037">
<mixed-citation publication-type="chapter"><string-name><surname>Phan</surname>, <given-names>X.-H.</given-names></string-name>, <string-name><surname>Nguyen</surname>, <given-names>L.-M.</given-names></string-name>, <string-name><surname>Horiguchi</surname>, <given-names>S.</given-names></string-name> (<year>2008</year>). <chapter-title>Learning to classify short and sparse text &amp; web with hidden topics from large-scale data collections</chapter-title>. In: <source>Proceedings of the 17th International Conference on World Wide Web</source>, pp. <fpage>91</fpage>–<lpage>100</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_038">
<mixed-citation publication-type="book"><string-name><surname>Rajaraman</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Ullman</surname>, <given-names>J.D.</given-names></string-name> (<year>2011</year>). <source>Mining of Massive Datasets</source>. <publisher-name>Cambridge University Press</publisher-name>, <isbn>1107015359</isbn>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_039">
<mixed-citation publication-type="journal"><string-name><surname>Resnik</surname>, <given-names>P.</given-names></string-name> (<year>1999</year>). <article-title>Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language</article-title>. <source>Journal of Artificial Intelligence Research</source>, <volume>11</volume>, <fpage>95</fpage>–<lpage>130</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_040">
<mixed-citation publication-type="journal"><string-name><surname>Rousseeuw</surname>, <given-names>P.J.</given-names></string-name> (<year>1987</year>). <article-title>Silhouettes: a graphical aid to the interpretation and validation of cluster analysis</article-title>. <source>Journal of Computational and Applied Mathematics</source>, <volume>20</volume>, <fpage>53</fpage>–<lpage>65</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_041">
<mixed-citation publication-type="journal"><string-name><surname>Saeys</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Inza</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Larrañaga</surname>, <given-names>P.</given-names></string-name> (<year>2007</year>), <article-title>A review of feature selection techniques in bioinformatics</article-title>. <source>Bioinformatics</source>, <volume>23</volume>(<issue>19</issue>), <fpage>2507</fpage>–<lpage>2517</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_042">
<mixed-citation publication-type="journal"><string-name><surname>Shepard</surname>, <given-names>R.N.</given-names></string-name> (<year>1980</year>). <article-title>Multidimensional scaling, tree-fitting, and clustering</article-title>. <source>Science</source>, <volume>210</volume>(<issue>4468</issue>), <fpage>390</fpage>–<lpage>398</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_043">
<mixed-citation publication-type="journal"><string-name><surname>Sondka</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Bamford</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Cole</surname>, <given-names>C.G.</given-names></string-name>, <string-name><surname>Ward</surname>, <given-names>S.A.</given-names></string-name>, <string-name><surname>Dunham</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Forbes</surname>, <given-names>S.A.</given-names></string-name> (<year>2018</year>). <article-title>The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers</article-title>. <source>Nature Reviews Cancer</source>, <volume>18</volume>, <fpage>696</fpage>–<lpage>705</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1038/s41568-018-0060-1" xlink:type="simple">https://doi.org/10.1038/s41568-018-0060-1</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_044">
<mixed-citation publication-type="journal"><string-name><surname>Subramanian</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Tamayo</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Mootha</surname>, <given-names>V.K.</given-names></string-name>, <string-name><surname>Mukherjee</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Ebert</surname>, <given-names>B.L.</given-names></string-name>, <string-name><surname>Gillette</surname>, <given-names>M.A.</given-names></string-name>, <string-name><surname>Paulovich</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Pomeroy</surname>, <given-names>S.L.</given-names></string-name>, <string-name><surname>Golub</surname>, <given-names>T.R.</given-names></string-name>, <string-name><surname>Lander</surname>, <given-names>E.S.</given-names></string-name>, <string-name><surname>Mesirov</surname> <given-names>J.P.</given-names></string-name> (<year>2005</year>). <article-title>Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles</article-title>. <source>Proceedings of the National Academy of Sciences (PNAS)</source>, <volume>102</volume>(<issue>43</issue>), <fpage>15545</fpage>–<lpage>15550</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_045">
<mixed-citation publication-type="journal"><string-name><surname>Tang</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Y.-Q.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>Z.</given-names></string-name> (<year>2007</year>). <article-title>Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis</article-title>. <source>IEEE/ACM Transactions on Computational Biology and Bioinformatics</source>, <volume>4</volume>(<issue>3</issue>), <fpage>365</fpage>–<lpage>381</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_046">
<mixed-citation publication-type="journal"><string-name><surname>Tarca</surname>, <given-names>A.L.</given-names></string-name>, <string-name><surname>Bhatti</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Romero</surname>, <given-names>R.</given-names></string-name> (<year>2013</year>). <article-title>A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity</article-title>. <source>PloS One</source>, <volume>8</volume>(<issue>11</issue>), <fpage>79217</fpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_047">
<mixed-citation publication-type="chapter"><string-name><surname>Tasoulis</surname>, <given-names>D.K.</given-names></string-name>, <string-name><surname>Plagianakos</surname>, <given-names>V.P.</given-names></string-name>, <string-name><surname>Vrahatis</surname>, <given-names>M.N.</given-names></string-name> (<year>2006</year>). <chapter-title>Differential evolution algorithms for finding predictive gene subsets in microarray data</chapter-title>. In: <source>IFIP International Conference on Artificial Intelligence Applications and Innovations</source>. <publisher-name>Springer</publisher-name>, pp. <fpage>484</fpage>–<lpage>491</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_048">
<mixed-citation publication-type="journal"><string-name><surname>Taub</surname>, <given-names>F.E.</given-names></string-name>, <string-name><surname>DeLeo</surname>, <given-names>J.M.</given-names></string-name>, <string-name><surname>Thompson</surname>, <given-names>E.B.</given-names></string-name> (<year>1983</year>). <article-title>Sequential comparative hybridizations analyzed by computerized image processing can identify and quantitate regulated RNAs</article-title>. <source>DNA</source>, <volume>2</volume>(<issue>4</issue>), <fpage>309</fpage>–<lpage>327</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_049">
<mixed-citation publication-type="book"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name> (<year>2018</year>). <source>Projection-Based Clustering through Self-Organization and Swarm Intelligence: Combining Cluster Analysis with the Visualization of High-Dimensional Data</source>. <isbn>978-3658205393</isbn>, <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_050">
<mixed-citation publication-type="chapter"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name> (<year>2020</year>). <chapter-title>Improving the sensitivity of statistical testing for clusterability with mirrored-density plots</chapter-title>. In: <source>Machine Learning Methods in Visualisation for Big Data</source>, pp. <fpage>19</fpage>–<lpage>23</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_051">
<mixed-citation publication-type="journal"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name> (<year>2021</year>a). <article-title>Distance-based clustering challenges for unbiased benchmarking studies</article-title>. <source>Scientific Reports</source>, <volume>11</volume>(<issue>1</issue>), <fpage>1</fpage>–<lpage>12</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_052">
<mixed-citation publication-type="journal"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name> (<year>2021</year>b). <article-title>The exploitation of distance distributions for clustering</article-title>. <source>International Journal of Computational Intelligence and Applications</source>, <volume>20</volume>(<issue>03</issue>), <fpage>2150016</fpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_053">
<mixed-citation publication-type="journal"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name> (<year>2022</year>a). <article-title>Exploiting distance-based structures in data using an explainable AI for stock picking</article-title>. <source>MDPI Information</source>, <volume>13</volume>(<issue>2</issue>), <fpage>51</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.3390/info13020051" xlink:type="simple">https://doi.org/10.3390/info13020051</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_054">
<mixed-citation publication-type="journal"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name> (<year>2022</year>b). <article-title>Identification of explainable structures in data with a human-in-the-loop</article-title>. <source>German Journal of Artificial Intelligence (Künstl. Intell.)</source>, <volume>36</volume>, <fpage>297</fpage>–<lpage>301</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s13218-022-00782-6" xlink:type="simple">https://doi.org/10.1007/s13218-022-00782-6</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_055">
<mixed-citation publication-type="chapter"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name> (<year>2022</year>c). <chapter-title>Knowledge-based identification of homogeneous structures in gene sets</chapter-title>. In: <source>World Conference on Information Systems and Technologies</source>, <publisher-name>Springer</publisher-name>, pp. <fpage>81</fpage>–<lpage>90</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_056">
<mixed-citation publication-type="chapter"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name>, <string-name><surname>Lerch</surname>, <given-names>F.</given-names></string-name> (<year>2016</year>). <chapter-title>Visualization and 3D printing of multivariate data of biomarkers</chapter-title>. In: <source>WSCG 2016 – 24th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision 2016</source>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_057">
<mixed-citation publication-type="journal"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name>, <string-name><surname>Stier</surname>, <given-names>Q.</given-names></string-name> (<year>2021</year>). <article-title>Fundamental clustering algorithms suite</article-title>. <source>SoftwareX</source>, <volume>13</volume>, <fpage>100642</fpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_058">
<mixed-citation publication-type="chapter"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name>, <string-name><surname>Ultsch</surname>, <given-names>A.</given-names></string-name> (<year>2015</year>). <chapter-title>Models of income distributions for knowledge discovery</chapter-title>. In: <source>European Conference on Data Analysis (ECDA)</source>. <publisher-name>University of Essex</publisher-name>, <publisher-loc>Colchester</publisher-loc>, pp. <fpage>136</fpage>–<lpage>137</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.13140/RG.2.1.4463.0244" xlink:type="simple">https://doi.org/10.13140/RG.2.1.4463.0244</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_059">
<mixed-citation publication-type="journal"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name>, <string-name><surname>Ultsch</surname>, <given-names>A.</given-names></string-name> (<year>2020</year>a). <article-title>Uncovering high-dimensional structures of projections from dimensionality reduction methods</article-title>. <source>MethodsX</source>, <volume>7</volume>, <fpage>101093</fpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_060">
<mixed-citation publication-type="journal"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name>, <string-name><surname>Ultsch</surname>, <given-names>A.</given-names></string-name> (<year>2020</year>b). <article-title>Using projection-based clustering to find distance-and density-based clusters in high-dimensional data</article-title>. <source>Journal of Classification</source>, <volume>38</volume>(<issue>2</issue>), <fpage>280</fpage>–<lpage>312</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_061">
<mixed-citation publication-type="journal"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name>, <string-name><surname>Ultsch</surname>, <given-names>A.</given-names></string-name> (<year>2021</year>). <article-title>Swarm intelligence for self-organized clustering</article-title>. <source>Artificial Intelligence</source>, <volume>290</volume>, <fpage>103237</fpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_062">
<mixed-citation publication-type="journal"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name>, <string-name><surname>Gehlert</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Ultsch</surname>, <given-names>A.</given-names></string-name> (<year>2020</year>a). <article-title>Analyzing the fine structure of distributions</article-title>. <source>PLoS One</source>, <volume>15</volume>(<issue>10</issue>), <fpage>1</fpage>–<lpage>66</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1371/journal.pone.0238835" xlink:type="simple">https://doi.org/10.1371/journal.pone.0238835</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_063">
<mixed-citation publication-type="chapter"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name>, <string-name><surname>Pape</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Ultsch</surname>, <given-names>A.</given-names></string-name> (<year>2020</year>b). <chapter-title>Interactive machine learning tool for clustering in visual analytics</chapter-title>. In: <source>7th IEEE International Conference on Data Science and Advanced Analytics</source>, <conf-loc>DSAA 2020, Sydney, NSW, Australia, 2020</conf-loc>, pp. <fpage>479</fpage>–<lpage>487</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/DSAA49011.2020.00062" xlink:type="simple">https://doi.org/10.1109/DSAA49011.2020.00062</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_064">
<mixed-citation publication-type="journal"><string-name><surname>Thrun</surname>, <given-names>M.C.</given-names></string-name>, <string-name><surname>Pape</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Ultsch</surname>, <given-names>A.</given-names></string-name> (<year>2021</year>). <article-title>Conventional displays of structures in data compared with Interactive Projection-Based Clustering (IPBC)</article-title>. <source>International Journal of Data Science and Analytics</source>, <volume>12</volume>(<issue>3</issue>), <fpage>249</fpage>–<lpage>271</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s41060-021-00264-2" xlink:type="simple">https://doi.org/10.1007/s41060-021-00264-2</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_065">
<mixed-citation publication-type="journal"><string-name><surname>Toussaint</surname>, <given-names>G.T.</given-names></string-name> (<year>1980</year>). <article-title>The relative neighbourhood graph of a finite planar set</article-title>. <source>Pattern Recognition</source>, <volume>12</volume>(<issue>4</issue>), <fpage>261</fpage>–<lpage>268</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_066">
<mixed-citation publication-type="journal"><string-name><surname>Ultsch</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Lötsch</surname>, <given-names>J.</given-names></string-name> (<year>2014</year>). <article-title>What do all the (human) micro-RNAs do?</article-title> <source>BMC Genomics</source>, <volume>15</volume>(<issue>1</issue>), <fpage>1</fpage>–<lpage>12</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_067">
<mixed-citation publication-type="journal"><string-name><surname>Ultsch</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Lötsch</surname>, <given-names>J.</given-names></string-name> (<year>2017</year>). <article-title>Machine-learned cluster identification in high-dimensional data</article-title>. <source>Journal of Biomedical Informatics</source>, <volume>66</volume>, <fpage>95</fpage>–<lpage>104</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_068">
<mixed-citation publication-type="journal"><string-name><surname>Ultsch</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Kringel</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Kalso</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Mogil</surname>, <given-names>J.S.</given-names></string-name>, <string-name><surname>Lötsch</surname>, <given-names>J.</given-names></string-name> (<year>2016</year>). <article-title>A data science approach to candidate gene selection of pain regarded as a process of learning and neural plasticity</article-title>. <source>Pain</source>, <volume>157</volume>(<issue>12</issue>), <fpage>2747</fpage>–<lpage>2757</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_069">
<mixed-citation publication-type="book"><string-name><surname>van Rijsbergen</surname> <given-names>C.J.</given-names></string-name> (<year>1979</year>). <source>Information Retrieval</source>, <publisher-name>Butterworth</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_070">
<mixed-citation publication-type="journal"><string-name><surname>Wei</surname>, <given-names>C.-H.</given-names></string-name>, <string-name><surname>Kao</surname>, <given-names>H.-Y.</given-names></string-name>, <string-name><surname>Lu</surname>, <given-names>Z.</given-names></string-name> (<year>2013</year>). <article-title>PubTator: a web-based text mining tool for assisting biocuration</article-title>. <source>Nucleic Acids Research</source>, <volume>41</volume>(<issue>W1</issue>), <fpage>518</fpage>–<lpage>522</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_071">
<mixed-citation publication-type="journal"><string-name><surname>Wilkinson</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Friendly</surname>, <given-names>M.</given-names></string-name> (<year>2009</year>). <article-title>The history of the cluster heat map</article-title>. <source>The American Statistician</source>, <volume>63</volume>(<issue>2</issue>), <fpage>179</fpage>–<lpage>184</lpage>.</mixed-citation>
</ref>
<ref id="j_infor517_ref_072">
<mixed-citation publication-type="journal"><string-name><surname>Wolting</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>McGlade</surname>, <given-names>C.J.</given-names></string-name>, <string-name><surname>Tritchler</surname>, <given-names>D.</given-names></string-name> (<year>2006</year>). <article-title>Cluster analysis of protein array results via similarity of Gene Ontology annotation</article-title>. <source>BMC Bioinformatics</source>, <volume>7</volume>(<issue>1</issue>), <fpage>1</fpage>–<lpage>13</lpage>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
