<?xml version="1.0" encoding="utf-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
<journal-title-group><journal-title>Informatica</journal-title></journal-title-group>
<issn pub-type="epub">1822-8844</issn><issn pub-type="ppub">0868-4952</issn><issn-l>0868-4952</issn-l>
<publisher>
<publisher-name>Vilnius University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">INFO1136</article-id>
<article-id pub-id-type="doi">10.15388/Informatica.2017.122</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Research Article</subject></subj-group></article-categories>
<title-group>
<article-title>ADvaNCE – Efficient and Scalable Approximate Density-Based Clustering Based on Hashing</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Li</surname><given-names>Tianrun</given-names></name><xref ref-type="aff" rid="j_info1136_aff_001">1</xref><bio>
<p><bold>T. Li</bold> is currently pursuing a PhD in databases and data mining at the University of Wisconsin.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Heinis</surname><given-names>Thomas</given-names></name><email xlink:href="t.heinis@imperial.ac.uk">t.heinis@imperial.ac.uk</email><xref ref-type="aff" rid="j_info1136_aff_002">2</xref><xref ref-type="corresp" rid="cor1">∗</xref><bio>
<p><bold>T. Heinis</bold> is currently a lecturer at Imperial College London. The research of his group focuses on the topics around data management in general and scientific databases, high-performance data analytics as well as approximate computing in particular.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Luk</surname><given-names>Wayne</given-names></name><xref ref-type="aff" rid="j_info1136_aff_002">2</xref><bio>
<p><bold>W. Luk</bold> is a full-time professor at Imperial College London. He leads the Custom Computing Research Group at Imperial focusing on research of reconfigurable hardware. He is a fellow of the Royal Academy of Engineering, IEEE and the British Computer Society. He is also a senior advisor of Maxeler and a Honorary Fellowship Advisor of the Croucher Foundation.</p></bio>
</contrib>
<aff id="j_info1136_aff_001"><label>1</label><institution>Tsinghua University</institution>, <country>China</country></aff>
<aff id="j_info1136_aff_002"><label>2</label><institution>Imperial College London</institution>, <country>United Kingdom</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2017</year></pub-date><pub-date pub-type="epub"><day>1</day><month>1</month><year>2017</year></pub-date><volume>28</volume><issue>1</issue><fpage>105</fpage><lpage>130</lpage><history><date date-type="received"><month>9</month><year>2016</year></date><date date-type="accepted"><month>2</month><year>2017</year></date></history>
<permissions><copyright-statement>© 2017 Vilnius University</copyright-statement><copyright-year>2017</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Analysing massive amounts of data and extracting value from it has become key across different disciplines. As the amounts of data grow rapidly, current approaches for data analysis are no longer efficient. This is particularly true for clustering algorithms where distance calculations between pairs of points dominate overall time: the more data points are in the dataset, the bigger the share of time needed for distance calculations.</p>
<p>Crucial to the data analysis and clustering process, however, is that it is rarely straightforward: instead, parameters need to be determined and tuned first. Entirely accurate results are thus rarely needed and instead we can sacrifice little precision of the final result to accelerate the computation. In this paper we develop ADvaNCE, a new approach based on approximating DBSCAN. More specifically, we propose two measures to reduce distance calculation overhead and to consequently approximate DBSCAN: (1) locality sensitive hashing to approximate and speed up distance calculations and (2) representative point selection to reduce the number of distance calculations.</p>
<p>The experiments show that the resulting clustering algorithm is more scalable than the state-of-the-art as the datasets become bigger. Compared with the most recent approximation technique for DBSCAN, our approach is in general one order of magnitude faster (at most 30× in our experiments) as the size of the datasets increase.</p>
</abstract>
<kwd-group>
<label>Key words</label>
<kwd>analytics</kwd>
<kwd>clustering</kwd>
<kwd>high-dimensional data</kwd>
<kwd>approximate computing</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="j_info1136_s_001">
<label>1</label>
<title>Introduction</title>
<p>Unlocking the value in the masses of the data stored and available to us has become a primary concern. Medical data, banking data, shopping data and others are all analysed in great detail to find patterns, to classify behaviour or phenomena and to finally predict behaviour and progression. The outcome of these analyses is ultimately hoped to predict behaviour of customers, to increase sales in marketing (Chen <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1136_ref_008">1996</xref>), to optimize diagnostic tools in medicine, to detect disease earlier, to optimize medical treatments for a better outcome (Adaszewski <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1136_ref_001">2013</xref>; Venetis <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1136_ref_017">2015</xref>) and so on.</p>
<p>Generating insightful knowledge from data, however, is not efficient today. While there exists a plethora of tools and algorithms for the analysis and clustering of data, most of them have a considerable complexity, meaning they will not scale well to the massive datasets we see today. They are very efficient on the small datasets we dealt with until recently but as the amounts of data grow rapidly, these algorithms do not scale and fail to provide a fast analysis.</p>
<p>At the same time, data analysis rarely is a straightforward process. All clustering algorithms require the tuning of parameters, e.g. number of clusters in <italic>k</italic>-means, maximum distance (<italic>ϵ</italic>) in DBSCAN, etc., and also the data preparation process frequently needs to be repeated, for example, to adjust the level or aggregation or similar. Clearly, throughout the process of iteratively improving the analysis, the results need not be exact. Instead, by sacrificing little precision, we can substantially accelerate each analysis and thus the overall process.</p>
<p>More formally, the problem we address is density-based clustering, i.e. given a set of points <italic>P</italic> in <italic>d</italic>-dimensional space <inline-formula id="j_info1136_ineq_001"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\mathbb{R}^{d}}$]]></tex-math></alternatives></inline-formula> (with typically a very high <italic>d</italic>), the goal is to group points into clusters, i.e. to divide the points into dense areas separated by sparse areas. Existing approaches generally differ in how dense and sparse areas are defined but all approaches share that distance calculations account for the majority of the execution time. By reducing the number of distance calculations and approximating the remaining ones, we can trade accuracy for efficiency and scalability.</p>
<p>In this paper we propose ADvaNCE (<bold>A</bold>pproximate <bold>D</bold>e<bold>N</bold>sity-Based <bold>C</bold>lust<bold>E</bold>ring) (Li <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1136_ref_014">2016</xref>), a novel approximation approach for DBSCAN, arguably the most prominent density-based clustering approach used today. Our approach reduces the overhead of calculating distances with two measures. First, we use locality sensitive hashing (LSH) to approximate each distance calculation and thus accelerate it. Clearly, this step in itself already introduces approximation. Second, we approximate the dataset by only retaining representative points which summarize its structure. Doing so reduces the number of points and consequently also the number of distance calculations. Like locality sensitive hashing, this measure approximates the result as well.</p>
<p>The resulting approach outperforms the classic DBSCAN by orders of magnitude by sacrificing the accuracy in the order of 0.001%. More importantly, ADvaNCE outperforms the state-of-the-art approximate DBSCAN approaches by at most 30×. It scales well with an increasing size of datasets for small <italic>ϵ</italic>’s and consistently outperforms the state of the art.</p>
<p>The remainder of this paper is structured as follows. In Section <xref rid="j_info1136_s_002">2</xref> we discuss in detail DBSCAN along with its known optimizations but also its challenges. We then move on to give an overview of ADvaNCE in Section <xref rid="j_info1136_s_006">3</xref> and discuss in more detail the approximate acceleration based on locality sensitive hashing in Section <xref rid="j_info1136_s_009">4.2</xref> and the approximation based on representative points in Section <xref rid="j_info1136_s_010">5</xref>. To understand the benefits and limitations of ADvaNCE we first analyse it analytically in Section <xref rid="j_info1136_s_011">6</xref> and then compare ADvaNCE’s efficiency to related approaches in Section <xref rid="j_info1136_s_014">7</xref>. We then analyse its performance in detail in Section <xref rid="j_info1136_s_023">8</xref>. We discuss the state-of-the-art of clustering algorithms (approximate and exact) in Section <xref rid="j_info1136_s_026">9</xref> and finally draw conclusions in Section <xref rid="j_info1136_s_029">10</xref>.</p>
</sec>
<sec id="j_info1136_s_002">
<label>2</label>
<title>DBSCAN and Its Challenges</title>
<p>DBSCAN is arguably the most renowned density based clustering algorithm. We first give an overview of DBSCAN, present a first optimization and then motivate our approach with a motivation experiment.</p>
<sec id="j_info1136_s_003">
<label>2.1</label>
<title>DBSCAN Recap</title>
<p>DBSCAN is a density based clustering algorithm that takes two parameters <italic>ϵ</italic> and <inline-formula id="j_info1136_ineq_002"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula> to determine what areas of the datasets are dense/sparse and finally to compute the clustering result. For every input point <italic>p</italic>, if the neighbourhood within radius <italic>ϵ</italic> contains at least <inline-formula id="j_info1136_ineq_003"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula> number of points, we call point <italic>p</italic> a core point.</p><statement id="j_info1136_stat_001"><label>Definition 1.</label>
<p>Point <italic>p</italic> is density reachable from point <italic>q</italic> if there exists a sequence of points <inline-formula id="j_info1136_ineq_004"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{1}},{p_{2}},{p_{3}},\dots ,{p_{n}}$]]></tex-math></alternatives></inline-formula> such that either: 
<list>
<list-item id="j_info1136_li_001">
<label>1.</label>
<p><inline-formula id="j_info1136_ineq_005"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi></mml:math><tex-math><![CDATA[${p_{1}}=p$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1136_ineq_006"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi></mml:math><tex-math><![CDATA[${p_{n}}=q$]]></tex-math></alternatives></inline-formula>;</p>
</list-item>
<list-item id="j_info1136_li_002">
<label>2.</label>
<p><inline-formula id="j_info1136_ineq_007"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{1}},{p_{2}},{p_{3}},\dots ,{p_{n-1}}$]]></tex-math></alternatives></inline-formula> are core points;</p>
</list-item>
<list-item id="j_info1136_li_003">
<label>3.</label>
<p><inline-formula id="j_info1136_ineq_008"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{i+1}}$]]></tex-math></alternatives></inline-formula> is in the neighbourhood of <inline-formula id="j_info1136_ineq_009"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{i}}$]]></tex-math></alternatives></inline-formula> with radius <italic>ϵ</italic>.</p>
</list-item>
</list>
</p></statement>
<p>Points that are not core points themselves but are density-reachable from any core point are called border points. A cluster defined by DBSCAN starts from a single core point and gradually adds all density-reachable points from any points in this cluster, which leads to a chain effect, and the cluster grows until it is finally complete. Points that are neither core points nor border points are called noise points and do not belong to any cluster.</p>
</sec>
<sec id="j_info1136_s_004">
<label>2.2</label>
<title>Grid-Based Optimization</title>
<p>Naive implementations of the DBSCAN algorithm (Ester <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1136_ref_011">1996</xref>) compute distances between all pairs of data points in the dataset in <inline-formula id="j_info1136_ineq_010"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O({n^{2}})$]]></tex-math></alternatives></inline-formula>. A first proposed optimization uses a KD-Tree (Bentley, <xref ref-type="bibr" rid="j_info1136_ref_005">1975</xref>) to reduce the number of distance calculations between data points.</p>
<fig id="j_info1136_fig_001">
<label>Fig. 1</label>
<caption>
<p>Neighbouring cells of c are in red.</p>
</caption>
<graphic xlink:href="info1136_g001.jpg"/>
</fig>
<p>At the centre of ADvaNCE is a grid-based optimization (Gunawan, <xref ref-type="bibr" rid="j_info1136_ref_013">2013</xref>) of DBSCAN that further reduces the distance calculations. The grid used has the same dimensions as all points in the dataset. With this grid we can considerably reduce the number of distance calculations: given a point <italic>p</italic>, we only have to search in a fixed number of cells around <italic>p</italic> to find other points within distance <italic>ϵ</italic> of <italic>p</italic>. More precisely, given the centre cell <italic>c</italic> in which <italic>p</italic> is located, we only have to search the cells <inline-formula id="j_info1136_ineq_011"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${c^{\prime }}$]]></tex-math></alternatives></inline-formula> with <inline-formula id="j_info1136_ineq_012"><alternatives><mml:math>
<mml:mi mathvariant="italic">dist</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi></mml:math><tex-math><![CDATA[$\mathit{dist}(c,{c^{\prime }})<\epsilon $]]></tex-math></alternatives></inline-formula>, i.e. the neighbouring cells <inline-formula id="j_info1136_ineq_013"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(c)$]]></tex-math></alternatives></inline-formula> (red in Fig. <xref rid="j_info1136_fig_001">1</xref>).</p>
<p>The grid uses uniform spacing of the cells in all dimensions, making assignment of points to cells and related calculations straightforward. With the grid, the algorithm has four distinct phases.</p>
<p><bold>Grid construction:</bold> in the assignment phase the algorithm iterates over all points and maps each point to the grid cell that encloses it. To eventually reduce the number of distance calculations, we choose the grid cell width as <inline-formula id="j_info1136_ineq_014"><alternatives><mml:math><mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mstyle></mml:math><tex-math><![CDATA[$\frac{\epsilon }{\sqrt{D}}$]]></tex-math></alternatives></inline-formula> with <italic>D</italic> as the dimension, guaranteeing that all points within a cell are by definition within distance <italic>ϵ</italic> of each other and thereby form a cluster.</p>
<p><bold>Determining core points:</bold> the goal of this phase is to determine core points. For this phase the choice of grid cell width used in the first phase is crucial. As mentioned previously, if the grid cell width in the first phase is set to <inline-formula id="j_info1136_ineq_015"><alternatives><mml:math><mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mstyle></mml:math><tex-math><![CDATA[$\frac{\epsilon }{\sqrt{D}}$]]></tex-math></alternatives></inline-formula>, then the diagonal of the cell is <italic>ϵ</italic> and consequently the distance between any two points in the same cell is at most <italic>ϵ</italic>. If a cell <italic>c</italic> contains more than <inline-formula id="j_info1136_ineq_016"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula> points, all the points inside cell <italic>c</italic> are consequently core points.</p>
<p><bold>Merge clusters:</bold> most important in this phase is that we consider each non-empty cell as a small cluster. Further, if two core points in two different cells are within distance less than <italic>ϵ</italic> of each other, these two points belong to the same cluster and the clusters need to be merged.</p>
<p>To compute all clusters, the optimization generates a graph <inline-formula id="j_info1136_ineq_017"><alternatives><mml:math>
<mml:mi mathvariant="italic">G</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">V</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$G=(V,E)$]]></tex-math></alternatives></inline-formula> where <italic>V</italic> represents all non-empty cells. Given two cells <inline-formula id="j_info1136_ineq_018"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${c_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1136_ineq_019"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${c_{2}}$]]></tex-math></alternatives></inline-formula>, <italic>E</italic> contains an edge between <inline-formula id="j_info1136_ineq_020"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${c_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1136_ineq_021"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${c_{2}}$]]></tex-math></alternatives></inline-formula> if there exist core points <inline-formula id="j_info1136_ineq_022"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{1}}$]]></tex-math></alternatives></inline-formula> in <inline-formula id="j_info1136_ineq_023"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${c_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1136_ineq_024"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{2}}$]]></tex-math></alternatives></inline-formula> in <inline-formula id="j_info1136_ineq_025"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${c_{2}}$]]></tex-math></alternatives></inline-formula> with <inline-formula id="j_info1136_ineq_026"><alternatives><mml:math>
<mml:mi mathvariant="italic">dist</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi></mml:math><tex-math><![CDATA[$\mathit{dist}({p_{1}},{p_{2}})<\epsilon $]]></tex-math></alternatives></inline-formula>. The result of the merge cluster phase consequently is the connected components of graph G which represent the clusters.</p>
<p><bold>Determine border points:</bold> given a non-core point <italic>p</italic> in cell <italic>c</italic> (by definition a cell with less than <inline-formula id="j_info1136_ineq_027"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula>), we check all core points in neighbouring cells <inline-formula id="j_info1136_ineq_028"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(c)$]]></tex-math></alternatives></inline-formula>, and find the core point <italic>q</italic> with the minimum distance to <italic>p</italic>. Then <italic>p</italic> is a border point and belongs to the cluster of core point <italic>q</italic>. If there are no core points in <inline-formula id="j_info1136_ineq_029"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(c)$]]></tex-math></alternatives></inline-formula>, then <italic>p</italic> is a noise point and does not belong to any cluster.</p>
</sec>
<sec id="j_info1136_s_005">
<label>2.3</label>
<title>DBSCAN Challenges</title>
<p>A considerable challenge of DBSCAN (and other clustering approaches) is the number of distance calculations and the time needed for them. Even for the optimized DBSCAN algorithm (Gunawan, <xref ref-type="bibr" rid="j_info1136_ref_013">2013</xref>) the number of distance calculations is substantial. To make matters worse, the number grows rapidly with increasing dataset size.</p>
<p>We demonstrate this problem with a motivation experiment where we increase the size of the dataset from 0.125 millions to 1 million. The datasets used have five dimensions and are randomly generated (Gan and Tao, <xref ref-type="bibr" rid="j_info1136_ref_012">2015</xref>). For DBSCAN we set the parameters to <inline-formula id="j_info1136_ineq_030"><alternatives><mml:math>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>5000</mml:mn></mml:math><tex-math><![CDATA[$\epsilon =5000$]]></tex-math></alternatives></inline-formula> and the minimum number of points to 100 (Gan and Tao, <xref ref-type="bibr" rid="j_info1136_ref_012">2015</xref>). We repeat the experiment for each dataset three times and report the average.</p>
<p>As the results in Fig. <xref rid="j_info1136_fig_002">2</xref> show, the share of the total time needed to execute distance calculations in DBSCAN starts with a few seconds for a very small dataset of 125’000 and grows rapidly to a substantial share of more than 60% for a dataset of 1 million. Clearly, the share of the overall time spent on distance calculations is substantial and grows rapidly. As we will show in the remainder of this paper, we speed up DBSCAN considerably and enable it to scale by using approximation in two respects. First, we use approximation to reduce the total number of distance calculations, and second, we use approximation to accelerate the remaining distance calculations. As we will show, only little precision has to be sacrificed using ADvaNCE to accelerate the clustering process considerably.</p>
<fig id="j_info1136_fig_002">
<label>Fig. 2</label>
<caption>
<p>Share of overall time needed for distance calculations.</p>
</caption>
<graphic xlink:href="info1136_g002.jpg"/>
</fig>
</sec>
</sec>
<sec id="j_info1136_s_006">
<label>3</label>
<title>ADvaNCE Overview</title>
<p>ADvaNCE uses the key insight that for large multi-dimensional datasets the time needed to find clusters is primarily driven by the number of distance computations needed. While in the previous work Gunawan (<xref ref-type="bibr" rid="j_info1136_ref_013">2013</xref>) has substantially reduced this number, it is still considerable and grows rapidly for increasingly big datasets. By approximating the cluster computation in a grid-based DBSCAN (Gunawan, <xref ref-type="bibr" rid="j_info1136_ref_013">2013</xref>) implementation we considerably accelerate the process.</p>
<p>The first approximation reduces the number of distance calculations, or more precisely, the number of neighbouring cells a cell has to speed up clustering. Using locality sensitive hashing (Andoni and Indyk, <xref ref-type="bibr" rid="j_info1136_ref_002">2008</xref>) (LSH) we can approximate distances between the points and efficiently establish whether two points are within distance <italic>ϵ</italic> of each other.</p>
<p>A second approximation reduces the points considered in each cell. To decide whether or not two grid cells, each containing more than <inline-formula id="j_info1136_ineq_031"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula> and thus containing only core points, should be joined, not all points need to be considered. To obtain an approximate result it is sufficient to take into account primarily points closer to the border of the cell: if each of two cells has a point within distance <italic>ϵ</italic> of each other, then there is no need to test further points.</p>
<p>Particularly, the former optimization merges the cells with at least <inline-formula id="j_info1136_ineq_032"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula> (which are consequently all core points) approximately and efficiently. It does, however, only join a subset of neighbouring cells at a time and consequently several iterations are needed to join all neighbouring cells of core points. ADvaNCE consequently is run iteratively to join cells. It runs iteratively until the result converges, i.e. until the result no longer changes between two iterations (set as a parameter).</p>
</sec>
<sec id="j_info1136_s_007">
<label>4</label>
<title>ADvaNCE: Approximate Neighbourhood</title>
<p>While the basic Grid based algorithm is very efficient in two dimensions with <inline-formula id="j_info1136_ineq_033"><alternatives><mml:math>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$D=2$]]></tex-math></alternatives></inline-formula> and in case the width of each cell is set to <inline-formula id="j_info1136_ineq_034"><alternatives><mml:math><mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mroot>
<mml:mrow>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:mroot>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle></mml:math><tex-math><![CDATA[$\frac{\sqrt[D]{\epsilon }}{2}$]]></tex-math></alternatives></inline-formula> (and the diagonal of each cell consequently is <italic>ϵ</italic>), in higher dimensions the number of <inline-formula id="j_info1136_ineq_035"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(c)$]]></tex-math></alternatives></inline-formula> cells grows considerably. For example, in five dimensions the cell width becomes very small with <inline-formula id="j_info1136_ineq_036"><alternatives><mml:math>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
</mml:mrow>
</mml:msqrt>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\epsilon /\sqrt{(}5)$]]></tex-math></alternatives></inline-formula>, meaning that three cells need to be searched in each dimension and <inline-formula id="j_info1136_ineq_037"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(c)$]]></tex-math></alternatives></inline-formula> thus contains <inline-formula id="j_info1136_ineq_038"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>7</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${7^{5}}$]]></tex-math></alternatives></inline-formula> cells. Merging cells into clusters will be very time consuming as all of <inline-formula id="j_info1136_ineq_039"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(c)$]]></tex-math></alternatives></inline-formula> (<inline-formula id="j_info1136_ineq_040"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>7</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${7^{5}}$]]></tex-math></alternatives></inline-formula> cells in five dimensions) has to be searched for every core point.</p>
<p>In the following we first discuss the basic concept of approximating the neighbourhood and then present how to design the algorithm that (a) approximates the neighbourhood and (b) clusters based on the approximation.</p>
<sec id="j_info1136_s_008">
<label>4.1</label>
<title>Using Locality Sensitive Hashing to Approximate</title>
<p>As discussed previously, searching in the neighbouring cells <inline-formula id="j_info1136_ineq_041"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(c)$]]></tex-math></alternatives></inline-formula> is crucial to the execution of grid-based DBSCAN (Gunawan, <xref ref-type="bibr" rid="j_info1136_ref_013">2013</xref>) as it is an integral part of every step (except when constructing the grid). The number of cells in <inline-formula id="j_info1136_ineq_042"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(c)$]]></tex-math></alternatives></inline-formula>, however, is growing uncontrollably with an increasing number of dimensions. The basic idea of our approximate algorithm is to find an approximate neighbour for each core point using locality sensitive hashing instead of searching in the potentially large number of cells in <inline-formula id="j_info1136_ineq_043"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(c)$]]></tex-math></alternatives></inline-formula>. We consequently use the approximate neighbours to finish DBSCAN with a controllable precision.</p>
<p>In the context of ADvaNCE we use locality sensitive hashing (LSH, Andoni and Indyk (<xref ref-type="bibr" rid="j_info1136_ref_002">2008</xref>)) to approximate and reduce the neighbourhood that needs to be considered. The LSH functions <italic>H</italic> we use are random hyperplane projection functions. More precisely, given two points <inline-formula id="j_info1136_ineq_044"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi></mml:math><tex-math><![CDATA[$p,q$]]></tex-math></alternatives></inline-formula> in dimension <inline-formula id="j_info1136_ineq_045"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${D_{1}}$]]></tex-math></alternatives></inline-formula> (the dimension of the input data) we use <italic>H</italic> a set of random hyperplane distance functions to project <italic>p</italic> and <italic>q</italic> into dimension <inline-formula id="j_info1136_ineq_046"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${D_{2}}$]]></tex-math></alternatives></inline-formula>, thereby approximating the distance between them. With this, we can assert the following about the approximate distance between <italic>p</italic> and <italic>q</italic>:</p>
<list>
<list-item id="j_info1136_li_004">
<label>1.</label>
<p>If <inline-formula id="j_info1136_ineq_047"><alternatives><mml:math>
<mml:mi mathvariant="italic">dist</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi></mml:math><tex-math><![CDATA[$\mathit{dist}(p,q)<\epsilon $]]></tex-math></alternatives></inline-formula>, then for every function <inline-formula id="j_info1136_ineq_048"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">H</mml:mi></mml:math><tex-math><![CDATA[${H_{i}}\in H$]]></tex-math></alternatives></inline-formula>, we have <inline-formula id="j_info1136_ineq_049"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi></mml:math><tex-math><![CDATA[${H_{i}}(p)-{H_{i}}(q)<\epsilon $]]></tex-math></alternatives></inline-formula>;</p>
</list-item>
<list-item id="j_info1136_li_005">
<label>2.</label>
<p>If <inline-formula id="j_info1136_ineq_050"><alternatives><mml:math>
<mml:mi mathvariant="italic">dist</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi></mml:math><tex-math><![CDATA[$\mathit{dist}(p,q)>\epsilon $]]></tex-math></alternatives></inline-formula>, then for every function <inline-formula id="j_info1136_ineq_051"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">H</mml:mi></mml:math><tex-math><![CDATA[${H_{i}}\in H$]]></tex-math></alternatives></inline-formula>, there is a considerable chance that <inline-formula id="j_info1136_ineq_052"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi></mml:math><tex-math><![CDATA[${H_{i}}(p)-{H_{i}}(q)>\epsilon $]]></tex-math></alternatives></inline-formula> and there is a small possibility that <inline-formula id="j_info1136_ineq_053"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi></mml:math><tex-math><![CDATA[${H_{i}}(p)-{H_{i}}(q)<\epsilon $]]></tex-math></alternatives></inline-formula>.</p>
</list-item>
</list>
<p>With this we consequently can construct a new grid <inline-formula id="j_info1136_ineq_054"><alternatives><mml:math>
<mml:mi mathvariant="italic">NG</mml:mi></mml:math><tex-math><![CDATA[$\mathit{NG}$]]></tex-math></alternatives></inline-formula> in <inline-formula id="j_info1136_ineq_055"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${D_{2}}$]]></tex-math></alternatives></inline-formula> with <inline-formula id="j_info1136_ineq_056"><alternatives><mml:math>
<mml:mi mathvariant="italic">cellWidth</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi></mml:math><tex-math><![CDATA[$\mathit{cellWidth}=\epsilon $]]></tex-math></alternatives></inline-formula> and assign each point to the corresponding cell. Given a point <italic>p</italic>, we define the cell <italic>c</italic> in the new grid <inline-formula id="j_info1136_ineq_057"><alternatives><mml:math>
<mml:mi mathvariant="italic">NG</mml:mi></mml:math><tex-math><![CDATA[$\mathit{NG}$]]></tex-math></alternatives></inline-formula> that contains <italic>p</italic> as the approximate neighbour of <italic>p</italic>, denoted by <inline-formula id="j_info1136_ineq_058"><alternatives><mml:math>
<mml:mi mathvariant="italic">AN</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{AN}(p)$]]></tex-math></alternatives></inline-formula> and all points in <italic>c</italic> will serve as approximate neighbour points of <italic>p</italic>. The algorithm is shown in pseudocode in Algorithm <xref rid="j_info1136_fig_003">1</xref>.</p>
<fig id="j_info1136_fig_003">
<label>Algorithm 1:</label>
<caption>
<p>HashAndAssign.</p>
</caption>
<graphic xlink:href="info1136_g003.jpg"/>
</fig>
<p>Given our approach to approximation, it is possible that points that are within the <italic>ϵ</italic> neighbour sphere <inline-formula id="j_info1136_ineq_059"><alternatives><mml:math>
<mml:mi mathvariant="italic">B</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$B(p,\epsilon )$]]></tex-math></alternatives></inline-formula> can not be found in a neighbour cell in <inline-formula id="j_info1136_ineq_060"><alternatives><mml:math>
<mml:mi mathvariant="italic">NG</mml:mi></mml:math><tex-math><![CDATA[$\mathit{NG}$]]></tex-math></alternatives></inline-formula> due to the approximation. In the grid <inline-formula id="j_info1136_ineq_061"><alternatives><mml:math>
<mml:mi mathvariant="italic">NG</mml:mi></mml:math><tex-math><![CDATA[$\mathit{NG}$]]></tex-math></alternatives></inline-formula> in <inline-formula id="j_info1136_ineq_062"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${D_{2}}$]]></tex-math></alternatives></inline-formula> space <inline-formula id="j_info1136_ineq_063"><alternatives><mml:math>
<mml:mi mathvariant="italic">B</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$B(p,\epsilon )$]]></tex-math></alternatives></inline-formula> may be split into two parts. We address this issue by iterating and accumulating the hashing results to obtain a good approximation of the true <italic>ϵ</italic> neighbours.</p>
</sec>
<sec id="j_info1136_s_009">
<label>4.2</label>
<title>ADvaNCE-LSH – Approximate, LSH-Based DBSCAN</title>
<p>Using the approximate neighbours in <inline-formula id="j_info1136_ineq_064"><alternatives><mml:math>
<mml:mi mathvariant="italic">NG</mml:mi></mml:math><tex-math><![CDATA[$\mathit{NG}$]]></tex-math></alternatives></inline-formula> generated based on locality sensitive hashing we propose ADvaNCE-LSH, our approximate grid-based DBSCAN algorithm using LSH as described in Andoni and Indyk (<xref ref-type="bibr" rid="j_info1136_ref_002">2008</xref>). The functions we use are hashing functions with a <italic>p</italic>-stable distribution (Andoni and Indyk, <xref ref-type="bibr" rid="j_info1136_ref_002">2008</xref>). Independent of the dimensionality of the input data, we project to 8 dimensional space. An overview in pseudocode is given in Algorithm <xref rid="j_info1136_fig_004">2</xref>.</p>
<fig id="j_info1136_fig_004">
<label>Algorithm 2:</label>
<caption>
<p>ADvaNCE-LSH – approximate, LSH-based DBSCAN.</p>
</caption>
<graphic xlink:href="info1136_g004.jpg"/>
</fig>
<p>The <inline-formula id="j_info1136_ineq_065"><alternatives><mml:math>
<mml:mi mathvariant="italic">Construct</mml:mi></mml:math><tex-math><![CDATA[$\mathit{Construct}$]]></tex-math></alternatives></inline-formula> <inline-formula id="j_info1136_ineq_066"><alternatives><mml:math>
<mml:mi mathvariant="italic">Grid</mml:mi></mml:math><tex-math><![CDATA[$\mathit{Grid}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1136_ineq_067"><alternatives><mml:math>
<mml:mi mathvariant="italic">Determine</mml:mi></mml:math><tex-math><![CDATA[$\mathit{Determine}$]]></tex-math></alternatives></inline-formula> <inline-formula id="j_info1136_ineq_068"><alternatives><mml:math>
<mml:mi mathvariant="italic">Core</mml:mi></mml:math><tex-math><![CDATA[$\mathit{Core}$]]></tex-math></alternatives></inline-formula> <inline-formula id="j_info1136_ineq_069"><alternatives><mml:math>
<mml:mi mathvariant="italic">Points</mml:mi></mml:math><tex-math><![CDATA[$\mathit{Points}$]]></tex-math></alternatives></inline-formula> functions are very similar to the functions explained for the basic grid-based DBSCAN implementation. In this version we only mark the points as core points when the cell contains more than <inline-formula id="j_info1136_ineq_070"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula> points and leave it to <italic>Determine Core Points Hash</italic> to determine which of the remaining points are core points.</p>
<p>The condition for the while loop in Algorithm <xref rid="j_info1136_fig_004">2</xref> can be set to a fixed number of iterations. Alternatively, and to control the precision of the result better, we can use a heuristic termination condition that counts the number of core points or the number of clusters and stops when the accuracy no longer improves, i.e. when there is no more change between two iterations.</p>
<p>The function <italic>Determine Core Points Hash</italic> and <italic>Determine Border Points Hash</italic> are the approximate versions of <italic>Determine Core Points</italic> and <italic>Determine Border Points</italic> described previously. More precisely, in <italic>Determine Core Points Hash</italic>, for every non-core point <italic>p</italic>, we count the number of points in <inline-formula id="j_info1136_ineq_071"><alternatives><mml:math>
<mml:mi mathvariant="italic">AN</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{AN}(p)$]]></tex-math></alternatives></inline-formula> within <italic>ϵ</italic> of point <italic>p</italic>. Although <inline-formula id="j_info1136_ineq_072"><alternatives><mml:math>
<mml:mi mathvariant="italic">AN</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{AN}(p)$]]></tex-math></alternatives></inline-formula> is not equal to <inline-formula id="j_info1136_ineq_073"><alternatives><mml:math>
<mml:mi mathvariant="italic">B</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$B(p,\epsilon )$]]></tex-math></alternatives></inline-formula>, we can converge to it through iterations. The accumulation of <inline-formula id="j_info1136_ineq_074"><alternatives><mml:math>
<mml:mi mathvariant="italic">AN</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{AN}(p)$]]></tex-math></alternatives></inline-formula> in successive iterations will converge to <inline-formula id="j_info1136_ineq_075"><alternatives><mml:math>
<mml:mi mathvariant="italic">B</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$B(p,\epsilon )$]]></tex-math></alternatives></inline-formula> and we can gradually compute the full result. In <italic>Determine Border Points Hash</italic> we find the nearest core point for every non-core points <italic>p</italic> in <inline-formula id="j_info1136_ineq_076"><alternatives><mml:math>
<mml:mi mathvariant="italic">AN</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{AN}(p)$]]></tex-math></alternatives></inline-formula> generated by LSH. This step can also be repeated several times to improve the accuracy.</p>
<p>The function <italic>Merge Clusters Hash</italic> is the approximate version of <italic>Merge Clusters</italic> discussed previously. If two core points <italic>p</italic> and <italic>q</italic> are in the same cell in <inline-formula id="j_info1136_ineq_077"><alternatives><mml:math>
<mml:mi mathvariant="italic">NG</mml:mi></mml:math><tex-math><![CDATA[$\mathit{NG}$]]></tex-math></alternatives></inline-formula> and their distance is less than <italic>ϵ</italic>, the two small clusters that contain <italic>p</italic> and <italic>q</italic> are merged. The algorithm is illustrated in Algorithm <xref rid="j_info1136_fig_005">3</xref>.</p>
<fig id="j_info1136_fig_005">
<label>Algorithm 3:</label>
<caption>
<p>Approximate cluster merging.</p>
</caption>
<graphic xlink:href="info1136_g005.jpg"/>
</fig>
<p>In Algorithm <xref rid="j_info1136_fig_005">3</xref> we merge the clusters in <inline-formula id="j_info1136_ineq_078"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${D_{1}}$]]></tex-math></alternatives></inline-formula> using the hashing result in <inline-formula id="j_info1136_ineq_079"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${D_{2}}$]]></tex-math></alternatives></inline-formula> space. Crucially, we add a break in the for-loop because we do not merge all possible clusters in one go but instead use a number of iterations to gradually merge the clusters and terminate when we meet the condition of the while loop in the main algorithm (Algorithm <xref rid="j_info1136_fig_004">2</xref>).</p>
</sec>
</sec>
<sec id="j_info1136_s_010">
<label>5</label>
<title>ADvaNCE: Representative Points Approximation</title>
<p>Given the basic approximation discussed before, we can further accelerate Grid-based DBSCAN featuring locality sensitive hashing called ADvaNCE-LSH. More particularly, in case of very dense datasets, the number of points in a cell that need to be compared to points in neighbouring cells is considerable, leading to substantial overhead. This leads to a bottleneck in the main iteration described previously as we have to iterate through all points in a cell and need to determine their relationship to points in neighbouring cells.</p>
<p>To address this issue we have designed a further optimization. Given <inline-formula id="j_info1136_ineq_080"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula>, which is the minimum number points to form a cluster, we set the maximum number of points in each cell to be <inline-formula id="j_info1136_ineq_081"><alternatives><mml:math>
<mml:mi mathvariant="italic">maxNum</mml:mi></mml:math><tex-math><![CDATA[$\mathit{maxNum}$]]></tex-math></alternatives></inline-formula> where <inline-formula id="j_info1136_ineq_082"><alternatives><mml:math>
<mml:mi mathvariant="italic">maxNum</mml:mi>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{maxNum}>\mathit{minPts}$]]></tex-math></alternatives></inline-formula> and ignore the other points in the main iteration.</p>
<p>For every cell <italic>c</italic> in the original grid <italic>G</italic> there are three cases:</p>
<list>
<list-item id="j_info1136_li_006">
<label>1.</label>
<p>If <inline-formula id="j_info1136_ineq_083"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$|c|<\mathit{minPts}$]]></tex-math></alternatives></inline-formula>, then we can not (yet) determine whether the points in <italic>c</italic> are core points and <italic>c</italic> is thus not affected by this approximation;</p>
</list-item>
<list-item id="j_info1136_li_007">
<label>2.</label>
<p>If <inline-formula id="j_info1136_ineq_084"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>⩾</mml:mo>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$|c|\geqslant \mathit{minPts}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1136_ineq_085"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>⩽</mml:mo>
<mml:mi mathvariant="italic">maxNum</mml:mi></mml:math><tex-math><![CDATA[$|c|\leqslant \mathit{maxNum}$]]></tex-math></alternatives></inline-formula>, then all points in <italic>c</italic> are core points and this cell will also not be affected by this approximation;</p>
</list-item>
<list-item id="j_info1136_li_008">
<label>3.</label>
<p>If <inline-formula id="j_info1136_ineq_086"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:mi mathvariant="italic">maxNum</mml:mi></mml:math><tex-math><![CDATA[$|c|>\mathit{maxNum}$]]></tex-math></alternatives></inline-formula>, then it follows that <inline-formula id="j_info1136_ineq_087"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$|c|>\mathit{minPts}$]]></tex-math></alternatives></inline-formula> and all points in <italic>c</italic> are core points. All points in <italic>c</italic> belong to the same cluster so we only need to identify which cluster this cell <italic>c</italic> belongs to during the main iteration. Although we ignore <inline-formula id="j_info1136_ineq_088"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">maxNum</mml:mi></mml:math><tex-math><![CDATA[$|c|-\mathit{maxNum}$]]></tex-math></alternatives></inline-formula> points we can still determine which cluster they belong to.</p>
</list-item>
</list>
<p>Key to the approach is to choose <inline-formula id="j_info1136_ineq_089"><alternatives><mml:math>
<mml:mi mathvariant="italic">maxNum</mml:mi></mml:math><tex-math><![CDATA[$\mathit{maxNum}$]]></tex-math></alternatives></inline-formula> points such that we do not lose information (or at least can limit the information lost).</p>
<p>Using Algorithm <xref rid="j_info1136_fig_006">4</xref> we select up to <inline-formula id="j_info1136_ineq_090"><alternatives><mml:math>
<mml:mi mathvariant="italic">maxNum</mml:mi></mml:math><tex-math><![CDATA[$\mathit{maxNum}$]]></tex-math></alternatives></inline-formula> of the points that are near the border of each cell and in the <italic>Merge Clusters Hash</italic> function we calculate the distance to determine the relationship between different cells. The points on the border of each cell are the most useful in determining what cells should be merged while the points near the centre of the cell are not very useful and we potentially ignore them. Consequently, Algorithm <xref rid="j_info1136_fig_006">4</xref> will first calculate the geometric centre of all points and take the first <inline-formula id="j_info1136_ineq_091"><alternatives><mml:math>
<mml:mi mathvariant="italic">maxNum</mml:mi></mml:math><tex-math><![CDATA[$\mathit{maxNum}$]]></tex-math></alternatives></inline-formula> points furthest away from the geometric centre.</p>
<fig id="j_info1136_fig_006">
<label>Algorithm 4:</label>
<caption>
<p>Reducing points in a cell.</p>
</caption>
<graphic xlink:href="info1136_g006.jpg"/>
</fig>
<fig id="j_info1136_fig_007">
<label>Fig. 3</label>
<caption>
<p>Neighbouring cells of <italic>c</italic> in red.</p>
</caption>
<graphic xlink:href="info1136_g007.jpg"/>
</fig>
<p>To illustrate the algorithm we randomly generate 400 points in 4 cells and set the <inline-formula id="j_info1136_ineq_092"><alternatives><mml:math>
<mml:mi mathvariant="italic">maxNum</mml:mi></mml:math><tex-math><![CDATA[$\mathit{maxNum}$]]></tex-math></alternatives></inline-formula> in each cell to be 50 and obtain the result in Fig. <xref rid="j_info1136_fig_007">3</xref>: the red-small points in the centre of each cell are points that are ignored by <italic>Merge Clusters Hash</italic> algorithm while the big blue points around the border are still considered.</p>
</sec>
<sec id="j_info1136_s_011">
<label>6</label>
<title>ADvaNCE: Analytical Analysis</title>
<p>To use ADvaNCE it is crucial to appreciate its benefits and to understand its limitations. In the following we therefore first analyse the accuracy of the algorithm and then reason about its time complexity.</p>
<sec id="j_info1136_s_012">
<label>6.1</label>
<title>Algorithm Accuracy</title>
<p>As ADvaNCE only approximates the DBSCAN result, it is of course crucial to analyse its accuracy. In the functions <italic>Determine CorePoints Hash</italic> and <italic>Merge Clusters Hash</italic> we use the approximate neighbours <inline-formula id="j_info1136_ineq_093"><alternatives><mml:math>
<mml:mi mathvariant="italic">AN</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{AN}(p)$]]></tex-math></alternatives></inline-formula> based on locality sensitive hashing instead of the actual <italic>ϵ</italic> neighbours <inline-formula id="j_info1136_ineq_094"><alternatives><mml:math>
<mml:mi mathvariant="italic">B</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$B(p,\epsilon )$]]></tex-math></alternatives></inline-formula>. This may lead to a decreased accuracy in the following scenarios:</p>
<list>
<list-item id="j_info1136_li_009">
<label>1.</label>
<p>Given a point <italic>p</italic>, the number of points in <inline-formula id="j_info1136_ineq_095"><alternatives><mml:math>
<mml:mi mathvariant="italic">AN</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{AN}(p)$]]></tex-math></alternatives></inline-formula> is less than <inline-formula id="j_info1136_ineq_096"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula>, but <inline-formula id="j_info1136_ineq_097"><alternatives><mml:math>
<mml:mi mathvariant="italic">B</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$B(p,eps)$]]></tex-math></alternatives></inline-formula> has more than <inline-formula id="j_info1136_ineq_098"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula> points. A core point <italic>p</italic> may therefore not be identified as core point in the current iteration;</p>
</list-item>
<list-item id="j_info1136_li_010">
<label>2.</label>
<p>Given cells <inline-formula id="j_info1136_ineq_099"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${c_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1136_ineq_100"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${c_{2}}$]]></tex-math></alternatives></inline-formula>, core points <inline-formula id="j_info1136_ineq_101"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{1}}\in {c_{1}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1136_ineq_102"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{2}}\in {c_{2}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1136_ineq_103"><alternatives><mml:math>
<mml:mi mathvariant="italic">dist</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi></mml:math><tex-math><![CDATA[$\mathit{dist}(p1,p2)<\epsilon $]]></tex-math></alternatives></inline-formula>, but <inline-formula id="j_info1136_ineq_104"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1136_ineq_105"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{2}}$]]></tex-math></alternatives></inline-formula> are not in the same bucket after locality sensitive hashing. The two cells that should be merged may not be merged in the current iteration.</p>
</list-item>
</list>
<p>In both scenarios, merging of clusters or identification of core points may not occur. Crucial to ADvaNCE, however, is that it merges clusters in multiple iterations. As the number of iterations grows ADvaNCE will gradually determine more core points and merge more clusters that need to be merged. The impact of these two scenarios on the accuracy will consequently decrease monotonically and the algorithm will converge depending on the stop criteria which is either a fixed number of iterations or no change (in clusters or core points) in the last iterations.</p>
<p>Also note that we still perform distance calculations for points in <inline-formula id="j_info1136_ineq_106"><alternatives><mml:math>
<mml:mi mathvariant="italic">AN</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{AN}(p)$]]></tex-math></alternatives></inline-formula> to select the points that are truly in <inline-formula id="j_info1136_ineq_107"><alternatives><mml:math>
<mml:mi mathvariant="italic">B</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$B(p,\epsilon )$]]></tex-math></alternatives></inline-formula> and use these points to determine core points or merge clusters. We can therefore rule out the following two cases to occur in ADvaNCE:</p>
<list>
<list-item id="j_info1136_li_011">
<label>1.</label>
<p>Point <italic>p</italic> is determined as core point but actually it is not;</p>
</list-item>
<list-item id="j_info1136_li_012">
<label>2.</label>
<p>Two cells are merged together but should not be.</p>
</list-item>
</list>
<p>In conclusion, the accuracy of ADvaNCE will increase monotonically with the number of iterations and an excessive number of iterations therefore cannot adversely affect the accuracy.</p>
</sec>
<sec id="j_info1136_s_013">
<label>6.2</label>
<title>Time Complexity</title>
<p>The most time consuming part of ADvaNCE is the iterations of hashing and merging. In the following we thus analyse the time complexity of each iteration of ADvaNCE. Since <italic>Determine Core Points Hash</italic> and <italic>Merge Clusters Hash</italic> perform a linear search in the approximate neighbourhood <inline-formula id="j_info1136_ineq_108"><alternatives><mml:math>
<mml:mi mathvariant="italic">AN</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{AN}(p)$]]></tex-math></alternatives></inline-formula>, the expected size of the <inline-formula id="j_info1136_ineq_109"><alternatives><mml:math>
<mml:mi mathvariant="italic">AN</mml:mi></mml:math><tex-math><![CDATA[$\mathit{AN}$]]></tex-math></alternatives></inline-formula>, which is also the expected number of collisions for each query point in locality sensitive hashing, is the key to determining the running time of each iteration. We first discuss characteristics of level-<italic>i</italic> cells (number and distance between them) and the <italic>p</italic>-stable hashing functions used (probability of collision) before we reason about the time complexity.</p>
<p><bold>Level-</bold><inline-formula id="j_info1136_ineq_110"><alternatives><mml:math>
<mml:mi mathvariant="bold-italic">i</mml:mi></mml:math><tex-math><![CDATA[$\boldsymbol{i}$]]></tex-math></alternatives></inline-formula> <bold>cell characteristics:</bold> let <italic>P</italic> be the set of points in <inline-formula id="j_info1136_ineq_111"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\mathbb{R}^{d}}$]]></tex-math></alternatives></inline-formula> where <italic>d</italic> is a constant (dimensions) and we construct the original grid <italic>G</italic> for all points in <italic>P</italic>. We start by defining <italic>level-i cells</italic>.</p><statement id="j_info1136_stat_002"><label>Definition 2.</label>
<p><bold>Level-</bold><inline-formula id="j_info1136_ineq_112"><alternatives><mml:math>
<mml:mi mathvariant="bold-italic">i</mml:mi></mml:math><tex-math><![CDATA[$\boldsymbol{i}$]]></tex-math></alternatives></inline-formula> <bold>cells:</bold> For a point <italic>q</italic> we find the cell <italic>c</italic> in <italic>G</italic> that contains <italic>q</italic>. Cells that are directly connected to <italic>c</italic> are level-0 cells. For level-<italic>i</italic> cells, the outer cells that directly connect to level-<italic>i</italic> cells are level-<inline-formula id="j_info1136_ineq_113"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(i+1)$]]></tex-math></alternatives></inline-formula> cells.</p></statement>
<p>An example of level-<italic>i</italic> cells in a 2D grid is shown in Fig. <xref rid="j_info1136_fig_008">4</xref>. Relative to the red cell, the yellow cells are level-0 cells, the green cells are level-1 cells, etc. The number of level-<italic>i</italic> cells <inline-formula id="j_info1136_ineq_114"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${N_{i}}$]]></tex-math></alternatives></inline-formula> is 
<disp-formula id="j_info1136_eq_001">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>−</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {N_{i}}={(2i+3)^{d}}-{(2i+1)^{d}}=O\big({i^{d-1}}\big).\]]]></tex-math></alternatives>
</disp-formula> 
For every point <italic>q</italic> in level-<italic>i</italic> cells, the minimum distance between <italic>q</italic> and <italic>p</italic>, <inline-formula id="j_info1136_ineq_115"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{i}}$]]></tex-math></alternatives></inline-formula> is 
<disp-formula id="j_info1136_eq_002">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">cellWidth</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>×</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {M_{i}}=i\times \mathit{cellWidth}=i\times \frac{\epsilon }{\sqrt{d}}.\]]]></tex-math></alternatives>
</disp-formula> 
The maximum number of points in level-<italic>i</italic> cell is a constant <inline-formula id="j_info1136_ineq_116"><alternatives><mml:math>
<mml:mi mathvariant="italic">maxNum</mml:mi></mml:math><tex-math><![CDATA[$\mathit{maxNum}$]]></tex-math></alternatives></inline-formula> as we discussed in the context of the representative points approximation.</p>
<p><bold>Hashing functions characteristics:</bold> if <inline-formula id="j_info1136_ineq_117"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${f_{p}}(t)$]]></tex-math></alternatives></inline-formula> is the probability density function of the absolute value of the <italic>p</italic>-stable distribution, then for two vectors <inline-formula id="j_info1136_ineq_118"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${v_{1}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1136_ineq_119"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${v_{2}}$]]></tex-math></alternatives></inline-formula> and with <inline-formula id="j_info1136_ineq_120"><alternatives><mml:math>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">‖</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">‖</mml:mo></mml:math><tex-math><![CDATA[$c=\| {v_{1}}-{v_{2}}\| $]]></tex-math></alternatives></inline-formula>, the probability of collision <inline-formula id="j_info1136_ineq_121"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$p(c)$]]></tex-math></alternatives></inline-formula> is: 
<disp-formula id="j_info1136_eq_003">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">v</mml:mi>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">b</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">v</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∫</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
</mml:msubsup><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">(</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">)</mml:mo>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">)</mml:mo>
<mml:mspace width="0.1667em"/>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ p(c)=P{r_{a,b}}\big[{h_{a,b}}(v1)={h_{a,b}}(v2)\big]={\int _{0}^{r}}\frac{1}{c}{f_{p}}\bigg(\frac{t}{c}\bigg)\bigg(1-\frac{t}{r}\bigg)\hspace{0.1667em}dt\]]]></tex-math></alternatives>
</disp-formula> 
where <italic>a</italic>, <italic>b</italic> and <italic>r</italic> are parameters for <italic>p</italic>-stable LSH functions. In order to get a desirable hash result, we can also concatenate <italic>k</italic> functions <italic>h</italic> in <italic>H</italic> and redefine the hash function as <inline-formula id="j_info1136_ineq_122"><alternatives><mml:math>
<mml:mi mathvariant="italic">g</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">v</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">v</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">v</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">v</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$g(v)={h_{1}}(v),{h_{2}}(v),\dots ,{h_{k}}(v)$]]></tex-math></alternatives></inline-formula>. In this case, the probability of collision <inline-formula id="j_info1136_ineq_123"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$p(c)$]]></tex-math></alternatives></inline-formula> is as follows: 
<disp-formula id="j_info1136_eq_004">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true">[</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∫</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
</mml:msubsup><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">(</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">)</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">)</mml:mo>
<mml:mspace width="0.1667em"/>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true">]</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {p^{k}}(c)=\frac{1}{{c^{k}}}\bigg[{\int _{0}^{r}}\frac{1}{c}{f_{p}}\bigg(\frac{t}{c}\bigg){\bigg(1-\frac{t}{r}\bigg)\hspace{0.1667em}dt\bigg]^{k}}.\]]]></tex-math></alternatives>
</disp-formula> 
With <italic>p</italic> (the possibility function defined by a <italic>p</italic>-stable LSH) and <italic>k</italic> (the number of hash functions used) the number of collisions for a query point <italic>q</italic> then is: 
<disp-formula id="j_info1136_eq_005">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="normal">#</mml:mi>
<mml:mi mathvariant="italic">Collision</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mo stretchy="false">‖</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo stretchy="false">‖</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ E[\mathrm{\# }\mathit{Collision}]=\sum \limits_{x\in D}{p^{k}}\big(\| q-x\| \big)\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_info1136_ineq_124"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mo stretchy="false">‖</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo stretchy="false">‖</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${p^{k}}(\| q-x\| )$]]></tex-math></alternatives></inline-formula> is the possibility of collision if two points with distance <inline-formula id="j_info1136_ineq_125"><alternatives><mml:math>
<mml:mo stretchy="false">‖</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo stretchy="false">‖</mml:mo></mml:math><tex-math><![CDATA[$\| q-x\| $]]></tex-math></alternatives></inline-formula> and <italic>D</italic> is the size of the dataset.</p>
<fig id="j_info1136_fig_008">
<label>Fig. 4</label>
<caption>
<p>Cells neighbouring the red cell are level-0 cells.</p>
</caption>
<graphic xlink:href="info1136_g008.jpg"/>
</fig>
<p><bold>Time complexityanalysis:</bold> since we divide the whole dataset into cells of levels, we can redefine the number of collision by level-<italic>i</italic> cells. In order to determine an upper bound for the number of collisions, we can assume that all points in level-<italic>i</italic> cells have the distance <inline-formula id="j_info1136_ineq_126"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${M_{i}}$]]></tex-math></alternatives></inline-formula>, which is the minimum distance they can have. 
<disp-formula id="j_info1136_eq_006">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="normal">#</mml:mi>
<mml:mtext mathvariant="italic">Collision</mml:mtext>
<mml:mo fence="true" stretchy="false">]</mml:mo>
<mml:mo>=</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mo stretchy="false">‖</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo stretchy="false">‖</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mo>⩽</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>∞</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi mathvariant="italic">maxNum</mml:mi>
<mml:mo>×</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mo>⩽</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>∞</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi mathvariant="italic">maxNum</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">(</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mo>⩽</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>∞</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi mathvariant="italic">maxNum</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">(</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mo>=</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:mi mathvariant="italic">maxNum</mml:mi>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">(</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>×</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>∞</mml:mi>
</mml:mrow>
</mml:munderover><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}E[\mathrm{\# }\textit{Collision}]=& \sum \limits_{x\in D}{p^{k}}\big(\| q-x\| \big)\\ {} \leqslant & {\sum \limits_{i=0}^{\infty }}\mathit{maxNum}\times {N_{i}}\times {p^{k}}({M_{i}})\\ {} \leqslant & {\sum \limits_{i=0}^{\infty }}\mathit{maxNum}\times O\big({i^{d-1}}\big)\times {p^{k}}\bigg(\frac{i\epsilon }{\sqrt{d}}\bigg)\\ {} \leqslant & {\sum \limits_{i=0}^{\infty }}\mathit{maxNum}\times O\big({i^{d-1}}\big)\times {\bigg(\frac{\sqrt{d}}{i\epsilon }\bigg)^{k}}\\ {} =& \mathit{maxNum}\times {\bigg(\frac{\sqrt{d}}{\epsilon }\bigg)^{k}}\times {\sum \limits_{i=0}^{\infty }}\frac{1}{O({i^{k-d+1}})}.\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>The expected number of collisions is proportional to an infinite series of <italic>i</italic>. Given any <italic>d</italic> (the dimension of the input data) we can always find <inline-formula id="j_info1136_ineq_127"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>⩾</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$k\geqslant (d+1)$]]></tex-math></alternatives></inline-formula> and the infinite series of <italic>i</italic> will consequently converge to a constant regardless of how big <italic>i</italic> is. The expected number of collisions can be expressed by the following function, where <italic>C</italic> is a constant independent of the size of the dataset: 
<disp-formula id="j_info1136_eq_007">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">maxNum</mml:mi>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">(</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ =\mathit{maxNum}\times {\bigg(\frac{\sqrt{d}}{\epsilon }\bigg)^{k}}\times C.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>The expected size of the approximate neighbour <inline-formula id="j_info1136_ineq_128"><alternatives><mml:math>
<mml:mi mathvariant="italic">AN</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{AN}(p)$]]></tex-math></alternatives></inline-formula> therefore is <inline-formula id="j_info1136_ineq_129"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(1)$]]></tex-math></alternatives></inline-formula>. In function <italic>Determine Core Point Hash</italic> and <italic>Merge Clusters Hash</italic>, we perform a linear search in AN for every point, making the time complexity <inline-formula id="j_info1136_ineq_130"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(n)$]]></tex-math></alternatives></inline-formula>. Also, the hashing process itself is <inline-formula id="j_info1136_ineq_131"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(n)$]]></tex-math></alternatives></inline-formula> and so the time complexity of one whole iteration consequently is <inline-formula id="j_info1136_ineq_132"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(n)$]]></tex-math></alternatives></inline-formula>.</p>
</sec>
</sec>
<sec id="j_info1136_s_014">
<label>7</label>
<title>Experimental Evaluation</title>
<p>In this section we first describe the experimental setup and then we perform a sensitivity analysis using synthetic datasets to understand the performance by changing one workload variable at a time. To demonstrate the benefits of our approach we also use real world datasets. Before we conclude we also analyse our algorithm in depth by discussing a breakdown of its performance.</p>
<sec id="j_info1136_s_015">
<label>7.1</label>
<title>Experimental Setup</title>
<p>The experiments are run on a Linux Ubuntu 2.6 machine equipped with Intel(R) Xeon(R) CPU E5-2640 0 CPUs running at 2.5 GHz, with 64 KB L1, 256 KB L2 and 12 MB L3 cache and 8 GB RAM at 1333 MHz. The storage consists of 2 SAS disks of 300 GB capacity each. Storage, however, is only used once to read the data into memory. In the following we discuss the software setup as well as the configuration details.</p>
</sec>
<sec id="j_info1136_s_016">
<label>7.2</label>
<title>Software Setup and Datasets</title>
<p>We compare our algorithm against the most recent approximate DBSCAN implementation, <italic>ρ</italic>-approximate DBSCAN available and set <italic>ρ</italic> to 0.001 as recommended (Gan and Tao, <xref ref-type="bibr" rid="j_info1136_ref_012">2015</xref>). We use two different versions of our approach, the first using the hashing approximation only (ADvaNCE-LSH) and the second using the representative point reduction as well (ADvaNCE). Crucially, the stop criteria for ADvaNCE (for both versions) is set so that it achieves the same accuracy as <italic>ρ</italic>-approximate DBSCAN. Our approach is configured to use eight <italic>p</italic>-stable hashing functions and convergence stops once no more clusters are merged.</p>
<p>Each algorithm implemented uses a single CPU core to ensure a fair comparison. Our approach is written in C++ while we use the executable provided for Gan and Tao (<xref ref-type="bibr" rid="j_info1136_ref_012">2015</xref>). All implementations are compiled using g++ with -o3.</p>
<p>All experiments are repeated five times and the execution time is averaged.</p>
<p>We use the same synthetic datasets as in Gan and Tao (<xref ref-type="bibr" rid="j_info1136_ref_012">2015</xref>) defined with a dimensionality <italic>d</italic>, a restart probability <inline-formula id="j_info1136_ineq_133"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ρ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">restart</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\rho _{\mathit{restart}}}$]]></tex-math></alternatives></inline-formula>, a target cardinality and a noise percentage <inline-formula id="j_info1136_ineq_134"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ρ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">noise</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\rho _{\mathit{noise}}}$]]></tex-math></alternatives></inline-formula>. The datasets are generated using a random walk: the walk moves step by step a distance of <inline-formula id="j_info1136_ineq_135"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">shift</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${r_{\mathit{shift}}}$]]></tex-math></alternatives></inline-formula> towards a random direction every <italic>c</italic> number of steps. In each step, with probability <inline-formula id="j_info1136_ineq_136"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ρ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">restart</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\rho _{\mathit{restart}}}$]]></tex-math></alternatives></inline-formula>, the walker restarts by jumping to a random location in the data space, resetting the step counter to <italic>c</italic>. No matter if a restart has happened, the walker produces a point uniformly at random in the ball centred at its current location with radius 100. We repeat a total of <inline-formula id="j_info1136_ineq_137"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>×</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ρ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">noise</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$n\times (1-{\rho _{\mathit{noise}}})$]]></tex-math></alternatives></inline-formula> steps generating the same number of points and add <inline-formula id="j_info1136_ineq_138"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>×</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ρ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">noise</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$n\times {\rho _{\mathit{noise}}}$]]></tex-math></alternatives></inline-formula> uniformly distributed noise points. Put simply, we start to generate a cluster and in every step, with probability <inline-formula id="j_info1136_ineq_139"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ρ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">restart</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\rho _{\mathit{restart}}}$]]></tex-math></alternatives></inline-formula>, start to generate a new cluster.</p>
<p>We also use two real datasets (Bache and Lichman, <xref ref-type="bibr" rid="j_info1136_ref_004">2013</xref>): the <italic>Household</italic> dataset in 7D as well as the <italic>KDD Cup ’99</italic> dataset (after PCA on the attributes) in 9D. The <italic>Household</italic> dataset measures electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years resulting in 2’049’280 data points. The <italic>KDD Cup ’99</italic> dataset consists of 4’898’431 data points representing detected network intrusion events.</p>
<p>All data is normalized in a domain of [0, 10’000] in all dimensions.</p>
</sec>
<sec id="j_info1136_s_017">
<label>7.3</label>
<title>Accuracy Metric</title>
<p>To measure the accuracy of the approximate result we compare it to the exact result. We use the established omega-index to measure the quality of the clustering result (Collins and Dent, <xref ref-type="bibr" rid="j_info1136_ref_009">1988</xref>; Murray <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1136_ref_015">2012</xref>). The omega-index assesses the similarity of two clustering results as the ratio of consistent pairs of points, i.e. a point <italic>p</italic> needs to be in the same clusters in both clustering results. We compare the exact result obtained with the grid-based DBSCAN algorithm with either version of ADvaNCE. A value of the omega-index of 100% consequently means ADvaNCE’s result is precise and all values below that indicate how much the approximate result deviates from the precise result.</p>
</sec>
<sec id="j_info1136_s_018">
<label>7.4</label>
<title>Synthetic Data</title>
<p>The synthetic datasets we use are generated as described previously. The DBSCAN parameter <inline-formula id="j_info1136_ineq_140"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula> is set to 100 (Gan and Tao, <xref ref-type="bibr" rid="j_info1136_ref_012">2015</xref>). <inline-formula id="j_info1136_ineq_141"><alternatives><mml:math>
<mml:mi mathvariant="italic">maxNum</mml:mi></mml:math><tex-math><![CDATA[$\mathit{maxNum}$]]></tex-math></alternatives></inline-formula> for the representative points approximation is set to <inline-formula id="j_info1136_ineq_142"><alternatives><mml:math>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">minPts</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
</mml:msqrt></mml:math><tex-math><![CDATA[$\sqrt{\mathit{minPts}\times M}$]]></tex-math></alternatives></inline-formula> where <italic>M</italic> is the maximum points per cell.</p>
<sec id="j_info1136_s_019">
<label>7.4.1</label>
<title>Increasing Dataset Size</title>
<p>In this first experiment we compare the execution time of our approach (both optimizations) with the most recent DBSCAN approximation technique. We cannot compare it with a basic, accurate DBSCAN as this takes too long to execute but infer from experiments that the exact DBSCAN implementation (Ester <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1136_ref_011">1996</xref>) is slower than <italic>ρ</italic>-approximate DBSCAN (Gan and Tao, <xref ref-type="bibr" rid="j_info1136_ref_012">2015</xref>). We execute the experiment in 5, 7, and 9 dimensions, increase the data points from 100’000 to 10 millions and set <italic>ϵ</italic> to 5000.</p>
<fig id="j_info1136_fig_009">
<label>Fig. 5</label>
<caption>
<p>Average execution time for experiments in 5, 7 and 9 dimensions with increasing dataset size.</p>
</caption>
<graphic xlink:href="info1136_g009.jpg"/>
</fig>
<p>As can be seen in Fig. <xref rid="j_info1136_fig_009">5</xref>, our approach using both approximations (ADvaNCE) outperforms <italic>ρ</italic>-approximate DBSCAN (Gan and Tao, <xref ref-type="bibr" rid="j_info1136_ref_012">2015</xref>) consistently and up to <inline-formula id="j_info1136_ineq_143"><alternatives><mml:math>
<mml:mn>100</mml:mn>
<mml:mo>×</mml:mo></mml:math><tex-math><![CDATA[$100\times $]]></tex-math></alternatives></inline-formula> (for 9 dimensions). As the number of dimensions increases, the execution time of <italic>ρ</italic>-approximate DBSCAN grows. Our approach, on the other hand, shows consistent performance and improves execution time with an increase in the number of dimensions. This is because approximation by hashing works particularly well in higher dimensions. Overall our approach scales better with an increasing number of points in the dataset.</p>
<p>The gap between the two versions of our approach, ADvaNCE-LSH and ADvaNCE narrows as the number of dimensions grows. This effect can be attributed to hash-based approximation improving with an increasing dimensionality. As a consequence, the relative contribution of the representative points approximation shrinks and does not improve the result considerably for higher dimensions.</p>
<p>In Fig. <xref rid="j_info1136_fig_010">6</xref> we measure the number of distance calculations instead of the execution time for the Grid-based DBSCAN implementation as well as both versions of ADvaNCE (we cannot measure it for <italic>ρ</italic>-approximate DBSCAN as we only have access to the executable). Given the considerable execution time of analysing and measuring the number of distance calculations, we only measure the number of distance calculations from 0.2 to 1.6 million data points. Comparing the trends of the number of distance calculations as well as of the execution time in Fig. <xref rid="j_info1136_fig_009">5</xref>, we can see they have a similar shape which clearly indicates that the number of distance calculations is indeed the major contributor to the overall execution time.</p>
<fig id="j_info1136_fig_010">
<label>Fig. 6</label>
<caption>
<p>Number of distance calculations with increasing dataset size.</p>
</caption>
<graphic xlink:href="info1136_g010.jpg"/>
</fig>
</sec>
<sec id="j_info1136_s_020">
<label>7.4.2</label>
<title>Increasing Epsilon</title>
<p>In a next experiment we compare the approaches with increasing <italic>ϵ</italic> ranging from 5’000 to 50’000. Also here the basic, precise DBSCAN version takes too long to execute, we cannot include it in the results and consequently only compare ADvaNCE with <italic>ρ</italic>-approximate DBSCAN.</p>
<fig id="j_info1136_fig_011">
<label>Fig. 7</label>
<caption>
<p>Average execution time for experiments in 5, 7 and 9 dimensions with increasing <italic>ϵ</italic>.</p>
</caption>
<graphic xlink:href="info1136_g011.jpg"/>
</fig>
<p>As the results in Fig. <xref rid="j_info1136_fig_011">7</xref> show, for an increasing <italic>ϵ</italic> the performance of <italic>ρ</italic>-approximate DBSCAN improves at first but then grows again (as can be seen in the experiment for five dimensions). Our approach based on hashing alone does not perform well as its execution time grows very rapidly. The problem with using hashing to compute distance calculations is that the precision degrades the bigger the <italic>ϵ</italic> grows. Using the representative point approximation as well, however, again reduces the execution time considerably, to the point where it again outperforms <italic>ρ</italic>-approximate DBSCAN.</p>
<p>For a very big <italic>ϵ</italic> (which is rarely used in real clustering applications), however, the <italic>ρ</italic>-approximate DBSCAN outperforms ADvaNCE. This effect is due to the dataset rather than the algorithm. As outlined previously, the data points are normalized to an interval of [0, 10’000] in all dimensions. In these experiments, however, the <italic>ϵ</italic> finally grows to 50’000 and, for example, in 5D the cell width is almost 23’000, resulting in only 5 cells in each dimension. With this few cells ADvaNCE cannot considerably reduce the number of distance calculations.</p>
</sec>
<sec id="j_info1136_s_021">
<label>7.4.3</label>
<title>Increasing Dimensions</title>
<p>We also test the performance, i.e. execution time of our approach with an increasing number of dimensions (beyond 9 dimensions) and compare it with <italic>ρ</italic>-approximate DBSCAN. As we increase the dimension, we fix the remaining parameters: we use a synthetic dataset with three million data points and an <italic>ϵ</italic> of 5’000.</p>
<fig id="j_info1136_fig_012">
<label>Fig. 8</label>
<caption>
<p>Average execution time for experiments with an increasing number of dimensions.</p>
</caption>
<graphic xlink:href="info1136_g012.jpg"/>
</fig>
<p>As the result in Fig. <xref rid="j_info1136_fig_012">8</xref> shows, although concave down and appearing to converge, the execution time of <italic>ρ</italic>-approximate DBSCAN grows faster than the one of both ADvaNCE versions, resulting in a speed up of more than 10× for the biggest dimension tested.</p>
<p>The inaccuracy LSH introduces grows with the number of dimensions of the input data. As our results show, however, the impact of this effect is minimal in the experiments we carried out.</p>
<p>Also here, the relative difference between the two versions of ADvaNCE indicates that the higher the dimension, the more the hashing approximation contributes to the result of the execution time reduction thanks to the fact that the representative points approximation diminishes with increasing dimensionality. Towards the biggest number of dimensions tested, the execution time for <italic>ρ</italic>-approximate DBSCAN and our approach are parallel with ADvaNCE still outperforming the state-of-the-art by more than 10×.</p>
<p>The results in Fig. <xref rid="j_info1136_fig_013">9</xref> show a similar picture. In this experiment we measure the number of distance calculations needed for an increasing number of dimensions for the same experimental setup. As the results show, the trend in both curves is the same as for the execution time in Fig. <xref rid="j_info1136_fig_012">8</xref> and consequently the number of distance calculations is indeed what drives the execution time.</p>
<fig id="j_info1136_fig_013">
<label>Fig. 9</label>
<caption>
<p>Number of distance calculations with an increasing number of dimensions.</p>
</caption>
<graphic xlink:href="info1136_g013.jpg"/>
</fig>
</sec>
</sec>
<sec id="j_info1136_s_022">
<label>7.5</label>
<title>Real World Datasets</title>
<p>As a final litmus test of the overall execution time of ADvaNCE, we also compare it against <italic>ρ</italic>-approximate DBSCAN on real world data. More specifically, we execute ADvaNCE and <italic>ρ</italic>-approximate DBSCAN on the <italic>KDD cup ’99</italic> dataset as well as on the <italic>Household</italic> dataset (Bache and Lichman, <xref ref-type="bibr" rid="j_info1136_ref_004">2013</xref>). Given that the number of points per cell is very unbalanced in the real world datasets, we set <inline-formula id="j_info1136_ineq_144"><alternatives><mml:math>
<mml:mi mathvariant="italic">maxNum</mml:mi></mml:math><tex-math><![CDATA[$\mathit{maxNum}$]]></tex-math></alternatives></inline-formula> for the representative approximation to <inline-formula id="j_info1136_ineq_145"><alternatives><mml:math>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
</mml:msqrt>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">M</mml:mi></mml:math><tex-math><![CDATA[$\sqrt{A}\times M$]]></tex-math></alternatives></inline-formula> where <italic>M</italic> is the maximum number of points in a cell and <italic>A</italic> is the average number of points per cell.</p>
<p>As the results in Fig. <xref rid="j_info1136_fig_014">10</xref> show, ADvaNCE generally outperforms <italic>ρ</italic>-approximate DBSCAN. As we have already seen on the synthetic data, the ADvaNCE version which also features the representative points approximation further improves the efficiency and in general improves execution time by a factor of 5×. Interestingly, for the <italic>KDD cup ’99</italic> data the trend both versions of ADvaNCE exhibit is similar to <italic>ρ</italic>-approximate DBSCAN but ADvaNCE is still 10x faster. For the <italic>Household</italic> dataset both versions of ADvaNCE scale substantially better compared to <italic>ρ</italic>-approximate DBSCAN: the execution time of <italic>ρ</italic>-approximate DBSCAN grows very fast while ADvaNCE stays nearly flat.</p>
<fig id="j_info1136_fig_014">
<label>Fig. 10</label>
<caption>
<p>Average execution time for experiments on real data with increasing epsilon.</p>
</caption>
<graphic xlink:href="info1136_g014.jpg"/>
</fig>
<p>In a next experiment we measure the execution time on the same real world datasets but we increase the size of the datasets. More precisely, we start with a tenth of the dataset and then increase it until we arrive at their full size.</p>
<p>As the results in Fig. <xref rid="j_info1136_fig_015">11</xref> show, ADvaNCE also outperforms <italic>ρ</italic>-approximate DBSCAN by more than a factor of 10. The trends are similar, but for an increasing dataset size, ADvaNCE has an edge over ADvaNCE-LSH as the gap between the trends grows bigger.</p>
<fig id="j_info1136_fig_015">
<label>Fig. 11</label>
<caption>
<p>Average execution time for experiments on real data with increasing dataset size.</p>
</caption>
<graphic xlink:href="info1136_g015.jpg"/>
</fig>
</sec>
</sec>
<sec id="j_info1136_s_023">
<label>8</label>
<title>In-Depth Analysis</title>
<p>Crucial to understanding the benefits of ADvaNCE is to analyse its behaviour in detail. In the following we thus first analyse the impact of increasing dimension and <italic>ϵ</italic> on the number of points (and thus execution time) and further analyse ADvaNCE through a breakdown of the execution time.</p>
<sec id="j_info1136_s_024">
<label>8.1</label>
<title>Number of Representative Points</title>
<p>As the dimension grows bigger, the maximum number of points and the average number of points in each cell is getting smaller (because the size of the cell is getting smaller: <inline-formula id="j_info1136_ineq_146"><alternatives><mml:math>
<mml:mi mathvariant="italic">cellWidth</mml:mi>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
</mml:mrow>
</mml:msqrt>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle></mml:math><tex-math><![CDATA[$\mathit{cellWidth}=\frac{\epsilon }{\sqrt{(}d)}$]]></tex-math></alternatives></inline-formula> and because there are more directions to spread the data). Fig. <xref rid="j_info1136_fig_016">12</xref> confirms this by measuring the maximum as well as average number of points per cell for an increasing number of dimensions. Consequently, ADvaNCE will perform well for an increasing number of dimensions even if only ADvaNCE-LSH, the version which does not curb the number of points per cell, is used.</p>
<fig id="j_info1136_fig_016">
<label>Fig. 12</label>
<caption>
<p>Maximum and average number of points per cell for increasing dimensions.</p>
</caption>
<graphic xlink:href="info1136_g016.jpg"/>
</fig>
<p>For an increasing <italic>ϵ</italic> the picture is different as Fig. <xref rid="j_info1136_fig_017">13</xref> shows. In this experiment we measure the maximum (<italic>Max Num</italic>), the average (<italic>Avg Num</italic>) as well as the reduced number (<italic>Redu Num</italic>) of points ADvaNCE considers. The reduced number of points is determined based on the maximum number of points in a cell <italic>M</italic> and <inline-formula id="j_info1136_ineq_147"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula> as follows: <inline-formula id="j_info1136_ineq_148"><alternatives><mml:math>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">minPts</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">M</mml:mi>
</mml:mrow>
</mml:msqrt></mml:math><tex-math><![CDATA[$\sqrt{\mathit{minPts}\times M}$]]></tex-math></alternatives></inline-formula>.</p>
<p>The results clearly show that as <italic>ϵ</italic> grows bigger, the size of cell grows and the number of points in each cell grows as well. The number of points in each cell, however, is unbalanced: some cells contain a very large number of points while others contain only few. The second optimization of ADvaNCE, representative points, effectively addresses this by reducing the number of points considered (while only sacrificing little accuracy).</p>
<fig id="j_info1136_fig_017">
<label>Fig. 13</label>
<caption>
<p>Maximum, average and reduced number of points per cell for a growing <italic>ϵ</italic>.</p>
</caption>
<graphic xlink:href="info1136_g017.jpg"/>
</fig>
<fig id="j_info1136_fig_018">
<label>Fig. 14</label>
<caption>
<p>Breakdown of the execution time of ADvaNCE.</p>
</caption>
<graphic xlink:href="info1136_g018.jpg"/>
</fig>
</sec>
<sec id="j_info1136_s_025">
<label>8.2</label>
<title>Algorithm Breakdown</title>
<p>To understand the behaviour of ADvaNCE, we analyse the performance of its major building blocks <italic>Determine Core Point LSH</italic> and <italic>Merge Clusters LSH</italic> in the main iteration of our algorithm. We analyse the behaviour for an increasing <italic>ϵ</italic>. As the results in Fig. <xref rid="j_info1136_fig_018">14</xref> (left) show, if we only use ADvaNCE-LSH (i.e. ADvaNCE without the representative points approximation) as <italic>ϵ</italic> grows, the time for determining core points will increase considerably from less than 0.1 s to more than 100 s. This is because in this step, for every non-core point, we have to search for all points in the approximate neighbourhood to see if this point is a core point. As <italic>ϵ</italic> grows larger, the number of points in the approximate neighbourhood grows substantially and the time of this step will become the bottleneck. If we limit the number of points in each cell by the representative points approximation, however, the execution time is curbed.</p>
<p>As Fig. <xref rid="j_info1136_fig_018">14</xref> (right) shows, as <italic>ϵ</italic> grows, if we only use the hashing algorithm but do not limit the number of points, the time for <italic>Merge Clusters LSH</italic> also increases. The increase, however, is moderate, going from less than 1 second to 10 seconds. This is because in this step we only find one core point in the approximate neighbourhood and then merge cells for every point. As <italic>ϵ</italic> grows bigger, the number of points in the approximate neighbourhood also grows, but the time of this step consequently does not change substantially. If we limit the number of points in each cell to the representative number of points, the execution time of this step remains the same.</p>
<p>Assessing the execution time of these two functions, we see that if we only use ADvaNCE-LSH, the execution time will increase as <italic>ϵ</italic> grows, mainly because of the time spent on <italic>Determine Core Points</italic>. The execution time of both optimizations we use in ADvaNCE, however, remains almost the same (growing slightly) for an increasing <italic>ϵ</italic> thanks to the representative points approximation.</p>
</sec>
</sec>
<sec id="j_info1136_s_026">
<label>9</label>
<title>Related Work</title>
<p>In the discussion of related work we first give a brief overview of exact density-based clustering approaches before we move on to discuss in more detail the approximate algorithms devised.</p>
<sec id="j_info1136_s_027">
<label>9.1</label>
<title>Exact Approaches</title>
<p>Arguably the most renowned density based clustering approach is DBSCAN (Ester <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1136_ref_011">1996</xref>) itself. Given a set of points in typically high-dimensional space, it groups together points that are closely packed together (points with many nearby neighbours), marking as outliers points that lie alone in sparse regions. More formally, DBSCAN takes parameters <italic>ϵ</italic>, <inline-formula id="j_info1136_ineq_149"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula> and identifies core points (points with at least <inline-formula id="j_info1136_ineq_150"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula> within radius <italic>ϵ</italic>). DBSCAN then assigns core points within distance <italic>ϵ</italic> of each other (along with non-core points within distance <italic>ϵ</italic>) to a cluster. All other points are noise. DBSCAN can efficiently partition dense and sparse space but cannot deal well with varying density.</p>
<p>OPTICS (Ankerst <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1136_ref_003">1999</xref>; Breunig <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1136_ref_007">1999</xref>) addresses this issue of DBSCAN. Instead of using a fixed distance predicate, <italic>ϵ</italic> is only used as a maximum distance and the distance is set in different areas of the dataset such that clusters can also be identified in areas of varying densities.</p>
</sec>
<sec id="j_info1136_s_028">
<label>9.2</label>
<title>Approximate Approaches</title>
<p>More important to put our work in context, several approximate DBSCAN algorithms have also been devised in recent years.</p>
<p>ADBSCAN (Yeganeh <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1136_ref_019">2009</xref>) approaches the problem of accelerating the clustering of DBSCAN by reducing the number of region queries needed. The approach is based on the assumption that clusters are discovered by executing range queries on the dataset. It consequently tries to minimize the number of range queries. To do so, it defines skeletal points as the minimum number of points where range queries (with radius <italic>ϵ</italic>) need to be executed to capture all clusters. The problem of finding the skeletal points is NP-complete and the ADBSCAN thus essentially proposes how to approximate the set of skeletal points using a genetic algorithm. By approximating the skeletal points, of course, the result of the clustering computation is approximate as well.</p>
<p>IDBSCAN (Borah and Bhattacharyya, <xref ref-type="bibr" rid="j_info1136_ref_006">2004</xref>) works similarly and approximates and thus accelerates DBSCAN by reducing the range or neighbourhood queries needed. Instead of executing a query for all points in the vicinity of a core point, IDBSCAN only samples the neighourbood and executes a query on a subset of the points. To do so, the neighbourhood of a core point is partitioned into quadrants and a range query is only executed for the points in quadrants randomly chosen. The speed up achieved reaches a factor of six for the datasets tested. However, no guarantees about the results can be made.</p>
<p>l-DBSCAN (Viswanath and Pinkesh, <xref ref-type="bibr" rid="j_info1136_ref_018">2006</xref>) uses the concept of <italic>leaders</italic> to accelerate clustering. Leaders are a concise but approximate representation of the patterns in the dataset. l-DBSCAN on a first level clusters the leaders instead of the full dataset thereby accelerating the process. In a second step it replaces the leaders with the actual points from the dataset. With this process l-DBSCAN outperforms the precise version of DBSCAN by a factor of at most two.</p>
<p><italic>ρ</italic>-approximate DBSCAN (Gan and Tao, <xref ref-type="bibr" rid="j_info1136_ref_012">2015</xref>) guarantees the result of its clustering to be between the exact result of DBSCAN with (<inline-formula id="j_info1136_ineq_151"><alternatives><mml:math>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\epsilon ,\mathit{minPts}$]]></tex-math></alternatives></inline-formula>) and (<inline-formula id="j_info1136_ineq_152"><alternatives><mml:math>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo>×</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">ρ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\epsilon \times (1+\rho ),\mathit{minPts}$]]></tex-math></alternatives></inline-formula>). The approximation comes with a substantial speed up and scalability: <italic>ρ</italic>-approximate DBSCAN is proven to run in <inline-formula id="j_info1136_ineq_153"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(n)$]]></tex-math></alternatives></inline-formula>. The approach accomplishes this by using a grid approach where data points are assigned to a grid with cell width <inline-formula id="j_info1136_ineq_154"><alternatives><mml:math><mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mstyle></mml:math><tex-math><![CDATA[$\frac{\epsilon }{\sqrt{d}}$]]></tex-math></alternatives></inline-formula>. For an exact result the algorithm has to connect/combine all pairs of cells <inline-formula id="j_info1136_ineq_155"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${c_{1}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1136_ineq_156"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${c_{2}}$]]></tex-math></alternatives></inline-formula> that (a) contain at least <inline-formula id="j_info1136_ineq_157"><alternatives><mml:math>
<mml:mi mathvariant="italic">minPts</mml:mi></mml:math><tex-math><![CDATA[$\mathit{minPts}$]]></tex-math></alternatives></inline-formula> and (b) contain two points (<inline-formula id="j_info1136_ineq_158"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{1}}\in {c_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1136_ineq_159"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{2}}\in {c_{2}}$]]></tex-math></alternatives></inline-formula>) within the distance of each other <italic>ϵ</italic>. This problem is known as the biochromatic closest pair (BCP) and solving it precisely is exceedingly costly. However, solving BCP approximately (in <inline-formula id="j_info1136_ineq_160"><alternatives><mml:math>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo>×</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">ρ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\epsilon \times (1+\rho )$]]></tex-math></alternatives></inline-formula>) can be accomplished in linear time. Clearly, the main benefit of this approach are the proven and firm theoretical bounds on error/approximation as well as the execution time.</p>
<p>Pardicle (Patwary <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1136_ref_016">2014</xref>) approximates DBSCAN in a supercomputing environment. It accelerates the basic DBSCAN approach by parallelization: after partitioning the multi-dimensional dataset into contiguous chunks, DBSCAN finds clusters in each chunk in parallel on different CPUs/cores of the supercomputing infrastructure. To account for clusters that span several chunks (analysed independently and in parallel on different cores), Pardicle replicates the border (of width <italic>ϵ</italic>) of a chunk and copies it to adjacent chunks, i.e. to cores in the supercomputer. Doing so in essence makes the chunks overlap but also ensures correctness of the result. To reduce the overhead in terms of replication and distance calculations, not the exact border is calculated but it is sampled. Doing so approximates the results: the result in each chunk is accurate, but merging of clusters between adjacent chunks may fail.</p>
</sec>
</sec>
<sec id="j_info1136_s_029">
<label>10</label>
<title>Conclusions</title>
<p>In this paper, we develop ADvaNCE, a novel approach for approximating density-based clustering. ADvaNCE builds on a Grid-based optimization of DBSCAN and uses two approximations to accelerate and enable scalability of the clustering process. Understanding that distance calculations are the major contributor to the execution time of DBSCAN, ADvaNCE accelerates clustering by (a) approximating distance calculations and (b) reducing the distance calculations required, thereby further approximating the result. To achieve the necessary level of approximation, ADvaNCE iterates over intermediate results to monotonically increase the accuracy.</p>
<p>Approximate clustering results oftentimes are sufficient as has been established by related work (Ankerst <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1136_ref_003">1999</xref>; Gan and Tao, <xref ref-type="bibr" rid="j_info1136_ref_012">2015</xref>). Compared to the most recent state-of-the-art algorithms in approximate density-based clustering (Gan and Tao, <xref ref-type="bibr" rid="j_info1136_ref_012">2015</xref>), ADvaNCE achieves the same accuracy but manages to do so more than one order of magnitude faster (30×) in the best case. With our experimental evaluation we also show that ADvaNCE executes faster when varying parameters like <italic>ϵ</italic>, dataset size or the number of dimensions. In general, ADvaNCE scales better than the state-of-the-art algorithms.</p>
</sec>
</body>
<back>
<ref-list id="j_info1136_reflist_001">
<title>References</title>
<ref id="j_info1136_ref_001">
<mixed-citation publication-type="journal"><string-name><surname>Adaszewski</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Dukart</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Kherif</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Frackowiak</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Draganski</surname>, <given-names>B.</given-names></string-name> (<year>2013</year>). <article-title>How early can we predict Alzheimer’s disease using computational anatomy?</article-title> <source>Neurobiology of Aging</source>, <volume>34</volume>(<issue>12</issue>), <fpage>2815</fpage>–<lpage>2826</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_002">
<mixed-citation publication-type="journal"><string-name><surname>Andoni</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Indyk</surname>, <given-names>P.</given-names></string-name> (<year>2008</year>). <article-title>Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions</article-title>. <source>Communications of the ACM</source>, <volume>51</volume>(<issue>1</issue>), <fpage>117</fpage>–<lpage>122</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_003">
<mixed-citation publication-type="chapter"><string-name><surname>Ankerst</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Breunig</surname>, <given-names>M.M.</given-names></string-name>, <string-name><surname>Kriegel</surname>, <given-names>H.P.</given-names></string-name>, <string-name><surname>Sander</surname>, <given-names>J.</given-names></string-name> (<year>1999</year>). <chapter-title>OPTICS: ordering points to identify the clustering structure</chapter-title>. In: <source>Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD’99</source>. <publisher-name>ACM</publisher-name>, <publisher-loc>New York</publisher-loc>, pp. <fpage>49</fpage>–<lpage>60</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_004">
<mixed-citation publication-type="book"><string-name><surname>Bache</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Lichman</surname>, <given-names>M.</given-names></string-name> (<year>2013</year>). <source>UCI Machine Learning Repository</source>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_005">
<mixed-citation publication-type="journal"><string-name><surname>Bentley</surname>, <given-names>J.L.</given-names></string-name> (<year>1975</year>). <article-title>Multidimensional binary search trees used for associative searching</article-title>. <source>Communications of the ACM</source>, <volume>18</volume>(<issue>9</issue>), <fpage>509</fpage>–<lpage>517</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_006">
<mixed-citation publication-type="chapter"><string-name><surname>Borah</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Bhattacharyya</surname>, <given-names>D.K.</given-names></string-name> (<year>2004</year>). <chapter-title>An improved sampling-based DBSCAN for large spatial databases</chapter-title>. In: <source>Proceedings of International Conference on Intelligent Sensing and Information Processing, 2004</source>, pp. <fpage>92</fpage>–<lpage>96</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_007">
<mixed-citation publication-type="chapter"><string-name><surname>Breunig</surname>, <given-names>M.M.</given-names></string-name>, <string-name><surname>Kriegel</surname>, <given-names>H.P.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>R.T.</given-names></string-name>, <string-name><surname>Sander</surname>, <given-names>J.</given-names></string-name> (<year>1999</year>). <chapter-title>OPTICS-OF: identifying local outliers</chapter-title>. In: <source>Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery. PKDD ’99</source>. <publisher-name>Springer-Verlag</publisher-name>, <publisher-loc>London, UK</publisher-loc>, pp. <fpage>262</fpage>–<lpage>270</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_008">
<mixed-citation publication-type="journal"><string-name><surname>Chen</surname>, <given-names>M.S.</given-names></string-name>, <string-name><surname>Han</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>P.S.</given-names></string-name> (<year>1996</year>). <article-title>Data mining: an overview from a database perspective</article-title>. <source>IEEE Transactions on Knowledge and Data Engineering</source>, <volume>8</volume>(<issue>6</issue>), <fpage>866</fpage>–<lpage>883</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_009">
<mixed-citation publication-type="journal"><string-name><surname>Collins</surname>, <given-names>L.M.</given-names></string-name>, <string-name><surname>Dent</surname>, <given-names>C.W.</given-names></string-name> (<year>1988</year>). <article-title>Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions</article-title>. <source>Multivariate Behavioral Research</source>, <volume>23</volume>(<issue>2</issue>), <fpage>231</fpage>–<lpage>242</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_010">
<mixed-citation publication-type="chapter"><string-name><surname>Datar</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Immorlica</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Indyk</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Mirrokni</surname>, <given-names>V.S.</given-names></string-name> (<year>2004</year>). <chapter-title>Locality-sensitive hashing scheme based on <italic>p</italic>-stable distributions</chapter-title>. In: <source>Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG’04</source>. <publisher-name>ACM</publisher-name>, <publisher-loc>New York</publisher-loc>, pp. <fpage>253</fpage>–<lpage>262</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_011">
<mixed-citation publication-type="chapter"><string-name><surname>Ester</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Kriegel</surname>, <given-names>H.P.</given-names></string-name>, <string-name><surname>Sander</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Xu</surname>, <given-names>X.</given-names></string-name> (<year>1996</year>). <chapter-title>A density-based algorithm for discovering clusters in large spatial databases with noise</chapter-title>. In: <source>Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96</source>, pp. <fpage>226</fpage>–<lpage>231</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_012">
<mixed-citation publication-type="chapter"><string-name><surname>Gan</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Tao</surname>, <given-names>Y.</given-names></string-name> (<year>2015</year>). <chapter-title>DBSCAN revisited: mis-claim, un-fixability, and approximation</chapter-title>. In: <source>Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15</source>, pp. <fpage>519</fpage>–<lpage>530</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_013">
<mixed-citation publication-type="other"><string-name><surname>Gunawan</surname>, <given-names>A.</given-names></string-name> (2013). <italic>A Faster Algorithm for DBSCAN</italic>. Master’s thesis, Technical University of Eindhoven.</mixed-citation>
</ref>
<ref id="j_info1136_ref_014">
<mixed-citation publication-type="chapter"><string-name><surname>Li</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Heinis</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Luk</surname>, <given-names>W.</given-names></string-name> (<year>2016</year>). <chapter-title>Hashing-based approximate DBSCAN</chapter-title>. In: <source>Proceedings of the 20th European Conference on Advances in Databases and Information Systems, ADBIS’16</source>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_015">
<mixed-citation publication-type="chapter"><string-name><surname>Murray</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Carenini</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>R.</given-names></string-name> (<year>2012</year>). <chapter-title>Using the omega index for evaluating abstractive community detection</chapter-title>. In: <source>Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization</source>, pp. <fpage>10</fpage>–<lpage>18</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_016">
<mixed-citation publication-type="chapter"><string-name><surname>Patwary</surname>, <given-names>M.M.A.</given-names></string-name>, <string-name><surname>Satish</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Sundaram</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Manne</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Habib</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Dubey</surname>, <given-names>P.</given-names></string-name> (<year>2014</year>). <chapter-title>Pardicle: parallel approximate density-based clustering</chapter-title>. In: <source>SC14: International Conference for High Performance Computing, Networking, Storage and Analysis</source>, pp. <fpage>560</fpage>–<lpage>571</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_017">
<mixed-citation publication-type="chapter"><string-name><surname>Venetis</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Ailamaki</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Heinis</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Karpathiotakis</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Kherif</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Mitelpunkt</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Vassalos</surname>, <given-names>V.</given-names></string-name> (<year>2015</year>). <chapter-title>Towards the identification of disease signatures</chapter-title>. In: <source>Proceedings of the 8th International Conference on Brain Informatics and Health BIH</source>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_018">
<mixed-citation publication-type="chapter"><string-name><surname>Viswanath</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Pinkesh</surname>, <given-names>R.</given-names></string-name> (<year>2006</year>). <chapter-title>l-DBSCAN: a fast Hybrid density based clustering method</chapter-title>. In: <source>18th International Conference on Pattern Recognition, ICPR’06</source>. Vol. <volume>1</volume>, pp. <fpage>912</fpage>–<lpage>915</lpage>.</mixed-citation>
</ref>
<ref id="j_info1136_ref_019">
<mixed-citation publication-type="chapter"><string-name><surname>Yeganeh</surname>, <given-names>S.H.</given-names></string-name>, <string-name><surname>Habibi</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Abolhassani</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Tehrani</surname>, <given-names>M.A.</given-names></string-name>, <string-name><surname>Esmaelnezhad</surname>, <given-names>J.</given-names></string-name> (<year>2009</year>). <chapter-title>An approximation algorithm for finding skeletal points for density based clustering approaches</chapter-title>. In: <source>2009 IEEE Symposium on Computational Intelligence and Data Mining</source>, pp. <fpage>403</fpage>–<lpage>410</lpage>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>