<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
<journal-title-group><journal-title>Informatica</journal-title></journal-title-group>
<issn pub-type="epub">1822-8844</issn><issn pub-type="ppub">0868-4952</issn><issn-l>0868-4952</issn-l>
<publisher>
<publisher-name>Vilnius University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">INFOR621</article-id>
<article-id pub-id-type="doi">10.15388/26-INFOR621</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Research Article</subject></subj-group></article-categories>
<title-group>
<article-title>Context-Aware Word-Based Forward Coding</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-4826-5265</contrib-id>
<name><surname>Zavadskyi</surname><given-names>Igor</given-names></name><email xlink:href="ihorzavadskyi@knu.ua">ihorzavadskyi@knu.ua</email><xref ref-type="aff" rid="j_infor621_aff_001">1</xref><bio>
<p><bold>I. Zavadskyi</bold>, PhD, DSc, professor at the Department of Mathematical Informatics, Faculty of Computer Science and Cybernetics, Taras Shevchenko National University of Kyiv, Ukraine. He received his PhD in computer science from the National Academy of Sciences of Ukraine in 2001 and his DSc degree in 2020. He is the author of more than 100 research papers and 25 textbooks and instructional manuals in computer science. His current research interests include information and coding theory, data compression, database management systems, and e-learning technologies. Prof. Zavadskyi is a member of the programme committee of the <italic>Data Compression Conference</italic> (DCC).</p></bio>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-9478-3303</contrib-id>
<name><surname>Klein</surname><given-names>Shmuel T.</given-names></name><email xlink:href="tomi@cs.biu.ac.il">tomi@cs.biu.ac.il</email><xref ref-type="aff" rid="j_infor621_aff_002">2</xref><bio>
<p><bold>S.T. Klein</bold> received a BSc in mathematics and statistics from the Hebrew University of Jerusalem in 1978, an MSc degree in applied mathematics, and a PhD in computer science, both from the Weizmann Institute of Science in Rehovot, Israel, in 1981 and 1988, respectively. He then spent three years as a visiting assistant professor at the University of Chicago, and returned in 1990 to Israel, where he has been with the Department of Computer Science of Bar Ilan University in Ramat Gan, Israel.</p>
<p>Prof. Klein’s research focuses on text processing and, in particular, data compression. He has worked on two of the largest full-text information retrieval systems: the Responsa Project at Bar Ilan and the Trésor de la Langue Française at the University of Chicago. He has authored two books, more than a hundred refereed journal and conference papers, and thirteen patents. He is currently a professor emeritus at Bar Ilan University, where he has been the chair of the Computer Science Department for several years.</p></bio>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-2320-9064</contrib-id>
<name><surname>Shapira</surname><given-names>Dana</given-names></name><email xlink:href="shapird@g.ariel.ac.il">shapird@g.ariel.ac.il</email><xref ref-type="aff" rid="j_infor621_aff_003">3</xref><xref ref-type="corresp" rid="cor1">∗</xref><bio>
<p><bold>D. Shapira</bold> received the BSc (cum laude), MSc (magna cum laude), and PhD degrees in mathematics and computer science from Bar Ilan University, Israel, in 1993, 1996 and 2001, respectively. She was then a postdoctoral fellow with Brandeis University, Waltham, MA, USA. She is currently a professor at Ariel University, Israel. She has coauthored over 50 refereed articles in scientific journals and over 70 in conference proceedings, and has been granted 4 patents. Her research interests include data compression, algorithm development and analysis and scheduling theory. She has served as the head of the Department of Computer Sciences, and is currently the head of the MSc Degree Program. Prof. Shapira has been a member of the programme committee of the <italic>Data Compression Conference</italic> (DCC) since 2011.</p></bio>
</contrib>
<aff id="j_infor621_aff_001"><label>1</label>Department of Computer Science and Cybernetics, <institution>Taras Shevchenko National University of Kyiv</institution>, Kyiv, <country>Ukraine</country></aff>
<aff id="j_infor621_aff_002"><label>2</label>Department of Computer Science, <institution>Bar Ilan University</institution>, Ramat Gan 52900, <country>Israel</country></aff>
<aff id="j_infor621_aff_003"><label>3</label>Department of Computer Science, Data Science and Artificial Intelligence Research Center, <institution>Ariel University</institution>, Ariel 40700, <country>Israel</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2026</year></pub-date><pub-date pub-type="epub"><day>6</day><month>3</month><year>2026</year></pub-date><volume>37</volume><issue>1</issue><fpage>229</fpage><lpage>249</lpage><history><date date-type="received"><month>6</month><year>2025</year></date><date date-type="accepted"><month>2</month><year>2026</year></date></history>
<permissions><copyright-statement>© 2026 Vilnius University</copyright-statement><copyright-year>2026</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Forward-looking coding has recently been introduced as a source modelling paradigm that exploits predictions of forthcoming symbols. In this paper, we extend this methodology to word-level alphabets, enabling improved compression performance for large and variable-length symbol sets. We present a space-efficient scheme for encoding header information, with particular emphasis on the accurate representation of symbol frequency distributions. In addition, we propose an alternative ordering strategy for word-based dictionaries that leverages the adaptive nature of forward-looking compression. We further show how these techniques can be integrated with a word-based <italic>Prediction by Partial Matching</italic> model of order one, while avoiding the zero-frequency problem. Experimental results confirm the effectiveness of the proposed approach across multiple datasets.</p>
</abstract>
<kwd-group>
<label>Key words</label>
<kwd>data compression</kwd>
<kwd>adaptive coding</kwd>
<kwd>arithmetic coding</kwd>
<kwd>entropy</kwd>
<kwd>forward coding</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="j_infor621_s_001">
<label>1</label>
<title>Introduction</title>
<p>Lossless data compression constitutes a fundamental paradigm in information theory, enabling the reduction of the required size for data representation, while guaranteeing the exact reconstruction of the original input file. Unlike lossy methodologies, which approximate data to enhance compression ratios, lossless algorithms operate by identifying and eliminating redundancy without introducing distortion. Consequently, lossless data encoding remains a foundational component of modern storage and transmission.</p>
<p>The utility of lossless data compression has evolved beyond sequential archival storage to encompass random access functionality through the use of compact data structures, see Navarro (<xref ref-type="bibr" rid="j_infor621_ref_020">2016</xref>). By organizing compressed data into navigable forms, such as wavelet trees (see Grossi <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor621_ref_015">2003</xref> and Baruch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor621_ref_005">2020</xref>) or compressed suffix arrays (see Grossi and Vitter, <xref ref-type="bibr" rid="j_infor621_ref_014">2005</xref>), these data structures enable direct access and querying of the underlying information without the computational overhead of full decompression (e.g. Baruch <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor621_ref_004">2017</xref>, <xref ref-type="bibr" rid="j_infor621_ref_005">2020</xref>). This capability allows massive datasets to reside entirely within faster levels of the memory hierarchy, permitting algorithms to operate directly over the compressed representation.</p>
<p>The entropy of a memory-less source <italic>X</italic> is defined by the well-known Shannon formula <inline-formula id="j_infor621_ineq_001"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="normal">Σ</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo movablelimits="false">log</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${H_{0}}(X)=-{\textstyle\sum _{s\in \Sigma }}p(s)\log p(s)$]]></tex-math></alternatives></inline-formula>, where Σ denotes the alphabet and <inline-formula id="j_infor621_ineq_002"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$p(s)$]]></tex-math></alternatives></inline-formula> is the probability of the symbol <italic>s</italic> appearance. This quantity is widely used as a theoretical lower bound on the efficiency of <italic>semi-static</italic> zero-order compression methods, that is, compression schemes that rely solely on symbol frequencies, which are assumed to be constant throughout the entire file. If applicable, one can discard parts of the data to achieve compression ratios beyond the <inline-formula id="j_infor621_ineq_003"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${H_{0}}$]]></tex-math></alternatives></inline-formula> value, which results in a lossy compression process. Another way to overcome the Shannon limit without compromising on data loss is to design a more accurate model that reflects the actual data source and yields a lower entropy, e.g. to design an enhanced probabilistic modelling of the next encoded element. This implies the potential for further storage savings.</p>
<p>Common <italic>adaptive</italic> compression methods, like the dynamic Huffman algorithm due to Vitter (<xref ref-type="bibr" rid="j_infor621_ref_022">1987</xref>), update the model based on the data encountered in the processed portion of the input file. To predict the probability of the next element being encoded at the current position in the file, the model analyses the statistics of the elements that have appeared up to that point. However, in situations where the precise occurrence count of each element throughout the entire file is known, an alternative approach called <italic>Forward-Looking</italic> Adaptive Compression, proposed by Klein <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor621_ref_017">2020</xref>), can be utilized. This method adjusts the dynamic model based on predictions of what is yet to come, effectively peering into the future. This contrasts with traditional <italic>Backward-Looking</italic> dynamic methods, which develop their current model from past observations of what has already happened. Hence, forward-looking compression bridges semi-static and adaptive approaches, combining the advantages of both. The net encoding produced by forward-looking has been proven to be more efficient than a static (Huffman, <xref ref-type="bibr" rid="j_infor621_ref_016">1952</xref>) code, well known to be optimal, by at least <inline-formula id="j_infor621_ineq_004"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="normal">Σ</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$|\Sigma |-1$]]></tex-math></alternatives></inline-formula> bits, where <inline-formula id="j_infor621_ineq_005"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="normal">Σ</mml:mi>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|\Sigma |$]]></tex-math></alternatives></inline-formula> refers to the size of the alphabet (Klein <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor621_ref_017">2020</xref>, Theorem 2). This upper bound of the net encoding is smaller than all known dynamic variants to date.</p>
<p>We use the term “alphabet” in a broader interpretation than just individual letters. It includes various types of substrings of different lengths. The crucial factor is that a precise procedure should be followed to parse the input text into a well-defined sequence of elements within the alphabet. When dealing with natural language texts, using <italic>words</italic> as the elements of the alphabet often leads to significantly better compression than using single characters alone, see Moffat (<xref ref-type="bibr" rid="j_infor621_ref_019">1989</xref>). The first contribution of this paper is to extend the forward-looking encoding method to alphabets of variable-length strings, instead of just characters. This involves increasing the size of the alphabet, which results in further savings to the net encoding due to the upper bound performance of the forward method being a factor of the alphabet size.</p>
<p>In <italic>Information Retrieval</italic>, it is often assumed that the list of distinct words is necessary for the retrieval process and thus already stored. Therefore, implementing the word-based compression technique incurs no additional overhead except for the list of word frequencies. However, storing exact symbol frequencies for the entire alphabet in the header may incur significant overhead, making the method impractical, especially for large alphabets considered here. This drawback is even more significant when the list of distinct words does not exist in advance, e.g. in file archiving.</p>
<p>An enhanced forward-looking approach that further improves the net encoding has been proposed in Fruchtman <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor621_ref_012">2022</xref>). It assigns increased weights to positions in the file that are closer to the currently processed element, rather than treating all positions equally. Yet, the overhead arising from the encoded file’s prelude is even more expensive than that of forward-looking and led to the adoption of the Backward-Looking heuristic approach in Fruchtman <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor621_ref_011">2021</xref>), which compromises on provable optimality guarantees. The current paper’s second contribution is an alternative solution for encoding headers of statistical methods that include the exact frequencies of the alphabet elements. It allows us to compress such a header into a tiny portion of the archived text.</p>
<p>A popular method of compressing a sequence of distinct words is by ordering them lexicographically and applying the <italic>Prefix Omission Method</italic> (<sans-serif>POM</sans-serif>) by Bratley and Choueka (<xref ref-type="bibr" rid="j_infor621_ref_006">1982</xref>). Our third contribution improves the compression ratio by replacing some low-frequency words in the text with meta-characters of higher frequency. This is made possible by using an alternative order for word alphabets. Experiments show that our method provides a competitive alternative to dictionary compression using specialized techniques such as <sans-serif>POM</sans-serif>.</p>
<p>The compression technique known as <italic>Prediction by Partial Matching</italic> (<sans-serif>PPM</sans-serif>) by Cleary and Witten (<xref ref-type="bibr" rid="j_infor621_ref_008">1984</xref>) enhances the compression efficiency by exploiting dependencies in the data via contextual models. <sans-serif>PPM</sans-serif> considers a variable-length history of symbols to capture dependencies in the data and makes more precise predictions about the following character based on the context. Our fourth contribution is a <sans-serif>PPM</sans-serif> of order-1 algorithm for a word-based alphabet, which ‘looks into the future’, unlike the work of Avrunin <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor621_ref_003">2022</xref>), which combines future-based forward-looking with the past-based <sans-serif>PPM</sans-serif>. The most interesting theoretical contribution of the proposed algorithm that combines <sans-serif>PPM</sans-serif> with forward-looking compression is that, in contrast with traditional backward <sans-serif>PPM</sans-serif>, there is no zero-frequency problem in the forward compression, as the probability of the special ‘escape’ symbol, which is used to indicate a shorter context, can be calculated precisely.</p>
<p>The following summarizes the contributions of this paper.</p>
<list>
<list-item id="j_infor621_li_001">
<label>1.</label>
<p>Extending the forward-looking encoding method to big alphabets, including variable-length strings, rather than just characters, leveraging the larger alphabet to amplify forward-looking gains.</p>
</list-item>
<list-item id="j_infor621_li_002">
<label>2.</label>
<p>A compact header encoding method for storing word frequencies.</p>
</list-item>
<list-item id="j_infor621_li_003">
<label>3.</label>
<p>Improving compression by selectively mapping low-frequency words to higher-frequency meta-characters.</p>
</list-item>
<list-item id="j_infor621_li_004">
<label>4.</label>
<p>A forward-looking order-1 word-based <sans-serif>PPM</sans-serif> adaptation.</p>
</list-item>
</list>
<p>This work is an extended version of a paper that has been presented in 2024 at the Data Compression Conference, DCC, and appeared in its proceedings in Zavadskyi <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor621_ref_024">2024</xref>). The additional contributions over those in the conference paper are:</p>
<list>
<list-item id="j_infor621_li_005">
<label>1.</label>
<p>The algorithms in Section <xref rid="j_infor621_s_004">4</xref> have been simplified.</p>
</list-item>
<list-item id="j_infor621_li_006">
<label>2.</label>
<p>In the newly added Section <xref rid="j_infor621_s_005">5</xref>, the context-aware word-based forward-looking paradigm has been adapted to the <sans-serif>PPM</sans-serif> compression algorithm.</p>
</list-item>
<list-item id="j_infor621_li_007">
<label>3.</label>
<p>The experimental results presented in Section <xref rid="j_infor621_s_006">6</xref> have been extended.</p>
</list-item>
</list>
<p>Our paper is organized as follows. Section <xref rid="j_infor621_s_002">2</xref> recalls the details of forward-looking compression. Section <xref rid="j_infor621_s_003">3</xref> presents an approach for encoding frequency data, which constitutes a significant part of the compressed file’s header overhead. Section <xref rid="j_infor621_s_004">4</xref> provides the word-based forward-looking algorithm featured with the low-frequency words replacement technique, Section <xref rid="j_infor621_s_005">5</xref> suggests a word-based <sans-serif>PPM</sans-serif> forward-looking algorithm, Section <xref rid="j_infor621_s_006">6</xref> presents experimental results, and finally, Section <xref rid="j_infor621_s_009">7</xref> concludes the paper.</p>
</sec>
<sec id="j_infor621_s_002">
<label>2</label>
<title>Forward-Looking Encoding</title>
<p>As mentioned above, traditional adaptive compression algorithms concentrate solely on the item currently processed, increasing its frequency after each occurrence, and thereby potentially leading to shorter codewords for future use. These savings may, however, result in the elongation of certain other codewords. An alternative adaptive <italic>forward-looking</italic> approach has been proposed, assuming the exact frequencies of each element throughout the entire file are available, possibly by a pre-scan of the source file. It enhances the conventional ‘self-centered’ behaviour by adopting a more ‘collaborative’ strategy. In this approach, the frequency of the currently processed item is <italic>decremented</italic>, even if this results in the associated codeword becoming longer. This operation, however, has the potential to reduce the average codeword length for the remaining elements, resulting in more efficient overall space utilization.</p>
<p>The <italic>forward-looking</italic> paradigm can be applied to any statistical encoding algorithm. In Klein <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor621_ref_017">2020</xref>), the forward-looking method is presented in the context of Huffman coding, whereas in Fruchtman <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor621_ref_013">2023</xref>), the forward-looking algorithm also considers arithmetic coding. The latter work provides <italic>bidirectional</italic> adaptive compression, taking both past and future into account, and provably improves on the net encoding of the forward-looking variant.</p>
<fig id="j_infor621_fig_001">
<label>Algorithm 1</label>
<caption>
<p><sc>Generic forward-looking encoding</sc></p>
</caption>
<graphic xlink:href="infor621_g001.jpg"/>
</fig>
<p>Algorithm <xref rid="j_infor621_fig_001">1</xref> presents the generic <italic>forward-looking</italic> modelling method suited to all statistical encodings. As any semi-static compression method, this algorithm requires two passes over the text. In line 1, the text <italic>T</italic> is scanned in a preprocessing stage in order to gather the underlying statistics, storing the frequency <inline-formula id="j_infor621_ineq_006"><alternatives><mml:math>
<mml:mtext mathvariant="sans-serif">freq</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">σ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textsf{freq}(\sigma )$]]></tex-math></alternatives></inline-formula> for each element <italic>σ</italic> of the alphabet Σ. In line 2, these statistics are transmitted to the decoder as part of the compressed file’s prelude, and are used as the model by both the encoder and decoder (line 3). In the processing stage, the text <inline-formula id="j_infor621_ineq_007"><alternatives><mml:math>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$T={x_{1}},\dots ,{x_{n}}$]]></tex-math></alternatives></inline-formula> is rescanned. The element <inline-formula id="j_infor621_ineq_008"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{i}}$]]></tex-math></alternatives></inline-formula> is encoded, its frequency is decremented by 1 and the model is updated. In particular, in case the weight of <italic>σ</italic> becomes 0, the corresponding element is removed from the model. For simplicity, from now onward, we assume that the ‘<sans-serif>update the model</sans-serif>’ statement also removes zero-frequency elements from the model. The generic <sans-serif>Forward-Looking-Decode</sans-serif> algorithm is straightforward and therefore omitted. The time complexity of both preprocessing and encoding passes of Algorithm <xref rid="j_infor621_fig_001">1</xref> is <inline-formula id="j_infor621_ineq_009"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(n)$]]></tex-math></alternatives></inline-formula>, where <italic>n</italic> is the text length. Similarly, the decoding process executes in <inline-formula id="j_infor621_ineq_010"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(n)$]]></tex-math></alternatives></inline-formula> time, as it requires a single pass to parse the frequency header and reconstruct the original text <italic>T</italic> from the compressed stream.</p>
</sec>
<sec id="j_infor621_s_003">
<label>3</label>
<title>Word Frequency List Compression</title>
<p>As is known, the word frequencies in a natural language text may be modelled by Zipf’s law, see Zipf (<xref ref-type="bibr" rid="j_infor621_ref_025">1949</xref>). It implies that in a large enough text, the frequency of the most frequent word is much larger than the frequency of the second frequent word, which, in turn, is much larger than the frequency of the following word and so on. The decay being logarithmic; at the bottom of a table of frequencies sorted by non-ascending order, there are many words of frequency 1, many of frequency 2, etc. Considering this specificity, we propose the following compact representation for frequency lists.</p>
<p>Split the list into two parts according to an integer parameter <italic>t</italic>. For the frequencies greater than <italic>t</italic>, apply delta encoding; for all natural numbers less than or equal to <italic>t</italic>, store the number of words having a frequency equal to this number. This way, we get a sequence consisting of many small numbers, which can be efficiently encoded using some universal code, e.g. one of those defined by Elias (<xref ref-type="bibr" rid="j_infor621_ref_009">1975</xref>) or the <italic>Reverse Multi-Delimiter</italic> (RMD) codes defined in Zavadskyi and Anisimov (<xref ref-type="bibr" rid="j_infor621_ref_023">2020</xref>). Given a list of unique words in the text in non-increasing frequency order, we can easily reconstruct the frequencies of all words from that compressed number sequence.</p>
<p>Let <inline-formula id="j_infor621_ineq_011"><alternatives><mml:math>
<mml:mi mathvariant="script">E</mml:mi></mml:math><tex-math><![CDATA[$\mathcal{E}$]]></tex-math></alternatives></inline-formula> denote an encoding function employing any universal code. Algorithm <xref rid="j_infor621_fig_002">2</xref> describes the compression of a non-increasing input frequency list <inline-formula id="j_infor621_ineq_012"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${f_{1}},\dots ,{f_{m}}$]]></tex-math></alternatives></inline-formula>, given a threshold <italic>t</italic>. The dot sign · is used for concatenation. The threshold <italic>t</italic> can be adjusted to minimize the space occupied by the encoded sequence <inline-formula id="j_infor621_ineq_013"><alternatives><mml:math>
<mml:mi mathvariant="script">L</mml:mi></mml:math><tex-math><![CDATA[$\mathcal{L}$]]></tex-math></alternatives></inline-formula>. Note that changing <italic>t</italic> a little requires recalculating only a few values of <inline-formula id="j_infor621_ineq_014"><alternatives><mml:math>
<mml:mi mathvariant="script">L</mml:mi></mml:math><tex-math><![CDATA[$\mathcal{L}$]]></tex-math></alternatives></inline-formula>, which simplifies the search. Our experiments show that the optimal value <italic>t</italic> is close to the fixed point of the frequencies <inline-formula id="j_infor621_ineq_015"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${f_{i}}$]]></tex-math></alternatives></inline-formula>, that is, the index <italic>i</italic> such that <inline-formula id="j_infor621_ineq_016"><alternatives><mml:math>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≃</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$i\simeq {f_{i}}$]]></tex-math></alternatives></inline-formula>, which can be a good starting position to search for this optimal value. The symmetrical decoding algorithm, <sans-serif>Freq-List-Decode</sans-serif>, is straightforward.</p>
<fig id="j_infor621_fig_002">
<label>Algorithm 2</label>
<caption>
<p><sc>Frequency List Encoding</sc></p>
</caption>
<graphic xlink:href="infor621_g002.jpg"/>
</fig>
<p>As an example, let us consider <italic>Alice in Wonderland</italic> by Lewis Carroll. The words <italic>w</italic> with frequencies <sans-serif>freq(</sans-serif><italic>w</italic><sans-serif>)</sans-serif> greater than 49 are given in Table <xref rid="j_infor621_tab_001">1</xref>. The optimal threshold value is <inline-formula id="j_infor621_ineq_017"><alternatives><mml:math>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>55</mml:mn></mml:math><tex-math><![CDATA[$t=55$]]></tex-math></alternatives></inline-formula>. The upper part of the table refers to words with frequencies larger than 55, while the lower part of the table refers to those with frequencies 55 or smaller. The values being encoded are shown in rows <italic>δ</italic> and <sans-serif>mult(</sans-serif><italic>f</italic><sans-serif>)</sans-serif>. In the upper part of the table the differences, <italic>δ</italic>, between consecutive original values are calculated, starting with the maximum value, 1505, in this example. In the lower part of the table, we store <inline-formula id="j_infor621_ineq_018"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">mult</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{mult}(f)$]]></tex-math></alternatives></inline-formula> – the multiplicity of the frequency <italic>f</italic>, which is the number of words having the frequency <italic>f</italic>.</p>
<table-wrap id="j_infor621_tab_001">
<label>Table 1</label>
<caption>
<p>Alice in Wonderland, word frequencies.</p>
</caption>
<graphic xlink:href="infor621_g003.jpg"/>
</table-wrap>
<p>In total, for 5 311 unique words in this text, there are only 112 encoded elements in the sequence <inline-formula id="j_infor621_ineq_019"><alternatives><mml:math>
<mml:mi mathvariant="script">L</mml:mi></mml:math><tex-math><![CDATA[$\mathcal{L}$]]></tex-math></alternatives></inline-formula> compressed into 81 bytes using one of the RMD codes, <inline-formula id="j_infor621_ineq_020"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mtext>-</mml:mtext>
<mml:mi>∞</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${R_{2\text{-}\infty }}$]]></tex-math></alternatives></inline-formula> (see Zavadskyi and Anisimov, <xref ref-type="bibr" rid="j_infor621_ref_023">2020</xref>). In comparison, the sequence of frequencies of words in the lexicographically sorted dictionary of this text can not be compressed to less than 1 900 bytes with the <sans-serif>bzip2</sans-serif> archiver, and to less than 370 bytes, if the dictionary elements are sorted in order of descending frequencies.</p>
<p>Since the frequency list is given in a sorted order, Algorithm <xref rid="j_infor621_fig_002">2</xref> requires <inline-formula id="j_infor621_ineq_021"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(m)$]]></tex-math></alternatives></inline-formula> time, and <inline-formula id="j_infor621_ineq_022"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(n+m)=O(n)$]]></tex-math></alternatives></inline-formula> time together with the forward encoding Algorithm <xref rid="j_infor621_fig_001">1</xref>, where <italic>m</italic> is the number of distinct frequencies of symbols, and <italic>n</italic> is the length of the text. Our method of frequency processing is faster than POM, as it does not require lexicographical dictionary sorting and time-consuming string processing.</p>
</sec>
<sec id="j_infor621_s_004">
<label>4</label>
<title>A Word-Based Forward-Looking Algorithm</title>
<p>The order of the words in the dictionary contains information that can be used during the compression. We showed above that sorting the dictionary elements by non-increasing frequency helped us reduce the size of the list of compressed frequency values. In addition to sorting the dictionary by frequencies, let us sort the words of the same frequency in the order of the leftmost (first) occurrence in the text. Then, if the encoder replaces all leftmost occurrences of words of the same frequency <italic>f</italic> with some special <italic>meta-character</italic> <inline-formula id="j_infor621_ineq_023"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">MC</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{MC}(f)$]]></tex-math></alternatives></inline-formula> (different for every frequency <italic>f</italic>), the decoder can replace such symbols with words of the respective frequencies, taking them from the dictionary one by one. If the frequency of the meta-character is larger than <italic>f</italic>, this technique may enhance the compression performance, as it replaces some terms of frequency <italic>f</italic> with more frequent ones.</p>
<p>Let us consider the following example text to clarify this definition: 
<disp-formula id="j_infor621_eq_001">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:mtext mathvariant="monospace">The dog is dog, and the cat is cat, and mice, oh</mml:mtext>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:mtext mathvariant="monospace">all the mice, know that</mml:mtext>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}& \texttt{The dog is dog, and the cat is cat, and mice, oh},\\ {} & \texttt{all the mice, know that},\end{aligned}\]]]></tex-math></alternatives>
</disp-formula> 
not taking into account capitalization and punctuation marks. The dictionary of words, sorted as described above, is the following (frequencies are given in brackets): 
<disp-formula id="j_infor621_eq_002">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:mtext mathvariant="monospace">the</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="monospace">dog</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="monospace">is</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="monospace">and</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="monospace">cat</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="monospace">mice</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="monospace">oh</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="monospace">all</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:mtext mathvariant="monospace">know</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="monospace">that</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}& \texttt{the}(3),\texttt{dog}(2),\texttt{is}(2),\texttt{and}(2),\texttt{cat}(2),\texttt{mice}(2),\texttt{oh}(1),\texttt{all}(1),\\ {} & \texttt{know}(1),\texttt{that}(1).\end{aligned}\]]]></tex-math></alternatives>
</disp-formula> 
We obtain four words of frequency 1, five words of frequency 2, and a single word of frequency 3. We suggest extending the alphabet of words to include the meta characters <inline-formula id="j_infor621_ineq_024"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mi mathvariant="sans-serif">MC</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="sans-serif">MC</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{\mathsf{MC}(1),\mathsf{MC}(2)\}$]]></tex-math></alternatives></inline-formula> with frequencies 4 and 5, respectively. <sans-serif>MC</sans-serif>(1) replaces all words of frequency 1, while <sans-serif>MC</sans-serif>(2) replaces all leftmost occurrences of words of frequency 2. As a result, the encoder receives the following preprocessed text: 
<disp-formula id="j_infor621_eq_003">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:mtext mathvariant="monospace">the</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mtext mathvariant="sans-serif">MC</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mspace width="2.5pt"/>
<mml:mtext mathvariant="sans-serif">MC</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mspace width="2.5pt"/>
<mml:mtext mathvariant="monospace">dog</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mtext mathvariant="sans-serif">MC</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mspace width="2.5pt"/>
<mml:mtext mathvariant="monospace">the</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mtext mathvariant="sans-serif">MC</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mspace width="2.5pt"/>
<mml:mtext mathvariant="monospace">is cat</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mtext mathvariant="monospace">and</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mtext mathvariant="sans-serif">MC</mml:mtext>
<mml:mtext mathvariant="monospace">(2)</mml:mtext>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:mtext mathvariant="sans-serif">MC</mml:mtext>
<mml:mtext mathvariant="monospace">(1)</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mtext mathvariant="sans-serif">MC</mml:mtext>
<mml:mtext mathvariant="monospace">(1) the mice</mml:mtext>
<mml:mspace width="5pt"/>
<mml:mtext mathvariant="sans-serif">MC</mml:mtext>
<mml:mtext mathvariant="monospace">(1)</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mtext mathvariant="sans-serif">MC</mml:mtext>
<mml:mtext mathvariant="monospace">(1)</mml:mtext>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}& \texttt{the}\hspace{2.5pt}\textsf{MC}(2)\hspace{2.5pt}\textsf{MC}(2)\hspace{2.5pt}\texttt{dog}\hspace{2.5pt}\textsf{MC}(2)\hspace{2.5pt}\texttt{the}\hspace{2.5pt}\textsf{MC}(2)\hspace{2.5pt}\texttt{is cat}\hspace{2.5pt}\texttt{and}\hspace{2.5pt}\textsf{MC}\texttt{(2)}\\ {} & \textsf{MC}\texttt{(1)}\hspace{2.5pt}\textsf{MC}\texttt{(1) the mice}\hspace{5pt}\textsf{MC}\texttt{(1)}\hspace{2.5pt}\textsf{MC}\texttt{(1)}.\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>The decoder, reading the first meta-character <sans-serif>MC</sans-serif>(2), starts processing the list of words of frequency 2 in the dictionary and outputs <monospace>dog</monospace>. The second meta-character <sans-serif>MC</sans-serif>(2) corresponds to the next word in the list, i.e. <monospace>is</monospace>, and so on. Note that using <sans-serif>MC</sans-serif>(1) eliminates all words with frequency 1, while <sans-serif>MC</sans-serif>(2) reduces all frequencies 2 by 1. Thus, the alphabet and corresponding probabilities used by the encoder are 
<disp-formula id="j_infor621_eq_004">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">{</mml:mo>
<mml:mtext mathvariant="monospace">the</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>17</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="monospace">dog</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>17</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="monospace">is</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>17</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="monospace">and</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>17</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="monospace">cat</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>17</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="monospace">mice</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>17</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:mspace width="1em"/>
<mml:mtext mathvariant="sans-serif">MC</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mspace width="2.5pt"/>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>17</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mtext mathvariant="sans-serif">MC</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mspace width="2.5pt"/>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>17</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">}</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}& \big\{\texttt{the}\hspace{2.5pt}(3/17),\texttt{dog}\hspace{2.5pt}(1/17),\texttt{is}\hspace{2.5pt}(1/17),\texttt{and}\hspace{2.5pt}(1/17),\texttt{cat}\hspace{2.5pt}(1/17),\texttt{mice}\hspace{2.5pt}(1/17),\\ {} & \hspace{1em}\textsf{MC}(1)\hspace{2.5pt}(4/17),\textsf{MC}(2)\hspace{2.5pt}(5/17)\big\}.\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>To evaluate the savings of this replacement method, we calculate the estimated size of the text as the sum of the <italic>information-contents</italic>, <inline-formula id="j_infor621_ineq_025"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mo movablelimits="false">log</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$-\log {p_{i}}$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_infor621_ineq_026"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{i}}$]]></tex-math></alternatives></inline-formula> is the probability of occurrence of the <italic>i</italic>-th word, of all the words in the text: for the obtained text this equals 45.12 bits, while for the original text, the sum is 54.73 bits. This estimated size may be achieved if arithmetic coding is used as a compression method.</p>
<p>In fact, the proposed replacement method is a generalization of the idea presented by Adiego <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor621_ref_001">2009</xref>) of using a special character to encode singleton words. It can be applied to any encoding where frequencies are known in advance, either static (as shown in the example) or adaptive forward encoding, implemented in Algorithm <xref rid="j_infor621_fig_003">3</xref>. But before formalizing the encoding algorithm, let us derive the criterion of expedience of the replacement of words with meta-characters in the general case of static compression.</p><statement id="j_infor621_stat_001"><label>Theorem 1.</label>
<p><italic>Let f be some frequency with multiplicity k, that is, the text T contains k symbols of frequency f. Let</italic> <inline-formula id="j_infor621_ineq_027"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${T^{\prime }}$]]></tex-math></alternatives></inline-formula> <italic>be the text obtained from T by replacing each leftmost occurrence of a symbol having frequency f with a meta-character</italic> <inline-formula id="j_infor621_ineq_028"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">MC</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{MC}(f)$]]></tex-math></alternatives></inline-formula><italic>. Then,</italic> 
<disp-formula id="j_infor621_eq_005">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">)</mml:mo>
<mml:mspace width="1em"/>
<mml:mtext>if and only if</mml:mtext>
<mml:mspace width="1em"/>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {H_{0}}(T)\gt {H_{0}}\big({T^{\prime }}\big)\hspace{1em}\text{if and only if}\hspace{1em}k{(f-1)^{f-1}}\gt {f^{f}}.\]]]></tex-math></alternatives>
</disp-formula>
</p></statement><statement id="j_infor621_stat_002"><label>Proof.</label>
<p>The meta-character <inline-formula id="j_infor621_ineq_029"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">MC</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{MC}(f)$]]></tex-math></alternatives></inline-formula> has frequency <italic>k</italic> and probability <inline-formula id="j_infor621_ineq_030"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi></mml:math><tex-math><![CDATA[$k/n$]]></tex-math></alternatives></inline-formula>, where <italic>n</italic> is the number of symbols in the text <italic>T</italic>. As the compression performance of arithmetic coding can be approximated by the entropy, we express the space usage of an individual element <italic>s</italic> with probability <inline-formula id="j_infor621_ineq_031"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$p(s)$]]></tex-math></alternatives></inline-formula> by its <italic>information content</italic>, <inline-formula id="j_infor621_ineq_032"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mo movablelimits="false">log</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$-\log p(s)$]]></tex-math></alternatives></inline-formula>. Thus, replacing each leftmost occurrence of a symbol with its meta-character <italic>saves</italic> 
<disp-formula id="j_infor621_eq_006">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {\log _{2}}\frac{k}{n}-{\log _{2}}\frac{f}{n}={\log _{2}}\frac{k}{f}\]]]></tex-math></alternatives>
</disp-formula> 
bits of space, or <inline-formula id="j_infor621_ineq_033"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$k{\log _{2}}(k/f)$]]></tex-math></alternatives></inline-formula> bits in total. However, if <inline-formula id="j_infor621_ineq_034"><alternatives><mml:math>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$f\gt 1$]]></tex-math></alternatives></inline-formula>, all <inline-formula id="j_infor621_ineq_035"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$k(f-1)$]]></tex-math></alternatives></inline-formula> non-leftmost occurrences of symbols of frequency <italic>f</italic> become less frequent by 1. Thus, the replacement causes increasing the length of the code of these symbols by 
<disp-formula id="j_infor621_eq_007">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal" fence="true" maxsize="2.03em" minsize="2.03em">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo mathvariant="normal" fence="true" maxsize="2.03em" minsize="2.03em">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ k(f-1)\bigg({\log _{2}}\frac{f}{n}-{\log _{2}}\frac{f-1}{n}\bigg)=k(f-1){\log _{2}}\frac{f}{f-1}\]]]></tex-math></alternatives>
</disp-formula> 
bits. In total, the described replacement technique allows us to save 
<disp-formula id="j_infor621_eq_008">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo mathvariant="normal" fence="true" maxsize="2.03em" minsize="2.03em">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>−</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo mathvariant="normal" fence="true" maxsize="2.03em" minsize="2.03em">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ k\bigg({\log _{2}}\frac{k}{f}-(f-1){\log _{2}}\frac{f}{f-1}\bigg)=k{\log _{2}}\frac{k{(f-1)^{f-1}}}{{f^{f}}}\]]]></tex-math></alternatives>
</disp-formula> 
bits of space, given <italic>k</italic> words of frequency <italic>f</italic>. This value is positive, if and only if the argument of the log is larger than 1, that is, <inline-formula id="j_infor621_ineq_036"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$k{(f-1)^{f-1}}\gt {f^{f}}$]]></tex-math></alternatives></inline-formula>.  □</p></statement>
<p>If, for a given frequency <italic>f</italic>, the condition of Theorem <xref rid="j_infor621_stat_001">1</xref> is satisfied, <italic>f</italic> is inserted into a replacement frequency set <inline-formula id="j_infor621_ineq_037"><alternatives><mml:math>
<mml:mi mathvariant="script">R</mml:mi></mml:math><tex-math><![CDATA[$\mathcal{R}$]]></tex-math></alternatives></inline-formula>, as shown in line 3 of Algorithm <xref rid="j_infor621_fig_003">3</xref>. In adaptive encoding, savings may be a bit different than <inline-formula id="j_infor621_ineq_038"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub><mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mstyle></mml:math><tex-math><![CDATA[$k{\log _{2}}\frac{k{(f-1)^{f-1}}}{{f^{f}}}$]]></tex-math></alternatives></inline-formula>. However, experiments we performed on real-world texts show that a list of replacement frequencies obtained by the criterion <inline-formula id="j_infor621_ineq_039"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$k{(f-1)^{f-1}}\gt {f^{f}}$]]></tex-math></alternatives></inline-formula>, appears to be optimal for the adaptive forward encoding as well.</p>
<fig id="j_infor621_fig_003">
<label>Algorithm 3</label>
<caption>
<p><sc>Word-based Forward Encoding with Word Replacement</sc></p>
</caption>
<graphic xlink:href="infor621_g004.jpg"/>
</fig>
<p>In Algorithms <xref rid="j_infor621_fig_003">3</xref> and <xref rid="j_infor621_fig_004">4</xref>, we denote the frequency of a word <italic>w</italic> by <inline-formula id="j_infor621_ineq_040"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{freq}(w)$]]></tex-math></alternatives></inline-formula>. By <inline-formula id="j_infor621_ineq_041"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">MC</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{MC}(x)$]]></tex-math></alternatives></inline-formula>, we denote the meta-character corresponding to frequency <italic>x</italic>. In Algorithm <xref rid="j_infor621_fig_003">3</xref>, we get the set of different frequencies (line 1) and form the replacement sets of frequencies <inline-formula id="j_infor621_ineq_042"><alternatives><mml:math>
<mml:mi mathvariant="script">R</mml:mi></mml:math><tex-math><![CDATA[$\mathcal{R}$]]></tex-math></alternatives></inline-formula> and words having these frequencies <inline-formula id="j_infor621_ineq_043"><alternatives><mml:math>
<mml:mi mathvariant="script">S</mml:mi></mml:math><tex-math><![CDATA[$\mathcal{S}$]]></tex-math></alternatives></inline-formula> in lines 2 and 3. Since the leftmost occurrences of words from <inline-formula id="j_infor621_ineq_044"><alternatives><mml:math>
<mml:mi mathvariant="script">S</mml:mi></mml:math><tex-math><![CDATA[$\mathcal{S}$]]></tex-math></alternatives></inline-formula> are replaced with meta-characters, we decrement the frequencies of these words (line 4) and compensate for this reduction by assigning appropriate multiplicities to the meta-character frequencies (line 5). In lines 6–7, we initialize the model with the frequencies of regular words and meta-characters. The encoding algorithm itself is quite straightforward: for each next word <italic>w</italic>, if <italic>w</italic> is seen for the first time and belongs to the replacement set, we substitute it with the corresponding meta-character (lines 10–11). In lines 12–14, we encode <italic>w</italic>, decrement its frequency and update the model, in accordance with the generic forward-looking encoding Algorithm <xref rid="j_infor621_fig_001">1</xref>.</p>
<p>To ensure the correctness of the encoding algorithm, in line 12, we have to encode <italic>w</italic> with respect to its forward probability <inline-formula id="j_infor621_ineq_045"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$p(w)=\mathsf{freq}(w)/(n-i+1)$]]></tex-math></alternatives></inline-formula>, where <italic>i</italic> is the current text position. As at each text position, we decrease some frequency by 1 (line 13), <inline-formula id="j_infor621_ineq_046"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="script">M</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textstyle\sum _{w\in \mathcal{M}}}p(w)$]]></tex-math></alternatives></inline-formula> is always 1, which proves that the model built by Algorithm <xref rid="j_infor621_fig_003">3</xref> is correct.</p>
<fig id="j_infor621_fig_004">
<label>Algorithm 4</label>
<caption>
<p><sc>Word-based Forward Decoding with Word Replacement</sc></p>
</caption>
<graphic xlink:href="infor621_g005.jpg"/>
</fig>
<p>The reversed decoding algorithm presented in Algorithm <xref rid="j_infor621_fig_004">4</xref> receives the encoded text <inline-formula id="j_infor621_ineq_047"><alternatives><mml:math>
<mml:mi mathvariant="script">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathcal{E}(T)$]]></tex-math></alternatives></inline-formula> and the dictionary <sans-serif>D</sans-serif> as arguments. We assume <sans-serif>D</sans-serif> is sorted in order of word frequency, where every group of words of frequency belonging to the replacement set <inline-formula id="j_infor621_ineq_048"><alternatives><mml:math>
<mml:mi mathvariant="script">R</mml:mi></mml:math><tex-math><![CDATA[$\mathcal{R}$]]></tex-math></alternatives></inline-formula> is second-level sorted by the position of the leftmost occurrence in the text <italic>T</italic>. In line 1, the list of word frequencies, encoded as part of the dictionary, is decompressed by applying the decoding algorithm, <sans-serif>Freq-List-Decode</sans-serif><inline-formula id="j_infor621_ineq_049"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="sans-serif">D</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(\mathsf{D})$]]></tex-math></alternatives></inline-formula>, for reversing the encoding of Algorithm <xref rid="j_infor621_fig_002">2</xref>. We maintain the array <inline-formula id="j_infor621_ineq_050"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">J</mml:mi></mml:math><tex-math><![CDATA[$\mathsf{J}$]]></tex-math></alternatives></inline-formula> of indices of replaced words that maps each frequency <italic>f</italic> to the list of words with frequency <italic>f</italic> in the dictionary. In line 8, <inline-formula id="j_infor621_ineq_051"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">J</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{J}(f)$]]></tex-math></alternatives></inline-formula> is initialized by pointing to the first word with frequency <italic>f</italic>, which is the leftmost occurrence in <italic>T</italic> of those words. An auxiliary array <sans-serif>G</sans-serif><inline-formula id="j_infor621_ineq_052"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[w]$]]></tex-math></alternatives></inline-formula> is initialized in line 3 and used in line 12 to return the initial frequency corresponding to a meta-character <italic>w</italic>. The word itself is recovered in line 13 using the dictionary <sans-serif>D</sans-serif>, and the pointer is advanced to the following word associated with the same meta-character in line 14.</p>
<p>The time complexity of both encoding and decoding Algorithms <xref rid="j_infor621_fig_003">3</xref> and <xref rid="j_infor621_fig_004">4</xref> is, obviously, <inline-formula id="j_infor621_ineq_053"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(n)$]]></tex-math></alternatives></inline-formula>, where <italic>n</italic> is the length of the text. It is better than the worst-case encoding time <inline-formula id="j_infor621_ineq_054"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(n{\log ^{2}}n)$]]></tex-math></alternatives></inline-formula> of the bzip2 compressor, used as the comparison base in Section <xref rid="j_infor621_s_006">6</xref>, and the same as the bzip2 decoding time (according to the bzip2 white paper, Seward, <xref ref-type="bibr" rid="j_infor621_ref_021">2019</xref>).</p>
</sec>
<sec id="j_infor621_s_005">
<label>5</label>
<title>Context-Based Forward-Looking Algorithm</title>
<p>As mentioned in Aljehane and Teahan (<xref ref-type="bibr" rid="j_infor621_ref_002">2017</xref>), the most efficient <sans-serif>PPM</sans-serif> model for word-based alphabets is an order-one model, i.e. to estimate the probability of a word in its context, it is advisable to consider only one previous word. Combining <sans-serif>PPM</sans-serif> with forward coding, we can store the forward frequency of each word after every other word together with the compressed file and update the model accordingly at each step of encoding and decoding. This may significantly reduce the compressed text size, but at the expense of a high overhead from storing numerous context-frequency tables. Thus, we suggest to store these tables partially. Namely, it is reasonable to save the frequency of the word <italic>w</italic> after the word <italic>c</italic> only if the probability of <italic>w</italic> appearing after <italic>c</italic> differs significantly from the unconditional probability of the word <italic>w</italic>.</p>
<p>Let <inline-formula id="j_infor621_ineq_055"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${T^{\prime }}$]]></tex-math></alternatives></inline-formula> be the sequence of terms obtained after replacing low-frequency words in the text by meta-characters. Let us discuss two consecutive terms <inline-formula id="j_infor621_ineq_056"><alternatives><mml:math>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$c={T^{\prime }}[i]$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor621_ineq_057"><alternatives><mml:math>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$w={T^{\prime }}[i+1]$]]></tex-math></alternatives></inline-formula>. We call <italic>c</italic> the <italic>context</italic> and <italic>w</italic> the <italic>term in the context c</italic>. Let <inline-formula id="j_infor621_ineq_058"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{freq}(w|c)$]]></tex-math></alternatives></inline-formula> be the context forward frequency of <italic>w</italic>, i.e. the number of occurrences of <italic>w</italic> right after <italic>c</italic> in the suffix <inline-formula id="j_infor621_ineq_059"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>′</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>…</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[${T^{\prime }}[i+1\dots n]$]]></tex-math></alternatives></inline-formula>. The idea is to store context frequencies of some terms in some contexts in a <italic>context table</italic> <inline-formula id="j_infor621_ineq_060"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$CT$]]></tex-math></alternatives></inline-formula>: <inline-formula id="j_infor621_ineq_061"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$CT[c][w]=\mathsf{freq}(w|c)$]]></tex-math></alternatives></inline-formula>. This allows us to use context forward frequencies of terms in the model instead of their unconditional forward frequencies, which makes the model more accurate. Note that we represent the two-dimensional matrix <inline-formula id="j_infor621_ineq_062"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$CT$]]></tex-math></alternatives></inline-formula> as an array of arrays using <inline-formula id="j_infor621_ineq_063"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[c][w]$]]></tex-math></alternatives></inline-formula> rather than <inline-formula id="j_infor621_ineq_064"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[c,w]$]]></tex-math></alternatives></inline-formula> as notation for its indices, to permit references <inline-formula id="j_infor621_ineq_065"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$CT[c]$]]></tex-math></alternatives></inline-formula> to one-dimensional sub-matrices in what follows.</p>
<p>There is a trade-off between the size of context tables and the compression gain they provide. Let <inline-formula id="j_infor621_ineq_066"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{P}(w|c)=\mathsf{freq}(w|c)/\mathsf{freq}(c)$]]></tex-math></alternatives></inline-formula> be the context forward probability of <italic>w</italic>, while <inline-formula id="j_infor621_ineq_067"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="sans-serif">left</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{P}(w)=\mathsf{freq}(w)/(n-\mathsf{left}(w))$]]></tex-math></alternatives></inline-formula> is the unconditional forward probability of <italic>w</italic>, where <inline-formula id="j_infor621_ineq_068"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">left</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{left}(w)$]]></tex-math></alternatives></inline-formula> is the position of the leftmost occurrence of <italic>w</italic>. We set two threshold values <inline-formula id="j_infor621_ineq_069"><alternatives><mml:math>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mspace width="-0.1667em"/>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$t{h_{\hspace{-0.1667em}f}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor621_ineq_070"><alternatives><mml:math>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">g</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$t{h_{g}}$]]></tex-math></alternatives></inline-formula>, representing, respectively, a bound on the frequency and a bound on the expected compression gain, and define an element in the table <inline-formula id="j_infor621_ineq_071"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$CT$]]></tex-math></alternatives></inline-formula> at index <inline-formula id="j_infor621_ineq_072"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[c][w]$]]></tex-math></alternatives></inline-formula> if and only if 
<disp-formula id="j_infor621_eq_009">
<label>(1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mspace width="-0.1667em"/>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="1em"/>
<mml:mtext>and</mml:mtext>
<mml:mspace width="1em"/>
<mml:mo fence="true" maxsize="2.03em" minsize="2.03em" stretchy="true">|</mml:mo>
<mml:mo movablelimits="false">log</mml:mo><mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo fence="true" maxsize="2.03em" minsize="2.03em" stretchy="true">|</mml:mo>
<mml:mo>·</mml:mo>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">g</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathsf{freq}(c)\gt t{h_{\hspace{-0.1667em}f}}\hspace{1em}\text{and}\hspace{1em}\bigg|\log \frac{P(w|c)}{P(w)}\bigg|\cdot \mathsf{freq}(w|c)\gt t{h_{g}}\]]]></tex-math></alternatives>
</disp-formula> 
(the absolute value is needed, because <inline-formula id="j_infor621_ineq_073"><alternatives><mml:math>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$P(w|c)$]]></tex-math></alternatives></inline-formula> may be smaller or larger than <inline-formula id="j_infor621_ineq_074"><alternatives><mml:math>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$P(w)$]]></tex-math></alternatives></inline-formula>). The first inequality enables us to avoid storing symbol probabilities in low-frequency contexts, since the cost of storing the context itself may exceed the compression gain it provides. The left part of the latter inequality can be considered as a measure of the compression gain achieved by defining the element of the context table at index <inline-formula id="j_infor621_ineq_075"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[c][w]$]]></tex-math></alternatives></inline-formula>. We need it as this gain has to cover at least the cost of storing the term <italic>w</italic> and its context frequency <inline-formula id="j_infor621_ineq_076"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{freq}(w|c)$]]></tex-math></alternatives></inline-formula>. Examples of chosen threshold values are given in the experimental section in Table <xref rid="j_infor621_tab_005">5</xref>.</p>
<p>We shall henceforth use the notations <inline-formula id="j_infor621_ineq_077"><alternatives><mml:math>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$c\in CT$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor621_ineq_078"><alternatives><mml:math>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$w\in CT[c]$]]></tex-math></alternatives></inline-formula> to refer to the fact that <inline-formula id="j_infor621_ineq_079"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$CT[c]$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor621_ineq_080"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$CT[c][w]$]]></tex-math></alternatives></inline-formula> have been defined, respectively. Accordingly, we also say “<italic>c</italic> belongs to <inline-formula id="j_infor621_ineq_081"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$CT$]]></tex-math></alternatives></inline-formula>”.</p>
<fig id="j_infor621_fig_005">
<label>Algorithm 5</label>
<caption>
<p><sc>Context-Aware Forward Encoding</sc></p>
</caption>
<graphic xlink:href="infor621_g006.jpg"/>
</fig>
<p>Algorithm <xref rid="j_infor621_fig_005">5</xref> shows the forward-looking context-based compression method. Its general logic is the same as in Algorithm <xref rid="j_infor621_fig_003">3</xref>, and its initialization is the same as in lines 1–7 of Algorithm <xref rid="j_infor621_fig_003">3</xref>, except for the initialization of the context table <inline-formula id="j_infor621_ineq_082"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$CT$]]></tex-math></alternatives></inline-formula> (line 2) and frequencies of using terms as contexts <sans-serif>cfreq</sans-serif> (line 3). However, we can use different models depending on whether the term <italic>c</italic> belongs to the context table. This choice is reflected in the function <sans-serif>Encode-and-Update-the-Model</sans-serif> given in Algorithm <xref rid="j_infor621_fig_006">6</xref>, using values and arrays precomputed in lines 1–8 of Algorithm <xref rid="j_infor621_fig_005">5</xref>.</p>
<list>
<list-item id="j_infor621_li_008">
<label>–</label>
<p>If <inline-formula id="j_infor621_ineq_083"><alternatives><mml:math>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo stretchy="false">∉</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$c\notin CT$]]></tex-math></alternatives></inline-formula>, then we construct the model based on unconditional forward frequencies of the terms (line 8).</p>
</list-item>
<list-item id="j_infor621_li_009">
<label>–</label>
<p>If <inline-formula id="j_infor621_ineq_084"><alternatives><mml:math>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$c\in CT$]]></tex-math></alternatives></inline-formula>, the probability of a term is calculated differently, depending on whether it belongs to <inline-formula id="j_infor621_ineq_085"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$CT[c]$]]></tex-math></alternatives></inline-formula>. If it does, we use its context probability (line 3) and update the context frequencies (lines 4, 5). Otherwise, we relate its forward frequency to the total frequency of terms that do not belong to <inline-formula id="j_infor621_ineq_086"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$CT[c]$]]></tex-math></alternatives></inline-formula> in the remaining text suffix (the first fraction in line 7). Of course, we should multiply this fraction by the ‘escape’ (out-of-context) probability. Unlike traditional backward <sans-serif>PPM</sans-serif>, the forward compression has no zero-frequency problem, and the ‘escape’ probability can be calculated precisely. Namely, it equals to the ratio of the number of the term <italic>c</italic> occurrences, where the next term does not belong to <inline-formula id="j_infor621_ineq_087"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$CT[c]$]]></tex-math></alternatives></inline-formula>, to the total forward frequency of <italic>c</italic> (second fraction in line 7).</p>
</list-item>
</list>
<p>Note that we use <inline-formula id="j_infor621_ineq_088"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$\mathsf{freq}(c)+1$]]></tex-math></alternatives></inline-formula> instead of <inline-formula id="j_infor621_ineq_089"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{freq}(c)$]]></tex-math></alternatives></inline-formula> in lines 3 and 7, since when we encode the term <italic>w</italic>, its context <italic>c</italic> has already been encoded and its frequency has been decreased by 1. However, we must not count the current term <italic>w</italic> as already encoded, and in relation to it, its context <italic>c</italic> should also be addressed as ‘not yet processed’.</p>
<fig id="j_infor621_fig_006">
<label>Algorithm 6</label>
<caption>
<p><sc>Encode and Update the Model Function</sc></p>
</caption>
<graphic xlink:href="infor621_g007.jpg"/>
</fig>
<p>Let us demonstrate the correctness of the model built in Algorithm <xref rid="j_infor621_fig_006">6</xref>. If <inline-formula id="j_infor621_ineq_090"><alternatives><mml:math>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo stretchy="false">∉</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$c\notin CT$]]></tex-math></alternatives></inline-formula>, then the regular forward model is applied,<xref ref-type="fn" rid="j_infor621_fn_001">1</xref><fn id="j_infor621_fn_001"><label><sup>1</sup></label>
<p>In the case <inline-formula id="j_infor621_ineq_091"><alternatives><mml:math>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo stretchy="false">∉</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$c\notin CT$]]></tex-math></alternatives></inline-formula>, we can use more accurate probabilities of terms based on their occurrences not preceded by any context. However, the cost of storing the ‘out-of-context’ frequencies in a separate table outweighs the savings.</p></fn> the correctness of which is proved in Section <xref rid="j_infor621_s_004">4</xref>. Let us discuss the case <inline-formula id="j_infor621_ineq_092"><alternatives><mml:math>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$c\in CT$]]></tex-math></alternatives></inline-formula>. Summing up both sides of the assignment in line 3, we get 
<disp-formula id="j_infor621_eq_010">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
</mml:mrow>
</mml:munder>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
</mml:mrow>
</mml:munder><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="sans-serif">cfreq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \sum \limits_{w\in CT[c]}p(w)=\sum \limits_{w\in CT[c]}\frac{CT[c][w]}{\mathsf{freq}(c)+1}=\frac{{\textstyle\sum _{w\in CT[c]}}\mathsf{freq}(w|c)}{\mathsf{freq}(c)+1}=\frac{\mathsf{cfreq}(c)}{\mathsf{freq}(c)+1}.\]]]></tex-math></alternatives>
</disp-formula> 
Summing up both sides of the assignment in line 7, we get 
<disp-formula id="j_infor621_eq_011">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">∉</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
</mml:mrow>
</mml:munder>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">∉</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="sans-serif">suffixLen</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>·</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="sans-serif">cfreq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="sans-serif">suffixLen</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="sans-serif">suffixLen</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>·</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="sans-serif">cfreq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="sans-serif">cfreq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}\sum \limits_{w\notin CT[c]}p(w)& =\frac{{\textstyle\sum _{w\notin CT[c]}}\mathsf{freq}(w)}{\mathsf{suffixLen}-s}\cdot \frac{\mathsf{freq}(c)+1-\mathsf{cfreq}(c)}{\mathsf{freq}(c)+1}\\ {} & =\frac{\mathsf{suffixLen}-s}{\mathsf{suffixLen}-s}\cdot \frac{\mathsf{freq}(c)+1-\mathsf{cfreq}(c)}{\mathsf{freq}(c)+1}=\frac{\mathsf{freq}(c)+1-\mathsf{cfreq}(c)}{\mathsf{freq}(c)+1},\end{aligned}\]]]></tex-math></alternatives>
</disp-formula> 
where <italic>s</italic> is calculated in line 6. Summing up the probabilities of all terms <italic>w</italic> in the case <inline-formula id="j_infor621_ineq_093"><alternatives><mml:math>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi></mml:math><tex-math><![CDATA[$c\in CT$]]></tex-math></alternatives></inline-formula>, we get: 
<disp-formula id="j_infor621_eq_012">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
</mml:mrow>
</mml:munder>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">∉</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
</mml:mrow>
</mml:munder>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="sans-serif">cfreq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>+</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="sans-serif">cfreq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \sum \limits_{w\in CT[c]}p(w)+\sum \limits_{w\notin CT[c]}p(w)=\frac{\mathsf{cfreq}(c)}{\mathsf{freq}(c)+1}+\frac{\mathsf{freq}(c)+1-\mathsf{cfreq}(c)}{\mathsf{freq}(c)+1}=1.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>Now, we estimate the time complexity of Algorithm <xref rid="j_infor621_fig_005">5</xref>. The context forward frequencies <inline-formula id="j_infor621_ineq_094"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{freq}(w|c)$]]></tex-math></alternatives></inline-formula> can be calculated together with the generic forward frequencies of symbols in <inline-formula id="j_infor621_ineq_095"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(n)$]]></tex-math></alternatives></inline-formula> time. One extra pass over the text is needed to check the inequalities (<xref rid="j_infor621_eq_009">1</xref>) and to construct the context table. This pass can be combined with calculating the frequencies of context terms in line 3 and takes <inline-formula id="j_infor621_ineq_096"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(n)$]]></tex-math></alternatives></inline-formula> time. The main loop in lines 5–10 has <italic>n</italic> iterations, however, summing up <inline-formula id="j_infor621_ineq_097"><alternatives><mml:math>
<mml:mi mathvariant="sans-serif">freq</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathsf{freq}(y)$]]></tex-math></alternatives></inline-formula> in the function <sans-serif>Encode-and-Update-the-Model</sans-serif> (line 6 of Algorithm <xref rid="j_infor621_fig_006">6</xref>) may require <inline-formula id="j_infor621_ineq_098"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(CT[c])$]]></tex-math></alternatives></inline-formula> time. Therefore, the worst-case time complexity of Algorithm <xref rid="j_infor621_fig_005">5</xref> is <inline-formula id="j_infor621_ineq_099"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(nc)$]]></tex-math></alternatives></inline-formula>, where <italic>c</italic> is the maximal number of terms in a particular context. This may be greater than <inline-formula id="j_infor621_ineq_100"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(nk)$]]></tex-math></alternatives></inline-formula> time of the conventional <sans-serif>PPM</sans-serif> approach, where <italic>k</italic> is the maximal context level. However, for all texts we experimented with, the average value <italic>c</italic> per word does not exceed 8, which implies an acceptable time overhead, comparable to that of the preliminary word-splitting phase.</p>
<p>The context-aware forward decoding algorithm mirrors the encoding Algorithm <xref rid="j_infor621_fig_005">5</xref> featured with the meta-character processing technique given in Algorithm <xref rid="j_infor621_fig_004">4</xref>.</p>
</sec>
<sec id="j_infor621_s_006">
<label>6</label>
<title>Experimental Results</title>
<sec id="j_infor621_s_007">
<label>6.1</label>
<title>Experiments on Word-Based Forward-Looking Compression</title>
<p>We tested the word-based compression techniques on three texts in English:</p>
<list>
<list-item id="j_infor621_li_010">
<label>–</label>
<p><sans-serif>Small</sans-serif> – Alice’s Adventures in Wonderland;</p>
</list-item>
<list-item id="j_infor621_li_011">
<label>–</label>
<p><sans-serif>Middle-sized</sans-serif> – The Bible, King James Version;</p>
</list-item>
<list-item id="j_infor621_li_012">
<label>–</label>
<p><sans-serif>Large</sans-serif> – English text from the Pizza&amp;Chili corpus.<xref ref-type="fn" rid="j_infor621_fn_002">2</xref><fn id="j_infor621_fn_002"><label><sup>2</sup></label>
<p><uri>http://pizzachili.dcc.uchile.cl/texts/nlang/</uri></p></fn></p>
</list-item>
</list>
<p>A word is defined as a character sequence between two general white spaces; capitalization and punctuation matters, e.g. we differentiate between ‘<sans-serif>word</sans-serif>’, ‘<sans-serif>Word</sans-serif>’, and ‘<sans-serif>word</sans-serif>’. The results of the net encoding and some statistics are presented in Table <xref rid="j_infor621_tab_002">2</xref>. The first column indicates the input files. Then follow the size of the file in MB, the total number of words, and the number of unique words. The fifth column gives the value of the static entropy <inline-formula id="j_infor621_ineq_101"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${H_{0}}$]]></tex-math></alternatives></inline-formula> in bytes calculated via the Shannon formula, which is used here as a baseline. The following three columns present the differences, in bytes, from the values in the fifth column.</p>
<list>
<list-item id="j_infor621_li_013">
<label>–</label>
<p><italic>Backward</italic>: the entropy calculated over frequencies produced by a standard adaptive encoding. Every word occurrence adds 1 to its frequency, and the initial frequency of a word is also 1.</p>
</list-item>
<list-item id="j_infor621_li_014">
<label>–</label>
<p><sans-serif>Algorithm</sans-serif> <xref rid="j_infor621_fig_001"><sans-serif>1</sans-serif></xref>: the entropy calculated over frequencies produced by a standard word-based forward-looking method described in Section <xref rid="j_infor621_s_002">2</xref>.</p>
</list-item>
<list-item id="j_infor621_li_015">
<label>–</label>
<p><sans-serif>Algorithm</sans-serif> <xref rid="j_infor621_fig_003"><sans-serif>3</sans-serif></xref>: forward-looking compression combined with the replacement of the leftmost occurrences of infrequent words described in Section <xref rid="j_infor621_s_004">4</xref>.</p>
</list-item>
</list>
<p>The last column, <inline-formula id="j_infor621_ineq_102"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|\mathcal{L}|$]]></tex-math></alternatives></inline-formula>, presents the size of word frequency information compressed by Algorithm <xref rid="j_infor621_fig_002">2</xref>.</p>
<table-wrap id="j_infor621_tab_002">
<label>Table 2</label>
<caption>
<p>Comparison of text compression methods (in bytes).</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>Size</italic> MB</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>Words</italic> <inline-formula id="j_infor621_ineq_103"><alternatives><mml:math>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$\times {10^{6}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>Words unique</italic></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor621_ineq_104"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${H_{0}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>Backward</italic></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><sans-serif>Algorithm</sans-serif> <xref rid="j_infor621_fig_001"><sans-serif>1</sans-serif></xref></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><sans-serif>Algorithm</sans-serif> <xref rid="j_infor621_fig_002"><sans-serif>2</sans-serif></xref></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor621_ineq_105"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|\mathcal{L}|$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>Small</sans-serif></td>
<td style="vertical-align: top; text-align: left">0.15</td>
<td style="vertical-align: top; text-align: left">0.03</td>
<td style="vertical-align: top; text-align: left">5 311</td>
<td style="vertical-align: top; text-align: left">32 015</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor621_ineq_106"><alternatives><mml:math>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mspace width="0.1667em"/>
<mml:mn>380</mml:mn></mml:math><tex-math><![CDATA[$+1\hspace{0.1667em}380$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor621_ineq_107"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mspace width="0.1667em"/>
<mml:mn>205</mml:mn></mml:math><tex-math><![CDATA[$-1\hspace{0.1667em}205$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor621_ineq_108"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mn>6</mml:mn>
<mml:mspace width="0.1667em"/>
<mml:mn>557</mml:mn></mml:math><tex-math><![CDATA[$-6\hspace{0.1667em}557$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">81</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>Middle</sans-serif></td>
<td style="vertical-align: top; text-align: left">3.95</td>
<td style="vertical-align: top; text-align: left">0.77</td>
<td style="vertical-align: top; text-align: left">28 659</td>
<td style="vertical-align: top; text-align: left">907 866</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor621_ineq_109"><alternatives><mml:math>
<mml:mo>+</mml:mo>
<mml:mn>14</mml:mn>
<mml:mspace width="0.1667em"/>
<mml:mn>480</mml:mn></mml:math><tex-math><![CDATA[$+14\hspace{0.1667em}480$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor621_ineq_110"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mn>7</mml:mn>
<mml:mspace width="0.1667em"/>
<mml:mn>764</mml:mn></mml:math><tex-math><![CDATA[$-7\hspace{0.1667em}764$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor621_ineq_111"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mn>38</mml:mn>
<mml:mspace width="0.1667em"/>
<mml:mn>322</mml:mn></mml:math><tex-math><![CDATA[$-38\hspace{0.1667em}322$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">407</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><sans-serif>Large</sans-serif></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">200</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">37.0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">836 002</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">52 805 108</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor621_ineq_112"><alternatives><mml:math>
<mml:mo>+</mml:mo>
<mml:mn>509</mml:mn>
<mml:mspace width="0.1667em"/>
<mml:mn>430</mml:mn></mml:math><tex-math><![CDATA[$+509\hspace{0.1667em}430$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor621_ineq_113"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mn>214</mml:mn>
<mml:mspace width="0.1667em"/>
<mml:mn>429</mml:mn></mml:math><tex-math><![CDATA[$-214\hspace{0.1667em}429$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor621_ineq_114"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mspace width="0.1667em"/>
<mml:mn>660</mml:mn>
<mml:mspace width="0.1667em"/>
<mml:mn>355</mml:mn></mml:math><tex-math><![CDATA[$-1\hspace{0.1667em}660\hspace{0.1667em}355$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2 517</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As can be seen, the size of a compressed frequency sequence is tiny, and the total performance of the forward-looking compression (<sans-serif>Alg. 1</sans-serif>+<inline-formula id="j_infor621_ineq_115"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|\mathcal{L}|$]]></tex-math></alternatives></inline-formula>) is essentially better than the traditional adaptive backward one. Combined with the word replacement technique, the outperformance of the forward adaptive encoding over the backward becomes even stronger. However, as the text size grows, the relative advantage decreases since the size of a dictionary becomes less significant in relation to a text.</p>
<p>The values in Table <xref rid="j_infor621_tab_002">2</xref> are obtained assuming that the list of different words is needed anyway for Information Retrieval and is thus already stored. However, when we consider the application of a compression technique to file archiving, the size of a dictionary should be taken into account. In word-based compression, a dictionary is large and has to be compressed itself using a non-word-based technique. The simplest way to compress a dictionary is to use a universal archiver. It performs much better if words are lexicographically sorted. Moreover, special methods like <sans-serif>POM</sans-serif>, or more sophisticated methods for large dictionaries, e.g. Ferragina <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor621_ref_010">2025</xref>), may further improve the compression ratio of the lexicographically ordered dictionary. By contrast, our word replacement method in Section <xref rid="j_infor621_s_004">4</xref> requires ordering the dictionary according to the leftmost appearance of words having the same frequencies in the text, which usually does not coincide with the lexicographical order. These two approaches appear to be alternative and competitive. We tested them both on texts of different sizes in two languages: two English texts from the Canterbury corpus,<xref ref-type="fn" rid="j_infor621_fn_003">3</xref><fn id="j_infor621_fn_003"><label><sup>3</sup></label>
<p><uri>https://corpus.canterbury.ac.nz/</uri></p></fn> the Bible in English, Shakespeare’s complete works, <italic>The Magic Mountain</italic> by Thomas Mann in German, and the Ukrainian fiction corpus.<xref ref-type="fn" rid="j_infor621_fn_004">4</xref><fn id="j_infor621_fn_004"><label><sup>4</sup></label>
<p><uri>https://github.com/zavadsky/corpus/</uri></p></fn> Also, we archived these texts with one of the most powerful BWT-based, Burrows and Wheeler (<xref ref-type="bibr" rid="j_infor621_ref_007">1994</xref>), universal archivers, <sans-serif>bzip2</sans-serif> (which outperforms LZMA2 and Zstd, ultra level 22, on all tested texts). The results are shown in Table <xref rid="j_infor621_tab_003">3</xref>, where the first two columns present the input files and their uncompressed size, while the next columns contain the following:</p>
<list>
<list-item id="j_infor621_li_016">
<label>–</label>
<p><sans-serif>bzip2</sans-serif><inline-formula id="j_infor621_ineq_116"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(T)$]]></tex-math></alternatives></inline-formula>: size of a file compressed by <sans-serif>bzip2</sans-serif> archiver.</p>
</list-item>
<list-item id="j_infor621_li_017">
<label>–</label>
<p><sans-serif>POM</sans-serif><inline-formula id="j_infor621_ineq_117"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(D)$]]></tex-math></alternatives></inline-formula> + <sans-serif>Alg. 1</sans-serif><inline-formula id="j_infor621_ineq_118"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(T)$]]></tex-math></alternatives></inline-formula>: word-based forward-looking text compression plus the sequence of word frequencies compressed using RMD codes of Zavadskyi and Anisimov (<xref ref-type="bibr" rid="j_infor621_ref_023">2020</xref>) and the lexicographically ordered dictionary compressed using the <sans-serif>POM</sans-serif> technique.<xref ref-type="fn" rid="j_infor621_fn_005">5</xref><fn id="j_infor621_fn_005"><label><sup>5</sup></label>
<p><sans-serif>POM</sans-serif> outputs concatenated word suffixes and two sequences of integers: suffix lengths and lengths of common prefixes with the preceding words in the dictionary. We compress concatenated suffixes with <sans-serif>bzip2</sans-serif> while the two sequences of numbers are compressed separately using static Huffman codes. Also, we should compress the word frequencies given in the order of the lexicographically sorted dictionary. Huffman codes are inefficient for this task, as the set of values to be compressed is too large. Thus, we choose the universal encoder that optimizes its compressed size among Elias, Fibonacci, and RMD codes. For all texts, the RMD code <inline-formula id="j_infor621_ineq_119"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mtext>–</mml:mtext>
<mml:mi>∞</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${R_{1,2,4\text{--}\infty }}$]]></tex-math></alternatives></inline-formula> or <inline-formula id="j_infor621_ineq_120"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mtext>–</mml:mtext>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>5</mml:mn>
<mml:mtext>–</mml:mtext>
<mml:mi>∞</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${R_{1\text{--}3,5\text{--}\infty }}$]]></tex-math></alternatives></inline-formula> performed the best.</p></fn></p>
</list-item>
<list-item id="j_infor621_li_018">
<label>–</label>
<p><inline-formula id="j_infor621_ineq_121"><alternatives><mml:math>
<mml:mtext mathvariant="sans-serif">bzip2</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mtext mathvariant="sans-serif">Alg. 3</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textsf{bzip2}(D)+\textsf{Alg.\hspace{0.17em}3}(T)$]]></tex-math></alternatives></inline-formula>: The proposed forward-looking compression with word replacement, given in Algorithm <xref rid="j_infor621_fig_003">3</xref> plus the dictionary <sans-serif>D</sans-serif> compressed by <sans-serif>bzip2</sans-serif> and the frequency values compressed via Algorithm <xref rid="j_infor621_fig_002">2</xref>. Also, the outperformance over the <sans-serif>bzip2</sans-serif> is shown in percents.</p>
</list-item>
</list>
<p>In all schemes, the <sans-serif>bzip2</sans-serif> has been used on the 9th <italic>ultra</italic> compression level. The forward-looking compression results correspond to arithmetic encoding. The source code can be found at <uri>https://github.com/zavadsky/forward</uri>.</p>
<table-wrap id="j_infor621_tab_003">
<label>Table 3</label>
<caption>
<p>Results of full archiving the natural language texts (in bytes).</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Size MB</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><sans-serif>bzip2</sans-serif></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor621_ineq_122"><alternatives><mml:math>
<mml:mtext mathvariant="sans-serif">POM</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mtext mathvariant="sans-serif">Alg. 1</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textsf{POM}(D)+\textsf{Alg.\hspace{0.17em}1}(T)$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor621_ineq_123"><alternatives><mml:math>
<mml:mtext mathvariant="sans-serif">bzip2</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mtext mathvariant="sans-serif">Alg. 3</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\textsf{bzip2}(D)+\textsf{Alg.\hspace{0.17em}3}(T)$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>alice29.txt</sans-serif></td>
<td style="vertical-align: top; text-align: left">0.148</td>
<td style="vertical-align: top; text-align: left">43 144</td>
<td style="vertical-align: top; text-align: left">46 011</td>
<td style="vertical-align: top; text-align: left"><bold>42 576</bold> (<inline-formula id="j_infor621_ineq_124"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mn>1.3</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$-1.3\% $]]></tex-math></alternatives></inline-formula>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>plrabn12.txt</sans-serif></td>
<td style="vertical-align: top; text-align: left">0.460</td>
<td style="vertical-align: top; text-align: left">145 310</td>
<td style="vertical-align: top; text-align: left">146 165</td>
<td style="vertical-align: top; text-align: left"><bold>141 591</bold> (<inline-formula id="j_infor621_ineq_125"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mn>2.6</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$-2.6\% $]]></tex-math></alternatives></inline-formula>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>The Bible</sans-serif></td>
<td style="vertical-align: top; text-align: left">3.95</td>
<td style="vertical-align: top; text-align: left"><bold>844 818</bold></td>
<td style="vertical-align: top; text-align: left">958 420</td>
<td style="vertical-align: top; text-align: left">959 525 (<inline-formula id="j_infor621_ineq_126"><alternatives><mml:math>
<mml:mo>+</mml:mo>
<mml:mn>12</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$+12\% $]]></tex-math></alternatives></inline-formula>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>Shakespeare</sans-serif></td>
<td style="vertical-align: top; text-align: left">5.19</td>
<td style="vertical-align: top; text-align: left">1 473 411</td>
<td style="vertical-align: top; text-align: left">1 401 683</td>
<td style="vertical-align: top; text-align: left"><bold>1 396 240</bold> (<inline-formula id="j_infor621_ineq_127"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mn>5.5</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$-5.5\% $]]></tex-math></alternatives></inline-formula>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>German</sans-serif></td>
<td style="vertical-align: top; text-align: left">1.01</td>
<td style="vertical-align: top; text-align: left">293 411</td>
<td style="vertical-align: top; text-align: left">280 671</td>
<td style="vertical-align: top; text-align: left"><bold>277 944</bold> (<inline-formula id="j_infor621_ineq_128"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mn>5.5</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$-5.5\% $]]></tex-math></alternatives></inline-formula>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><sans-serif>Ukrainian</sans-serif></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">15.2</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">3 061 124</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2 976 830</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>2 899 135</bold> (<inline-formula id="j_infor621_ineq_129"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mn>5.6</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$-5.6\% $]]></tex-math></alternatives></inline-formula>)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The best compression performance is highlighted in bold. The results show that for all texts, except for the Bible, the best choice is combining Algorithms <xref rid="j_infor621_fig_002">2</xref> and <xref rid="j_infor621_fig_003">3</xref> for compression of the frequency list and the text itself, respectively, with <sans-serif>bzip2</sans-serif>-compression of the non-lexicographically ordered word list. The Bible contains many highly repetitive words, which is not beneficial for our word replacement method, intended for the efficient representation of non-frequent alphabet elements. By contrast, for the Ukrainian text, the combination of Algorithm <xref rid="j_infor621_fig_002">2</xref> and <xref rid="j_infor621_fig_003">3</xref> proves to be the most efficient, as Ukrainian is an inflectional language and thus exhibits a much larger number of distinct, low-frequency words.</p>
<p>The English language has indeed the ability to express a term or concept in precise words, with only a small number of variants, e.g. <sans-serif>eat, eats</sans-serif>. The conjugation of verbs in French provides many more terms, e.g. <sans-serif>mange, manges, mangeons, mangez, mangent</sans-serif>. Similarly, the practically unlimited possibility of concatenating terms in German to form Komposita generates many different words, like <sans-serif>Haus, Haustor, Hausnummer, Holzhaus, Bauhaus, Hausbau, Bauhausepoche, Hausbaugesellschaft</sans-serif>, all of which are translated into phrases of several terms. While for English, these phrases would increase the count of the terms <sans-serif>house</sans-serif> or <sans-serif>home</sans-serif>, they are split over many more terms in German. For the German text, the combination of Algorithms <xref rid="j_infor621_fig_002">2</xref> and <xref rid="j_infor621_fig_003">3</xref> also proves to be highly efficient.</p>
<p>This phenomenon will be even more accentuated in highly inflected languages like Hebrew, whose morphology provides the possibility of prefixing a root by articles and prepositions like <sans-serif>the, and, to</sans-serif>, <inline-formula id="j_infor621_ineq_130"><alternatives><mml:math>
<mml:mo>…</mml:mo>
<mml:mspace width="0.1667em"/></mml:math><tex-math><![CDATA[$\dots \hspace{0.1667em}$]]></tex-math></alternatives></inline-formula>, and suffixing it by possessive pronouns, like <sans-serif>mine, his</sans-serif>, <inline-formula id="j_infor621_ineq_131"><alternatives><mml:math>
<mml:mo>…</mml:mo>
<mml:mspace width="0.1667em"/></mml:math><tex-math><![CDATA[$\dots \hspace{0.1667em}$]]></tex-math></alternatives></inline-formula>. A Hebrew noun can thus have hundreds of variants, and a verb even thousands, which will evidently lead to the appearance of many terms with very low occurrence rates.</p>
<p>This principle of getting an advantage from the appearance of rare terms also applies when assessing the efficiency of our approach for other types of data: the more low-frequency symbols the data contain, the greater the benefits of our method, unless the dictionary grows so large that compressing it becomes more critical than compressing the text itself.</p>
<p>In summary, the word-level forward-looking adaptive compression, together with dictionary-aware text preprocessing techniques, demonstrate impressive compression ratios for different types of natural language texts. Combining this word-level approach with character-level BWT-based dictionary compression has been shown to be usually more beneficial than using only the latter for the whole text.</p>
</sec>
<sec id="j_infor621_s_008">
<label>6.2</label>
<title>Experiments on Context-Aware Forward Compression</title>
<p>To evaluate the compression performance of the proposed context-aware forward compression, we considered the following prose and poetry masterpieces in English:</p>
<list>
<list-item id="j_infor621_li_019">
<label>–</label>
<p><italic>Alice</italic> – Alice’s Adventures in Wonderland;</p>
</list-item>
<list-item id="j_infor621_li_020">
<label>–</label>
<p><italic>Sonnets</italic> – Shakespeare’s sonnets;</p>
</list-item>
<list-item id="j_infor621_li_021">
<label>–</label>
<p><italic>plrabn12.txt</italic> – Paradise Lost, a poem by John Milton;</p>
</list-item>
<list-item id="j_infor621_li_022">
<label>–</label>
<p><italic>Harry Potter</italic> – Harry Potter and the Philosopher‘s Stone;</p>
</list-item>
<list-item id="j_infor621_li_023">
<label>–</label>
<p><italic>lcet10.txt</italic> – workshop on electronic text proceedings, the file from Canterbury corpus.</p>
</list-item>
</list>
<p>The file sizes and optimal values for <inline-formula id="j_infor621_ineq_132"><alternatives><mml:math>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mspace width="-0.1667em"/>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$t{h_{\hspace{-0.1667em}f}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor621_ineq_133"><alternatives><mml:math>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">g</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$t{h_{g}}$]]></tex-math></alternatives></inline-formula> for both the mainstream and the punctuation stream appear in Table <xref rid="j_infor621_tab_004">4</xref>. Note that, unlike the previous experiments, we do not compress text corpora since different files in a corpus may have different optimal threshold values.</p>
<table-wrap id="j_infor621_tab_004">
<label>Table 4</label>
<caption>
<p>Test data for context-aware forward compression.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin"/>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Size KB</td>
<td colspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Mainstream</td>
<td colspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Punctuation</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor621_ineq_134"><alternatives><mml:math>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mspace width="-0.1667em"/>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$t{h_{\hspace{-0.1667em}f}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor621_ineq_135"><alternatives><mml:math>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">g</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$t{h_{g}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor621_ineq_136"><alternatives><mml:math>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mspace width="-0.1667em"/>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$t{h_{\hspace{-0.1667em}f}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor621_ineq_137"><alternatives><mml:math>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">g</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$t{h_{g}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>Alice</sans-serif></td>
<td style="vertical-align: top; text-align: left">152</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">14</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">14</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>Sonnets</sans-serif></td>
<td style="vertical-align: top; text-align: left">100</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">20</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">15</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>plrabn12</sans-serif></td>
<td style="vertical-align: top; text-align: left">467</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">18</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">14</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>Harry Potter</sans-serif></td>
<td style="vertical-align: top; text-align: left">429</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">15</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">12</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><sans-serif>lcet10</sans-serif></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">416</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">3</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">14</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">3</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">14</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We, therefore, split the characters of each data file into two kinds of words, regular words and punctuation words as follows:</p>
<list>
<list-item id="j_infor621_li_024">
<label>–</label>
<p>a word in the <italic>main text stream</italic> is a longest contiguous sequence of alphanumeric characters;</p>
</list-item>
<list-item id="j_infor621_li_025">
<label>–</label>
<p>for a prose text, each sequence of characters between the words of the mainstream is considered as a word of the <italic>punctuation stream</italic>;</p>
</list-item>
<list-item id="j_infor621_li_026">
<label>–</label>
<p>for poetry, a sequence of spaces and punctuation characters in each line of a text is considered a single word of the punctuation stream; groups of two or more consecutive spaces are represented with unique predefined symbols.</p>
</list-item>
</list>
<p>Consider the following prose fragment</p><disp-quote>
<p><sub>⊔</sub><sub>⊔</sub>‘<monospace>Well</monospace>!’<sub>⊔</sub><monospace>thought</monospace> <sub>⊔</sub><monospace>Alice</monospace><sub>⊔</sub><monospace>to</monospace><sub>⊔</sub><monospace>herself</monospace>,<sub>⊔</sub>‘<monospace>after</monospace><sub>⊔</sub><monospace>such</monospace></p>
<p><sub>⊔</sub><monospace>a</monospace><sub>⊔</sub><monospace>fall</monospace><sub>⊔</sub><monospace>as</monospace><sub>⊔</sub><monospace>this</monospace>,<sub>⊔</sub><monospace>I</monospace><sans-serif>EOL</sans-serif></p>
<p><monospace>shall</monospace><sub>⊔</sub><monospace>think</monospace><sub>⊔</sub><monospace>nothing</monospace><sub>⊔</sub><monospace>of</monospace><sub>⊔</sub><monospace>tumbling</monospace><sub>⊔</sub><monospace>downstairs</monospace>!’</p></disp-quote>
<p>Its punctuation stream contains the following words:</p>
<p><sub>⊔</sub><sub>⊔</sub>‘; !’<sub>⊔</sub>; <sub>⊔</sub>; <sub>⊔</sub>; <sub>⊔</sub>; ,<sub>⊔</sub>‘; <sub>⊔</sub>; <sub>⊔</sub>; <sub>⊔</sub>; <sub>⊔</sub>; <sub>⊔</sub>; ,<sub>⊔</sub>; <sans-serif>E</sans-serif>OL; <sub>⊔</sub>; <sub>⊔</sub>; <sub>⊔</sub>; <sub>⊔</sub>; <sub>⊔</sub>; !’.</p>
<p>In a poetry example from Shakespeare’s sonnet</p><disp-quote>
<p><monospace>Music</monospace><sub>⊔</sub><monospace>to</monospace><sub>⊔</sub><monospace>hear</monospace><inline-formula id="j_infor621_ineq_138"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>⊔</mml:mo>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${,_{\bigsqcup }}$]]></tex-math></alternatives></inline-formula><monospace>why</monospace><sub>⊔</sub><monospace>hear’st</monospace><sub>⊔</sub><monospace>thou</monospace><sub>⊔</sub><monospace>music</monospace><sub>⊔</sub><monospace>sadly</monospace>?</p>
<p><monospace>Sweets</monospace><sub>⊔</sub><monospace>with</monospace><sub>⊔</sub><monospace>sweets</monospace><sub>⊔</sub><monospace>war</monospace><sub>⊔</sub><monospace>not</monospace><inline-formula id="j_infor621_ineq_139"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>⊔</mml:mo>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${,_{\bigsqcup }}$]]></tex-math></alternatives></inline-formula><monospace>joy</monospace><sub>⊔</sub><monospace>delights</monospace><sub>⊔</sub><monospace>in</monospace><sub>⊔</sub><monospace>joy</monospace>:</p></disp-quote>
<p>there are two punctuation words: <sub>⊔</sub><sub>⊔</sub>,<sub>⊔</sub><sub>⊔</sub><sub>⊔</sub><sub>⊔</sub><sub>⊔</sub>?; <sub>⊔</sub><sub>⊔</sub><sub>⊔</sub><sub>⊔</sub>,<sub>⊔</sub><sub>⊔</sub><sub>⊔</sub><sub>⊔</sub>:.</p>
<table-wrap id="j_infor621_tab_005">
<label>Table 5</label>
<caption>
<p>Results of full archiving the natural language texts (in bytes).</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Text</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>Alice</italic></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>Sonnets</italic></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>plrabn12</italic></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>Harry Potter</italic></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>lcet10</italic></td>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="4" style="vertical-align: middle; text-align: left"><sans-serif>mainstream</sans-serif></td>
<td style="vertical-align: top; text-align: left"><sans-serif>comp-txt</sans-serif></td>
<td style="vertical-align: top; text-align: left">22591</td>
<td style="vertical-align: top; text-align: left">16646</td>
<td style="vertical-align: top; text-align: left">86138</td>
<td style="vertical-align: top; text-align: left">76695</td>
<td style="vertical-align: top; text-align: left">59703</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>comp-dic</sans-serif></td>
<td style="vertical-align: top; text-align: left">10065</td>
<td style="vertical-align: top; text-align: left">10735</td>
<td style="vertical-align: top; text-align: left">33923</td>
<td style="vertical-align: top; text-align: left">26798</td>
<td style="vertical-align: top; text-align: left">24661</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>con-freq</sans-serif></td>
<td style="vertical-align: top; text-align: left">1899</td>
<td style="vertical-align: top; text-align: left">289</td>
<td style="vertical-align: top; text-align: left">2248</td>
<td style="vertical-align: top; text-align: left">5416</td>
<td style="vertical-align: top; text-align: left">4966</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>uncon-freq</sans-serif></td>
<td style="vertical-align: top; text-align: left">84</td>
<td style="vertical-align: top; text-align: left">66</td>
<td style="vertical-align: top; text-align: left">137</td>
<td style="vertical-align: top; text-align: left">140</td>
<td style="vertical-align: top; text-align: left">117</td>
</tr>
<tr>
<td rowspan="4" style="vertical-align: middle; text-align: left"><sans-serif>punctuation</sans-serif></td>
<td style="vertical-align: top; text-align: left"><sans-serif>comp-txt</sans-serif></td>
<td style="vertical-align: top; text-align: left">5735</td>
<td style="vertical-align: top; text-align: left">1543</td>
<td style="vertical-align: top; text-align: left">9492</td>
<td style="vertical-align: top; text-align: left">4817</td>
<td style="vertical-align: top; text-align: left">10780</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>comp-dic</sans-serif></td>
<td style="vertical-align: top; text-align: left">688</td>
<td style="vertical-align: top; text-align: left">921</td>
<td style="vertical-align: top; text-align: left">4812</td>
<td style="vertical-align: top; text-align: left">355</td>
<td style="vertical-align: top; text-align: left">664</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>con-freq</sans-serif></td>
<td style="vertical-align: top; text-align: left">73</td>
<td style="vertical-align: top; text-align: left">57</td>
<td style="vertical-align: top; text-align: left">40</td>
<td style="vertical-align: top; text-align: left">213</td>
<td style="vertical-align: top; text-align: left">197</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sans-serif>uncon-freq</sans-serif></td>
<td style="vertical-align: top; text-align: left">38</td>
<td style="vertical-align: top; text-align: left">28</td>
<td style="vertical-align: top; text-align: left">55</td>
<td style="vertical-align: top; text-align: left">40</td>
<td style="vertical-align: top; text-align: left">47</td>
</tr>
<tr>
<td colspan="2" style="vertical-align: top; text-align: left"><sans-serif>Total</sans-serif></td>
<td style="vertical-align: top; text-align: left">41173</td>
<td style="vertical-align: top; text-align: left"><bold>30285</bold></td>
<td style="vertical-align: top; text-align: left"><bold>136845</bold></td>
<td style="vertical-align: top; text-align: left"><bold>114474</bold></td>
<td style="vertical-align: top; text-align: left"><bold>101135</bold></td>
</tr>
<tr>
<td colspan="2" style="vertical-align: top; text-align: left"><sans-serif>PPMd</sans-serif></td>
<td style="vertical-align: top; text-align: left"><bold>38969</bold></td>
<td style="vertical-align: top; text-align: left">30361</td>
<td style="vertical-align: top; text-align: left">137386</td>
<td style="vertical-align: top; text-align: left">118529</td>
<td style="vertical-align: top; text-align: left">102519</td>
</tr>
<tr>
<td colspan="2" style="vertical-align: top; text-align: left; border-bottom: solid thin"><sans-serif>BZip2</sans-serif></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">43144</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">31805</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">144011</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">119600</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">107683</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Both streams are encoded independently, applying the context-aware forward-looking technique described in Section <xref rid="j_infor621_s_005">5</xref>. During the decoding, words of a mainstream are interleaved with symbols of a punctuation stream.</p>
<p>We have also tested the ‘backspace character’ approach discussed in Klein and Shapira (<xref ref-type="bibr" rid="j_infor621_ref_018">2009</xref>). It relies on the fact that a single space follows most words. Therefore, these single spaces could be eliminated from the punctuation stream, and words followed by non-space characters could be marked with a special ‘backspace’ symbol that never occurs in a text. The decoder then takes a word from the punctuation stream and puts it after a word from the mainstream, only if the latter is followed by the backspace symbol. Otherwise, the decoder appends a mainstream word with a single space. This approach appeared to be efficient only for the <italic>Harry Potter</italic> file, as it is not split into lines by <sans-serif>EOL</sans-serif> characters artificially and thus contains many more single spaces than other files. Thus, Table <xref rid="j_infor621_tab_005">5</xref> demonstrates the compression of the Harry Potter file with the ‘backspace character’ approach and other files without it.</p>
<p>To compress context frequency tables, we use different multi-delimiter codes. Each table <inline-formula id="j_infor621_ineq_140"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$CT[c]$]]></tex-math></alternatives></inline-formula> consists of the sequence of terms and the sequence of respective frequencies. Terms can be indexed by their positions in the dictionary appended with meta-characters. We sort <inline-formula id="j_infor621_ineq_141"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$CT[c]$]]></tex-math></alternatives></inline-formula> in the order of ascending term indices. Then, we encode two sequences of numbers: (1) the differences between consecutive term indices and (2) the differences between term frequencies and the minimal frequency in <inline-formula id="j_infor621_ineq_142"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$CT[c]$]]></tex-math></alternatives></inline-formula>. Also, we use differential encoding to encode the sequence of context terms, <inline-formula id="j_infor621_ineq_143"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{c\}$]]></tex-math></alternatives></inline-formula>.</p>
<p>The results of full archiving appear in Table <xref rid="j_infor621_tab_005">5</xref> in which the upper part corresponds to the mainstream, and the lower part to the punctuation signs. The numbers report the sizes, in bytes, of the compressed text, the compressed dictionary, the context frequencies, and the unconditional forward frequencies, denoted by <sans-serif>comp-txt</sans-serif>, <sans-serif>comp-dic</sans-serif>, <sans-serif>con-freq</sans-serif> and <sans-serif>uncon-freq</sans-serif>, respectively. As can be seen, an improvement in the performance ratio is evident for the <italic>Alice</italic> and <italic>plrabn12</italic> texts when comparing the results in Table <xref rid="j_infor621_tab_005">5</xref> with those in Table <xref rid="j_infor621_tab_003">3</xref>. Moreover, our method performs a bit better than <sans-serif>PPMd</sans-serif> for all files, except for the shortest one, and essentially better than <sans-serif>bzip2</sans-serif>. The largest improvement (about 3.5%) is achieved for <italic>Harry Potter</italic>, which is a ‘pure’ text file without artificial <sans-serif>EOL</sans-serif> characters and with a relatively low number of punctuation signs.</p>
</sec>
</sec>
<sec id="j_infor621_s_009">
<label>7</label>
<title>Conclusion</title>
<p>This paper presents an innovative adaptive coding method that aligns the forward-looking approach with word-based alphabets, improving natural language text compression efficiency. The paper’s key contributions include:</p>
<list>
<list-item id="j_infor621_li_027">
<label>1.</label>
<p>Extending the forward-looking approach to accommodate word-based alphabets;</p>
</list-item>
<list-item id="j_infor621_li_028">
<label>2.</label>
<p>Introducing an efficient encoding for the header information consisting of word frequencies;</p>
</list-item>
<list-item id="j_infor621_li_029">
<label>3.</label>
<p>Replacing low-frequency words in the text with higher-frequency meta-characters and using an alternative ordering for word alphabets;</p>
</list-item>
<list-item id="j_infor621_li_030">
<label>4.</label>
<p>Adapting the order-one <sans-serif>PPM</sans-serif> algorithm for forward adaptive encoding on word-based alphabets.</p>
</list-item>
</list>
<p>The experiments show that, in terms of compression efficiency, the new approach can outperform existing ones, even when the entire list of words is considered to be a part of the header. Furthermore, the proposed <sans-serif>PPM</sans-serif> style method outperforms <sans-serif>bzip2</sans-serif> by 4–6% and is, in most cases, even better than the powerful <sans-serif>PPMd</sans-serif>-based compressor. Notably, the integration of the <sans-serif>PPM</sans-serif> framework with forward adaptive encoding effectively resolves the zero-frequency problem, which constitutes a major limitation of the classical <sans-serif>PPM</sans-serif> method. Furthermore, the proposed compression schemes are shown to be efficient in terms of both compression performance and algorithmic time complexity, while satisfying acceptable asymptotic bounds.</p>
<p>Experimental evaluation and optimization of encoding and decoding times are left for future work, along with the integration of our approach with other dictionary-based compression and context selection techniques. This will enable the proposed methods to be incorporated into high-performance data compression and information retrieval systems.</p>
</sec>
</body>
<back>
<ref-list id="j_infor621_reflist_001">
<title>References</title>
<ref id="j_infor621_ref_001">
<mixed-citation publication-type="chapter"><string-name><surname>Adiego</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Martínez-Prieto</surname>, <given-names>M.A.</given-names></string-name>, <string-name><surname>de la Fuente</surname>, <given-names>P.</given-names></string-name> (<year>2009</year>). <chapter-title>High performance word-codeword mapping algorithm on PPM</chapter-title>. In: <source>Data Compression Conference, DCC 2009</source>, <conf-loc>Snowbird, UT, USA</conf-loc>, pp. <fpage>23</fpage>–<lpage>32</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_002">
<mixed-citation publication-type="journal"><string-name><surname>Aljehane</surname>, <given-names>N.O.</given-names></string-name>, <string-name><surname>Teahan</surname>, <given-names>W.J.</given-names></string-name> (<year>2017</year>). <article-title>Word-based grammars for PPM</article-title>. <source>International Journal of Advanced Computer Science and Applications</source>, <volume>8</volume>(<issue>10</issue>), <fpage>1</fpage>–<lpage>11</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_003">
<mixed-citation publication-type="journal"><string-name><surname>Avrunin</surname>, <given-names>R.M.</given-names></string-name>, <string-name><surname>Klein</surname>, <given-names>S.T.</given-names></string-name>, <string-name><surname>Shapira</surname>, <given-names>D.</given-names></string-name> (<year>2022</year>). <article-title>Combining forward compression with PPM</article-title>. <source>SN Computer Science</source>, <volume>3</volume>(<issue>239</issue>), <fpage>239</fpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_004">
<mixed-citation publication-type="journal"><string-name><surname>Baruch</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Klein</surname>, <given-names>S.T.</given-names></string-name>, <string-name><surname>Shapira</surname>, <given-names>D.</given-names></string-name> (<year>2017</year>). <article-title>A space efficient direct access data structure</article-title>. <source>Journal of Discrete Algorithms</source>, <volume>43</volume>, <fpage>26</fpage>–<lpage>37</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_005">
<mixed-citation publication-type="journal"><string-name><surname>Baruch</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Klein</surname>, <given-names>S.T.</given-names></string-name>, <string-name><surname>Shapira</surname>, <given-names>D.</given-names></string-name> (<year>2020</year>). <article-title>Accelerated partial decoding in wavelet trees</article-title>. <source>Discrete Applied Mathematics</source>, <volume>274</volume>, <fpage>2</fpage>–<lpage>10</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.dam.2018.07.016" xlink:type="simple">https://doi.org/10.1016/j.dam.2018.07.016</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_006">
<mixed-citation publication-type="journal"><string-name><surname>Bratley</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Choueka</surname>, <given-names>Y.</given-names></string-name> (<year>1982</year>). <article-title>Processing truncated terms in document retrieval systems</article-title>. <source>Information Processing and Management</source>, <volume>18</volume>(<issue>5</issue>), <fpage>257</fpage>–<lpage>266</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_007">
<mixed-citation publication-type="other"><string-name><surname>Burrows</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Wheeler</surname>, <given-names>D.J.</given-names></string-name> (<year>1994</year>). <italic>A Block-Sorting Lossless Data Compression Algorithm</italic>. Technical Report 124. Digital Equipment Corporation.</mixed-citation>
</ref>
<ref id="j_infor621_ref_008">
<mixed-citation publication-type="journal"><string-name><surname>Cleary</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Witten</surname>, <given-names>I.</given-names></string-name> (<year>1984</year>). <article-title>Data compression using adaptive coding and partial string matching</article-title>. <source>IEEE Transactions on Communications</source>, <volume>32</volume>(<issue>4</issue>), <fpage>396</fpage>–<lpage>402</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_009">
<mixed-citation publication-type="journal"><string-name><surname>Elias</surname>, <given-names>P.</given-names></string-name> (<year>1975</year>). <article-title>Universal codeword sets and representations of the integers</article-title>. <source>IEEE Transactions on Information Theory</source>, <volume>21</volume>(<issue>2</issue>), <fpage>194</fpage>–<lpage>203</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_010">
<mixed-citation publication-type="journal"><string-name><surname>Ferragina</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Rotundo</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Vinciguerra</surname>, <given-names>G.</given-names></string-name> (<year>2025</year>). <article-title>Two-level massive string dictionaries</article-title>. <source>Information Systems</source>, <volume>128</volume>, <elocation-id>102490</elocation-id></mixed-citation>
</ref>
<ref id="j_infor621_ref_011">
<mixed-citation publication-type="chapter"><string-name><surname>Fruchtman</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Gross</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Klein</surname>, <given-names>S.T.</given-names></string-name>, <string-name><surname>Shapira</surname>, <given-names>D.</given-names></string-name> (<year>2021</year>). <chapter-title>Backward weighted coding</chapter-title>. In: <string-name><surname>Bilgin</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Marcellin</surname>, <given-names>M.W.</given-names></string-name>, <string-name><surname>Serra-Sagristà</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Storer</surname>, <given-names>J.A.</given-names></string-name> (Eds.), <source>31st Data Compression Conference, DCC 2021</source>, <conf-loc>Snowbird, UT, USA</conf-loc>, <conf-date>March 23–26, 2021</conf-date>. <publisher-name>IEEE</publisher-name>, pp. <fpage>93</fpage>–<lpage>102</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_012">
<mixed-citation publication-type="journal"><string-name><surname>Fruchtman</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Gross</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Klein</surname>, <given-names>S.T.</given-names></string-name>, <string-name><surname>Shapira</surname>, <given-names>D.</given-names></string-name> (<year>2022</year>). <article-title>Weighted forward looking adaptive coding</article-title>. <source>Theoretical Computer Science</source>, <volume>930</volume>, <fpage>86</fpage>–<lpage>99</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_013">
<mixed-citation publication-type="journal"><string-name><surname>Fruchtman</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Gross</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Klein</surname>, <given-names>S.T.</given-names></string-name>, <string-name><surname>Shapira</surname>, <given-names>D.</given-names></string-name> (<year>2023</year>). <article-title>Bidirectional adaptive compression</article-title>. <source>Discrete Applied Mathematics</source>, <volume>330</volume>, <fpage>40</fpage>–<lpage>50</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_014">
<mixed-citation publication-type="journal"><string-name><surname>Grossi</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Vitter</surname>, <given-names>J.S.</given-names></string-name> (<year>2005</year>). <article-title>Compressed suffix arrays and suffix trees with applications to text indexing and string matching</article-title>. <source>SIAM Journal on Computing</source>, <volume>35</volume>(<issue>2</issue>), <fpage>378</fpage>–<lpage>407</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1137/S0097539702402354" xlink:type="simple">https://doi.org/10.1137/S0097539702402354</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_015">
<mixed-citation publication-type="chapter"><string-name><surname>Grossi</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Gupta</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Vitter</surname>, <given-names>J.S.</given-names></string-name> (<year>2003</year>). <chapter-title>High-order entropy-compressed text indexes</chapter-title>. In: <source>SODA ’03: Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA)</source>, pp. <fpage>841</fpage>–<lpage>850</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_016">
<mixed-citation publication-type="journal"><string-name><surname>Huffman</surname>, <given-names>D.</given-names></string-name> (<year>1952</year>). <article-title>A method for the construction of minimum redundancy codes</article-title>. <source>Proceedings of the IRE</source>, <volume>40</volume>(<issue>9</issue>), <fpage>1098</fpage>–<lpage>1101</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_017">
<mixed-citation publication-type="journal"><string-name><surname>Klein</surname>, <given-names>S.T.</given-names></string-name>, <string-name><surname>Saadia</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Shapira</surname>, <given-names>D.</given-names></string-name> (<year>2020</year>). <article-title>Forward looking Huffman coding</article-title>. <source>Theory of Computing Systems</source>, <volume>65</volume>, <fpage>593</fpage>–<lpage>612</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_018">
<mixed-citation publication-type="chapter"><string-name><surname>Klein</surname>, <given-names>S.T.</given-names></string-name>, <string-name><surname>Shapira</surname>, <given-names>D.</given-names></string-name> (<year>2009</year>). <chapter-title>On the usefulness of backspace</chapter-title>. In: <source>Proceedings of the Prague Stringology Conference</source>, <conf-loc>2009, Prague, Czech Republic, August 31–September 2</conf-loc>, pp. <fpage>80</fpage>–<lpage>89</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_019">
<mixed-citation publication-type="journal"><string-name><surname>Moffat</surname>, <given-names>A.</given-names></string-name> (<year>1989</year>). <article-title>Word-based text compression</article-title>. <source>Software: Practice and Experience</source>, <volume>19</volume>(<issue>2</issue>), <fpage>185</fpage>–<lpage>198</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_020">
<mixed-citation publication-type="book"><string-name><surname>Navarro</surname>, <given-names>G.</given-names></string-name> (<year>2016</year>). <source>Compact Data Structures: A Practical Approach</source>. <publisher-name>Cambridge University Press</publisher-name>, <publisher-loc>Cambridge, UK</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_021">
<mixed-citation publication-type="other"><string-name><surname>Seward</surname>, <given-names>J.</given-names></string-name> (<year>2019</year>). bzip2 and libbzip2, version 1.0.8. A program and library for data compression.</mixed-citation>
</ref>
<ref id="j_infor621_ref_022">
<mixed-citation publication-type="journal"><string-name><surname>Vitter</surname>, <given-names>J.S.</given-names></string-name> (<year>1987</year>). <article-title>Design and analysis of dynamic Huffman codes</article-title>. <source>Journal of the ACM</source>, <volume>34</volume>(<issue>4</issue>), <fpage>825</fpage>–<lpage>845</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_023">
<mixed-citation publication-type="chapter"><string-name><surname>Zavadskyi</surname>, <given-names>I.O.</given-names></string-name>, <string-name><surname>Anisimov</surname>, <given-names>A.V.</given-names></string-name> (<year>2020</year>). <chapter-title>Reverse multi-delimiter compression codes</chapter-title>. In: <string-name><surname>Bilgin</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Marcellin</surname>, <given-names>M.W.</given-names></string-name>, <string-name><surname>Serra-Sagristà</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Storer</surname>, <given-names>J.A.</given-names></string-name> (Eds.), <source>Data Compression Conference, DCC 2020</source>, <conf-loc>Snowbird, UT, USA</conf-loc>, <conf-date>March 24–27, 2020</conf-date>. <publisher-name>IEEE</publisher-name>, pp. <fpage>173</fpage>–<lpage>182</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_024">
<mixed-citation publication-type="chapter"><string-name><surname>Zavadskyi</surname>, <given-names>I.O.</given-names></string-name>, <string-name><surname>Klein</surname>, <given-names>S.T.</given-names></string-name>, <string-name><surname>Shapira</surname>, <given-names>D.</given-names></string-name> (<year>2024</year>). <chapter-title>Word-based forward coding</chapter-title>. In: <source>Data Compression Conference, DCC 2024</source>, <conf-loc>Snowbird, UT, USA</conf-loc>, pp. <fpage>352</fpage>–<lpage>361</lpage>.</mixed-citation>
</ref>
<ref id="j_infor621_ref_025">
<mixed-citation publication-type="book"><string-name><surname>Zipf</surname>, <given-names>G.K.</given-names></string-name> (<year>1949</year>). <source>Human Behavior and the Principle of Least Effort: An Introduction to Human Eoclogy</source>. <publisher-name>Addison-Wesley Press</publisher-name>, <publisher-loc>Cambridge, UK</publisher-loc>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
