Full-Text Indexing: Mixed Content #2079

ChristianGruen · 2022-03-05T08:10:58Z

Presently, only text nodes and attribute values end up in the BaseX indexes. Whenever a path expression points to a text node (or an element that only has text nodes as children), it can be rewritten for index access, no matter how the full paths look like. This design decision turned out to be powerful for exact searches and for full-text queries on arbitrary text nodes, but it is too unflexible for mixed-content data.

A few years ago, we added features to restrict indexing to the text nodes of specific element names. We could enhance this approach for full-text queries:

Index the string value of specific elements that will be specified via FTINCLUDE and
rewrite only paths for index access that do not address descendants of the indexed element.

As an example, a user might want to query the head and p elements of a TEI document:

<div>
  <head>No. 2, September 2006</head>
  <p>It was clearly popular, for it appears in Peter Stent’s
advertisements of 1654 and 1662, and is still listed in his successor
John Overton’s catalogue of 1673,<note>Alexander Globe, <title
level="m">Peter Stent, London Printseller, c.</title> 1642-65
(Vancouver, 1985), p. 123 (no.*448).</note> yet only the unique
impression in the British Museum's Department of Prints and Drawings
survives - testimony to the great rarity of such popular material.</p>
</div>

The following queries could then be evaluated via the index:

/div[head contains text '2006']
//p[. contains text 'popular']

Queries such as the following ones would not be rewritten for index access anymore:

//p[text() contains text 'popular']

The text was updated successfully, but these errors were encountered:

liamquin · 2022-04-14T15:55:30Z

Why not rewrite //p[text() contains text 'popular'] as
//p[text()[. contains text 'popular']]
would it then use the index??

Or, betterr maybe,
text()[. contains text 'popular']/..[self::p]
?

ChristianGruen · 2022-04-18T14:15:39Z

If we can assess at compile time that all p elements in a database are leaf elements (i.e., have a single text node as child), we could indeed rewrite //p[text() contains text 'popular'] for index access, too.

Otherwise, if p elements have child elements, we don’t know which substring of the indexed text occurs in that text node. The following two expressions will yield a different result:

<p>popular<suffix>s</suffix></p>        contains text 'popular',
<p>popular<suffix>s</suffix></p>/text() contains text 'popular'

graydon2014 · 2022-04-25T10:44:43Z

Without regard to practicality of indexing (because I have no idea!),

//p[normalize-space(.) contains text 'popular']

is what I'm usually after -- where is this phrase in the document? There can be a lot of inline markup and for "where's the phrase?" purposes I want to know the nearest common ancestor of all the text nodes in the phrase.

ChristianGruen · 2022-04-25T11:13:50Z

For finding the nearest common ancestor elements, it’s still recommendable to search on text node level:

let $xml := document {
  <p>
    There’s is a <b>popular</b> saying …
  </p>  
}
return $xml//p//text()[. contains text 'popular']/ancestor::*[1]  (:  → <b>...</b> :)

If nodes are atomized, things are getting complicated because the found tokens may appear on different node levels. The token in the following query is assembled from the child text nodes of p and b:

let $xml := document {
  <p>There’s is a <b>p</b>opular saying …</p>
}
return $xml//p[. contains text 'popular']  (:  → <p>...</p> :)

About normalize-space(.), full-text tokenization includes this (so you can replace normalize-space(.) by .), and it additionally removes diacritics, normalizes upper/case, etc. The behavior can be made explicit by calling ft:tokenize.

graydon2014 · 2022-04-25T12:31:03Z

I managed to express the use case in a muddled way; apologies!

let $xml := document {
<bucket>
<title>Complex Reference</title>
There's a complex <link>reference to this document.
</bucket>
}
return $xml//*[. contains text 'complex reference']

This is the kind of search I want to do against a relatively large amount of content (e.g., a national legal code) where the specific element is not known and could in principle be one of a number of elements and in practice is a variety of elements expressing different semantics and in some cases you want the titles and in other cases the references but the first step is to find everywhere the phrase occurs. The goal is to get the closest containing element with all the text nodes of the searched phrase.

The query above returns all the elements of which it is true, which is what it's supposed to do:
<bucket>
<title>Complex Reference</title>
There's a complex <link>reference</link>
 to this document.
</bucket>
<title>Complex Reference</title>
There's a complex <link>reference</link>
 to this document.
complex <link>reference</link>

But ideally there'd be a way to do the "closest ancestor" version with the case where it's a multi-word phrase with components in different text nodes. My (probably naive) thought is that maybe there could be an index of string properties of elements, which would allow returning the closest containing element of the full-text match.

ChristianGruen · 2022-07-31T21:45:34Z

Postponed to a later version.

ChristianGruen added this to the 10 milestone Mar 5, 2022

ChristianGruen removed this from the 10 milestone Jul 31, 2022

ChristianGruen added the feature label Oct 31, 2022

ChristianGruen changed the title ~~Full-Text Indexing: Index specific full-text elements~~ Full-Text Indexing: Mixed Content Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full-Text Indexing: Mixed Content #2079

Full-Text Indexing: Mixed Content #2079

ChristianGruen commented Mar 5, 2022 •

edited

Loading

liamquin commented Apr 14, 2022

ChristianGruen commented Apr 18, 2022

graydon2014 commented Apr 25, 2022

ChristianGruen commented Apr 25, 2022

graydon2014 commented Apr 25, 2022 •

edited

Loading

ChristianGruen commented Jul 31, 2022

Full-Text Indexing: Mixed Content #2079

Full-Text Indexing: Mixed Content #2079

Comments

ChristianGruen commented Mar 5, 2022 • edited Loading

liamquin commented Apr 14, 2022

ChristianGruen commented Apr 18, 2022

graydon2014 commented Apr 25, 2022

ChristianGruen commented Apr 25, 2022

graydon2014 commented Apr 25, 2022 • edited Loading

ChristianGruen commented Jul 31, 2022

ChristianGruen commented Mar 5, 2022 •

edited

Loading

graydon2014 commented Apr 25, 2022 •

edited

Loading