-
-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full-Text Indexing: Mixed Content #2079
Comments
Why not rewrite //p[text() contains text 'popular'] as Or, betterr maybe, |
If we can assess at compile time that all Otherwise, if <p>popular<suffix>s</suffix></p> contains text 'popular',
<p>popular<suffix>s</suffix></p>/text() contains text 'popular' |
Without regard to practicality of indexing (because I have no idea!), //p[normalize-space(.) contains text 'popular'] is what I'm usually after -- where is this phrase in the document? There can be a lot of inline markup and for "where's the phrase?" purposes I want to know the nearest common ancestor of all the text nodes in the phrase. |
For finding the nearest common ancestor elements, it’s still recommendable to search on text node level: let $xml := document {
<p>
There’s is a <b>popular</b> saying …
</p>
}
return $xml//p//text()[. contains text 'popular']/ancestor::*[1] (: → <b>...</b> :) If nodes are atomized, things are getting complicated because the found tokens may appear on different node levels. The token in the following query is assembled from the child text nodes of let $xml := document {
<p>There’s is a <b>p</b>opular saying …</p>
}
return $xml//p[. contains text 'popular'] (: → <p>...</p> :) About |
I managed to express the use case in a muddled way; apologies! let $xml := document { This is the kind of search I want to do against a relatively large amount of content (e.g., a national legal code) where the specific element is not known and could in principle be one of a number of elements and in practice is a variety of elements expressing different semantics and in some cases you want the titles and in other cases the references but the first step is to find everywhere the phrase occurs. The goal is to get the closest containing element with all the text nodes of the searched phrase. The query above returns all the elements of which it is true, which is what it's supposed to do: But ideally there'd be a way to do the "closest ancestor" version with the case where it's a multi-word phrase with components in different text nodes. My (probably naive) thought is that maybe there could be an index of string properties of elements, which would allow returning the closest containing element of the full-text match. |
Postponed to a later version. |
Presently, only text nodes and attribute values end up in the BaseX indexes. Whenever a path expression points to a text node (or an element that only has text nodes as children), it can be rewritten for index access, no matter how the full paths look like. This design decision turned out to be powerful for exact searches and for full-text queries on arbitrary text nodes, but it is too unflexible for mixed-content data.
A few years ago, we added features to restrict indexing to the text nodes of specific element names. We could enhance this approach for full-text queries:
FTINCLUDE
andAs an example, a user might want to query the
head
andp
elements of a TEI document:The following queries could then be evaluated via the index:
Queries such as the following ones would not be rewritten for index access anymore:
The text was updated successfully, but these errors were encountered: