Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Searching hyphenated words or phrases continued in the next line doesn't work #700

Open
1 task done
jdujava opened this issue Dec 10, 2024 · 2 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@jdujava
Copy link
Contributor

jdujava commented Dec 10, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Problem

Searching hyphenated words or phrases continued in the next line doesn't work.

Steps to reproduce

See the attached file: zathura-search-hyphenated.pdf

Try to search for suggested words/phrases. Occurrences that are being hyphenated/broken/continued to next line are not found.

Expected behavior

Zathura should find all occurrences. It should understand when hyphen/dash is followed by a newline, that it is one word. Similarly, concerning the searching, newline at the end of line should be equivalent to a space.

zathura version (zathura --version)

zathura 0.5.9

girara version (zathura --version)

girara 0.4.5

zathura backend

poppler

@jdujava jdujava added the bug Something isn't working label Dec 10, 2024
@alerque
Copy link

alerque commented Dec 11, 2024

The way PDFs are constructed (including lots of possible variance) it is not always possible to deduce this information. It is possible for the PDF creator to embed information that could be used for this purpose, but most PDFs are constructed in a way that doesn't make it as simple as your issue report suggests "newline at the end of a line" is just not a thing, nor is "hyphen followed by a newline". At least not in any universal sort of way. The shaping and positioning of each letter or batch of letters is all done in advance and absolute positions or relative offsets on the page are recorded, but there is not concept of a "new line". The code for a subscript (that happens to be offset below the previously output characters) is going to look similar to the code to go to a new line, it is just a new x/y position. One can try to guess based on whether both the y position goes lower and the x position is reduced, but this guessing can and does also go very badly with some PDF constructions.

@jdujava
Copy link
Contributor Author

jdujava commented Dec 11, 2024

Sure, I agree, in complete generality it is probably more difficult than I made it sound.

However, as is the case with the attached PDF (and virtually with any other PDF I have tried the following), selecting/copying the text in Zathura over the newline also includes the newline (when I paste it somewhere, it includes also the newline at "correct" position).

When I paste it in Zathura search box, it looks like
image
but it still can't find the text I copied.

Both browser PDF viewers and for example Evince handle this issue generally correctly (though I am not saying that some special PDFs be weird).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants