Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve relative URLs within RSS article description #21943

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

zent1n0
Copy link

@zent1n0 zent1n0 commented Dec 3, 2024

In RSS widget, a relative url will cause an infinite loop loading for resource, which won't break until unselect the article explicitly, thus caused the memory leak described in #20117 .

The code provides a conversion over the first 3 forms of relative path as described in MDN web docs using regex, and only function on <a href> and <img src> html tags. With the code, RSS widget now should follow the correct absolute urls.

However, the #20117 could not be fully closed with the code, as now log shows the RSS widget still send around 5 reqs/sec in the condition which was causing a rapid flow of qDebug log which indicates it stuck on a infinite loop sending requests, failing fast and immediately retries. The former situation is eating up a single CPU core at 100% and 3 MiB/s RAM increase, while the latter at below 10% CPU usage and 1 MiB/s RAM.
image
From the snapshot you could know the url is correct, but with no valid response (QNetworkReply::RemoteHostClosedError) and I don't know why. The fact is that the site byr.pt is an IPv6 only site and the url will provide an 30x jump to another path.

Happy with any assistance or suggestions on further improvements. 😃 🚀

@zent1n0 zent1n0 marked this pull request as ready for review December 3, 2024 07:28
@luzpaz luzpaz added the RSS RSS-related issues/changes label Dec 3, 2024
Copy link
Member

@glassez glassez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some preliminary comments.

.gitignore Outdated
@@ -1,4 +1,5 @@
.vscode/
.cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you believe that qBittorrent .gitignore should contain it you need to provide separate PR with description of why it is needed to be done at project level. Otherwise you could just add it in your local .gitignore file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'm using clangd extension and it produces many file under .cache/. I'll draft another PR later.

QString normalizeBasePath = basePath.endsWith(u'/') ? basePath : basePath + u'/';
QRegularExpressionMatchIterator iter = rx.globalMatch(html);

while (iter.hasNext()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid coding style.

Suggested change
while (iter.hasNext()) {
while (iter.hasNext())
{

rx.setPattern(
uR"(((<a\s+[^>]*?href|<img\s+[^>]*?src)\s*=\s*["'])((https?|ftp):)?(\/\/[^\/]*)?(\/?[^\/"].*?)(["']))"_s);

QString normalizeBasePath = basePath.endsWith(u'/') ? basePath : basePath + u'/';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
QString normalizeBasePath = basePath.endsWith(u'/') ? basePath : basePath + u'/';
QString normalizedBasePath = basePath.endsWith(u'/') ? basePath : basePath + u'/';

QString relativePath = match.captured(6);
if (relativePath.startsWith(u'/'))
relativePath = relativePath.mid(1);
if (!match.captured(4).isEmpty() && !match.captured(5).isEmpty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to review/maintain regex-related logic so it would be better to either use named capturing groups or just assign them to local (const) variables before using in the code.

In RSS widget, a relative url will cause an infinite loop loading for
resource, which won't break until unselect the article explicitly

format code
@zent1n0
Copy link
Author

zent1n0 commented Dec 5, 2024

Fixed a bug with //{hostname} like URL and some coding styles, should make the regex logic more friendly to read.

Copy link
Member

@glassez glassez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet another review iteration.

src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
@@ -91,6 +91,7 @@ private slots:

private:
bool eventFilter(QObject *obj, QEvent *event) override;
void convertRelativePathToAbsolute(QString &html, const QString &basePath) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not need (and isn't supposed to need) access to RSSWidget class members, so it is preferable to declare it in anonymous namespace in rsswidget.cpp file.

src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
zent1n0 and others added 3 commits December 5, 2024 23:34
@zent1n0
Copy link
Author

zent1n0 commented Dec 6, 2024

Adjusted my code. Could you review the code for this round @glassez

Copy link
Member

@Chocobo1 Chocobo1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only commenting on coding style.
I'll leave the correctness to others.

src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
@glassez
Copy link
Member

glassez commented Dec 6, 2024

Convert relative paths to absolute

Well, using the wrong term was what looked confusing to me.
Don't we really talk about the URLs, not the paths?

Co-authored-by: Chocobo1 <[email protected]>
Co-authored-by: Vladimir Golovnev <[email protected]>
@zent1n0 zent1n0 changed the title RSS widget: Convert relative paths to absolute RSS widget: Convert relative URLs to absolute Dec 6, 2024
@glassez glassez changed the title RSS widget: Convert relative URLs to absolute Convert relative URLs to absolute within RSS article descriptions Dec 6, 2024
@glassez glassez changed the title Convert relative URLs to absolute within RSS article descriptions Resolve relative URLs within RSS article descriptions Dec 6, 2024
@glassez glassez changed the title Resolve relative URLs within RSS article descriptions Resolve relative URLs within RSS article description Dec 6, 2024
@@ -54,6 +54,8 @@
#include "feedlistwidget.h"
#include "ui_rsswidget.h"

void convertRelativeUrlToAbsolute(QString &html, const QString &baseUrl);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be inside anonymous namespace as I said before, i.e.:

namespace
{
    void convertRelativeUrlToAbsolute(QString &html, const QString &baseUrl)
    {
        // code
    }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My misunderstanding of anonymous namespace. I thought it meant an static scope in the source file. I'll correct it later.

const QString fullMatch = match.captured(0);
const QString prefix = match.captured(1);
const QString suffix = match.captured(7);
const QString absolutePath = normalizedBaseUrl + relativePath;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absoluteURL ???

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Corrected. 🚀

src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
@zent1n0
Copy link
Author

zent1n0 commented Dec 7, 2024

BTW it's not an final solution. The true issue is the RSS htmlbrowser infinitely retries immediate loading resources from html. The issue behind is hard to track with call stack, since it is called by signals and slots.

@zent1n0
Copy link
Author

zent1n0 commented Dec 11, 2024

@glassez Shall we start next review iteration? I'm looking forward to a release with less memory issue in RSS widget. 🚀

@glassez
Copy link
Member

glassez commented Dec 11, 2024

The true issue is the RSS htmlbroser infinitely retries immediate loading resources from html.

What do you mean? Does it repeatedly download the same images etc. from the Internet?

@zent1n0
Copy link
Author

zent1n0 commented Dec 12, 2024

Does it repeatedly download the same images etc. from the Internet?

Yes, and specially when the resource is not available.

In my case a relative URL /oss/userfiles/5e2d209f-2306-46b5-b596-e89cf1c57598 will be completed as http:///oss/userfiles/5e2d209f-2306-46b5-b596-e89cf1c57598 and fails in a millisecond. With my patch it converts into https://byr.pt/oss/userfiles/5e2d209f-2306-46b5-b596-e89cf1c57598 but still got a infinite loop at a slower pace. Maybe because the URL above leads to an redirect into https://oss.byr.pt/byrbt/userfiles/cb/cb785da349087f79d06183902f50b0d252fac050?{some_other_query_params}.

I'm wondering if the htmlbrowser calls to QNetworkRequest could perform with an option that allows redirect, and could get it working with RSS widget.

@glassez
Copy link
Member

glassez commented Dec 14, 2024

Does it repeatedly download the same images etc. from the Internet?

Yes, and specially when the resource is not available.

Yet another question. It behaves the same way if HTML initially contains absolute URLs, right?

@zent1n0
Copy link
Author

zent1n0 commented Dec 14, 2024

Does it repeatedly download the same images etc. from the Internet?

Yes, and specially when the resource is not available.

Yet another question. It behaves the same way if HTML initially contains absolute URLs, right?

No, the absolute URLs will be rendered normally most of the time. Sometimes, the failed elements display as a blank page icon, but I'm not cofident to tell if in this situation the infinite loading issue is occurring.

@glassez
Copy link
Member

glassez commented Dec 14, 2024

Yet another question. It behaves the same way if HTML initially contains absolute URLs, right?

No, the absolute URLs will be rendered normally most of the time. Sometimes, the failed elements display as a blank page icon, but I'm not cofident to tell if in this situation the infinite loading issue is occurring.

I'm confused. Do you still want to say that initially absolute URLs and URLs that was relative and transformed to absolute by your code behave differently? (provided that they point to existing resources)

@zent1n0
Copy link
Author

zent1n0 commented Dec 14, 2024

Yet another question. It behaves the same way if HTML initially contains absolute URLs, right?

Let's figure it out.

  1. My convert function only rewrite the URL when the matched result is not absolute one. If it is absolute, as the line 641-647, this iteration should be skipped, left text unmodified. So the code will not perform differently with an initially absolute URL.
  2. If you mean the issue itself, the initially absolue URLs sometimes fail to load.

Does it repeatedly download the same images etc. from the Internet?

No, the absolute URLs will be rendered normally most of the time. Sometimes, the failed elements display as a blank page icon, but I'm not cofident to tell if in this situation the infinite loading issue is occurring.

I mean I can't confirm the failed elements with an absolute URL would cause the same infinite loading loop, just the same behavior that is to stuck like relative ones mentioned in the issue.

Sorry for my poor English that might make the answer confusing 😢

@glassez
Copy link
Member

glassez commented Dec 16, 2024

If you mean the issue itself

Sure. I understand the logic of the code. I am interested in the results of your investigation on the problem itself.

Does it repeatedly download the same images etc. from the Internet?

Yes, and specially when the resource is not available.

The most important thing that interests me is whether it "repeatedly download the same images etc." when the resource IS vavailable and was successfully downloaded previously.

As for "not available" resources, we have either endless attempts to get them from a local disk (without this patch) or from a network location (with this patch).

Considering that this patch fixes the problem in the case where the converted to absolute URL points to an available resource, I approve it. The problem of endless redownload of non available resources seems to be on Qt side (however, it would be nice to inform them about it, so that they can fix it further).

src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
Co-authored-by: Vladimir Golovnev <[email protected]>
@zent1n0
Copy link
Author

zent1n0 commented Dec 16, 2024

Thanks for your reviews and suggestions. Although it's not an final solution, I'm looking forward to these snippets in released version. I'll be grateful if you could give a probable prediction that which tag will this be included in.

Moreover, it will be great if someone can update uncrustify.cfg to our latest coding style guidelines. In my practice, uncrustify only get a proper new line before the namespace definition, but get other braces in {do, if, for, while} reverted into brace within single line.

@glassez
Copy link
Member

glassez commented Dec 16, 2024

Moreover, it will be great if someone can update uncrustify.cfg to our latest coding style guidelines.

I don't use it so I don't care. Let someone concerned take care of it.

@glassez
Copy link
Member

glassez commented Dec 16, 2024

@zent1n0
How about #21943 (comment)?

src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
@glassez glassez added this to the 5.1 milestone Dec 16, 2024
@glassez glassez self-assigned this Dec 16, 2024
Co-authored-by: Vladimir Golovnev <[email protected]>
glassez
glassez previously approved these changes Dec 17, 2024
@glassez glassez requested a review from a team December 17, 2024 03:10
src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
Copy link
Member

@Chocobo1 Chocobo1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last few comments about coding style (hopefully)

src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
src/gui/rss/rsswidget.cpp Outdated Show resolved Hide resolved
zent1n0 and others added 2 commits December 18, 2024 08:58
Co-authored-by: Chocobo1 <[email protected]>
Co-authored-by: Vladimir Golovnev <[email protected]>
@glassez glassez requested review from Chocobo1 and glassez December 18, 2024 06:41
@zent1n0
Copy link
Author

zent1n0 commented Dec 18, 2024

Should I squash the commits into one?

@glassez
Copy link
Member

glassez commented Dec 18, 2024

Should I squash the commits into one?

You can do it. Otherwise it will be done while merging PR.

@zent1n0
Copy link
Author

zent1n0 commented Dec 18, 2024

I prefer automatic methods 🤖 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RSS RSS-related issues/changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants