maybe_string dealing with non-unicode strings #1329

wschuell · 2024-11-27T03:06:13Z

Hi,

I am cloning and analyzing a big bunch of repositories, and therefore stumble on rare edge cases.
In this repo: https://github.com/LSSTDESC/DeblenderVAE
One of the refs turned out to be non-unicode (a branch name, '0xc3master' ), and that causes even clone to fail.

python3 -c "import pygit2; pygit2.clone_repository(path='temp_repo',url='https://github.com/LSSTDESC/DeblenderVAE')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "pygit2/__init__.py", line 217, in clone_repository
    payload.check_error(err)
  File "pygit2/callbacks.py", line 97, in check_error
    raise self._stored_exception
  File "pygit2/callbacks.py", line 424, in wrapper
    return f(*args)
           ^^^^^^^^
  File "pygit2/callbacks.py", line 565, in _update_tips_cb
    s = maybe_string(refname)
        ^^^^^^^^^^^^^^^^^^^^^
  File "pygit2/utils.py", line 37, in maybe_string
    return ffi.string(ptr).decode('utf8')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 20: invalid continuation byte

I tracked the function being wrapped: <function _update_tips_cb at 0x75b8ebb17420>

I found this, about refs names not being necessarily unicode:
https://stackoverflow.com/questions/69174955/what-character-encoding-is-used-in-git-symbolic-refs-especially-on-windows

No error if in .utils, I catch the error like this:

def maybe_string(ptr):
    if not ptr:
        return None

    try:
        return ffi.string(ptr).decode("utf8")
    except BaseException as e:
        return ffi.string(ptr).decode("latin1")

But of course there are other encodings than utf8 and latin1. What could be a generic solution? Does the output of maybe_string need to be a text string and not byte string?

If not, I suggest:

def maybe_string(ptr):
    if not ptr:
        return None

    try:
        return ffi.string(ptr).decode("utf8")
    except UnicodeDecodeError:
        return ffi.string(ptr)

which also clones without error. But it creates a case where the output is only in rare cases a byte string...

jdavid · 2024-11-27T09:26:10Z

maybe_string(...) is used in a number of places, I would keep it returning a text string.

There are other solutions, these work for me:

os.fsdecode(...)
.decode("utf8", errors="replace")
.decode("utf8", errors="surrogateescape")

I think I prefer number 3, could you try it?

Also, please add a unit test, it can simply clone this DeblenderVAE repo.

Thanks!

wschuell · 2024-11-27T11:33:15Z

I implemented the surrogate solution, and a unit test using the testrepo. However I had to create the new branch with subprocess (hope git CLI is available though for CI), because the surrogate policy is not applied at encoding either.

I can look into it, but maybe you can tell me if what I did so far complies.

Also, I just created a cloned folder next to the testrepo folder, is it cleaned up automatically, or should I clean it up manually?

wschuell · 2024-11-27T11:36:00Z

Just saw that black auto-formatted all the quotes, I can revert it if it's a problem

jdavid · 2024-11-27T12:29:46Z

Yes please, don't make format changes

Everything should be cleaned up automatically, this is handled by pytest. For example when I tried your changes locally the cloned repo is at /tmp/pytest-of-jdavid/pytest-current/.../testrepo/test_nonunicode_repo

The test fails on Windows (AppVeyor), https://ci.appveyor.com/project/jdavid/pygit2/builds/51066441/job/ueawtoexy0bqussu

Better to create the \xc3master branch once in testrepo (test/data/testrepo.zip), then git won't be required when running the tests, and it may fix the tests in AppVeyor.

… fails because of surrogates not allowed

wschuell · 2024-11-27T14:23:54Z

Commit title self-explanatory. Here the pytest output:

___________________________________ test_nonunicode_branchname[\xc3master] ___________________________________

testrepo = pygit2.Repository('/tmp/pytest-of-wschuell/pytest-18/test_nonunicode_branchname__xc0/testrepo/.git/')
bstring = b'\xc3master'

    def test_nonunicode_branchname(testrepo, bstring):
>       testrepo.branches.local.create(bstring.decode("utf8", errors="surrogateescape"),commit=testrepo.head.target)

test/test_nonunicode.py:44: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pygit2.branches.Branches object at 0x78d035376e70>, name = '\udcc3master'
commit = 2be5719152d4f82c7302b1c0932d8e5f0a4a0e98, force = False

    def create(self, name: str, commit, force=False):
>       return self._repository.create_branch(name, commit, force)
E       UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

/home/wschuell/.conda/envs/base_conda_12/lib/python3.12/site-packages/pygit2-1.15.1-py3.12-linux-x86_64.egg/pygit2/branches.py:79: UnicodeEncodeError
========================================== short test summary info ===========================================
FAILED test/test_nonunicode.py::test_nonunicode_branchname[\xc3master] - UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed
============================================= 1 failed in 0.04s ==============================================

I tried to track it down, but this goes into the C code.

wschuell · 2024-11-27T14:36:23Z

Alternatively, it works with subprocess using

subprocess.check_output(cmd.decode('utf8',errors='surrogateescape').split(" "), cwd=testrepo.workdir)

The issue is then that non-unicode bytestrings cannot be passed to create branches, but at least my case is solved: cloning a repository with an already existing non-unicode branch.

jdavid · 2024-11-27T17:46:29Z

Windows and macOS tests fail:

What I meant before was to add a new test repo, for example test/data/repo_notutf.zip which already has the non-utf8 branch \xc3master. However, I've tried this and somehow zip screws the file, the reference \xc3master becomes something else.

Alernatively what I can do is to add such a repo to https://github.com/pygit2/

jdavid · 2024-11-27T17:57:49Z

Just did that, see https://github.com/pygit2/test_branch_notutf

Then:

python3 -c "import pygit2; pygit2.clone_repository(path='temp_repo', url='https://github.com/pygit2/test_branch_notutf.git')"

Could you change the unit test to clone https://github.com/pygit2/test_branch_notutf.git ?
The test should be simpler, and maybe it will work with macOS/Windows

…anch

jdavid · 2024-11-29T18:04:51Z

Thanks!

The tests for macOS and Windows still fail though. I've created a branch just to try, but regardless the method to decode the errors are the same, see https://github.com/libgit2/pygit2/actions/workflows/tests.yml and https://ci.appveyor.com/project/jdavid/pygit2/history

There may be an issue with libgit2 clone in Windows and macOS, this needs some research..

wschuell · 2024-11-29T18:13:49Z

Looking at the error it seems that there is a lockfile written with the name of the ref/branch, and the string is used not as bytestring, which apparently causes it to fail because it does not recognize the surrogates. I guess it could be able to write the files with filenames not utf8 if the string is passed as bytes, but one would have to track down where this happens. I'll have a look in a few days (unless you do before).

maybe_string dealing with non-unicode strings

b832b5d

Unit test for non unicode branch name

af8e73d

test using pygit2 for branch creation instead of subprocess call, but…

5c11ece

… fails because of surrogates not allowed

reverting to subprocess using str not bytes

3256813

wschuell added 3 commits November 27, 2024 22:08

update test with cloning of existing repo rather than creating new br…

e6a35c5

…anch

Typo in shutil call

44ef360

Using surrogateescape for the caught git error message as well

c541064

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maybe_string dealing with non-unicode strings #1329

maybe_string dealing with non-unicode strings #1329

wschuell commented Nov 27, 2024

jdavid commented Nov 27, 2024

wschuell commented Nov 27, 2024

wschuell commented Nov 27, 2024

jdavid commented Nov 27, 2024

wschuell commented Nov 27, 2024

wschuell commented Nov 27, 2024

jdavid commented Nov 27, 2024

jdavid commented Nov 27, 2024

jdavid commented Nov 29, 2024

wschuell commented Nov 29, 2024

maybe_string dealing with non-unicode strings #1329

Are you sure you want to change the base?

maybe_string dealing with non-unicode strings #1329

Conversation

wschuell commented Nov 27, 2024

jdavid commented Nov 27, 2024

wschuell commented Nov 27, 2024

wschuell commented Nov 27, 2024

jdavid commented Nov 27, 2024

wschuell commented Nov 27, 2024

wschuell commented Nov 27, 2024

jdavid commented Nov 27, 2024

jdavid commented Nov 27, 2024

jdavid commented Nov 29, 2024

wschuell commented Nov 29, 2024