Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests fail non-deterministically (Error: error retrieving size of image) #217

Closed
deeplow opened this issue Sep 27, 2022 · 12 comments · Fixed by #288
Closed

tests fail non-deterministically (Error: error retrieving size of image) #217

deeplow opened this issue Sep 27, 2022 · 12 comments · Fixed by #288
Assignees
Milestone

Comments

@deeplow
Copy link
Contributor

deeplow commented Sep 27, 2022

See example in: https://app.circleci.com/pipelines/github/freedomofpress/dangerzone/443

This happens sometimes when I run make test. It only started happening when I started doing tests in parallel (pytest -n 4).

Podman with the following:

E       AssertionError: assert 1 == 0
E        +  where 1 = <Result CalledProcessError(125, ['/usr/bin/podman', 'image', 'list', '--format', '{{.ID}}', 'dangerzone.rocks/dangerzone'])>.exit_code

tests/test_cli.py:65: AssertionError
----------------------------- Captured stderr call -----------------------------
Error: error retrieving size of image "a19b6c87d9a7e435b0ee80a86453ea9d9adbb211d151404db8baffe43dd531ea": you may need to remove the image to resolve the error: size/digest of layer with ID "652e014cf40878b225ca19dec247b7b1f82aea56f3cfb63a98058fd0dd8b1741" could not be calculated: lgetxattr /home/circleci/.local/share/containers/storage/overlay/d54f9e57784c8d6f6c3e15926b3460e5cbb24bdcd527c4ac13310f05e3dc6c9e/merged/usr/share/mime/application/vnd.coffeescript.xml: no such file or directory
@deeplow
Copy link
Contributor Author

deeplow commented Oct 27, 2022

@apyrgio
Copy link
Contributor

apyrgio commented Oct 27, 2022

It happens regularly in my local machine as well. I usually rerun the tests, but it's a pain for the CI. Good to know we have an issue for this.

@deeplow
Copy link
Contributor Author

deeplow commented Oct 27, 2022

This only started happening when I added concurrent tests in the CI. Running them in sequence doesn't seem to generate this.

@apyrgio
Copy link
Contributor

apyrgio commented Oct 27, 2022

Makes sense. I wonder if this is something that may bite users once we introduce parallel conversions. If a conversion thread is using podman run, and another conversion thread attempts to do the same, we may hit the same bug. Has it ever occurred in your tests?

@deeplow
Copy link
Contributor Author

deeplow commented Oct 27, 2022

I wondered the same. But that has never happened to me.

@deeplow
Copy link
Contributor Author

deeplow commented Nov 3, 2022

@apyrgio apyrgio added this to the 0.5.0 milestone Nov 7, 2022
@apyrgio
Copy link
Contributor

apyrgio commented Nov 7, 2022

I'm adding this issue to the 0.5.0 milestone, because it hurts development velocity, and is probably easy to fix. We may choose to drop it, of course, but it's worth investigating it.

@deeplow
Copy link
Contributor Author

deeplow commented Dec 7, 2022

Running the command /usr/bin/podman image list --format {{.ID}} dangerzone.rocks/dangerzone in parallel reproduces the issue:

yes 'dangerzone.rocks/dangerzone' | head -n 8 | xargs -n 1 -P 8 /usr/bin/podman image list --format {{.ID}} 
f50a7c432284
Error: error retrieving size of image "f50a7c432284e0078af648d224f8e0e4f93dd17ce2301be21883d047a2098c2e": you may need to remove the image to resolve the error: size/digest of layer with ID "709197c1057b238d91d21b7bcae11f9a0a8371aea6222b7c01b38c840a8f71c0" could not be calculated: open /home/user/.local/share/containers/storage/overlay/7d0c9bcaec239ff821c5bcc4a7776b4b79423196a1813c4ccdcb0da2e6bfbe9c/merged/usr/share/icons/hicolor/512x512/stock/net: no such file or directory
Error: error retrieving size of image "f50a7c432284e0078af648d224f8e0e4f93dd17ce2301be21883d047a2098c2e": you may need to remove the image to resolve the error: size/digest of layer with ID "709197c1057b238d91d21b7bcae11f9a0a8371aea6222b7c01b38c840a8f71c0" could not be calculated: open /home/user/.local/share/containers/storage/overlay/709197c1057b238d91d21b7bcae11f9a0a8371aea6222b7c01b38c840a8f71c0/merged/usr/share/mime/video: no such file or directory
Error: error retrieving size of image "f50a7c432284e0078af648d224f8e0e4f93dd17ce2301be21883d047a2098c2e": you may need to remove the image to resolve the error: size/digest of layer with ID "c51cab76847b0ecb465b85c303374244bf77a03b200a45ca6ac1d57d49107c97" could not be calculated: lgetxattr /home/user/.local/share/containers/storage/overlay/c51cab76847b0ecb465b85c303374244bf77a03b200a45ca6ac1d57d49107c97/merged/usr/share/gtk-3.0/emoji/de.gresource: no such file or directory
Error: error retrieving size of image "f50a7c432284e0078af648d224f8e0e4f93dd17ce2301be21883d047a2098c2e": you may need to remove the image to resolve the error: size/digest of layer with ID "c51cab76847b0ecb465b85c303374244bf77a03b200a45ca6ac1d57d49107c97" could not be calculated: open /home/user/.local/share/containers/storage/overlay/c51cab76847b0ecb465b85c303374244bf77a03b200a45ca6ac1d57d49107c97/merged/usr/share/icons/hicolor/512x512/stock/table: no such file or directory
f50a7c432284
f50a7c432284
f50a7c432284

@apyrgio
Copy link
Contributor

apyrgio commented Dec 7, 2022

Quick question: what's the Podman version you tested? I'm asking because while I have encountered this on Ubuntu 20.04, I haven't encountered it in Fedora or later Ubuntu versions, so it may be something that's fixed. If this is the case, we can run the tests sequentially in older versions, and in parallel for newer ones.

@deeplow
Copy link
Contributor Author

deeplow commented Dec 7, 2022

what's the Podman version you tested?

Podman version 3.4.7 on Fedora. But the same happens on 4.3.1 and on the tip of the master branch (4.4.0-dev).

@deeplow
Copy link
Contributor Author

deeplow commented Dec 8, 2022

I'm adding this issue to the 0.5.0 milestone, because it hurts development velocity, and is probably easy to fix. We may choose to drop it, of course, but it's worth investigating it.

I was taking a look at this as the issue is quite frustrating. The reason why this happens appears to be a race condition of calculating the image's digests when multiple layers are involved (no, it is not cached). The issue is similar to containers/podman#12582.

What solved it

I moved from podman v3.4.7 to v4.3.1 and removed ~/.local/share/containers. After that, the reproducer command no longer showed the bug.

@deeplow
Copy link
Contributor Author

deeplow commented Dec 12, 2022

Finding the first working version

bash script (in case this ever needs to be run again)
setup_intructions="\n
Setup Instructions\n
------------------\n
  \n
  1. clone podman and its dependencies into ~/podman/ following the guide\n in https://podman.io/getting-started/installation#installing-development-versions-of-podman\n\n
  2. run this script\n
"

if [ ! -f ~/podman/podsman ]; then
  # Do something if the file exists
  echo "ERROR: Podman source code not cloned"
  echo -e $setup_intructions
  exit 1
fi

versions=(v4.0.2 v4.0.3 v4.1.0 v4.1.1 v4.2.0 v4.2.1 v4.3.0 v4.3.1)

for v in "${versions[@]}"; do
    echo -e "\n\n"
    echo "============================================"
    echo "      TESTING NOW PODMAN VERSION $v"
    echo "============================================"

    # clean podman state
    sudo rm ~/.local/share/containers -rf

    # build podman
    cd /home/user/podman/podman
    git checkout $v
    make BUILDTAGS="seccomp"
    sudo make install PREFIX=/usr
    podman --version

    # test dangerzone
    cd /home/user/dangerzone

    # install container
    ./dev_scripts/dangerzone-install

    # check that the parallel is working
    time yes 'dangerzone.rocks/dangerzone' | head -n 16 | xargs -n 1 -P 16 /usr/bin/podman images --format {{.ID}}

    # run tests in parallel to corrupt
    make test

    # check if now yes is still failing
    time yes 'dangerzone.rocks/dangerzone' | head -n 16 | xargs -n 1 -P 16 /usr/bin/podman images --format {{.ID}}

done

After running the script look at the output of the second yes command. If it all went good, you should see no errors on the screen. If failed, look for errors of two following two kinds.

ERRO[0004] Can not stat "/home/user/.local/share/containers/storage/overlay/957f5231b23193389e6cbafba9f2803f47d7acd0d6508f2c562f7fe553a1af4c/merged/etc/shadow": lstat /home/user/.local/share/containers/storage/overlay/957f5231b23193389e6cbafba9f2803f47d7acd0d6508f2c562f7fe553a1af4c/merged/etc/shadow: no such file or directory 
Error: error retrieving size of image "a0a475abfdb7a0a39e6b84b1ef76cc3031d06ebd6755e192f71d25dd55a86809": you may need to remove the image to resolve the error: size/digest of layer with ID "a0e56104ca9dc752d2568c5f501546fea06253fbeb3f64022be2870d619086b9" could not be calculated: llistxattr /home/user/.local/share/containers/storage/overlay/a0e56104ca9dc752d2568c5f501546fea06253fbeb3f64022be2870d619086b9/merged/var/cache/fontconfig/3830d5c3ddfd5cd38a049b759396e72e-le64.cache-8: no such file or directory

deeplow added a commit that referenced this issue Dec 12, 2022
Instability in the automated tests sometimes would sometimes fail when
running "podman images --format {{.ID}}". It turns out that in versions
prior to podman 4.3.0, podman volumes (stored in
~/.local/share/contaiers) would get corrupted when multiple tests were
run in parallel.

The current solution is to wrap the test command to run sequentially in
versions prior to the fix and in parallel for versions after that.

Fixes #217
deeplow added a commit that referenced this issue Dec 12, 2022
Instability in the automated tests sometimes would sometimes fail when
running "podman images --format {{.ID}}". It turns out that in versions
prior to podman 4.3.0, podman volumes (stored in
~/.local/share/contaiers) would get corrupted when multiple tests were
run in parallel.

The current solution is to wrap the test command to run sequentially in
versions prior to the fix and in parallel for versions after that.

Fixes #217
@apyrgio apyrgio modified the milestones: 0.5.0, 0.4.1 Dec 13, 2022
@apyrgio apyrgio assigned apyrgio and deeplow and unassigned apyrgio Dec 14, 2022
deeplow added a commit that referenced this issue Jan 2, 2023
Instability in the automated tests sometimes would sometimes fail when
running "podman images --format {{.ID}}". It turns out that in versions
prior to podman 4.3.0, podman volumes (stored in
~/.local/share/contaiers) would get corrupted when multiple tests were
run in parallel.

The current solution is to wrap the test command to run sequentially in
versions prior to the fix and in parallel for versions after that.

Fixes #217
@deeplow deeplow closed this as completed in 84b8212 Jan 9, 2023
deeplow added a commit that referenced this issue Jul 5, 2023
Parallel tests had given us issues in the part [1]. This time, they
weren't playing well with pytest-qt. One hypothesis is that Qt
application components run as singletons and don't play well when there
are two instances.

The symptom we were experiencing was infinite recursion and removing
pytest-xdist solved the issue.

[1]: #217
[2]: https://github.com/freedomofpress/dangerzone/actions/runs/5244389012/jobs/9470323475?pr=439
deeplow added a commit that referenced this issue Jul 18, 2023
Parallel tests had given us issues in the part [1]. This time, they
weren't playing well with pytest-qt. One hypothesis is that Qt
application components run as singletons and don't play well when there
are two instances.

The symptom we were experiencing was infinite recursion and removing
pytest-xdist solved the issue.

[1]: #217
[2]: https://github.com/freedomofpress/dangerzone/actions/runs/5244389012/jobs/9470323475?pr=439
apyrgio added a commit that referenced this issue Jul 24, 2023
Run tests sequentially, because in subsequent commits we will add
Qt tests that do not play nice when `pytest` creates new processes [1].

Also, remove the pytest wrapper, whose main task was to decide if tests
can run in parallel [2].

[1]: https://bugreports.qt.io/projects/PYSIDE/issues/PYSIDE-2393
[2]: #217
apyrgio added a commit that referenced this issue Jul 24, 2023
Run tests sequentially, because in subsequent commits we will add
Qt tests that do not play nice when `pytest` creates new processes [1].

Also, remove the pytest wrapper, whose main task was to decide if tests
can run in parallel [2].

[1]: https://bugreports.qt.io/projects/PYSIDE/issues/PYSIDE-2393
[2]: #217
apyrgio added a commit that referenced this issue Jul 24, 2023
Run tests sequentially, because in subsequent commits we will add
Qt tests that do not play nice when `pytest` creates new processes [1].

Also, remove the pytest wrapper, whose main task was to decide if tests
can run in parallel [2].

[1]: https://bugreports.qt.io/projects/PYSIDE/issues/PYSIDE-2393
[2]: #217
deeplow added a commit that referenced this issue Jul 25, 2023
Parallel tests had given us issues in the part [1]. This time, they
weren't playing well with pytest-qt. One hypothesis is that Qt
application components run as singletons and don't play well when there
are two instances.

The symptom we were experiencing was infinite recursion and removing
pytest-xdist solved the issue.

[1]: #217
deeplow added a commit that referenced this issue Jul 25, 2023
Parallel tests had given us issues in the part [1]. This time, they
weren't playing well with pytest-qt. One hypothesis is that Qt
application components run as singletons and don't play well when there
are two instances.

The symptom we were experiencing was infinite recursion and removing
pytest-xdist solved the issue.

[1]: #217
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants