tests fail non-deterministically (Error: error retrieving size of image) #217

deeplow · 2022-09-27T11:18:16Z

See example in: https://app.circleci.com/pipelines/github/freedomofpress/dangerzone/443

This happens sometimes when I run make test. It only started happening when I started doing tests in parallel (pytest -n 4).

Podman with the following:

E       AssertionError: assert 1 == 0
E        +  where 1 = <Result CalledProcessError(125, ['/usr/bin/podman', 'image', 'list', '--format', '{{.ID}}', 'dangerzone.rocks/dangerzone'])>.exit_code

tests/test_cli.py:65: AssertionError
----------------------------- Captured stderr call -----------------------------
Error: error retrieving size of image "a19b6c87d9a7e435b0ee80a86453ea9d9adbb211d151404db8baffe43dd531ea": you may need to remove the image to resolve the error: size/digest of layer with ID "652e014cf40878b225ca19dec247b7b1f82aea56f3cfb63a98058fd0dd8b1741" could not be calculated: lgetxattr /home/circleci/.local/share/containers/storage/overlay/d54f9e57784c8d6f6c3e15926b3460e5cbb24bdcd527c4ac13310f05e3dc6c9e/merged/usr/share/mime/application/vnd.coffeescript.xml: no such file or directory

The text was updated successfully, but these errors were encountered:

deeplow · 2022-10-27T12:54:43Z

Happened again on main https://app.circleci.com/pipelines/github/freedomofpress/dangerzone/531/workflows/9dbd663d-17a2-48e9-a4af-1ebf5e83b948/jobs/2409

apyrgio · 2022-10-27T12:59:18Z

It happens regularly in my local machine as well. I usually rerun the tests, but it's a pain for the CI. Good to know we have an issue for this.

deeplow · 2022-10-27T13:01:56Z

This only started happening when I added concurrent tests in the CI. Running them in sequence doesn't seem to generate this.

apyrgio · 2022-10-27T13:35:12Z

Makes sense. I wonder if this is something that may bite users once we introduce parallel conversions. If a conversion thread is using podman run, and another conversion thread attempts to do the same, we may hit the same bug. Has it ever occurred in your tests?

deeplow · 2022-10-27T13:37:11Z

I wondered the same. But that has never happened to me.

deeplow · 2022-11-03T18:10:50Z

Another example: https://app.circleci.com/pipelines/github/freedomofpress/dangerzone/554/workflows/39a4eb03-abd3-4f50-8ced-f7a7737ad8a1/jobs/2606

apyrgio · 2022-11-07T10:05:59Z

I'm adding this issue to the 0.5.0 milestone, because it hurts development velocity, and is probably easy to fix. We may choose to drop it, of course, but it's worth investigating it.

deeplow · 2022-12-07T14:04:06Z

Running the command /usr/bin/podman image list --format {{.ID}} dangerzone.rocks/dangerzone in parallel reproduces the issue:

yes 'dangerzone.rocks/dangerzone' | head -n 8 | xargs -n 1 -P 8 /usr/bin/podman image list --format {{.ID}} 
f50a7c432284
Error: error retrieving size of image "f50a7c432284e0078af648d224f8e0e4f93dd17ce2301be21883d047a2098c2e": you may need to remove the image to resolve the error: size/digest of layer with ID "709197c1057b238d91d21b7bcae11f9a0a8371aea6222b7c01b38c840a8f71c0" could not be calculated: open /home/user/.local/share/containers/storage/overlay/7d0c9bcaec239ff821c5bcc4a7776b4b79423196a1813c4ccdcb0da2e6bfbe9c/merged/usr/share/icons/hicolor/512x512/stock/net: no such file or directory
Error: error retrieving size of image "f50a7c432284e0078af648d224f8e0e4f93dd17ce2301be21883d047a2098c2e": you may need to remove the image to resolve the error: size/digest of layer with ID "709197c1057b238d91d21b7bcae11f9a0a8371aea6222b7c01b38c840a8f71c0" could not be calculated: open /home/user/.local/share/containers/storage/overlay/709197c1057b238d91d21b7bcae11f9a0a8371aea6222b7c01b38c840a8f71c0/merged/usr/share/mime/video: no such file or directory
Error: error retrieving size of image "f50a7c432284e0078af648d224f8e0e4f93dd17ce2301be21883d047a2098c2e": you may need to remove the image to resolve the error: size/digest of layer with ID "c51cab76847b0ecb465b85c303374244bf77a03b200a45ca6ac1d57d49107c97" could not be calculated: lgetxattr /home/user/.local/share/containers/storage/overlay/c51cab76847b0ecb465b85c303374244bf77a03b200a45ca6ac1d57d49107c97/merged/usr/share/gtk-3.0/emoji/de.gresource: no such file or directory
Error: error retrieving size of image "f50a7c432284e0078af648d224f8e0e4f93dd17ce2301be21883d047a2098c2e": you may need to remove the image to resolve the error: size/digest of layer with ID "c51cab76847b0ecb465b85c303374244bf77a03b200a45ca6ac1d57d49107c97" could not be calculated: open /home/user/.local/share/containers/storage/overlay/c51cab76847b0ecb465b85c303374244bf77a03b200a45ca6ac1d57d49107c97/merged/usr/share/icons/hicolor/512x512/stock/table: no such file or directory
f50a7c432284
f50a7c432284
f50a7c432284

apyrgio · 2022-12-07T16:49:44Z

Quick question: what's the Podman version you tested? I'm asking because while I have encountered this on Ubuntu 20.04, I haven't encountered it in Fedora or later Ubuntu versions, so it may be something that's fixed. If this is the case, we can run the tests sequentially in older versions, and in parallel for newer ones.

deeplow · 2022-12-07T19:03:34Z

what's the Podman version you tested?

Podman version 3.4.7 on Fedora. But the same happens on 4.3.1 and on the tip of the master branch (4.4.0-dev).

deeplow · 2022-12-08T18:34:25Z

I'm adding this issue to the 0.5.0 milestone, because it hurts development velocity, and is probably easy to fix. We may choose to drop it, of course, but it's worth investigating it.

I was taking a look at this as the issue is quite frustrating. The reason why this happens appears to be a race condition of calculating the image's digests when multiple layers are involved (no, it is not cached). The issue is similar to containers/podman#12582.

What solved it

I moved from podman v3.4.7 to v4.3.1 and removed ~/.local/share/containers. After that, the reproducer command no longer showed the bug.

deeplow · 2022-12-12T14:56:37Z

Finding the first working version

bash script (in case this ever needs to be run again)

setup_intructions="\n
Setup Instructions\n
------------------\n
  \n
  1. clone podman and its dependencies into ~/podman/ following the guide\n in https://podman.io/getting-started/installation#installing-development-versions-of-podman\n\n
  2. run this script\n
"

if [ ! -f ~/podman/podsman ]; then
  # Do something if the file exists
  echo "ERROR: Podman source code not cloned"
  echo -e $setup_intructions
  exit 1
fi

versions=(v4.0.2 v4.0.3 v4.1.0 v4.1.1 v4.2.0 v4.2.1 v4.3.0 v4.3.1)

for v in "${versions[@]}"; do
    echo -e "\n\n"
    echo "============================================"
    echo "      TESTING NOW PODMAN VERSION $v"
    echo "============================================"

    # clean podman state
    sudo rm ~/.local/share/containers -rf

    # build podman
    cd /home/user/podman/podman
    git checkout $v
    make BUILDTAGS="seccomp"
    sudo make install PREFIX=/usr
    podman --version

    # test dangerzone
    cd /home/user/dangerzone

    # install container
    ./dev_scripts/dangerzone-install

    # check that the parallel is working
    time yes 'dangerzone.rocks/dangerzone' | head -n 16 | xargs -n 1 -P 16 /usr/bin/podman images --format {{.ID}}

    # run tests in parallel to corrupt
    make test

    # check if now yes is still failing
    time yes 'dangerzone.rocks/dangerzone' | head -n 16 | xargs -n 1 -P 16 /usr/bin/podman images --format {{.ID}}

done

After running the script look at the output of the second yes command. If it all went good, you should see no errors on the screen. If failed, look for errors of two following two kinds.

ERRO[0004] Can not stat "/home/user/.local/share/containers/storage/overlay/957f5231b23193389e6cbafba9f2803f47d7acd0d6508f2c562f7fe553a1af4c/merged/etc/shadow": lstat /home/user/.local/share/containers/storage/overlay/957f5231b23193389e6cbafba9f2803f47d7acd0d6508f2c562f7fe553a1af4c/merged/etc/shadow: no such file or directory

Error: error retrieving size of image "a0a475abfdb7a0a39e6b84b1ef76cc3031d06ebd6755e192f71d25dd55a86809": you may need to remove the image to resolve the error: size/digest of layer with ID "a0e56104ca9dc752d2568c5f501546fea06253fbeb3f64022be2870d619086b9" could not be calculated: llistxattr /home/user/.local/share/containers/storage/overlay/a0e56104ca9dc752d2568c5f501546fea06253fbeb3f64022be2870d619086b9/merged/var/cache/fontconfig/3830d5c3ddfd5cd38a049b759396e72e-le64.cache-8: no such file or directory

Instability in the automated tests sometimes would sometimes fail when running "podman images --format {{.ID}}". It turns out that in versions prior to podman 4.3.0, podman volumes (stored in ~/.local/share/contaiers) would get corrupted when multiple tests were run in parallel. The current solution is to wrap the test command to run sequentially in versions prior to the fix and in parallel for versions after that. Fixes #217

Parallel tests had given us issues in the part [1]. This time, they weren't playing well with pytest-qt. One hypothesis is that Qt application components run as singletons and don't play well when there are two instances. The symptom we were experiencing was infinite recursion and removing pytest-xdist solved the issue. [1]: #217 [2]: https://github.com/freedomofpress/dangerzone/actions/runs/5244389012/jobs/9470323475?pr=439

Run tests sequentially, because in subsequent commits we will add Qt tests that do not play nice when `pytest` creates new processes [1]. Also, remove the pytest wrapper, whose main task was to decide if tests can run in parallel [2]. [1]: https://bugreports.qt.io/projects/PYSIDE/issues/PYSIDE-2393 [2]: #217

Parallel tests had given us issues in the part [1]. This time, they weren't playing well with pytest-qt. One hypothesis is that Qt application components run as singletons and don't play well when there are two instances. The symptom we were experiencing was infinite recursion and removing pytest-xdist solved the issue. [1]: #217

apyrgio added this to the 0.5.0 milestone Nov 7, 2022

deeplow mentioned this issue Dec 12, 2022

Fix test instability: pytest in seq. podman<4.3.0 #288

Merged

apyrgio modified the milestones: 0.5.0, 0.4.1 Dec 13, 2022

apyrgio assigned apyrgio and deeplow and unassigned apyrgio Dec 14, 2022

deeplow closed this as completed in 84b8212 Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests fail non-deterministically (Error: error retrieving size of image) #217

tests fail non-deterministically (Error: error retrieving size of image) #217

deeplow commented Sep 27, 2022

deeplow commented Oct 27, 2022

apyrgio commented Oct 27, 2022

deeplow commented Oct 27, 2022

apyrgio commented Oct 27, 2022

deeplow commented Oct 27, 2022

deeplow commented Nov 3, 2022

apyrgio commented Nov 7, 2022

deeplow commented Dec 7, 2022

apyrgio commented Dec 7, 2022

deeplow commented Dec 7, 2022 •

edited

Loading

deeplow commented Dec 8, 2022

deeplow commented Dec 12, 2022

tests fail non-deterministically (Error: error retrieving size of image) #217

tests fail non-deterministically (Error: error retrieving size of image) #217

Comments

deeplow commented Sep 27, 2022

deeplow commented Oct 27, 2022

apyrgio commented Oct 27, 2022

deeplow commented Oct 27, 2022

apyrgio commented Oct 27, 2022

deeplow commented Oct 27, 2022

deeplow commented Nov 3, 2022

apyrgio commented Nov 7, 2022

deeplow commented Dec 7, 2022

apyrgio commented Dec 7, 2022

deeplow commented Dec 7, 2022 • edited Loading

deeplow commented Dec 8, 2022

What solved it

deeplow commented Dec 12, 2022

Finding the first working version

deeplow commented Dec 7, 2022 •

edited

Loading