Comments (3)
Thanks @10upsimon, this sounds like a tricky one to diagnose so nice work on the R&D so far.
TL;DR: Let's try what you've suggested for now, see how we get on, and refine things/diagnose further in a separate issue.
As mentioned, nice work so far. The fact you've seen an apparent improvement makes me think we should take the pragmatic option of applying these changes in the hope we can immediately improve our CI reliability and thus save time in the development lifecycle.
That said, digging into the specific changes, it's a bit of an assortment of seemingly unconnected tweaks. Did you do enough testing of each in isolation to be sure they are all having an effect? At present it's not clear to me whether each of them is strictly necessary, and it would be nice to be more certain that we are only applying changes with a quantifiable effect. For example, I'm not sure the GH Action VM actually has a GPU for --disable-gpu
to be relevant (unless the flag is introduced for some side effect).
The results are also a bit unquantified in terms of how much of a difference it makes. It's good you've seen an apparent positive change, but we could standardise our testing methods for some clearer results.
What I'd suggest, as mentioned is to apply these changes for now, but introduce a followup issue where we do something along these lines:
- Create a script to run a GH action repeatedly and capture the results (success/failure at a minimum, and a deeper capture of logs etc would be nice).
- Fork the repo to minimise disruption to regular development.
- Using our script, run the VRT action on the fork for a set period of time, e.g. 1 day. Analyze the results for frequency and distribution of success/failures and potentially dig in for deeper analysis.
- Repeat the above for various permutations to help to quantify which are having an effect.
- Consider logging additional details - does Backstop have a verbose/debug logging mode? Or Docker etc? It could also be useful to log additional system resource usage e.g. memory, file handles.
- I'd also try a newer version of the Alpine base image as well as other distros as you've suggested.
With the above testing framework in place we'd be at liberty to test ideas that occur to us in a relatively time-efficient manner. The scripts that come out of it could be useful for future GH action debugging too.
How does that sound to you?
from site-kit-wp.
Not seeing what can be the cause, since I this seems to be CI related, locally it doesn't fail. Unassigning myself
from site-kit-wp.
@techanvil pinging you here after many days of testing, tweaking, testing some more (because you are the issue creator and may have more insight).
First off, there are not a ton of documented cases of this happening online. For the cases that do arise, the solutions vary, but the common suggestions and/or reasons include:
- Needing to manually install chromium as a package within the OS, manually via the Dockerfile. We already do this.
- Manually adding the
openrc
anddbus
APK packages within the Alpine Linux powered OS, and adding it as a runtime service withrc-update
. I have done this via the linked POC PR - Removing the headless mode of backstop, but I did not think that was feasible in this case
- Adding the following flags to Backstop config:
--no-sandbox
,-disable-gpu
,--disable-setuid-sandbox
,--no-zygote
. We already had--no-sandbox
defined in our config, I added the other flags as part of the linked PR.
Another topic that came up a few times was that of disk availability. To test this theory and whether it was indeed a disk space issue, I updated the visual-regression.yml
config to include steps that check disk space usage. In my testing, we seldom exceeded 69-72%, so we can rule that out entirely.
When I made the above changes, I saw an instant improvement in the occurrence of this error. In fact, for many runs and re-runs, I simply was not able to observe the error again. Unfortunately, it did return every now and then but far and few between (I ran many many test runs).
What I did however observe is that when I run the VRT workflow before around 9am my time, which I'd argue is when most of the US and EU tech industries are still relatively inactive compared to business hours, I had a 100% success rate. This could simply be a coincidence, but the later on in the day that I (re) ran the workflow, the more likely I was to encounter the error again, although arguably still less frequently than before.
Using the same Linux config in local Docker based testing, I was never able to reproduce the issue, but again this could too be a coincidence.
This leaves me to conclude, and really with not much support or information, that this is likely a GitHub resource availability or usage issue, as the workers are shared.
My suggestion would be to apply the following changes and see if we see a general decrease in the occurrence:
- Update the backstop Dockerfile to include manual installation of the
openrc
anddbus
APK packages - Add the following flags to Backstop config:
--no-sandbox
,-disable-gpu
,--disable-setuid-sandbox
,--no-zygote
Beyond that, I would not discount the fact that maybe this is only happening on the Alpine Linux image? We could try with another distro to test this theory.
Thoughts?
from site-kit-wp.
Related Issues (20)
- Implement `Web_Tag` for RRM
- Implement RRM disconnection confirmation modal
- Implement RRM banner notification
- Add GA tracking for RRM
- Implement RRM `getServiceURL()` selector
- Update PAX integrator HOT 2
- Update button font-weight to match design system
- Blank audience tile title when changing GA property HOT 1
- Distance between New badge and text should be 14px instead of 6px.
- Margin on top of text of the Change groups CTA should be 32px and not 24px.
- The 'Change groups' CTA should appear below the widget area on mobile and tablet viewports up to 783px HOT 4
- Release 1.129.0
- Use git lfs for reference images
- Console error: Cannot read properties of undefined (reading 'hasClass')
- Scroll to the Audiences Widget Area from the Setup Success Notice
- Scroll to the Audience Segmentation Settings section when navigating there from links on the dashboard
- Only render the Audience Segmentation Setup CTA when Analytics is connected
- Clear the "temporarily hidden" state for an audience when it's removed from the audience selection
- Revise Audience Segmentation component names/locations for consistency
- Use the `useInViewSelect()` hook where applicable in the Audience Segmentation feature
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from site-kit-wp.