Debugging the request submission pipeline¶
When a merge request gets merged, a folder will be added to the main branch. The backend will then submit the jobs on DIRAC, make an issue in the repository, sign the request as a convener, and then remove the folder from main. If a folder remains in the main branch for over 10 minutes, its typically a sign that something has failed in the background and needs to be debugged.
Possible things that could go wrong include:
author GitLab username doesn’t match DIRAC username
author proxy not valid
merging user’s proxy not valid
webhook or openshift failure.
Luckily, there is a bot feature on MC Requests that can automatically message on some errors, as well as take several commands (cf /ci-test
in Core Software) to perform a couple of actions. These can be used to debug and recover from the issues mentioned above.
It should be noted that while anyone can see the messages the bot posts on the repository, the slash commands are only available to those in the Expert Group as determined in the Simulation GitLab repository. Anyone else is not authorised and the bot will reply as such.
Merging user’s proxy not valid¶
This failure should happen early on while the system is doing the pre-submission checks. It should also be easily caught as the bot will reply in the merge request if it can’t find a proxy.
The fix for this problem would be for the user responsible to get a valid proxy and add it via lhcb-proxy-init
. The user will then have to ping the simulation experts, preferrably in the MC Liaisons channel, to ask someone to get this fixed.
The expert would then typically have to make a new comment with /retry-singing
, after which the system will try to sign the request and clean the repository in one go. Note it is possible to just try the signing step by doing /retry-singing --no-clean
.
However, if the expert decides to cancel the submission, they can run /clean-repo
. This will remove the offending folder in master, but will not submit the jobs to the grid.
Webhook or openshift failure¶
There is a slim chance that the webhook was not received so the system will not perform anything. Alternatively, Openshift gets quite aggressive about killing jobs that go over the resource. Depending on when this happens, some or even all of the request may not be submitted. This is usually accompanied with the issue not being produced, even if all were successfully submitted.
The end result of either is that the failure leaves the merge request in an unknown state, with some of the requests possibly not submitted or the issue not being generated.
In this case, use the command /submit-failed
. The system will then check to see if the database is up to date, check which requests have not been submitted, and advise on the next steps. The bot will reply regarding which action to take next, but they usually come in two forms.
Run
/submit-failed --do-for-real
In this case, the system has determined that all the requests have successfully been submitted on DIRAC, the bot now only needs to update the database, clean the repository, and make the issue. Running the above command will trigger these steps.
Run
/submit-failed --do-for-real --allow-submit
In this case, the system has determined that some of the requests weren’t submitted to DIRAC. Carefully check the jobs in DIRAC to see which have been submitted and which have not been submitted based on the information printed by the bot. If you are satified, then run the above command to trigger DIRAC submission as well as the usual database update, repository cleaning, and issue making.