Debugging the request submission pipeline

When a merge request gets merged, a folder will be added to the main branch. The backend will then submit the jobs on DIRAC, make an issue in the repository, sign the request as a convener, and then remove the folder from main. If a folder remains in the main branch for over 10 minutes, its typically a sign that something has failed in the background and needs to be debugged.

Possible things that could go wrong include:

  • author GitLab username doesn’t match DIRAC username

  • author proxy not valid

  • merging user’s proxy not valid

  • webhook or openshift failure.

Luckily, there is a bot feature on MC Requests that can automatically message on some errors, as well as take several commands (cf /ci-test in Core Software) to perform a couple of actions. These can be used to debug and recover from the issues mentioned above.

It should be noted that while anyone can see the messages the bot posts on the repository, the slash commands are only available to those in the Expert Group as determined in the Simulation GitLab repository. Anyone else is not authorised and the bot will reply as such.

Author GitLab username doesn’t match DIRAC username

This failure should happen in the middle of testing, not when the request is submitted. This should be easily caught as the bot will message in the merge request saying that no valid DIRAC mapping exists for the username.

The fix for this problem would be for the author to open a ticket with the CERN git service to ask for the GitLab username to be changed to match the CERN username. This can be done on the service portal. Once fixed, triggering a new pipeline should work as normal.

Author proxy not valid

This failure will also happen in the middle of testing, and should also be caught by the bot who will message in the merge request saying that the author proxy cannot be found.

The fix for this problem would be for the author to get a valid proxy and add it via lhcb-proxy-init. Once fixed, triggering a new pipeline should work as normal.

Merging user’s proxy not valid

This failure should happen early on while the system is doing the pre-submission checks. It should also be easily caught as the bot will reply in the merge request if it can’t find a proxy.

The fix for this problem would be for the user responsible to get a valid proxy and add it via lhcb-proxy-init. The user will then have to ping the simulation experts, preferrably in the MC Liaisons channel, to ask someone to get this fixed.

The expert would then typically have to make a new comment with /retry-singing, after which the system will try to sign the request and clean the repository in one go. Note it is possible to just try the signing step by doing /retry-singing --no-clean.

However, if the expert decides to cancel the submission, they can run /clean-repo. This will remove the offending folder in master, but will not submit the jobs to the grid.

Webhook or openshift failure

There is a slim chance that the webhook was not received so the system will not perform anything. Alternatively, Openshift gets quite aggressive about killing jobs that go over the resource. Depending on when this happens, some or even all of the request may not be submitted. This is usually accompanied with the issue not being produced, even if all were successfully submitted.

The end result of either is that the failure leaves the merge request in an unknown state, with some of the requests possibly not submitted or the issue not being generated.

In this case, use the command /submit-failed. The system will then check to see if the database is up to date, check which requests have not been submitted, and advise on the next steps. The bot will reply regarding which action to take next, but they usually come in two forms.

  1. Run /submit-failed --do-for-real

In this case, the system has determined that all the requests have successfully been submitted on DIRAC, the bot now only needs to update the database, clean the repository, and make the issue. Running the above command will trigger these steps.

  1. Run /submit-failed --do-for-real --allow-submit

In this case, the system has determined that some of the requests weren’t submitted to DIRAC. Carefully check the jobs in DIRAC to see which have been submitted and which have not been submitted based on the information printed by the bot. If you are satified, then run the above command to trigger DIRAC submission as well as the usual database update, repository cleaning, and issue making.