Improve the QA Workflow with Slack and Bitrise

Improve the QA Workflow with Slack and Bitrise

Working as a mobile manual QA engineer is probably not the easiest thing in the world. While the job seems straightforward from the outside, depending on the organization and the complexity of the apps sometimes getting a build with the right variant, flavor or scheme might be a challenging task for which someone might end up having to ask the development team for support.

Ever since I started working as a QA Automation - and lately Platform - Engineer, one of my side activities has been trying to find a way to support and empower the manual testing team to save them the hassle of combining the correct pieces of the puzzle while getting their daily activities done.

The problem

Let’s consider we have a mobile app that our team of QA manual testers have been assigned to. The app can be built in multiple variants depending on the desired environment, with different configurations and specific flags that might enable or disable certain features in the final build. Then, if development is still ongoing but our tester has been asked to take a look at it for a preliminary check they’d need a branch to build and, of course, the destination bucket for the APK/IPA: should it be distributed via Firebase App Distribution, TestFlight or Google Play as an internal testing build?

What we had and what we actually wanted

The CI/CD we use as daily driver is Bitrise, which features this interface to trigger a new build for a given app.

You can type the branch, attach a message that will be shown as a description in the summary page and then the workflow.

To have a QA engineer manually trigger a build, they would have to:

  • Ask for permission to access the app, since Bitrise allows to gate them behind a simple permissions system.
  • Make sure they’re building the correct branch, as there’s no dropdown menu nor autocompletion. They have to type it on their own or copy-paste it, and in case of errors they don’t get an immediate feedback as the build gets triggered anyway, but fails later at the git clone step. This way, if the build starts and there’s no manual check that it actually went over that step, they might come back after a while and notice they wasted time just because of a typo in the branch name!
  • Select the correct workflow: while our workflows have pretty straightforward names that give a rough idea of which kind of build is being triggered in a given environment and our release workflows require a manual judgment step (e.g. even if a release build is triggered, nothing will hit the stores until someone pushes the button), it’s pretty risky exposing production related workflows - or even worse, test workflows that might have unexpected behaviors - to any user who has access to the app.

Also, navigating through Bitrise might take a bit to find the required app as it’s identified by its unique identifier, the app slug, rather than the name displayed on the dashboard. There’s no friendly name in the URL, so it might take a few clicks to get to the build modal on the website unless the page is saved somewhere.

Leveraging Slack bots

Having seen and also experienced the points mentioned before myself - let’s be honest, nobody can escape from mispelling a branch - in our team we figured it would be easier (and eventually safer) if builds could be triggered directly from a Slack channel. As a matter of fact, Slack has a great support when it comes to automation since its API allows for incredible integrations.

We started considering what we actually had at our disposal to build a simple MVP, mostly to explore the solution and understand if it was viable:

  • We needed to get the apps list from somewhere, with the related branches and workflows. In this regard, Bitrise offers a quite extensive API set that you can check out here.
  • We had to choose the technology we wanted to use. At first we were more incline to JVM solutions, but later on we went towards Express.js combined with Bolt, the official Slack API client.
  • We wanted to use Slack modals to allow users trigger builds. While typing an extra long command in the chat instead feels pretty hardcore (and its support eventually arrived), it’s not easy at all to remember the several parameters so Slack’s Block Kit seemed a good fit.

The experience with our first PoC

After a week working on it, we had our first version ready: our Slack bot was able to give all the available apps/branches/workflow combinations and trigger a build in a matter of seconds, with just a /bitrise slash command in the chat.

This is roughly how it worked, having the User, Build Bot and Bitrise in place:

The initial rollout of our Build Bot had been very quiet, and as you can see from the diagram above we didn’t have a permissions system in place but rather we gated the slash command behind a check that prevented anyone but myself and few other colleagues from using it. Sure, we did have a secondary workspace we used for development and to try stuff out, but at the same time we wanted to keep a production version.

The limits at that time

During some testing, however, we noticed that:

  • Some builds were failing due to non-existing branches: this left us pretty surprised at first, then we discovered that the Bitrise API returns the branches featured at least once in the build history and not the actual existing branches, so it’s no wonder if you’re looking for a release/4.0.0 branch and you get entries like hotfix/1.3.2 from a year ago! Back then the API documentation wasn’t clear enough and it was after contacting the Bitrise support that we figured how this worked.
  • A simple misconfiguration could have opened the bot to everyone: not having a proper permissions system in place before publishing the bot to the entire workspace would’ve meant making everything available to ~80 people who could trigger any build in any variant. This was because the API calls to Bitrise were (and still are) performed with an authentication token that belongs to a service account with access to all the apps in our organization, even private ones. Why would someone need access to the super-secret-simple-poc-ci-check-test app on Bitrise? What’s the point of actually using a different tool if it does not solve probably one of our most important headaches?
  • Permissions via code require unnecessary deployments and it’s an huge waste of time, other than the worst way to deal with it. We were honestly tired of redeploying the app for every single change.

The need of a better ACL

While we managed to get our first version up an running, as for all things our needs changed and there was more to address before publishing the tool to everyone.

For example, we wanted to:

  • Define which apps would show up in the modal for a given user: some apps in our dashboard are pure playground material, some are there for legacy reasons and we don’t want to delete them just to make them disappear from the list.
  • Have a stronger control over permissions to define a list of users with different access tiers to trigger builds on certain apps with specific workflows limited to their needs. The Release Manager, for instance, is the only one who can trigger production builds later deployed to the stores.
  • Set the permissions dynamically without having to deploy a new version for trivial changes.
  • Have an always updated branch list for any app, at any time, without the need to manually trigger a build on the Bitrise web page first.

With that in mind, we created a microservice - later called Trixie - to help us solve these issues, tackling them one at a time.

Identify Users and create different Access Tiers

We started by fetching Slack users list from our workspace and combining their UserId + TeamId to identify them, and then we stored them in a PostgreSQL database.

Then, we defined four different tiers of access:

  1. Basic -> Users who can access our main apps and trigger builds that will be deployed to Firebase App Distribution only (for both Android and iOS). This level is not critical at all, because the builds also have specific distribution groups set on Firebase and no production harm can be done.
  2. Developer -> Users who have Basic permissions but can also access internal libraries/pods projects and trigger special workflows on them. This gets important when for example someone wants to deploy a snapshot version of an Android library by using our internal GitHub Packages Gradle plugin (our hub is read-only for most non-CI users).
  3. Release Manager -> Users who have Developer permissions and can submit apps to the stores via release workflows.
  4. Admin -> Users who have Release Manager permissions but can access all apps, all workflows and modify permissions at runtime via Slack (e.g. for allowing a given user to use the bot, since we want it to be opt-in).

Every user has an associated access_tier that defaults to Basic upon creation.

Solving the outdated branches list issue

Being tired of not knowing if a branch was available or not, we generated a SSO enabled token from one of our service accounts on GitHub able to access almost any repository within the organization, and used it to perform authenticated API calls to fetch the currently active branches list for a given repository.

In this case we didn’t store the branches list in our database, as we always want an updated list at any time (and also because the API call is not that expensive to make).

Give our users a fast way to opt-in

In order to save ourselves from writing long SQL queries and also make the whole thing as easy as possible, we added two additional /buildbot commands, help and request-access.

The former doesn’t really do much, as it sends the caller a DM with a brief overview of what the bot actually does. The latter, on the other hand, automatically creates an access request on behalf of the user that needs to be approved by an Admin with just a simple action on a service request message posted into a private channel.

0:00
/

All these interactive changes are still backed by Trixie which is built with the help of Ktor for dispatching requests and Exposed for managing the underlying database.

What if… An API Gateway?

At this point we had Build Bot making calls to Trixie for fetching permissions, GitHub for the active branches and then Bitrise for triggering builds and getting their status.

Once again, however, this solution worked but we were not done yet:

  • Three clients were too much for what we actually needed, and it caused the Slack application to be unnecessarily convoluted for its scope.
  • Build Bot and Trixie were both performing calls to Bitrise. We had to maintain two clients with the same configuration, as the only difference was the language used to implement them (Build Bot is written in Typescript, while Trixie in Kotlin). If one day we were asked to develop a Microsoft Teams or a Discord bot to do the same thing, we’d have to duplicate most of their logic.
  • Still for the aforementioned reason, we wanted to expand the usage of Trixie in the future for triggering builds from other platforms (such as GitHub comments, but this is topic for another post!), and with that in mind we wanted to centralize as much stuff as possible.

Going back to diagrams, this was the intermediate situation described above:

We then went back and made Trixie our API gateway for triggering builds, finally changing the way our actors interacted with each other.

Time to launch some builds! 🚀

Once everything is up and running, it’s time for our efforts to yield fruits. Let’say we have this super secret project, called app-to-build-2, we want to build on the branch named feature/NOJIRA/fancy-refactoring with the workflow dev-fad.

We start by typing /bitrise in any channel on Slack, select the app, branch and workflow from the dropdown menus and confirm via the Enqueue button.

0:00
/

That’s it! Our build has been triggered in a blink of an eye, and looking at the Bitrise dashboard in the app-to-build-2 page, this is what we see:

As you probably have noticed, we build against a branch AND the commit SHA-1 hash at the time of the launch. This is an intended behavior: if we launch a rebuild from the Bitrise dashboard for any reason (e.g. one of the package registries times out), we want the same commit to be built to avoid any issues.

Our bot has been online since the beginning of September 2020, and over than 1700 builds (1708 to be exact, at the time of writing) have been launched by QA, Product Owners and Developers!

Niccolò Forlini

Niccolò Forlini

Senior Mobile Engineer