2021-03-23

GitHub 上で完結する、GitHub Actions + Fastlane をフル活用した iOS の継続的デリバリー

DevOps

f:id:tadashi-nemoto0713:20210224120254p:plain

Platform Engineer (旧 DevOps Engineer) の根本征です。

前回は GitHub Actions + GitLab Flow を使った API / Frontend のデプロイフローの改善について紹介しました。

techblog.exawizards.com

iOS の継続的デリバリーも GitHub Actions を活用して改善することができたので、今回はその内容を紹介したいと思います。

iOS のみならず、Android 開発や Flutter などのマルチプラットフォーム開発での継続的デリバリーにも応用できると考えています。

Git Flow の採用・Git Flow の簡単な説明
継続的デリバリーの解説
継続的デリバリーの手順まとめ・おわりに
参考・注意点

Git Flow の採用・Git Flow の簡単な説明

前回の記事では、API / Frontend のデプロイフロー・ブランチ戦略において GitLab Flow を採用したと述べました。しかしモバイルアプリのリリースには、下記の要素があり GitLab Flow や GitHub Flow と相性が悪いと考えました。

リリース毎に App Store Connect / Google Play Console で市場に出ているバージョンより上げる必要がある
リリースには審査が必要になり、恣意的にいつでもリリースできる状況ではない

また、モバイルアプリのリリースフローとしては他にトランクベース開発などを採用しているプロダクトもあると思います。しかしトランクベース開発は、1週に1回必ずリリースするなどという比較的大規模な開発向きで、弊社のモバイル開発の現状とは合わないと考えました。

上記の理由から今回は、モバイルアプリのデプロイフロー・ブランチ戦略で多く採用されている Git Flow を採用しました。

f:id:tadashi-nemoto0713:20210118184341p:plain

この後の説明をより理解しやすくするために、簡単に Git Flow について説明します。

① 機能実装をする際は、develop ブランチから feature ブランチを作成し、作業を開始します。作業・レビューが完了したら develop ブランチに merge します。

f:id:tadashi-nemoto0713:20210316161849p:plain

② リリースの準備を行う段階で、develop ブランチから release ブランチを作成します。

f:id:tadashi-nemoto0713:20210212164923p:plain

③ release ブランチでリリースに必要な確認・修正を行い、リリースできる状態になったら master・develop ブランチに merge します。

f:id:tadashi-nemoto0713:20210212182725p:plain

④ リリースブランチを master ブランチに merge 後、Tag を付けて本番環境などにリリースします(モバイル開発だと App Store Connect / Google Play Console に申請)。

f:id:tadashi-nemoto0713:20210212182646p:plain

⑤ リリース後に深刻なバグが発見された場合には、Hotfix リリースを行います。master ブランチから Hotfix 用のリリースブランチを作成します。

f:id:tadashi-nemoto0713:20210212171006p:plain

⑥ Hotfix 用のリリースブランチ上で修正・確認が終わり、リリースできる状態になったら master・develop ブランチに merge して本番環境などに再度リリースします。

f:id:tadashi-nemoto0713:20210317122731p:plain

次からは、この Git Flow の中でどのように継続的デリバリーを実現したかについて解説します。

継続的デリバリーの解説

リリースブランチ・Pull Request の作成

① での develop ブランチ上での開発が進み、リリースの準備をするタイミングで ② で説明したリリースブランチの作成を行います。

f:id:tadashi-nemoto0713:20210212164923p:plain — リリースブランチの作成

リリースブランチは手元で手動で作成することが多いのですが、今回は GitHub Actions + Fastlane を活用して自動化することができました。

まず、リリースバージョンの上げ方を Semantic Versioning に従います。Semantic Versioning 自体についてはここでは深く解説しませんが、Major・Minor・Patch によってバージョンの上げ方を規則化させることができます。

f:id:tadashi-nemoto0713:20210305142013j:plain:w200 — Semantic Versioning

Fastlane には increment_version_number という Action があり、Major・Minor・Patch のいずれかを引数に渡してあげることで、Semantic Versioning に基づいてバージョンを上げてくれます。

そして、下記の Fastlane によって

Semantic Versioning を使ったリリースバージョンのアップデート、ファイルのコミット
リリースブランチの作成、GitHub への Push
master・develop ブランチへの Pull Request の作成

まで行うことができます。

f:id:tadashi-nemoto0713:20210315131519p:plain

increment_version_number の Fastlane Action 自体は現在 iOS のみ提供されていますが、下記の記事で Android での事例が紹介されていますので、参考にしてもらえればと思います。

Automating semantic versioning model in mobile releases | ThoughtWorks

最後にこの Fastlane を GitHub Actions 経由で実行させるようにします。

f:id:tadashi-nemoto0713:20210312182827p:plain

GitHub Actions では様々なワークフローのトリガーの種類がありますが、今回は任意のタイミングで実行させたいため、GitHub Action のUIから手動でトリガーすることができる workflow_dispatch を利用します。

ワークフローをトリガーするイベント - GitHub Docs

workflow_dispatch では引数も与えることができるため、今回のバージョンアップは Major か Minor か渡せるようにします(Patch は Hotfix リリースのみに使うためここでは使いません)。

これによって、GitHub Actions の UI から Major か Minor を指定してワークフローをトリガーしてあげることによって、リリースバージョンのアップデート、リリースブランチ・Pull Request の作成まで自動化することができました。

f:id:tadashi-nemoto0713:20210314171333p:plain — GitHub Actions で workflow_dispatch を実行する

f:id:tadashi-nemoto0713:20210312183859p:plain — master・develop ブランチに向けられた Pull Request

f:id:tadashi-nemoto0713:20210302224859p:plain — バージョン部分のファイル差分

2つのリリース Pull Request を同時に merge する

先程のステップで、リリースブランチから master・develop ブランチに向けた 2 つの Pull Request が自動で作成されました。そして ③ で説明した通り、リリースブランチがリリースできる状態になったら master・develop ブランチに merge します。

f:id:tadashi-nemoto0713:20210212182725p:plain

もちろん手動で 2 つの Pull Request を merge することもできますが、どちらかを merge し忘れるという可能性が出てきます。忘れることなく同時にこの 2 つの Pull Request を merge させるために、以下の GitHub Actions を作成しました。

f:id:tadashi-nemoto0713:20210312181035p:plain

この GitHub Actions によって、Pull Request に release という label を付けたら、develop と master ブランチに向けられた2つの Pull Request を同時に merge することができます。

f:id:tadashi-nemoto0713:20210314132857p:plain

Tag & GitHub Release の作成、App Store Connect へ申請

リリースブランチが develop・master ブランチに merge され、master ブランチに新しいコミットが入ると、Git Flow では ⑤ で述べた Tag の作成 & 本番環境へのリリースを行います。 iOS 開発の場合、このタイミングで App Store Connect に申請を出す(サブミット)ことが多いです。

f:id:tadashi-nemoto0713:20210212182646p:plain

master ブランチへのコミットをトリガーに Tag & GitHub Release の作成は下記の GitHub Actions で自動化することができます。

f:id:tadashi-nemoto0713:20210314142612p:plain

f:id:tadashi-nemoto0713:20210314172613p:plain — Tag & GitHub Release の作成

また、このタイミングでリリース版アプリのビルド・App Store Connect へのアップロードを行います。

f:id:tadashi-nemoto0713:20210314172439p:plain

ここでは Fastlane の Deliver Action を活用しています。

Deliver Action の設定によって、Apple Store Connect へのアップロードだけでなく、メタデータ・スクリーンショットのアップロード、申請(サブミット)、承認後の自動リリースまで行うことができます。

Hotfix リリース

上記までが、通常のリリースフローになります。

通常のリリース以外に、リリース後に深刻なバグが発見された場合には Hotfix リリースを行います。 ⑤ と ⑥ の Step に従って、master ブランチから Hotfix 用のリリースブランチを作成をし、修正後 master・develop ブランチに merge してリリースします。

f:id:tadashi-nemoto0713:20210212171006p:plain

Hotfix リリースを行う際には、Semantic Versioning に従って Patch(x.x.0 → x.x.1) のみのアップデートになります。

f:id:tadashi-nemoto0713:20210305135019p:plain

そして、GitHub Actions では master ブランチからチェックアウトし、Hotfix 用のリリースブランチ・Pull Request の作成が行われるようにします。

f:id:tadashi-nemoto0713:20210314145833p:plain

その後、Hotfix 用のリリースブランチ上で修正・確認が終わった後は、通常のリリースと同じく

Pull Request に release の label を付け、master・develop ブランチに merge
master ブランチにコミットが入り、自動で Tag & GitHub Release の作成、そして App Store Connect へ申請

という手順でリリースしていくことができます。

【任意】検証版アプリの配布(Firebase App Distribution)

Git Flow とは直接関係しませんが、GitHub Actions を使った検証版アプリの配布についても簡単に解説します。

リリース前にアプリの動作確認をするために、よく下記のサービスなどを活用して検証版アプリの配布・動作確認を行います。

そして、検証版アプリのビルド・アップロード作業を GitHub Actions を使って自動化することができます。 Git Flow だと ① の feature ブランチでの機能開発が終わって develop ブランチに merge した際や、リリースブランチにコミットがあった際にトリガーするのがよさそうです。

f:id:tadashi-nemoto0713:20210212182754p:plain

GitHub Actions でのワークフローは下記のようになります。

f:id:tadashi-nemoto0713:20210305134944p:plain

上記のワークフローでは、トリガーとして特定のブランチへのコミットの他に、workflow_dispatch を用意しています。

① での機能開発の最中に、「develop・release ブランチにまだ merge したくないけど特定の feature ブランチをビルド・配布したい」という状況はよく起こると思います。

f:id:tadashi-nemoto0713:20210317125311p:plain

GitHub Actions の workflow_dispatch を使うことで、GitHub Actions の UI からブランチを指定した上で手動でトリガーすることが可能です。

f:id:tadashi-nemoto0713:20210212184657p:plain — workflow_dispatch

継続的デリバリーの手順まとめ・おわりに

最後にこれまでの GitHub Actions を活用した継続的デリバリーの手順をまとめます。

【通常のリリースフロー】

feature ブランチを作成, develop ブランチにマージ
リリースを準備するタイミングで GitHub Actions の UI から、リリース準備のためのワークフローを実行(Major か Minor を選択)
リリースバージョンのアップデート(Major か Minor)、リリースブランチの作成、master・develop ブランチへの Pull Request の作成がされる
このタイミングで、検証版アプリのビルド・Firebase App Distribution への配布もされるため、実機などで動作確認をする
リリースできるタイミングになったら、Pull Request に release の label を付け、develop・master ブランチへ同時に merge される
master ブランチにコミットが入り、自動で Tag & GitHub Release の作成、そして App Store Connect へ申請される

【Hotfix リリース】

Hotfix リリースを準備するタイミングで GitHub Actions の UI から、Hotfix リリース準備のためのワークフローを実行
リリースバージョンのアップデート(Patch)、リリースブランチの作成、master・develop ブランチへの Pull Request の作成がされる
上記のリリースブランチに対して不具合修正をコミットする
リリースブランチへのコミット毎に自動で検証版アプリのビルド・Firebase App Distribution へ配布されるので、実機などで動作確認
修正を確認したら、Pull Request に release の label を付け、develop・master ブランチへ同時に merge される
master ブランチにコミットが入り、自動で Tag & GitHub Release の作成、そして App Store Connect へ申請される

これによって、リリースに必要な作業をほぼGitHub上で完結させることができました
(Fastlane の Deliver アクションによって、Appl Store Connect への作業をどのぐらい自動化するかにもよりますが)

モバイルアプリ開発におけるリリースは、APIやフロントエンド開発と比較してリリース頻度は低くなりがちなものの、リリースのために必要な作業は多くなりがちです。

この継続的デリバリーによって、

より俊敏にアプリの改善を市場にリリースできるようになる
開発者がプロダクト開発によりフォーカスできるようになる

ことを期待しています。

hrmos.co

参考・注意点

Using Github Actions to Automate Our Release Process – Rebecca Franks - @riggaroo
Automating semantic versioning model in mobile releases | ThoughtWorks
この記事で紹介した、GitHub Actions のワークフローはセルフホストランナーを活用しているため、GitHub ホストランナーと記述が異なる部分があります。
一部の GitHub Actions のワークフローでは、他のワークフローをトリガーさせるために個人アクセストークンを利用しています。

2021-01-28

Improving Continuous Delivery with GitLab Flow + GitHub Actions

DevOps

f:id:tadashi-nemoto0713:20201228172750p:plain

The Japanese version of this blog post can be found here:

techblog.exawizards.com

Hello, I'm Tadashi Nemoto from the DevOps team.

In this article, I will demonstrate how to improve deployment flows using GitHub Actions.

Standard deployment flows and their problems
How about GitHub Flow?
About GitLab Flow
Automatically generate release pull requests using git-pr-release + GitHub Actions
Deploy using GitHub Actions
Effects and challenges
Summary

Standard deployment flows and their problems

Although it varies by departments and services, current deployment flows are as follows.

Three environments
- develop environment
- staging environment
- production environment
Git Flow as a branching strategy
CI / CD using Jenkins
- develop and staging environments → Deploy with branch updates
- production environment → Software engineers or DevOps engineers deploy by specifying a tag

f:id:tadashi-nemoto0713:20210118184341p:plain

Git Flow works well for large scale development or projects with fixed release timing (e.g. iOS / Android work).

However, I have found it to be less beneficial for small to medium scale projects where the release windows tend to be arbitrarily decided (such as in API / Frontend work).

On the other hand, Git Flow involves more complex branch management (having to maintain release and hotfix branches for example) than others, and I have thought that this is not suitable for deploying to production frequently.

Therefore, I wanted to achieve the following goals by improving deployment flows.

Enable small, autonomous deployments → Increase deployment frequency
Simplify branch management and deployment strategies

How about GitHub Flow?

I think GitHub Flow is often used as an alternative to Git Flow.

GitHub Flow is a branching strategy that is actually used in the development of GitHub services, and I initially looked into it as a possibility for this project.

GitHub Flow only has a master(main) branch and a feature branch, allowing for simpler branch management.

f:id:tadashi-nemoto0713:20210115112754p:plain — GitHub Flow(From Understanding the GitHub flow)

The master branch is considered to be a branch that can be deployed to the production environment at any time, and there are many examples of deploying to the production environment triggered by push.

This GitHub Flow seems to be able to solve the problem I mentioned earlier, but I thought having a verification environment before release might cause a problem.

As mentioned earlier, in GitHub Flow, the master branch is considered to be the branch that can be deployed to the production environment at any time.

Therefore, each feature branch needs to be used for verification before release.

f:id:tadashi-nemoto0713:20210127144507p:plain — Problem of verification environment in GitHub Flow
(From Introduction to GitLab Flow)

And if you have multiple feature branches / pull requests, you will need to take the following approach.

Switch between one or more validation environments as needed
Launch an environment for each pull request → close the environment after each release

I've been looking for a simpler way to solve the above issues with validation environments, and this time I've decided on GitLab Flow.

About GitLab Flow

GitLab Flow is documented by GitLab.

Introduction to GitLab Flow | GitLab

The above document also lists some of the issues with Git Flow / GitHub Flow.

GitLab Flow allows you to have the branches you need for a release while maintaining the master branch/feature branch relationship of GitHub Flow.

The GitLab Flow documentation introduces production branch model, environment branches model, and release branches model, and I thought that environment branches model would improve deployment flows while making effective use of the current environments.

f:id:tadashi-nemoto0713:20210115113159p:plain — Production branch model and environment branches model of GitLab Flow
(From Introduction to GitLab Flow)

In the environment branches model, each branch is paired with an environment, and when branches are modified, they are automatically deployed to their respective environments.

In this case, it looks like the following

master branch → develop environment
staging branch → staging environment
production branch → production environment

You can then proceed with deployment by creating and merging pull requests in master → staging → production.

I believe this GitLab Flow has the following advantages

Simpler branch management and deployment than Git Flow
Easier than GitHub Flow to prepare a verification environment before a production release

Automatically generate release pull requests using git-pr-release + GitHub Actions

As mentioned earlier, GitLab Flow deploys by creating and merging pull requests on the master → staging → production branch.

However, if we were to do this manually, I thought the following might happen.

Forgetting to create and merge pull requests
It may become difficult to know which changes are in each pull request.

To solve these problems, I automated creating and modification of the above pull requests using git-pr-release + GitHub Actions.

f:id:tadashi-nemoto0713:20210118142154p:plain — Auto-generated release pull request using git-pr-release + GitHub Actions

git-pr-release can detect differences between branches and create a release pull request with a list of merged pull requests.

You can run git-pr-release itself as a command, but you can also automate it by creating a GitHub Actions workflow like below.

f:id:tadashi-nemoto0713:20210325114549p:plain

By doing this, we can simplify the deployment flow to the production environment as much as possible, as shown below:

Review pull requests of feature branches and merge them into the master branch
Automatically deploy to the verification environment
Automatically generate release pull requests to deploy to the production environment
Check the validation environment (manual testing or end-to-end testing using the validation environment, etc.)
If there are no problems, merge the pull request and automatically deploy to the production environment

Deploy using GitHub Actions

The actual deployment is also done with GitHub Actions.

It is set to be triggered when there is a push in each environment branch (master, staging, production).

f:id:tadashi-nemoto0713:20210325114621p:plain

We use self-hosted runners of GitHub Actions when deploying, please see the following blog entry for more details:

techblog.exawizards.com

Effects and challenges

We are currently implementing this deployment flow in several departments and services, and have been able to achieve the first two points mentioned.

Enable small, autonomous deployments → Increase deployment frequency
Simplify branch management and deployment strategies

In addition, some projects now deploy several times a day, previously having deployed only once every 1-2 weeks.

f:id:tadashi-nemoto0713:20210115113531p:plain

At the same time, there are some issues that we need to be aware of.

GitLab Flow is a relatively flexible flow compared to GitHub Flow, and the number of pull requests that can be deployed to the production environment at one time can be decided flexibly.

Therefore, depending on the situation, the size of the pull requests that can be deployed at one time may become quite large, making it difficult to achieve the original goal of "enabling small deployments".

To reduce the cost of rework in case of a bug, we may need to implement the following policy when using GitLab Flow in the future.

Merge, validate, and deploy features/bug fixes to the production environment for every pull request.
Minor dependency updates created by bot such as Dependabot can be merged, verified, and deployed at once.

Summary

This time, I only improved the deployment flow (Continuous Deployment).

Still, we would like to expand the Continuous Testing, DevSecOps, and other mechanisms to continuously improve the service in line with this flow in the future.

I also believe that the optimal deployment flow itself will change depending on the organisation's growth and its requirements, so we will continue to improve upon this in the future.

I hope that this entry will be of some help in improving your own deployment flow.

hrmos.co

2021-01-21

GitLab Flow + GitHub Actions ではじめる、デプロイフローの改善・自動化

DevOps

f:id:tadashi-nemoto0713:20201228172750p:plain

DevOps エンジニアの根本征です。

前回のエントリーでは GitHub Actions の self-hosted runners について紹介しました。

今回はそれらを活用したデプロイフロー(主に API / Frontend)の改善について紹介したいと思います。

これまでのデプロイフローと課題
GitHub Flow はどうか
GitLab Flow とは
git-pr-release + GitHub Actions を使った、リリース Pull Request の自動生成
GitHub Actions を使ってデプロイを行う
効果と課題
おわりに

これまでのデプロイフローと課題

部署やサービスによって異なりますが、これまでのデプロイにまつわる環境は大まかに下記のような状況でした。

3つの環境
- develop 環境(主に開発者が使う環境)
- staging 環境(本番リリース前の検証環境)
- production 環境(本番環境)
Git Flow
Jenkins を使った CI / CD
- develop 環境・staging 環境 → ブランチの更新でデプロイ
- production 環境 → tag を指定してソフトウェアエンジニアもしくは DevOps エンジニアがデプロイ

f:id:tadashi-nemoto0713:20210118184341p:plain — Git Flow

Git Flow は大規模な開発や、リリースタイミングが決められているもの(iOS / Android など)とは相性が良いです。

しかし、API / Frontend などリリースタイミングが恣意的に決めることができ、かつ小・中規模の開発だとあまりメリットがないと感じました。

逆に、Git Flow は他と比べて複雑なブランチ管理(release ブランチや hotfix ブランチ)になってしまい、これによってデプロイ頻度が下がっている可能性もあると考えました。

そのため、今回のデプロイフローの改善によって下記を実現したいと考えました。

小さく自律的にデプロイできるようにする → デプロイ頻度を上げる
シンプルなブランチ管理・デプロイができるようにする

GitHub Flow はどうか

上記の Git Flow の代替としてよく導入されているのが GitHub Flow だと思います。

GitHub Flow は実際に GitHub のサービス開発において活用されているブランチ戦略であり、私も最初は GitHub Flow を導入できないか検討しました。

GitHubにおける継続的デリバリー/How GitHub builds and deploy software - Speaker Deck

GitHub Flow では master(main) ブランチと feature ブランチしかなく、シンプルなブランチ管理を実現することができます。

master ブランチはいつでも本番環境にデプロイができるブランチと考えられ、push をトリガーに本番環境へデプロイしているという事例も多くあると思います。

この GitHub Flow によって先ほど挙げた課題を解決することができそうですが、リリース前の検証環境が課題になると考えました。

先述の通り、GitHub Flow だと master ブランチはいつでも本番環境へデプロイができるブランチだと考えられています。

そのため、リリース前の検証にはそれぞれの feature ブランチで検証する必要があります。

f:id:tadashi-nemoto0713:20210119162043p:plain — GitHub Flow における検証環境の問題
(Introduction to GitLab Flowより)

そして、複数の feature ブランチ / Pull Request が存在する場合には下記のような方法を取る必要が出てきます。

1つあるいは複数の検証環境を必要に応じて切り替える
Pull Request 毎に環境が立ち上がる → リリース後にその環境を閉じる

私はよりシンプルに上記の検証環境にまつわる課題を解決できないか考え、今回導入したのが GitLab Flow です。

GitLab Flow とは

GitLab Flow に関しては GitLab がドキュメントを公開しています。

Introduction to GitLab Flow | GitLab

またこちらの記事で翻訳がされています。

GitLab flowから学ぶワークフローの実践 | POSTD

上記のドキュメントでも Git Flow / GitHub Flow に対する課題点を挙げています。

そして GitLab Flow では、GitHub Flow の masterブランチと featureブランチの関係はそのままに、リリースに必要なブランチを用意することができます。

GitLab Flow のドキュメントでは、production ブランチモデル、環境ブランチモデル、release ブランチモデルが紹介されており、今回環境ブランチモデルが現状ある環境を有効活用しながらデプロイフローを改善できるのではと考えました。

環境ブランチモデルでは、それぞれのブランチと環境を対にして、ブランチに変更があった場合には自動でそれぞれの環境へデプロイがされるようにします。

今回の場合だと下記のようになります。

master ブランチ → develop 環境
staging ブランチ → staging 環境
production ブランチ → production(本番)環境

そして、master → staging → production に Pull Request を作成・マージしていくことでデプロイを進めることができます。

これによって下記のようなメリットがあると考えています。

Git Flow よりシンプルなブランチ管理・デプロイを行うことができる
GitHub Flow よりも容易に本番リリース前の検証環境を用意することができる

git-pr-release + GitHub Actions を使った、リリース Pull Request の自動生成

先述の通り、GitLab Flow では master → staging → production ブランチに Pull Request を作成・マージしていくことでデプロイを進めていきます。

しかし、手動でこれを行うとなると、下記の恐れがあると考えました。

Pull Request の作成・マージのし忘れが発生する
どの変更が入った Pull Request なのか分かりづらくなる

これらを解決するために、上記の Pull Request の作成・更新を git-pr-release + GitHub Actions を使い自動化しました。

git-pr-release ではブランチ間の差異を検出し、マージされた Pull Request の一覧が表示されたリリース Pull Request を作成することができます。

git-pr-release 自体はコマンドで実行することが可能ですが、下記のような GitHub Actions のワークフローを作成することによって自動化することができます。

f:id:tadashi-nemoto0713:20210325114549p:plain

これによって、本番環境までのデプロイフローを下記のようになるべくシンプルにすることができました。

feature ブランチの Pull Request をレビュー・master ブランチにマージ
検証環境へ自動的にデプロイ・本番環境へデプロイするリリース Pull Request が自動的に生成・更新
検証環境で確認(手動テストもしくは検証環境を使った End to End テストなど)
問題なければ Pull Request をマージ・本番環境へ自動的にデプロイ

GitHub Actions を使ってデプロイを行う

実際のデプロイに関しても GitHub Actions で行っています。

それぞれの環境ブランチ(master, staging, production)で push がある際にトリガーされるように設定しています。

f:id:tadashi-nemoto0713:20210325114621p:plain

また、デプロイする Job に関しては、GitHub Actions の self-hosted runners を使っています。

詳細については下記エントリーをご覧ください。

techblog.exawizards.com

効果と課題

現在いくつかの部署・サービスでこのデプロイフローを導入しており、最初に述べた2点を実現することができました。

小さく自律的にデプロイできるようにする → デプロイ頻度を上げる
シンプルなブランチ管理・デプロイができるようにする

また、サービスによっては 1~2週に1回のデプロイ頻度から、1日に数回デプロイ できるようになりました。

f:id:tadashi-nemoto0713:20210115113531p:plain

同時に、課題・注意しないといけない点も出てきました。

GitLab Flow は GitHub Flow を比べると比較的柔軟なフローになっており、本番環境へ1度にデプロイできる Pull Request の数も柔軟に決めることができます。

そのため、状況によっては1度にデプロイする Pull Request のサイズが大きくなってしまい、本来目指していた「小さくデプロイできるようにする」を実現することが困難になる可能性があります。

不具合があった際の手戻りのコストを減らすためにも、今後は GitLab Flow を使う場合には下記のようなポリシーを決めていく必要がありそうです。

feature / bugfix は 1 Pull Request 毎にマージ・検証・本番環境までデプロイしていく
Dependabot などによって作られる、マイナーな依存関係のアップデートはまとめてマージ・検証・デプロイしていく

おわりに

今回はデプロイフロー(Continuous Deployment)のみの改善でしたが、今後はこのフローに合わせて継続的テスティング(Continuous Testing) や DevSecOps など、サービスを継続的に改善していくための仕組みを拡充していきたいと考えています。

また、このデプロイフロー自体も組織やサービスの成長によって最適な形が変わってくると考えているため、今後も継続的に改善していきたいと考えています。

今回のエントリーが、現場でのデプロイフローの改善に何かしら参考になれば幸いです。

hrmos.co

2020-11-04

Creating CI / CD pipeline using GitHub Actions self-hosted runners on AWS ECS

f:id:tadashi-nemoto0713:20201019135904p:plain

This is English version of this article.

techblog.exawizards.com

Hello, I'm Tadashi Nemoto from the DevOps team.

I joined ExaWizards this year in July in order to improve CI / CD promote the usage of automated testing in product development.

In this article, I will demonstrate how to create GitHub Actions with self-hosted runners on AWS ECS.

GitHub Actions and self-hosted runners
Running self-hosted runners on Docker
Running self-hosted runners on AWS ECS
Create CI / CD pipeline to deploy an application to AWS ECS
Summary

GitHub Actions and self-hosted runners

You may already know of or use GitHub Actions if you are an active GitHub user.

Combined with the Actions available on GitHub Marketplace, you can easily build a variety of powerful CI/CD pipelines.

In addition, instead of using runners provided by GitHub, you can also prepare your own runner (self-hosted runners).

About self-hosted runners - GitHub Docs

If you use GitHub Actions on a GitHub provided runner you will be charged on a free-to-use plus pay-as-you-go basis.

However, there is no additional charge for using GitHub Actions with a self-hosted runner on your own infrastructure.

You can set up new self-hosted runners from the Repository or Organization settings.

f:id:tadashi-nemoto0713:20201016201036p:plain

Once the setup is complete, you will see that it has been added as a runner.

Runners configured in a repository can be executed in that repository, and runners configured in an organization can be executed in all the repositories of that organization.

f:id:tadashi-nemoto0713:20201016201348p:plain

Finally, in your GitHub Actions configuration file, you should indicate that you want it to run using your self-hosted runners.

f:id:tadashi-nemoto0713:20210325114741p:plain

It is true that you will need to set up and maintain self-hosted runners by yourself.

However, I believe it's attractive that we don't have to prepare and maintain the management part of a CI / CD workflow (like a Jenkins master instance). GitHub offers this functionality for free.

Running self-hosted runners on Docker

The source code of the self-hosted runner agent is available as open source, but GitHub currently doesn't provide a Docker image for it.

There are open source projects to get self-hosted runners running on Docker and Kubernetes.

In this article, we will work together to create a CI / CD pipeline using GitHub Actions with self-hosted runners on AWS ECS.

Running self-hosted runners on Docker
Running self-hosted runners on AWS ECS
Creating a CI / CD pipeline to deploy an application to AWS ECS using those self-hosted runners

First, we'll try to run self-hosted runners on our local machine using the below Docker image.

github.com

In order to launch a runner for an organization, run the following docker command:

f:id:tadashi-nemoto0713:20210325114821p:plain

Because we want to run workflows such as docker build, we need to be able to use Docker commands inside this container.

You can solve this by sharing the Docker daemon on the host machine (-v /var/run/docker.sock:/var/run/docker.sock part, Docker outside of Docker)

Using Docker-in-Docker for your CI or testing environment? Think twice.

Running self-hosted runners on AWS ECS

Next, we'll use the Docker image from earlier and run it in AWS ECS.

In this article, I will focus on three points using AWS CDK (Typescript).

The first is how to achieve Docker outside of Docker (DooD) in AWS ECS (the -v /var/run/docker.sock:/var/run/docker.sock part).

In AWS ECS, you can solve this problem by adding a Volume to the Task side and mounting that Volume on the Container side.

f:id:tadashi-nemoto0713:20210325114902p:plain

f:id:tadashi-nemoto0713:20210325114933p:plain

AWS ECS offers two startup types, Fargate and EC2, but as Fargate is not currently supported to do the above, I chose the EC2 startup type this time.

The second is about the role you give to the ECS Task.

Of course, you can do this by storing AWS access keys in GitHub and passing them to GitHub Actions.

However, you can eliminate the need to store AWS access keys on the GitHub side by giving the self-hosted runner’s container itself the role it needs for the above.

f:id:tadashi-nemoto0713:20210325115036p:plain

The third is about spot instances.

Self-hosted runners are used for CI/CD, so you can leverage spot instances to keep costs low.

With AWS CDK, you can use a spot instance by setting the spotInstanceDraining property to true.

f:id:tadashi-nemoto0713:20210325115131p:plain

Create CI / CD pipeline to deploy an application to AWS ECS

This time, we will create CI / CD pipeline to deploy an application to AWS ECS using that GitHub Actions self-hosted runners.

It's complicated because it's the same AWS ECS, but it's assumed that the cluster running self-hosted runners and the cluster running the application are separate.

f:id:tadashi-nemoto0713:20201020153820p:plain

Docker build and push to ECR
↓
Edit Task Definition file to newer docker image
↓
Register Task Definition and wait Service to update

When combined with the steps of GitHub Actions provided by AWS, it looks like this.

f:id:tadashi-nemoto0713:20210325115227p:plain

As mentioned earlier, you'll need to store and pass your AWS access key to GitHub in aws-actions/configure-aws-credentials step.

f:id:tadashi-nemoto0713:20210325115303p:plain

However, this is not necessary this time because we have given self-hosted runners the privileges they need to do so.

Summary

In this article, I have demonstrated how to use GitHub Actions self-hosted runners on AWS ECS. I believe it would be beneficial for the following use cases:

You're using GitHub Actions to deploy an application to AWS, but you don't want to pass AWS access keys to GitHub.
You want to create a cost-efficient CI/CD environment by utilizing spot instances (especially if you expect to significantly exceed your free quota).
You want to run your CI/CD environment on a machine with higher specs than GitHub's runner.

Our team plans to create better CI/CD pipelines based on GitHub Actions in the future.

hrmos.co

2020-10-22

Real-time pose estimation in Android

This article is focused on Pose Estimation using TensorFlow Lite. I will guide you through every step from picking an ML model to displaying an output on the screen, with detailed explanations and materials for further reading. We will not dive deep into Machine Learning, however, as our primary goal is to learn how to use the tools provided by TensorFlow to accomplish the task of pose estimation. No prior Machine Learning experience is required, but it is assumed that you have some Java/Kotlin and Android proficiency. Without further ado, let’s get started!

Part 1: TensorFlow Lite

TensorFlow is an open source library for numerical computation and machine learning. It uses Python to provide an API for training and running ML models and deep neural networks, while executing operations in C++.

Data flow graphs are structures that describe how data moves through a series of processing nodes. Each node is a mathematical operation, and each node's input/output is a multidimensional data array, or a tensor.

Simply put, to receive an array of key points representing a human pose we need to format the initial image to match processing node's expected input and run it through a series of transformations described in a model - a process called inference.

TensorFlow Lite is a lightweight version of TensorFlow built specifically for mobile and embedded devices. It supports a set of core operations which have been tuned for performance while staying relatively lean in size. TFLite also provides an interpreter with hardware acceleration in Android (NNAPI). To learn more about TFLite and its constraints, please refer to this guide.

Quick start

To kick start your Android project, please check out the official documentation and this demo app:

Android guide

Pose Estimation demo

In short, to add tflite module to your project, modify your app's build.gradle as follows:

// Check the latest tensorflow-lite version at JCenter: 
// [https://bintray.com/google/tensorflow/tensorflow-lite](https://bintray.com/google/tensorflow/tensorflow-lite)
ext.tfliteVersion = '0.0.0-nightly'

android {
    defaultConfig {
        ndk {
            // include only relevant architectures to reduce apk size
            abiFilters 'armeabi-v7a', 'arm64-v8a'
        }
    }
}

dependencies {
    implementation 'org.tensorflow:tensorflow-lite:$tfLiteVersion'
}

The dependency contains core TFLite classes. Let's go over some of them one by one:

Interpreter - A class that helps with building and accessing a native interpreter, which is an interface between Java code and the core C++ tensor flow logic. In its constructor you can provide a file path to your pre-trained model and Interpreter.Options (more on that later).

Delegate - An interface for providing a native handle to a delegate - an executor that handles partial (or full) computation of a data flow graph.

Tensor - A representation of a multidimensional byte array containing input or output data.

Delegates

By default, all computation will be handled by the CPU. You can parallelize inference on CPU by setting the number of threads the task will run on:

val numThreads = 4 // depends on the number of cores the CPU has
val options = Interpreter.Options().apply { setNumThreads(numThreads) }

TensorFlow Lite provides 3 built-in delegates to run inference on:

GPU - provides a great increase in performance and power efficiency. I would recommend picking the GPU delegate as the default option, with the caveat that your device has to support OpenCL or OpenGL ES 3.1 and that not all operations are supported. You can read more about it in the official docs.

To add the GPU delegate to your project, add the following dependency:

// Check the latest gpu delegate version at JCenter
// https://bintray.com/google/tensorflow/tensorflow-lite-gpu
dependencies {
    implementation 'org.tensorflow:tensorflow-lite-gpu:$tfLiteVersion'
}

NNAPI - a delegate that utilizes Neural Networks API providing hardware acceleration on newer Android devices (API 27+). It is included in the tensorflow-lite package, so you don't need to add an extra dependency.

Hexagon - a substitution for the NNAPI delegate on older Android devices that do not fully support Neural Networks API.

Add the following dependency if you want to support older devices:

// Check the latest Hexagon version at JCenter
// https://bintray.com/google/tensorflow/tensorflow-lite-hexagon
ext.tfLiteHexagon = '0.0.0-nightly'
dependencies {
    implementation 'org.tensorflow:tensorflow-lite-hexagon:$tfLiteHexagon'
}

Below is the complete interpreter setup snippet:

/* (c) ExaWizards */

sealed class DelegateOptions {
    data class CPU(val numThreads: Int): DelegateOptions()
    object GPU: DelegateOptions()
    object NNAPI: DelegateOptions()
    object Hexagon: DelegateOptions()
}

fun createInterpreter(
        model: MappedByteBuffer,
        delegateOptions: DelegateOptions
): Interpreter {
    val options = Interpreter.Options().apply {
        when (delegateOptions) {
          DelegateOptions.CPU -> setNumThreads(numThreads)
        DelegateOptions.NNAPI -> setUseNNAPI(true)
        DelegateOptions.GPU -> addDelegate(GpuDelegate())
        DelegateOptions.Hexagon -> addDelegate(HexagonDelegate())
      }    
    }
    return Interpreter(model, options)
}

Support library (experimental)

The TensorFlow team provides an optional package with various utility classes to simplify image operations and tensor buffer processing. If you don't want to deal with bitmap manipulations and bit shifting, then give this library a shot!

Currently it's in beta, so please be careful when adding it to your main application. I'd recommend playing around with it in a side project to catch any potential shortcomings for your use case.

To add the dependency, modify your build.gradle

// Check the latest support library version at JCenter
// https://bintray.com/google/tensorflow/tensorflow-lite-support
ext.tfLiteSupportVersion = '0.1.0-rc1'
dependencies {
    implementation 'org.tensorflow:tensorflow-lite-support:$tfLiteSupportVersion'
}

Let's take a look at some of the classes and interfaces available:

ImageProcessor - a class that accumulates various transformations and applies them to a target TensorImage

ImageOperator - a base interface for TensorImage transformations, including:

Rot90Op - rotate an image by 90 degrees counter-clockwise N times.
ResizeOp - resize an image to match the target size. It performs scaling, so be careful to preserve your original aspect ratio.
ResizeWithCropOrPadOp - crop or pad an image to match your model's expected input size. It does not scale the original image, make sure to scale it down before applying this operator.

TensorOperator - a base interface for TensorBuffer transformations:

NormalizeOp - perform normalization - adjust buffer values to a common scale, usually in a range of [-1; 1].
QuantizeOp - perform quantization - map float values to a smaller set of integer numbers. It is used in quantized models to increase performance at the cost of precision.
DequantizeOp - reverse quantization.

Below is an example of building an ImageProcessor and transforming an image bitmap:

/* (c) ExaWizards */

val imageProcessor = ImageProcessor.Builder()
            .add(ResizeOp(scaledHeight, scaledWidth, ResizeOp.ResizeMethod.BILINEAR))
            .add(Rot90Op(numRotations))
            .add(ResizeWithCropOrPadOp(modelHeight, modelWidth))
        // f(x) = (x - 127.5) / 127.5; f(x) ∈ [-1; 1]; x ∈ [0; 255]
            .add(NormalizeOp(127.5f, 127.5f)) 
            .build()
val tensorImage = TensorImage.fromBitmap(bitmap)
val processedImage = imageProcessor.process(tensorImage)

TFLite wrapper (experimental)

If your model contains metadata, it enables you to use the TensorFlow Lite wrapper code generator. The Model wrapper eliminates the need to set up your delegates, manually performing image transformations and dealing with raw TensorBuffer output. The extent to which generated code will be helpful to you entirely depends on the completeness of the metadata. Also, keep in mind that this feature is in an experimental phase, so you'll probably have to wait until it becomes stable before replacing all your ML-related logic with generated code.

To learn more about the wrapper code generator, please refer to the official docs.

Part 2: Model

To accomplish our task - human pose estimation - it is crucial that we have a basic understanding of our ML model and learn about our expected inputs/outputs. The TFLite "Getting started" page and linked source code provide enough information to kick start a new Proof of Concept project, but if we are going to make any changes to the core logic or simply want to compare existing options - it's better to know what we're dealing with.

Picking the right model

Let's start by examining a repository of open-sourced ML models - TensorFlow hub. This is a great place to search for domain-specific, format-specific solutions.

Our search query would be an "image pose detection" domain with "model format" filter set to TFLite (as of June 2020, there's only one model satisfying this criteria - MobileNet_075). Now, we have two options: pure model or model + metadata. From tensorflow.org:

TensorFlow Lite metadata provides a standard for model descriptions. The metadata is an important source of knowledge about what the model does and its input / output information. The metadata consists of both - human readable parts which convey the best practice when using the model, and - machine readable parts that can be leveraged by code generators, such as the TensorFlow Lite Android code generator.

Let's further examine model with metadata. I found this useful tool to visualize the model's structure: Netron. After uploading the .tflite file we can see convolutional layers the model has and check what the expected inputs and outputs are:

f:id:ivanpo:20201006145032p:plain — Graph

f:id:ivanpo:20201006145103p:plain — Metadata

A closer look

First, let's understand the input requirements:

Image: FloatArray [1][353][257][3]

To prepare the image for classification, we'll need to scale it down to 353x257 pixels, extract each pixel's RGB value and normalize it, meaning the values should be within [-1;1].

Second, let's pay attention to the outputs:

Image (Grayscale): [1][23][17][17]

An input image that has been reduced to 23x17 points, and each keypoint (out of 17 in total) has received a "confidence score"
Offsets: [1][23][17][34]

Since the output matrix has a much smaller size, we want to get a better idea of where the original keypoint might have been. Offset vectors are here to help — once we pick the right (x, y) for keypoints, apply the following formula to get the final coordinates:

y = keyPoint.y * originalHeight + offsets[0][keyPoint.y][keyPoint.x][keyPoint.index]

x = keyPoint.x * originalWidth + offsets[0][keyPoint.y][keyPoint.x][keyPoint.index + 17]
Forward displacement: [1][23][17][64]

Backward displacement: [1][23][17][1]

In multi-pose estimation, when there are multiple poses to detect, it is not enough to pick a keypoint with the highest score — we need to pick multiple keypoints and group them into a graph representing a distinct human pose. Displacement arrays are used in a fast greedy decoding algorithm explained in this paper: PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model. I will discuss the implementation later in the series.

The output structure seems to correspond with TensorFlow Pose Estimation starter guide:

Heatmaps: [1][height][width][N]
Offsets: [1][height][width][N * 2]
Forward displacements: [1][height][width][E * 2]
Backward displacements: [1][height][width][E * 2]

You might have noticed that something doesn't add up. The backward displacements matrix should be the same shape as the forward displacements: [1][23][17][64], but instead we are getting [1][23][17][1]. I believe it's a known problem (it is mentioned on StackOverflow), however it only affects multi-pose estimation. For single-pose estimation we will be using a much simpler "brute-force" solution that doesn't involve part-based graph traversal.

Part 3: Inference

Now that I‘ve given an overview of TFLite, models and available support tools, it's time to dive into the process of inference. The goal is to feed a prepared TensorImage to an interpreter and extract 17 key points with their (x, y) location and probability (confidence).

Preparation

If you manually downloaded the right model for your task, I recommend placing it in the /assets folder. If you don't want to check the file into VCS, simply add it to .gitignore and use this handy Gradle script, which will download the file automatically at build time:

/* (c) ExaWizards */

// download.gradle
def targetFile = "src/main/assets/posenet_model_meta.tflite"
def modelFloatDownloadUrl = "https://tfhub.dev/tensorflow/lite-model/posenet/mobilenet/float/075/1/metadata/1?lite-format=tflite"

task downloadModelFloat(type: DownloadUrlTask) {
    doFirst {
        println "Downloading ${modelFloatDownloadUrl}"
    }
    sourceUrl = "${modelFloatDownloadUrl}"
    target = file("${targetFile}")
}

class DownloadUrlTask extends DefaultTask {
    @Input
    String sourceUrl

    @OutputFile
    File target

    @TaskAction
    void download() {
        ant.get(src: sourceUrl, dest: target)
    }
}

preBuild.dependsOn downloadLibs

// Add this line to your build.gradle
// apply from:'download.gradle'

Now, we need to set up an interpreter. Depending on your target devices and benchmarks, you may choose one of the few available delegates, and load the model from the assets folder:

/* (c) ExaWizards */

private fun createInterpreter(device: Model.Device): Interpreter {
        val options = Interpreter.Options().apply {
            when (device) {
                Model.Device.CPU -> setNumThreads(numThreads)
                Model.Device.GPU -> addDelegate(GpuDelegate())
                Model.Device.NNAPI -> setUseNNAPI(true)
            }
        }
        return Interpreter(loadModelFile("posenet_model_meta.tflite", context), options)
}

private fun loadModelFile(path: String, context: Context): MappedByteBuffer {
        val fileDescriptor = context.assets.openFd(path)
        val inputStream = FileInputStream(fileDescriptor.fileDescriptor)
        return inputStream.channel.map(
            FileChannel.MapMode.READ_ONLY, fileDescriptor.startOffset, fileDescriptor.declaredLength
        )
}

To avoid getting errors during model loading, add this to your app's build.gradle to disable .tflite file compression:

android {
    ...
    aaptOptions {
        noCompress "tflite"
  }
}

Assuming you've already prepared TensorImage (check Part 4 for more info), let's proceed with inference.

Single pose estimation

The interpreter takes an input array of ByteBuffer with a Tensor shape defined by the model; in our case it's [1, 353, 257, 3]. The output array will contain four 4-dimensional float arrays: Heatmaps, Offsets, Forward displacements, Backward displacements. You can get their default shapes by calling

getInterpreter().getOutputTensor(i).shape(),

where i ∈ [0, 3], as we have 4 output tensors.

/* (c) ExaWizards */

val outputMap = mutableMapOf<Int, Any>()

fun estimatePose(tensorImage: TensorImage): Person {
        val inputArray = arrayOf(tensorImage.buffer)
        (0 until interpreter.outputTensorCount).forEach {
            outputMap[it] = reshapeTo4dArray(interpreter.getOutputTensor(it).shape())
        }
        interpreter.runForMultipleInputsOutputs(inputArray, outputMap)
                // parse outputMap
                return extractKeyPoints(outputMap, tensorImage.width, tensorImage.height)
}

private fun reshapeTo4dArray(shape: IntArray): Array<Array<Array<FloatArray>>> =
        Array(shape[0]) { Array(shape[1]) { Array(shape[2]) { FloatArray(shape[3]) } } }

Next step is to extract key points and create a Person object that contains all of the information we need to draw a person's shape on-screen. Since we are focusing on single pose estimation for now, we will only need two arrays: Heatmaps and Offsets. The idea is to find the locations of the key points with the highest confidence scores, calculate their (x, y) coordinates using offset adjustment and normalize the confidence score to the range [0;1].

/* (c) ExaWizards */

// order is important!
enum class BodyPart {
    NOSE, LEFT_EYE, RIGHT_EYE, LEFT_EAR, RIGHT_EAR, LEFT_SHOULDER, RIGHT_SHOULDER,
    LEFT_ELBOW, RIGHT_ELBOW, LEFT_WRIST, RIGHT_WRIST, LEFT_HIP, RIGHT_HIP,
    LEFT_KNEE, RIGHT_KNEE, LEFT_ANKLE, RIGHT_ANKLE
}

data class Position(val x: Int, val y: Int)
data class KeyPoint(val bodyPart: BodyPart, val position: Position, val score: Float)
data class Person(val keyPoints: List<KeyPoint>, val score: Float)

@Suppress("UNCHECKED_CAST")
private fun extractKeyPoints(
    outputMap: Map<Int, Any>,
    imageWidth: Int,
    imageHeight: Int
): Person {
    val heatMaps = outputMap[0] as Array<Array<Array<FloatArray>>>
    val offsets = outputMap[1] as Array<Array<Array<FloatArray>>>

    val height = heatMaps[0].size
    val width = heatMaps[0][0].size
    val numKeyPoints = heatMaps[0][0][0].size

    val keyPoints = mutableListOf<KeyPoint>()
    val bodyParts = enumValues<BodyPart>()
    var totalConfidence = 0f
    for (keyPoint in 0 until numKeyPoints) {
        var maxVal = heatMaps[0][0][0][keyPoint]
        var maxRow = 0
        var maxCol = 0
        // Find the (row, col) locations of where the keyPoints are most likely to be.
        for (row in 0 until height) {
            for (col in 0 until width) {
                if (heatMaps[0][row][col][keyPoint] > maxVal) {
                    maxVal = heatMaps[0][row][col][keyPoint]
                    maxRow = row
                    maxCol = col
                }
            }
        }
        val yDisplacement = offsets[0][maxRow][maxCol][keyPoint]
        val xDisplacement = offsets[0][maxRow][maxCol][keyPoint + numKeyPoints]
        val yCoord = maxRow / (height - 1).toFloat() * imageHeight + yDisplacement
        val xCoord = maxCol / (width - 1).toFloat() * imageWidth + xDisplacement
        val confidence = sigmoid(maxVal)
        val bodyPart = bodyParts[keyPoint]
        totalConfidence += confidence
        keyPoints.add(KeyPoint(bodyPart, Position(xCoord.toInt(), yCoord.toInt()), confidence))
    }

    return Person(keyPoints, totalConfidence / numKeyPoints)
}

/** Returns a value within [0,1].   */
private fun sigmoid(x: Float): Float {
    return (1.0f / (1.0f + exp(-x)))
}

And there we have it - a Person object containing key point locations and their confidence scores! The next step would be to filter key points by a confidence threshold and translate the coordinates back to the starting image dimensions - remember, we applied a number of transformations (rotation, scale, crop) to the original input. I will discuss this logic later in the series, using a CameraX feed as an example.

Multi-pose estimation

If we want to get more than one person's key points, the brute-force key point search solution above will not work. As I mentioned before, we have to use forward and backward displacement arrays to handle this task.

The idea of a modified algorithm is described in this PersonLab paper:

f:id:ivanpo:20201006145205p:plain — Multipose algorithm

As you can see, the algorithm is non-trivial and requires a bit of time to get right. You can try implementing it yourself, or use one of these open source projects as an example:

PoseNet Typescript by TensorFlow

PoseNet Java by shaqian

Important note: Before you decide to enable multi-pose estimation, make sure your model supports it! The current model listed on TensorFow Hub returns incorrect displacement arrays, so try using a modified version from this StackOverflow answer instead.

f:id:ivanpo:20201006145314p:plain — Multipose output

Part 4: Camera 1̶ 2̶ X

Android CameraX is a great library used to seamlessly integrate camera logic into the project's codebase by combining existing use cases that interface with the device's camera API: Preview, Image Analysis, Image Capture. If you're not familiar with the CameraX architecture, please refer to the official documentation page.

f:id:ivanpo:20201006145348p:plain — from CameraX documentation

In this part we will focus on combining Preview with Image Analysis to display an inferred human pose on screen in real time.

Preparation

To get started with CameraX and get a better idea of its architecture and capabilities, I recommend following Google’s codelab page. I you want a quick start by looking at a complete implementation, you can refer to my PoseNet sample (coming soon).

Image Analysis

Once you're familiar with the CameraX API, let's start by setting up an ImageAnalysis use case. First, we might want to request a specific resolution by calling

val builder = ImageAnalysis.Builder().setTargetResolution(Size(width, height))

Keep in mind that in order to infer a human pose in real time we will need to heavily downscale our original image to match the Model's input size. However, we can't just request any arbitrary resolution; instead, it will depend on the Camera implementation and will fall back to the nearest available resolution in case the requested size doesn't exist.

Next, let's set an appropriate backpressure strategy. Inferrence takes time, so we won't be able to process every frame from the camera feed before the next one comes in. To avoid buffer overflow, we will skip subsequent frames until we're done processing the current one:

builder.setBackpressureStrategy(ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST)

Finally, let's create an Analyzer. Image.Analyzer will receive an ImageProxy object which we will use to get an Image and transform it using the ImageProcessor class provided by the TensorFlow Lite support library.

Here's sample code for ImageAnalysis setup

/* (c) ExaWizards */

val useCase: ImageAnalysis = ImageAnalysis.Builder()
    .setTargetResolution(Size(targetWidth, targetHeight))
    .setBackpressureStrategy(ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST)
    .build()
    .apply(::setAnalyzer)

private fun setAnalyzer(imageAnalysis: ImageAnalysis) {
    imageAnalysis.setAnalyzer(
        Executors.newSingleThreadExecutor(),
        ImageAnalysis.Analyzer { image ->
            val transformedImage = image.use {
                processImage(
                    image.image ?: throw Exception("Unexpected ImageProxy"),
                    image.imageInfo.rotationDegrees,
                    modelConfig.modelWidth,
                    modelConfig.modelHeight
                )
            }
            val person = estimatePose(transformedImage.tensorImage)
            onPoseData(PoseData(
                person,
                transformedImage.originalSize,
                transformedImage.scaledSize,
                transformedImage.paddedSize,
                transformedImage.orientation)
            )
        }
    )
}

data class PoseData(
    val person: Person,
    val originalSize: Size,
    val scaledSize: Size,
    val paddedSize: Size,
    val orientation: Orientation,
    val transformedBitmap: Bitmap?
)

Important note: if you're using the GPU delegate for inference, remember that only the original thread that instantiated a GPU delegate can call it. Here, I'm using Executors.newSingleThreadExecutor() as an image processing executor and lazily creating a GPU instance. That means I cannot reuse the same delegate once I discard the ImageAnalysis object and have to instantiate a new delegate again.

Image Transformation

To prepare an image for inference we need to perform the following series of transformations:

Downscaling → Rotation → Cropping → Normalization

In order to translate the resulting pose coordinates back to the original dimensions, I recommend keeping each step's variables in a data class — that way it will be easier to apply each transformation in reverse order.

Important note: CameraX provides an Image in YUV_420_888 format, which we will convert to RGB values in order to extract a byte buffer for further image processing with PoseNet. I am using RenderScript for YUV → RGB conversion; you can take a look at the "sample approach" here.

The TensorFlow Lite support library provides helper operations discussed earlier, each resulting in creating a new TensorImage that holds a modified Bitmap. A complete image processing function looks something like this:

/* (c) ExaWizards */

private val yuvToRgbConverter = YuvToRgbConverter(context.applicationContext)

data class TransformedImage(
    val tensorImage: TensorImage,
    val originalSize: Size,
    val scaledSize: Size,
    val paddedSize: Size,
    val orientation: Orientation
)

fun processImage(
    image: Image,
    rotationDegrees: Int,
    targetWidth: Int, // input tensor size
    targetHeight: Int // input tensor size
): TransformedImage {
    val imageBitmap = Bitmap.createBitmap(image.width, image.height, Bitmap.Config.ARGB_8888)
    yuvToRgbConverter.yuvToRgb(image, bitmap)
    val numRotations = rotationDegrees / 90
    val scale = min(image.height.toDouble() / targetWidth, image.width.toDouble() / targetHeight)
    val scaledSize = Size((image.width / scale).toInt(), (image.height / scale).toInt())
    val orientation = if (numRotations % 2 == 0) {
        Orientation.HORIZONTAL
    } else {
        Orientation.VERTICAL
    }
    val imageProcessor = ImageProcessor.Builder()
        .add(ResizeOp(scaledSize.height, scaledSize.width, ResizeOp.ResizeMethod.BILINEAR))
        .add(Rot90Op(-numRotations))
        .add(ResizeWithCropOrPadOp(targetHeight, targetWidth))
        .add(NormalizeOp(127.5f, 127.5f))
        .build()
    val tensorImage = TensorImage.fromBitmap(imageBitmap)
    return TransformedImage(
        imageProcessor.process(tensorImage),
        Size(image.width, image.height),
        scaledSize,
        Size(targetWidth, targetHeight),
        orientation
    )
}

Coordinate translation

The final step is to extract the inferred pose‘s key points and apply the coordinate translation algorithm to match the camera's preview layout. The tricky part is to add (x, y) padding in case your pose overlay view aspect ratio doesn't match the original image. The CameraX preview window will do the same, and the effect is similar to ImageView's centerCrop scale type. Let's add this extension function:

/* (c) ExaWizards */

private val minConfidence = 0.7f

fun PoseData.extractKeyPoints(val width: Int, val height: Int): Map<BodyPart, PointF> {
    val scaledWidth: Int
    val scaledHeight: Int
    val originalWidth: Int
    val originalHeight: Int
    when (orientation) {
        Orientation.HORIZONTAL -> {
            scaledWidth = scaledSize.width
            scaledHeight = scaledSize.height
            originalWidth = originalSize.width
            originalHeight = originalSize.height
        }
        Orientation.VERTICAL -> {
            scaledWidth = scaledSize.height
            scaledHeight = scaledSize.width
            originalWidth = originalSize.height
            originalHeight = originalSize.width
        }
    }
    val xOffset = (scaledWidth - paddedSize.width) / 2.0
    val yOffset = (scaledHeight - paddedSize.height) / 2.0

    // crop or pad to fit current view
    val originalRatio = originalHeight / originalWidth.toDouble()
    val widthFactor: Double
    val heightFactor: Double
    val xPad: Double
    val yPad: Double
    if (width * originalRatio >= height) {
        // width is the basis
        xPad = .0
        yPad = (height - width * originalRatio) / 2
        widthFactor =
            (width / originalWidth.toDouble()) * originalWidth / scaledWidth.toDouble()
        heightFactor =
            (width * originalRatio / originalHeight.toDouble()) * originalHeight / scaledHeight.toDouble()
    } else {
        xPad = (width - height / originalRatio) / 2
        yPad = .0
        widthFactor =
            ((height / originalRatio) / originalWidth.toDouble()) * originalWidth / scaledWidth.toDouble()
        heightFactor =
            (height / originalHeight.toDouble()) * originalHeight / scaledHeight.toDouble()
    }

    return person.keyPoints
            .asSequence()
            .filter { it.score > minConfidence }
            .map {
                it.bodyPart to it.position.toAdjustedPoints(
                    widthFactor,
                    heightFactor,
                    xOffset,
                    yOffset,
                    xPad,
                    yPad
                )
            }
            .toMap()
}

private fun Position.toAdjustedPoints(
    widthFactor: Double,
    heightFactor: Double,
    xOffset: Double,
    yOffset: Double,
    xPad: Double,
    yPad: Double
) = PointF(
    ((x + xOffset) * widthFactor + xPad).toFloat(),
    ((y + yOffset) * heightFactor + yPad).toFloat()
)

That's it! Now all you need to do is to invalidate() the view on every update from ImageAnalyzer and draw a circle where each of the extracted key points are:

/* (c) ExaWizards */

// inside PoseOverlayView.kt

private var pointMap: Map<BodyPart, PointF> = emptyMap()
    set(value) {
        field = value
        invalidate()
    }

private val circleRadius = 8.0f
private val circlePaint: Paint = Paint().apply {
    color = Color.WHITE
    strokeWidth = 8.0f
}

fun updatePoseData(poseData: PoseData) {
    pointMap = poseData.extractKeyPoints()
}

override fun onDraw(canvas: Canvas?) {
    super.onDraw(canvas)
    canvas ?: return
    canvas.drawColor(Color.TRANSPARENT, PorterDuff.Mode.CLEAR)
    pointMap.forEach { entry ->
        entry.values.forEach { canvas.drawCircle(it.x, it.y, circleRadius, circlePaint) }
    }
}

Part 5: Definition of Done

Previously we learned how to set up an interpreter, pick the right model, how to attach the CameraX analyzer and draw the output on a canvas. There’s one more thing left to cover: how to improve user experience depending on your use case. You may need more precise key point estimation, or, maybe, fast inference time is critical for a smooth UX. In this part we will discuss some tips and tricks that may be worth considering.

Optimizing for accuracy

Posenet is a fully convolutional model, meaning it was trained with a specific image size but can process larger images, sacrificing performance in favor of accuracy. The only rule is that the size should be a multiple of 16, plus 1 (see this answer). Previously, we talked about the expected input/output tensor’s shape: [1, 353, 257, 3] for the input and [1, 23, 17, X] for the various output tensors. As you may remember, input shape represents the amount of input image pixels times 3 (one Float per each RGB-channel). the output shape scales linearly with an outputStride: outWidth = ((inputWidth - 1) / outputStride) + 1, where the outputStride can be 8, 16 or 32. The lower the outputStride, the higher the accuracy, but the slower the speed.

A pre-trained .tflite model does not support a variable output stride, but we can change the input tensor shape and adjust our expectations for the output tensor. Here’s how to do it:

/* (c) ExaWizards */

//create an interpreter first
val interpreter: Interpreter = Interpreter(model, options)

// let's double the size of the default tensor
fun resizeInput() {
    interpreter.resizeInput(0, intArrayOf(1, 705, 513, 3))
}

// remember to scale a processed image size to 705x513 instead of 353x257
fun <T> estimatePose(byteBuffer: ByteBuffer, decoder: Decoder<T>): T {
    val inputArray = arrayOf(byteBuffer)
        // output shapes will become [1, 45, 33, X]
    model.run(inputArray, outputs.buffer)
    return decoder.decode(create4DArray(outputs))
}

Important note: remember that inference time does not scale linearly. On my Pixel 1 test device, using the GPU delegate, I was able to get ~70ms average inference, while doubling the input size brought the time up to ~270ms!

This method is useful if you don’t care about real-time performance and instead are analyzing a static image while running some scene transition animation or showing a brief loading screen after taking a picture.

Optimizing for performance

If we can afford to sacrifice accuracy to gain true real-time pose estimation even on lower-end devices, it might be a good idea to scale the image down to even smaller size. Remember to adjust your input/output tensor shape accordingly.

One other bit of advice I can give you is to optimize the image processing part. During my tests on Pixel 1 I was using the TFLite support library, and the image processing took up to ~60ms on average, almost the same time as inference itself! Here's what it looked like:

/* (c) ExaWizards */

val imageProcessor = ImageProcessor.Builder()
    .add(ResizeOp(scaledSize.height, scaledSize.width, ResizeOp.ResizeMethod.BILINEAR))
    .add(Rot90Op(-numRotations))
    .add(ResizeWithCropOrPadOp(targetHeight, targetWidth))
    .add(NormalizeOp(127.5f, 127.5f))
    .build()
val tensorImage = TensorImage.fromBitmap(imageBitmap)
val tensorBuffer = imageProcessor.process(tensorImage).tensorBuffer

Under the hood each ImageOperator produces a new Bitmap by applying a transformation to the original image, and the last operation in the chain transforms a Bitmap into a ByteBuffer and performs normalization on it. Let's take a look at how we can optimize this:

Combine ResizeOp with Rot90Op
Leave ResizeWithCropOrPadOp as is
Combine Bitmap → ByteBuffer with NormalizeOp

You can create your own operators by implementing the ImageOperator and TensorOperator interfaces, which are a part of the TFLite support library, but I will show you a sample image transformation without ImageProcessor to better understand how it works:

/* (c) ExaWizards */

val rotateMatrix = Matrix()
val scale = min(
    image.height.toDouble() / targetWidth,
    image.width.toDouble() / targetHeight
)
val scaledSize = Size((image.width / scale).toInt(), (image.height / scale).toInt())
val sx: Float = scaledSize.width / image.width.toFloat()
val sy: Float = scaledSize.height / image.height.toFloat()
// combine ResizeOp with Rot90Op
rotateMatrix.preScale(sx, sy)
rotateMatrix.postRotate(rotationDegrees.toFloat())
val rotatedBitmap = Bitmap.createBitmap(
        imageBitmap, 0, 0, imageBitmap.width, imageBitmap.height,
        rotateMatrix, true
    )

// see ResizeWithCropOrPadOp.java for implementation
val croppedBitmap = cropBitmap(rotatedBitmap, targetHeight, targetWidth)

// extract RGB values and normalize them
val mean = 128f
val std = 128f
val bytesPerChannel = 4
val inputChannels = 3
val batchSize = 1
val inputBuffer = ByteBuffer.allocateDirect(
    batchSize * bytesPerChannel * croppedBitmap.height * croppedBitmap.width * inputChannels
)
inputBuffer.order(ByteOrder.nativeOrder())
inputBuffer.rewind()
val intValues = IntArray(croppedBitmap.width * croppedBitmap.height)
croppedBitmap.getPixels(intValues, 0, croppedBitmap.width, 0, 0, croppedBitmap.width, croppedBitmap.height)
for (pixelValue in intValues) {
    inputBuffer.putFloat(((pixelValue shr 16 and 0xFF) - mean) / std)
    inputBuffer.putFloat(((pixelValue shr 8 and 0xFF) - mean) / std)
    inputBuffer.putFloat(((pixelValue and 0xFF) - mean) / std)
}
return inputBuffer

By applying this simple improvement I was able to save ~25ms on average, bringing the image processing time down to ~35ms.

Frame interpolation

My final tip for you is about providing users with a smooth UX even if your computational budget is relatively low.

Like I mentioned before, the Pixel 1 is not the most performant device to run inference on, with an average time of ~100ms (including image processing) using default tensor shapes. That means every pose update will take at least 100ms to appear on screen, resulting in an average of 10 frames per second. What should we do if we simply can't go faster, but still want smooth 60fps updates?

In that case I suggest using a trick involving interpolation. The idea is that, whenever a new pose update comes, instead of drawing a new frame immediately, we start gradually moving existing points to their new destination over time, creating the illusion of smooth updates. If an update happens before the points reach their previous destination, simply start a new intrepolator from their current position to the new one. It's important to remember that this trick will introduce an artificial delay and will de-sync the camera feed and pose overlay view, making the experience arguably worse on more performant devices (i.e., capable of at least 30fps updates). Still, you can make the interpolation time dynamic and adjust it at runtime based on how much time the last inference took to complete.

/* (c) ExaWizards */

// in FluidPoseView.kt
...
private var pointMap: MutableMap<BodyPart, PointF> = mutableMapOf()
private val interpolator = LinearInterpolator()
private val flow = MutableStateFlow<MutableMap<BodyPart, PointF>?>(null)
private val coroutineScope: CoroutineScope? = (context as? AppCompatActivity)?.lifecycleScope
private var animJob: Job? = null
private val durationNanos = 1e8f

private val evaluator = object : TypeEvaluator<MutableMap<BodyPart, PointF>> {
    private val pointFEvaluator: PointFEvaluator = PointFEvaluator()

    override fun evaluate(
        fraction: Float,
        startValue: MutableMap<BodyPart, PointF>?,
        endValue: MutableMap<BodyPart, PointF>?
    ): MutableMap<BodyPart, PointF> {
        val updated = startValue?.mapValues { entry ->
            val startPointF = entry.value
            val endPointF = endValue?.get(entry.key)
            when {
                startPointF == zeroPoint -> endPointF ?: zeroPoint
                endPointF == null -> zeroPoint
                else -> pointFEvaluator.evaluate(fraction, startPointF, endPointF)
            }
        }?.toMutableMap() ?: mutableMapOf()
        endValue?.forEach {
            updated.addIfAbsent(it.key, it.value)
        }
        return updated
    }
}

override fun onAttachedToWindow() {
    super.onAttachedToWindow()
    animJob = coroutineScope?.launch {
        flow.collectLatest { endValue ->
            endValue ?: return@collectLatest
            val startValue = pointMap
            val startTime = System.nanoTime()
            while (true) {
                val time = awaitFrame()
                val fraction = (time - startTime) / durationNanos
                if (fraction >= 1.0f) {
                    break
                }
                val interpolatedFraction = interpolator.getInterpolation(fraction)
                pointMap = evaluator.evaluate(interpolatedFraction, startValue, endValue)
                invalidate()
            }
        }
    }
}

override fun onDetachedFromWindow() {
    super.onDetachedFromWindow()
    animJob?.cancel()
}

override fun onDraw(canvas: Canvas?) {
    // pointMap.values
    //        .filter { it != zeroPoint }
    //        .forEach { canvas.drawCircle(it.x, it.y, circleRadius, circlePaint) }
}

f:id:ivanpo:20201006150119g:plain — Low performance, discrete frames

f:id:ivanpo:20201006150229g:plain — Low performance, interpolated frames

Conclusion

We learned how to integrate TensorFlow Lite into your project, explored the TensorFlow support library package, analyzed the Posenet model, discussed what the inference is and leveraged the CameraX API to efficiently analyze the camera feed in real time. You can apply many of the concepts discussed here to other use cases, too, and to give you a quick start I will prepare an open source sample project showcasing the on-device machine learning kit.

Thanks for your time!

2020-10-22

GitHub Actions の self-hosted runners を AWS ECS で動かして、CI / CD パイプラインを作る

f:id:tadashi-nemoto0713:20201019135904p:plain

DevOps エンジニアの根本征です。

7月からエクサウィザーズにジョインし、CI / CD パイプラインの改善や自動テストの布教などを行っています。

今回は GitHub Actions の self-hosted runners を AWS ECS 上に構築し運用してみたので、その試行錯誤について紹介したいと思います。

GitHub Actions と self-hosted runners
self-hosted runners を Docker で動かす
self-hosted runners を AWS ECS で動かす
アプリケーションを AWS ECS へデプロイする Workflow を作る
おわりに

GitHub Actions と self-hosted runners

GitHub Actions は GitHub ユーザーであれば現在多くの方がご存知・ご活用されているかと思います。

GitHub Marketplace で公開されている Actions と組み合わせることによって、簡単に様々な CI / CD パイプラインを構築することができます。

そんな GitHub Actions ですが、GitHubが提供する Runner を使う代わりに自前で用意することもできます(self-hosted runners)。

セルフホストランナーについて - GitHub Docs

GitHub が提供する Runner で GitHub Actionsを利用する場合、無料利用枠 + 従量課金の課金体制になります。

しかし、self-hosted runners で GitHub Actions を利用する場合には別途料金がかかることはありません。

Repository または Organization の設定からself-hosted runners をセットアップすることができます(Linux / MacOS / Windows 毎に手順が示されます)。

f:id:tadashi-nemoto0713:20201016201036p:plain

セットアップが完了すると、Runner として追加されていることが確認できます。

Repository で設定した Runner はその Repository で、そして Organization で設定した Runner はその Organization 内の全ての Repository で実行することができます。

f:id:tadashi-nemoto0713:20201016201348p:plain

最後に GitHub Actions の設定ファイルにおいて、self-hosted runners で実行することを記述します。

f:id:tadashi-nemoto0713:20210325114741p:plain

self-hosted runners はマシン自体は自分たちで調達・メンテナンスをしないといけないですが、ワークフローを管理する部分(Jenkins でいう master)を自前で用意せず無料で利用できる点は魅力的だと感じます。

self-hosted runners を Docker で動かす

self-hosted runners 自体はオープンソースとして公開されていますが、現在 Docker イメージは提供されていません。

調べたところ、Docker・Kubernetes で self-hosted runners 動かすためにオープンソースプロジェクトで様々な試行錯誤が行われているみたいです。

本記事では AWS ECS をメインに下記の手順で解説します。

self-hosted runners を Docker で動かす
self-hosted runners を AWS ECS で動かす
その self-hosted runners を用い、アプリケーションを AWS ECS にデプロイするパイプラインを作る

まず下記の Docker イメージを使い、手元で self-hosted runners を立ち上げてみます。

github.com

Organization に対して Runner を立ち上げるためには下記 docker コマンドを実行します。

f:id:tadashi-nemoto0713:20210325114821p:plain

GitHub Actions では Docker の操作(ビルドなど)を行うため、このコンテナ内でも Docker コマンドが使えることが必要になります。

そのため、ホストマシン上の Docker daemon を共有することで解決しています(-v /var/run/docker.sock:/var/run/docker.sock の部分、Docker outside of Docker)

Using Docker-in-Docker for your CI or testing environment? Think twice.

self-hosted runners を AWS ECS で動かす

次に先ほどの Docker イメージを使って、AWS ECSで動かしてみます。

今回は AWS CDK(Typescript) をベースに3つのポイントに絞って解説します。

1つ目に AWS ECS で Docker outside of Docker(DooD) をどう実現するかについてです(-v /var/run/docker.sock:/var/run/docker.sockの部分)。

AWS ECS だと Task 側に Volume を追加し、Container 側にその Volume をマウントすることで解決することができます。

f:id:tadashi-nemoto0713:20210325114902p:plain

f:id:tadashi-nemoto0713:20210325114933p:plain

AWS ECS では起動タイプとして Fargate と EC2 がありますが、現状 Fargate で上記を行うことがサポートされていなかったため、今回はEC2起動タイプを選択しました。

2つ目に ECS Task に与える Role についてです。

今回の GitHub Actions 上でのパイプラインでは、Docker ビルド・ECRへのイメージのアップロード・ECSへのデプロイまで行おうとしています。

もちろん GitHub 側に AWS アクセスキーを保存し、それを GitHub Action に渡すことで上記を実現することができます。

しかし、self-hosted runners のコンテナ自体に上記に必要な Role を与えることによって、GitHub側に AWS アクセスキーを保存する必要自体をなくすことができます。

f:id:tadashi-nemoto0713:20210325115036p:plain

3つ目に、スポットインスタンスについてです。

self-hosted runners は CI / CD として使うため、スポットインスタンスを活用してコストを抑えることができます。

AWS CDK の場合、spotInstanceDraining プロパティを true にすることでスポットインスタンスを利用することができます。

f:id:tadashi-nemoto0713:20210325115131p:plain

アプリケーションを AWS ECS へデプロイする Workflow を作る

今回はこの GitHub Actions と self-hosted runners を活用して、アプリケーションを AWS ECS へデプロイする Workflow を作りたいと思います。

同じ AWS ECS なのでややこしくなってしまいますが、self-hosted runners を動かすクラスタとアプリケーションを動かすクラスタは別という想定です。

f:id:tadashi-nemoto0713:20201020153820p:plain

具体的には下記のような手順になります。

Docker build して ECR にPushする
↓
Task Definition ファイルを編集する(Dockerイメージの部分を新しくする)
↓
Task Definition を新たに登録し、Service が更新されるまで待つ

AWSで提供されている GitHub Actions の Step と組み合わせると下記のようになります。

f:id:tadashi-nemoto0713:20210325115227p:plain

先ほども述べましたが、本来は aws-actions/configure-aws-credentials の Step で下記のように AWS アクセスキーを GitHub に保存し渡してあげる必要があります。

f:id:tadashi-nemoto0713:20210325115303p:plain

しかし、今回は self-hosted runners に必要な権限を渡してあげているのでこの必要はありません。

おわりに

今回はGitHub Actions の self-hosted runners を AWS ECS 上に構築してみましたが、下記のような状況でメリットがあると考えています。

GitHub Action を使ってアプリケーションをAWSへデプロイする際、GitHub 側に不必要に AWS アクセスキーを渡したくない
スポットインスタンスや活用することによって、コストパフォーマンスよく CI / CD 環境を運用したい(特に無料枠を大幅に超えることが予想される場合)
GitHub が提供する Runner よりスペックの高いマシンで CI / CD 環境を運用したい

今後、この GitHub Actions をベースにより良い CI / CD パイプラインが作れたら良いと考えています。

hrmos.co

2020-09-14

【連載】時系列データにおける異常検知（１）

はじめに

こんにちは。MLエンジニアの福成毅です。

私は、ある自社プロダクトの要素技術として時系列異常検知モデルの開発に携わってきました(2019/10 〜 2020/03)。異常検知には今まで取り組んだことがなかったですが、時間をかけて様々なアプローチがあることを学びました。異常検知は、機械の故障やシステム障害などにおいて発生する異常データを見つけ出す手法であり、様々な産業での応用が期待されています。一方で教師データ（特に異常データ）の不足や時系列特有の制約など、どうしても難易度が高くなりがちなタスクでもあります。

今回の投稿では、異常検知の基本的な考え方を述べ、時系列異常検知における代表的なタスクの紹介を行います。何回かに分けて投稿しますので、少し長くなりますが、おつきあい頂ければ幸いです。

基本的な考え方

ここでは異常検知の基本的な考え方について述べていきます。

教師なし学習

異常検知は教師あり学習・教師なし学習どちらでも解くことができますが、どちらかというと教師なし学習の方がスタンダードなやり方になります。

イメージとしては、まず「正常モデル」のみを作り上げ、この正常モデルでは「理解」できなかったデータは異常であると考えるということです。

ちなみに正常と異常が選り分けられなくていなくとも、異常データが正常データに比べてごくわずかであれば、異常データがノイズとなるだけで正常モデルを作ることができます（とはいえ正常データのみで正常モデルを作ることがベストではありますが・・・）。

f:id:t-fukunari:20200424170800p:plain

教師あり学習の難しさ

なぜ教師なし学習が異常検知においてスタンダードなのか。もちろんラベルさえあれば教師あり学習でも行うことは可能ですが、いくつかハードルを乗り越える必要があると思います。

具体的には、以下のようなケースがあると考えられます。

そもそもラベルがない

よくある話です。まだ異常に遭遇していなかったり、異常のパターンが網羅できてなかったりすることが理由として考えられます。また後に述べますが、正常の定義が変わっていくことでラベルをつけることができないケースも考えられます。

異常データが少なすぎる

そもそも異常は滅多に起こらず（だからこそ「異常」なのですが）、正常データは十分あるが異常データがほとんど得られないということが考えられます。このような不均衡データでモデルを作るとどうしても予測が正常に偏りがちになります。

未知の異常に遭遇する可能性が高い

いままで運良く故障しなかった箇所の故障、新手の詐欺・ハッキングetc... 大方我々を待ち受けているのは未知の異常です。これまでの既知の異常でモデルを作ったとしても、未知の異常が得られるたびに、再学習や時には問題設定の変更を強いられることになります。

f:id:t-fukunari:20200424170852p:plain

確率分布による正常モデル

ではその「正常モデル」をどうやって作っていくか。色々方法はあるかと思いますが、よくあるのは、確率分布を考えるアプローチかと思います。ざっくり説明すると、正常データでヒストグラムを作り、それを滑らかにするイメージです。ここでの正常データは、異常が含まれていないか、含まれていたとしてもごくわずかであることを前提とします。ごくわずかであれば含まれていてもよいというのは、わずかであればモデル化の際に無視されるためです。

f:id:t-fukunari:20200508134927p:plain

そして異常かどうか調べたいデータが上記の分布において確率が低いところで観測された場合、正常とされる中でめったに起こらないことが起こった = 別の分布から発生したのではないか？と疑うことができます。つまり正常ではないということです。どれぐらいの低確率だったら異常とするか = 閾値をどれくらいにするかは調整次第ですが、様々な手法は概ねこの考え方から派生します。

また分布そのもので考えず、統計量で考えることも可能です。例えば、正常データの平均値からの距離が標準偏差×定数倍を超えたら異常とする、といったものです。データが少なすぎてどうしても分布を推定できそうにない時に使える手法です。

時系列データにおける考え方

以降では、時系列データにおける異常検知の考え方について述べていきます。

時系列データは常に一定の状態を取るとは限らず、着目する区間によって正常の意味が変わってきます。そのため時系列データにおいての異常検知は、"どの時点に対しての"異常であるか？を意識する必要があります。言い換えると、「ある区間Aのデータを正常な区間と考え、別の区間Bが区間Aに対して異常であるかどうかを調べること」ということになります。あとは上で挙げた異常検知の考え方と同様に、区間Aでモデルを作り、区間Bでのデータがそのモデルで「理解できない」とした時に「区間Aに対して区間Bが異常である」と言うことができます。以降、便宜上この2つの区間を以下のように定義します。

正常と仮定した区間A → 参照区間
異常かどうかを調べたい区間B → 評価区間

大抵のタスクでは、直近の時系列に対して異常かどうかを判定したい場合が多いので、参照区間と評価区間を隣り合わせにすることがポイントです。そして下図のようにスライドさせることで、すべての区間で異常かどうかを調べていく、という流れになります。

f:id:t-fukunari:20210408172512p:plain

また私自身経験はしていませんが、もし正常な区間が絶対的に定まるようなタスクの場合は、参照区間を正常な区間に固定し、評価区間のみをスライドさせることも考えられます。この方法は機械の故障検知などで有効だと思われます。

f:id:t-fukunari:20210408172345p:plain

時系列異常検知のタスク

時系列における異常検知のタスクとしてよく出てくるのは、「外れ値検知」と「変化点検知」この2つかと思います。字面を見るだけでもなんとなく違いをイメージできるかと思いますが、先ほど紹介した参照区間と評価区間の枠組みを用いてこれらを説明します。

評価区間を1点にする → 外れ値検知

評価区間を1点とすることで、その1点が異常かどうかを調べることになります。これが「外れ値検知」とよばれるタスクになります。

f:id:t-fukunari:20210408173002p:plain

そして、外れ値検知はさらに2種類に分けられます。

1つは時系列依存しない外れ値です。つまり、時系列をシャッフルさせても外れ値としてわかるものです。この場合、値そのものが異常と判断できるので、先ほどの確率分布による正常モデルで考えることができます。また、閾値を持たせることでルールベースでも検出できます。

f:id:t-fukunari:20210408172700p:plain

もう1つは時系列依存する外れ値です。つまり、時系列をシャッフルさせるとその値が異常であるとわからなくなるようなものです。このような外れ値の場合はChangeFinderのような時系列予測系のモデルを用いた方がうまく解けます。

f:id:t-fukunari:20210408172724p:plain

評価区間を2点以上とる → 変化点検知

これに対し、評価区間をある程度の長さに取ると、「変化点検知」とよばれるタスクになります。この場合、評価区間という「カタマリ」単位で異常かどうかが判断されます。

参照区間と評価区間を隣り合わせにしていることが前提で、評価区間が異常と判断された場合、参照区間と評価区間の間で何かしらの「変化」が生じたということが言え、これらの区間の境目が「変化点」ということになります。

f:id:t-fukunari:20210408174104p:plain

そしてこちらの変化点検知に対しても、外れ値検知と同様に、時系列依存する場合としない場合に分けることができます。

時系列依存しない場合のアプローチに関して、一番シンプルな方法としては、参照区間と評価区間の統計量を計算しそれぞれ比較する、といった方法が考えられます。また、これまでは正常区間のみでモデルを作ることをお話ししてきましたが、評価区間にも十分データが揃うのであれば、評価区間でもモデルを作ることが可能です。分布のイメージだと、参照区間と評価区間でそれぞれ分布ができるようなものです。あとは分布そのもので比較したり、また「密度比推定」と呼ばれる分布の比をダイレクトに求める方法で異常かどうか調べていくことができます。

f:id:t-fukunari:20210408174952p:plain

時系列依存する場合は、外れ値検知と同様に予測モデルを作るアプローチが考えられます。例えばAutoEncoderの再構成誤差を用いる方法が考えられます。

今回のまとめと次回予定

長くなりましたので今回はここまでです。ポイントを以下にまとめます。

異常検知では、明示的な正解ラベルを学習に用いない教師なし学習が主流である。
時系列中に2つの区間を設け、その中でモデル化を行いつつ区間をスライドさせるのが基本的な考え方である。
区間の長さにより、大きくは外れ値検知・変化点検知に分けられる。
- 両者ともに、時系列依存の有無の観点で分けることも可能である。

次回は、外れ値検知・変化点検知のより具体的なアプローチについて述べていきたいと思います。

参考文献

おわりに

エクサウィザーズは優秀なエンジニア、社会課題を一緒に解決してくれる魔法使い”ウィザーズ”を募集していますので、ご興味を持たれた方はぜひご応募ください。
採用情報｜株式会社エクサウィザーズ

ExaWizards Engineer Blogでは、AIなどの技術情報を発信していきます。ぜひフォローをよろしくお願いします！
Linkedinもどしどしフォローお待ちしています！