使用 AirFlow 來控制 workflow

Posted by Tim Lin on 2019-04-05

使用 AirFlow 來控制 workflow

聽到 AirFlow 的時候, 第一個反應是, 我們不是有 Jenkins 了嗎??

認真查了一下,Jenkins 和 AirFlow 還是有蠻大的差異的,雖然 Jenkins 現在也有 workflow 的功能了,但兩者一開始產品定義就不大一樣了

Airflow is a platform to programmaticaly author, schedule and monitor data pipelines.

Jenkins is a self-contained, open source automation server which can be used to automate all sorts of tasks related to building, testing, and delivering or deploying software.

1
2
3
4
5
Apache Airflow is not a DevOps tool. It is a workflow orchestration tool
primarily designed for managing “ETL” jobs in Hadoop environments. It
basically will execute commands on the specified platform and also
orchestrate data movement. It was never designed to do anything remotely
similar to Jenkins or Gitlab.

我自己下的結論是:
Airflow 主要處理 workflow, Jenkins 主要處理 CI/CD, 各有優缺,所以看到蠻多人兩者搭配使用

jenkins + airflow

jenkins 滿足 CI/CD 的需求,
airflow 拿來動態組合 python 的程式碼來達到動態的 workflow

Airflow 安裝

因為有裝 anaconda 了, 直接是用包好的XD

https://anaconda.org/search?q=airflow

裝完再用 conda list 看一下

postgresql 安裝

決定使用 postgresql 來當 airflow 的資料庫

1
sudo apt-get install postgresql postgresql-contrib

如何啟動

前面都安裝好了的話就可以做以下步驟

1
2
3
4
5
6
7
8
9
# initialize the database
airflow initdb


# start the web server, default port is 8080
airflow webserver -p 8083

# start the scheduler
airflow scheduler

postgresql 連線設定

待續…

Reference