toggle

Running Airflow on an EC2 Instance

Apache Airflow is an open-source platform that helps you programmatically author, schedule, and monitor workflows. Running Airflow on an EC2 instance provides you with a scalable and flexible environment to manage your data pipelines. In this blog, we’ll walk you through setting up Airflow on an EC2 instance step by step. We’ll cover everything from setting up an EC2 instance to installing Airflow, configuring the environment and running your first workflow.

Running Airflow on an EC2 Instance

Apoorva

Dec 1, 2024 |

10 mins

Running Airflow on an EC2 Instance

Step 1 : Launch and Connect EC2 Instance 

1 - Search for EC2 on your AWS dashboard, and open it. Then, click on “Launch Instance”. Here we start creating the instance needed for Airflow. 

Launch_Instance (1).png

2 - Now, for creating a new instance, we need to name the instance something, we’re naming it small-airflow for now. 

small_airflow.png

3 - Airflow needs a Database to run and we will use PostGreSQL. PostGreSQL needs atleast 2 GB of RAM, so we cannot use “t2.micro” as it only has 1GB RAM. We’ll choose the next possible option which is “t2.small”. It is not free but pretty cheap. For the operating system, we’ll choose “Ubuntu Server 22.04”.

quick_start.png
t2.small.png

4 - You will have to create a key pair if you haven’t already. Here, we have created a key pair as you can see below: 

for_airflow.png

5 - And that’s all! Your EC2 Instance is ready to connect once it starts. You can connect it by going to the EC2 Instances page, choosing your instance, and then clicking on “Connect” 

EC2_Instance (1).png

Step 2 : Configure the Instance to run Airflow

1 - We will update the packages already installed on EC2 by using the command : “sudo apt update”.

sudo_apt_update.png

2 - Next, we need to install “pip”, which is used to install any library in Python including Airflow. We will use the command : “sudo apt install python3-pip”.

sudo_apt_update (1).png

3 - To initialize Airflow for the first time, we need SQLite, so we use this command to install SQLite : “sudo apt install sqlite3”.

sudo_apt_install_sqlite3.png

Step 3 : Make a Python Virtual Environment

1 - Now we will make a Python virtual environment so as to not have conflicts with packages installed on your system. We will use the command : “sudo apt-get install python3.10-venv”.

apt-get_install_python3.10-venv.png

2- And while we are in the virtual environment, we will install this library “libpq dev” as sometimes it is required, sometimes it's not. We’ll use the 

command “sudo apt-get install libpq-dev”.

sudo_apt-get_install_libpq-dev.png

Step 4 : Install Apache Airflow

1 - First we’ll activate our virtual environment using the commands : “python3 -m venv venv” “source venv/bin/activate”.

virtual_environment.png

2 - By using the command given below, we can download specific versions of Airflow and its required dependencies. Some dependencies might not work with the latest or certain version of others, so we get all the required ones at once. 

apache-airflow-postgres-__2.5.0.png

3 - Command : pip install "apache-airflow[postgres]==2.5.0" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.5.0/constraints-3.7.txt"

This is the command to initialize airflow : “airflow db init”.

airflow_db_init.png

Step 5 : Install PostGreSQL and connect to Airflow

1 - By default, we use SQLite as a database and we must replace it with PostGreSQL.Postgres allows multiple tasks to run at the same time compared to a single task at a time by SQLite.

And this is the command to install Postgres: “sudo apt-get install postgresql postgresql-contrib”

“sudo -i -u postgres”

sudo_apt-get_install_postgresql_postgresql-contrib.png

2 - Next, we will connect to the PostGreSQL database, with p-SQL and create a database Airflow as well as a user with the password and grant privileges to that user. Use the commands : “psql”

PostGreSQL_database.png

Airflow needs to store data in the database, and that’s why we’re connected to it. 

3 - After connecting we run three commands :

“CREATE DATABASE airflow;  CREATE USER airflow WITH PASSWORD 'airflow';  GRANT ALL PRIVILEGES ON DATABASE airflow TO airflow;”

We create the database, and then create a user entry in it, and then we grant privileges to the user.

CREATE_DATABASE_airflow.png

4 - Now, press Ctrl+Z and type “exit” to exit. 

exit.png

Step 6 : Configure and Run Airflow 

1 - Next, we need to find the airflow configuration file, which will be in the Airflow folder. In the default.In the Airflow.cfg file, we will have a default connection string, which is connection to SQLite.We will replace that connection string with Postgres connection string using this command to connect with the user Airflow, the password Airflow, the localhost and the Airflow database using PostgreSQL as configured before in p-SQL.Use this command for the same : “cd airflow” “sed -i 's#sqlite:////home/ubuntu/airflow/airflow.db#postgresql+psycopg2://airflow:airflow@localhost/airflow#g' airflow.cfg”.

sed_-i.png
airflow.cfg.png

2 - You may also want to change your executor to local executor from sequential executor to execute many tasks. You must follow this command for the same : “sed -i 's#SequentialExecutor#LocalExecutor#g' airflow.cfg”.

SequentialExecutor.png
SequentialExecutor-LocalExecutor.png

3 - After this, we may or may not want example tags to show up, so if you want to, you can run this command : “sed -i 's/load_examples = True/load_examples = False/g' airflow.cfg”.

True_load_examples.png

4 - We need to initialize Airflow once more to reload the config file using the command : “airflow db init”.

airflow_db_init-1.png

5 - Finally, we shall create a user by executing this command with username Airflow, first and last name Airflow with the role admin.The email address shall be airflow@gmail.com.

Command : “airflow users create -u airflow -f airflow -l airflow -r Admin -e airflow@gmail.com

airflow_users.png

6 - Then, a prompt will come to give the password, which we will give as “airflow”, and we are now done initializing Airflow.

initializing_Airflow.png

7 - But we still need to open the port on which AirFlow works.Go back to the AWS instances, select your instances,navigate to the security tab, click on security group, go to inbound rules, and then edit inbound rules.

airflow_works-1.png
instances.png
edit_inbound_rules.png

Add 8080 as port number and select source as anywhere IPv4, and then click “Save Rules” 

Save_Rules.png

8 - Go back to the terminal of the same instance, type “ Airflow web server &”.The symbol amp percent is used to run it in the background.

Airflow_web_server.png

9 - Then type “airflow scheduler” to start the scheduler.

airflow_scheduler.png

10 - Now, go back to the instances page and get public IPV4 address and then open it in a new tab.  Add “:8080” after it to open the Airflow server.

AirFlow_works.png

11 - Type Airflow as username and password as Airflow and voila, you have your Airflow running on your EC2 instance.  Voila! You have your airflow running on your EC2 instance.

Airflow_as_username_and_password.png
Running_Airflow_on_an_EC2_Instance.png

Conclusion

Congratulations! You’ve successfully set up Apache Airflow on an EC2 instance. You can now start building and managing your workflows. This setup provides a scalable solution for running your data pipelines, leveraging the power of AWS EC2. If you have any questions or run into any issues, feel free to connect with us.

Happy Airflowing!