Deploying Snowplow 2 - Infrastructure

29 Sep 2024

This is part 2 of our post that explains how to deploy Snowplow GCP and GKE. In part 1 we gave a high level overview of the architecture. In this part we will explain how to use Terraform to create the required infrastructure in GCP. You probably want to add the following defintions in a folder called snowplow insides modules folder and then import it from main.

We go over these resources:

Variables
Google Cloud Storage
PostgreSQL
BigQuery Dataset
Networking and IP Address
PubSub topics and subscriptions
PubSub to GCS Permissions
Service Account and Permissions
Wrap Up

Variables

Let’s start with the set of variables that our infrastructure is going to need:

# modules/snowplow/variables.tf

variable "project_id" {
  description = "The ID of the GCP project"
  type        = string
}

variable "region" {
  description = "The region for GCP resources"
  type        = string
}

variable "igludb_instance_id" {
  description = "The ID of the Cloud SQL PostgreSQL instance for Iglu"
  type        = string
}

variable "igludb_db_name" {
  description = "Database name for Iglu"
  type        = string
}

variable "igludb_username" {
  description = "The username for the Iglu database"
  type        = string
}

variable "igludb_password" {
  description = "The password for the Iglu database user"
  type        = string
  sensitive   = true # Mark as sensitive to protect the password
}

Google Cloud Storage

Next we need three storage buckets for bad rows, failed inserts, etc:

# modules/snowplow/storage.tf

resource "google_storage_bucket" "bq_loader_dead_letter_bucket" {
  name                        = "spangle-snowplow-bq-loader-dead-letter"
  location                    = var.region
  storage_class               = "STANDARD"
  force_destroy               = true
  uniform_bucket_level_access = true

  versioning {
    enabled = false
  }
}

resource "google_storage_bucket" "bad_1_bucket" {
  name                        = "spangle-snowplow-bad-1"
  location                    = var.region
  storage_class               = "STANDARD"
  force_destroy               = true
  uniform_bucket_level_access = true

  versioning {
    enabled = false
  }
}

resource "google_storage_bucket" "bq_bad_rows_bucket" {
  name                        = "spangle-snowplow-bq-bad-rows"
  location                    = var.region
  storage_class               = "STANDARD"
  force_destroy               = true
  uniform_bucket_level_access = true

  versioning {
    enabled = false
  }
}

PostgreSQL

In case you already have a cloudsql PostgreSQL setup you can skip next step, otherwise we need to create a cloudsql PostgreSQL database. Here is just an example setup:

# modules/snowplow/sql_database.tf

resource "google_sql_database_instance" "dev_postgres_instance" {
  name             = "dev-postgres"
  database_version = "POSTGRES_16"
  region           = var.region


  settings {
    tier                        = "db-custom-2-8192"
    deletion_protection_enabled = true

    backup_configuration {
      enabled                        = true
      location                       = "us"
      point_in_time_recovery_enabled = true
    }

    maintenance_window {
      update_track = "canary"
    }
  }
}

Then we need to create a database for the Iglu server on the PostgreSQL database created above:

# modules/snowplow/sql_database.tf

# Create a database in the existing Cloud SQL PostgreSQL instance
resource "google_sql_database" "iglu_db" {
  name     = var.igludb_db_name     # Replace with your desired database name
  instance = var.igludb_instance_id # Reference the existing Cloud SQL instance
  project  = var.project_id         # Reference the project ID
}

# Create a user for the database with a password
resource "google_sql_user" "iglu_postgres_user" {
  name     = var.igludb_username    # Use the username variable
  password = var.igludb_password    # Use the password variable
  instance = var.igludb_instance_id # Reference the existing Cloud SQL instance
  project  = var.project_id         # Reference the project ID
}

If you are creating the database with the commands above, make sure you add a dependency declaration to the database definitions. Something like:

  depends_on = [
    google_sql_database_instance.dev_postgres_instance
  ]

to both google_sql_database.iglu_db and google_sql_database.iglu_postgres_user above.

BigQuery Dataset

Next, let’s create the BigQuery dataset. We call the dataset snowplow, you can of course change that to something else, but you need to make sure you update the service configuration files, that we will go over in part 3, accordingly.

# modules/snowplow/bigquery.tf

# Create a BigQuery dataset with the specified location and description
resource "google_bigquery_dataset" "snowplow_event_dataset" {
  dataset_id  = "snowplow"
  description = "Snowplow event dataset"
  location    = "US"

  project = var.project_id
}

Networking and IP Address

Next, let’s create the static IP address that is used as the tracker endpoint:

# modules/snowplow/network.tf

# Create a global static IP address
resource "google_compute_global_address" "snowplow_ingress_ip" {
  name       = "snowplow-ingress-ip"
  ip_version = "IPV4"
}

PubSub topics and subscriptions

We also need to create the six PubSub topics and their related subscriptions that we mentioned in Part 1.

# modules/snowplow/pubsub.tf

locals {
  pubsub_topics = [
    # Tuple format: ("topic_name", subscription_needed)
    ["snowplow-bad-1", false],
    ["snowplow-bq-bad-rows", false],
    ["snowplow-bq-loader-server-failed-inserts", true],
    ["snowplow-bq-loader-server-types", true],
    ["snowplow-enriched", true],
    ["snowplow-raw", true]
  ]
}

resource "google_pubsub_topic" "pubsub_topics" {
  for_each = { for topic in local.pubsub_topics : topic[0] => topic }
  name     = each.value[0]
}

resource "google_pubsub_subscription" "pubsub_subscriptions" {
  for_each = { for topic in local.pubsub_topics : topic[0] => topic if topic[1] }
  name     = each.value[0]
  topic    = each.value[0]

  expiration_policy {
    ttl = "604800s" # Set TTL to 7 days (604800 seconds)
  }
}

resource "google_pubsub_subscription" "bad_1_subscription" {
  name  = "snowplow-bad-1-gcs"
  topic = google_pubsub_topic.pubsub_topics["snowplow-bad-1"].id

  cloud_storage_config {
    bucket = google_storage_bucket.bad_1_bucket.name

    filename_prefix          = ""
    filename_suffix          = "-${var.region}"
    filename_datetime_format = "YYYY-MM-DD/hh_mm_ssZ"

    # max_bytes    = 1000000
    # max_duration = "300s"
    # max_messages = 1000
  }

  depends_on = [
    google_storage_bucket.bad_1_bucket,
    google_storage_bucket_iam_member.admin,
  ]
}

resource "google_pubsub_subscription" "bq_bad_rows_subscription" {
  name  = "snowplow-bq-bad-rows-gcs"
  topic = google_pubsub_topic.pubsub_topics["snowplow-bq-bad-rows"].id

  cloud_storage_config {
    bucket = google_storage_bucket.bq_bad_rows_bucket.name

    filename_prefix          = ""
    filename_suffix          = "-${var.region}"
    filename_datetime_format = "YYYY-MM-DD/hh_mm_ssZ"

    # max_bytes    = 1000000
    # max_duration = "300s"
    # max_messages = 1000
  }

  depends_on = [
    google_storage_bucket.bq_bad_rows_bucket,
    google_storage_bucket_iam_member.admin,
  ]
}

PubSub to GCP Permissions

Somewhat confusingly, we should explicitly allow PubSub to GCS service account to write to the GCS buckets we mentioned above. In other words:

# modules/snowplow/iam.tf

data "google_project" "project" {
}

locals {
  bucket_names = [
    google_storage_bucket.bad_1_bucket.name,
    google_storage_bucket.bq_bad_rows_bucket.name,
  ]
}

resource "google_storage_bucket_iam_member" "admin" {
  for_each = toset(local.bucket_names)

  bucket = each.value
  role   = "roles/storage.admin"
  member = "serviceAccount:service-${data.google_project.project.number}@gcp-sa-pubsub.iam.gserviceaccount.com"
}

Service Account and Permissions

Next we create service account that is going to be used by all the Snowplow services and we give it all the permissions that these services need. Note that, here we are using one shared service account just for simplicity. Ideally, it’s better to create different service accounts for different services and give them exactly the minimum permissions that they need.

Problably the most non-trivial part of this declaration is the last part, workload_identity_user_binding where we allow GKE service account snowplow/snowplow-service-account i.e. service account snowplow-service-account in Kubernetes namespace snowplow to bind to snowplow service account in GCP. In other words, we give our kubernetes services the ability to use GKE service account snowplow/snowplow-service-account which can then bind to snowplow service account we create below on GCP. More on that in Part 3.

# modules/snowplow/service_accounts.tf

# Combined Service Account: snowplow
resource "google_service_account" "snowplow" {
  account_id   = "snowplow"
  display_name = "snowplow"
  description  = "Combined service account with all permissions for Snowplow"
}

# List of roles to be assigned to the service account
locals {
  roles = [
    "roles/bigquery.dataEditor",
    "roles/logging.logWriter",
    "roles/pubsub.publisher",
    "roles/pubsub.subscriber",
    "roles/pubsub.viewer",
    "roles/storage.objectViewer"
  ]
}

# Assign all roles to the service account in a loop
resource "google_project_iam_member" "combined_iam_roles" {
  for_each = toset(local.roles)
  project  = var.project_id
  role     = each.value
  member   = "serviceAccount:${google_service_account.snowplow.email}"
}

# Add IAM policy binding to the Google Cloud Storage bucket
resource "google_storage_bucket_iam_member" "bq_loader_dead_letter_bucket_binding" {
  bucket = google_storage_bucket.bq_loader_dead_letter_bucket.name
  role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.snowplow.email}"
}

# Add Workload Identity User binding
resource "google_service_account_iam_binding" "workload_identity_user_binding" {
  service_account_id = google_service_account.snowplow.name

  role = "roles/iam.workloadIdentityUser"

  members = [
    "serviceAccount:${var.project_id}.svc.id.goog[snowplow/snowplow-service-account]"
  ]
}

Wrap Up

Nice! by now we should have eight files in modules/snowplow folder. Feel tree to terraform plan and terraform apply.

    modules/snowplow
    ├── README.md
    ├── bigquery.tf
    ├── iam.tf
    ├── network.tf
    ├── pubsub.tf
    ├── service_accounts.tf
    ├── sql_database.tf
    ├── storage.tf
    └── variables.tf

In Part 3 we will go over the Kubernetes cluster deployment.