Snowplow on AWS Fargate - IAM Permissions

February 12, 2020


This is part four of a blog post series about Snowplow on AWS Fargate.


Goal

This post will outline the minimum necessary IAM permissions required to run each Snowplow component in AWS Fargate.

Investigation

I spent quite some time investigating this issue before writing this post myself. Questions related to this have been asked a number of times but I was unable to find a definitive answer.

The following are related topics/questions I found along the way:

The following are error messages you are likely to see when encountering permission issues:

The best information I could find previous was “You need to give it create, read write permissions to Dynamo” and “I just gave it full access”. Hopefully this post can do a more thorough job!

Scala Stream Collector

Collectors in Snowplow are the top of the data funnel. Raw data from the trackers is received over HTTPS and, in the case of streaming deployments, is then written in Thrift to AWS Kinesis data streams. Valid data is written to a “good” stream and invalid data is written to a “bad” stream. I highly recommend checking out the official technical docs for more details.

In terms of AWS IAM permissions, this one is relatively straight forward.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:PutRecord"
      ],
      "Resource": [
        "${collector_stream_out_good}",
        "${collector_stream_out_bad}"
      ]
    }
  ]
}

Scala Stream Enricher

Enrichers in Snowplow are second in the data funnel and similar to collectors. Raw data is consumed from AWS Kinesis data streams, modified by all configured “enrichers” and then written back out to a AWS Kinesis data stream. Valid data is written to a “good” stream and invalid data is written to a “bad” stream. I highly recommend checking out the official technical docs for more details about the scala stream enricher.

However, since it uses the AWS Kinesis Client library (KCL), there is more nuance to the IAM permissions required.

Per the AWS docs, KCL uses an AWS DynamoDB table to manage state of the consumer (leases, checkpoints, etc). This means that our Snowplow enricher also requires all permissions necessary to manage this state table.

KCL also emits metrics to Cloudwatch, so we’ll need permissions for that as well.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:GetShardIterator",
        "kinesis:GetRecords",
        "kinesis:ListShards"
      ],
      "Resource": [
        "${collector_stream_out_good}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
          "kinesis:ListStreams"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:PutRecord",
        "kinesis:PutRecords"
      ],
      "Resource": [
        "${enricher_stream_out_good}",
        "${enricher_stream_out_bad}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:CreateTable",
        "dynamodb:DescribeTable",
        "dynamodb:Scan",
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": [
        "${enricher_state_table}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricData"
      ],
      "Resource": "*"
    }
  ]
}

Note: This example does not include a stream for PII data. If you want that, you’ll need to include it in the 3rd statement where write permissions to Kinesis data streams are defined.

S3 Loader

Loaders in Snowplow are third in the data funnel and similar to enrichers. Raw data is consumed from AWS Kinesis data streams and written to a valid data sink, in this case, S3. If records can not be written to the sink, they’re written to a “bad” stream. I highly recommend checking out the official technical docs for more details.

Similar to the enricher, the AWS Kinesis Client library (KCL) is also used, so we will need to include permissions for the DynamoDB state table and Cloudwatch permissions as well.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:GetShardIterator",
        "kinesis:GetRecords",
        "kinesis:ListShards"
      ],
      "Resource": [
        "${enricher_stream_out_good}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
          "kinesis:ListStreams"
      ],
      "Resource": [
          "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:PutRecord",
        "kinesis:PutRecords"
      ],
      "Resource": [
        "${loader_stream_out_bad}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:CreateTable",
        "dynamodb:DescribeTable",
        "dynamodb:Scan",
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": [
        "${s3_loader_state_table}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricData"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::${bucket_id}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::${bucket_id}/*"
      ]
    }
  ]
}

Elasticsearch Loader

The ES loader and S3 loaders are nearly identical, expect we trade an S3 bucket sink for an Elasticsearch cluster. I highly recommend checking out the official technical docs for more details about the es loader.

Similar to the enricher and S3 loader, the AWS Kinesis Client library (KCL) is also used, so we will need to include permissions for the DynamoDB state table and Cloudwatch permissions as well.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:GetShardIterator",
        "kinesis:GetRecords",
        "kinesis:ListShards"
      ],
      "Resource": [
        "${enricher_stream_out_good}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
          "kinesis:ListStreams"
      ],
      "Resource": [
          "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:PutRecord",
        "kinesis:PutRecords"
      ],
      "Resource": [
        "${loader_stream_out_bad}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:CreateTable",
        "dynamodb:DescribeTable",
        "dynamodb:Scan",
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": [
        "${es_loader_state_table}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricData"
      ],
      "Resource": "*"
    }
  ]
}

If Amazon Elasticsearch Service (AES) is being used and you’re running it inside a VPC, you’ll need to make sure the access policies on the domain are also configured. The role here should have a policy attached that grants it all of the permissions that are defined above. Check out the official technical docs for more details about AES access policies.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "${role}"
        ]
      },
      "Action": [
        "es:ES*"
      ],
      "Resource": [
        "${domain}",
        "${domain}/*"
      ]
    }
  ]
}

To be continued…

The next blog post in this series will discuss the EmrEtlRunner handling streaming data. Stay tuned!

Return home