Snowplow on AWS Fargate - IAM Permissions


This is part four of a blog post series about Snowplow on AWS Fargate.


Goal

This post will outline the minimum necessary IAM permissions required to run each Snowplow component in AWS Fargate.

Investigation

I spent quite some time investigating this issue before writing this post myself. Questions related to this have been asked a number of times but I was unable to find a definitive answer.

The following are related topics/questions I found along the way:

The following are error messages you are likely to see when encountering permission issues:

The best information I could find previous was “You need to give it create, read write permissions to Dynamo” and “I just gave it full access”. Hopefully this post can do a more thorough job!

Scala Stream Collector

Check out the official technical docs for more details about the scala stream collector.

Collectors in Snowplow are the top of the data funnel. Raw data from the trackers is received over HTTPS and, in the case of streaming deployments, is then written in Thrift to AWS Kinesis data streams. Valid data is written to a “good” stream and invalid data is written to a “bad” stream. In terms of AWS IAM permissions, this one is relatively straight forward.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:PutRecord"
      ],
      "Resource": [
        "${collector_stream_out_good}",
        "${collector_stream_out_bad}"
      ]
    }
  ]
}

Scala Stream Enricher

Check out the official technical docs for more details about the scala stream enricher.

Enrichers in Snowplow are second in the data funnel and similar to collectors. Raw data is consumed from AWS Kinesis data streams, modified by all configured “enrichers” and then written back out to a AWS Kinesis data stream. Valid data is written to a “good” stream and invalid data is written to a “bad” stream. However, since it uses the AWS Kinesis Client library (KCL), there is more nuance to the IAM permissions required.

Check out the official docs for more details on the DynamoDB state table.

KCL uses an AWS DynamoDB table to manage state of the consumer (leases, checkpoints, etc). This means that our Snowplow enricher also requires all permissions necessary to manage this state table.

KCL also emits metrics to Cloudwatch, so we’ll need permissions for that as well.

Please note that this example is not include a stream for PII data. If you want that, you’ll need to include it in the 3rd statement where write permissions to Kinesis data streams are defined.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:GetShardIterator",
        "kinesis:GetRecords",
        "kinesis:ListShards"
      ],
      "Resource": [
        "${collector_stream_out_good}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
          "kinesis:ListStreams"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:PutRecord",
        "kinesis:PutRecords"
      ],
      "Resource": [
        "${enricher_stream_out_good}",
        "${enricher_stream_out_bad}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:CreateTable",
        "dynamodb:DescribeTable",
        "dynamodb:Scan",
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": [
        "${enricher_state_table}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricData"
      ],
      "Resource": "*"
    }
  ]
}

S3 Loader

Check out the official technical docs for more details about the s3 loader.

Loaders in Snowplow are third in the data funnel and similar to enrichers. Raw data is consumed from AWS Kinesis data streams and written to a valid data sink, in this case, S3. If records can not be written to the sink, they’re written to a “bad” stream. Similar to the enricher, AWS Kinesis Client library (KCL), is also used.

Check out the official docs for more details on the DynamoDB state table.

KCL uses an AWS DynamoDB table to manage state of the consumer (leases, checkpoints, etc). This means that our Snowplow loader also requires all permissions necessary to manage this state table.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:GetShardIterator",
        "kinesis:GetRecords",
        "kinesis:ListShards"
      ],
      "Resource": [
        "${enricher_stream_out_good}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
          "kinesis:ListStreams"
      ],
      "Resource": [
          "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:PutRecord",
        "kinesis:PutRecords"
      ],
      "Resource": [
        "${loader_stream_out_bad}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:CreateTable",
        "dynamodb:DescribeTable",
        "dynamodb:Scan",
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": [
        "${s3_loader_state_table}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricData"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::${bucket_id}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::${bucket_id}/*"
      ]
    }
  ]
}

Elasticsearch Loader

Check out the official technical docs for more details about the es loader.

The ES loader and S3 loaders are nearly identical, expect we trade an S3 bucket sink for an Elasticsearch cluster.

Check out the official docs for more details on the DynamoDB state table.

KCL uses an AWS DynamoDB table to manage state of the consumer (leases, checkpoints, etc). This means that our Snowplow loader also requires all permissions necessary to manage this state table.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:GetShardIterator",
        "kinesis:GetRecords",
        "kinesis:ListShards"
      ],
      "Resource": [
        "${enricher_stream_out_good}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
          "kinesis:ListStreams"
      ],
      "Resource": [
          "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:PutRecord",
        "kinesis:PutRecords"
      ],
      "Resource": [
        "${loader_stream_out_bad}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:CreateTable",
        "dynamodb:DescribeTable",
        "dynamodb:Scan",
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": [
        "${es_loader_state_table}"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricData"
      ],
      "Resource": "*"
    }
  ]
}

Check out the official technical docs for more details about AES access policies.

If Amazon Elasticsearch Service (AES) is being used and you’re running it inside a VPC, you’ll need to make sure the access policies on the domain are also configured. The role here should have a policy attached that grants it all of the permissions that are defined above.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "${role}"
        ]
      },
      "Action": [
        "es:ES*"
      ],
      "Resource": [
        "${domain}",
        "${domain}/*"
      ]
    }
  ]
}

To be continued…

The next blog post in this series will discuss the EmrEtlRunner handling streaming data. Stay tuned!

Return home