Your service is getting traction. But with higher traction, hackers come along. You don’t want to end up in the news with your user’s data leaking. In this post we explore the limits and potential of data-layer isolation with AWS.
What is Isolation?
Isolation means that a service gives each user the illusion of using the service by themselves. A critical type of isolation is data isolation: One user cannot see the data of another user, unless intended so. Indeed, one user writing and reading their own data is no very useful. Instead, said data becomes valuable when selectively shared. Unfortunately, as shown by GitHub’s Security Bug Bounty program, achieving data isolation is tricky: The faster features are pushed, the higher the chance that your team introduce a data isolation bug.
End-point vs. Data-Layer Isolation
Your service may perform isolation in several places, the two extremes being end-point isolation (the most common) and data-layer isolation.
End-point isolation validates a request as soon as received. For example, WordPress performs authorization in the application layer, i.e., in PHP code. Then, WordPress uses the same database username and password, both for writing a page as a logged-in user and viewing a page as a guest user. Hence, once a hacker made it behind the first line of defense, they can talk to the database server directly. Such level of isolation may be sufficient for a website. However, it will certainly fall short when dealing with finances or health data.
Data-layer isolation validates a data access request as close as possible to the data storage. For example, the user submits a request with their token. Your service passes that token down the call-chain until it reaches the data layer. Generally, the data-layer consists in a database or an object storage. The data layer then validates the token and matches its claims with a policy. The policy tells what data (if at all) to serve. For example, the policy may deny “PutObject” when a token claims “page read” scope. Data-layer isolation has the potential to reduce coding errors, since your team need to review less code. Naturally, the policy itself becomes security-sensitive code. Furthermore, said policies might change less often and your team may treat policies with greater care, hence reducing isolation bugs.
The Best of Both Worlds
Of course, a service may combine the two extremes. As an example, think of a multi-tenant SaaS application serving the users of several companies (i.e., tenants), such as G Suite. The app may use data-layer isolation between tenants, but end-point isolation between users. The former offers a safety net so you don’t make it in the news. One tenant seeing the data of another tenant is a big deal. The latter implements all the fancy sharing features, hence your team will likely change it more often and introduce bugs. However, if one user sees the data of another user within the same tenant, this is less critical and your team can fix the service quickly.
Since end-point isolation is common and well-understood, in the rest of the post we will focus on data-layer isolation.
Data-Layer Isolation on AWS
Let us now go through the basic building blocks to achieve data-layer isolation in AWS. I will assume some familiarity with the following AWS services: IAM, Lambda, S3 and DynamoDB. Also, I will assume that you have written policies before. Each of these building blocks spans several AWS documentation pages. Here, we only convey the minimum to understand data-layer isolation and trade-offs involved.
Policies
The foundation of data-layer isolation are AWS IAM policies. In essence, a policy is a list of statements. Each statement allows or denies a given action (e.g., “PutObject”) on a given AWS resource (e.g., “S3 bucket my-customers-images”). For DynamoDB, the policy may also include condition keys: Simply put, dynamodb:LeadingKeys restricts the “rows”, whereas dynamodb:Attributes restricts the “columns” which are visible when using the policy.
Since policies are restricted in size to a few KB, but also to improve readability, several ways of condensing policies are available: First, you may use wildcards (e.g., “S3 objects customer-data/customer1-*”). Second, you may use global conditions keys, however, this is only useful if your isolation is performed at the user granularity.
This is a good moment to step back and think about your isolation strategy, and answer two main questions:
- What granularity of isolation do you want to have in your data layer? Should you isolate per-user, per-groups of users or per-tenant?
- What AWS resources should your service isolated? The two extremes are the silo and the pooled isolation model. In the silo model, each tenant owns a DynamoDB table and an S3 bucket. In contrast, in the pooled model, tenants share DynamoDB tables and S3 buckets, but own row and object prefixes.
- What scopes are there in your service? For example, you may want to prevent users who can “read album” from uploading S3 objects.
Roles
Policies by themselves are just text documents with no power. Your service needs to attach them to AWS IAM roles. If you are new to roles, think of them as templates for temporary (generally one-hour-long) AWS credentials. Your service may attach policies to roles in two ways. First, it can create managed policies, i.e., stand-alone policies listed in IAM. Second, your service can include a policy inline with the role. The former allows sharing policies among multiple roles, whereas the latter attaches a policy to a single role. Our advice is to use inline policies, so as to simplify tracking of AWS resources created at run-time.
Now comes the scary part: Your service will need to create AWS roles, e.g., when creating a new tenant. Simply creating a “super role that can create any role” will certainly raise red flags with your security team. Therefore, you may use a permission boundary, which is a managed policy stating the maximum access that any role created by your super role may have. This will keep both your security team happy and your service flexible.
Let us now explore how your code can assume these roles, whether running inside a lambda, EC2 instance or container.
Cognito IdentityPool
An IdentityPool is in essence an AWS resource that converts a token into a role. You can imagine it like a secure function that takes as input a signed JWT token with a set of claims, and converts them into an AWS access key. Note that, the GetCredentialsForIdentity call is public API. This means that your code needs no AWS credentials, only a valid token. In fact, we only ran our lambdas with basic execution role, except, of course, for the create tenant lambda that had a dedicated “super role”. We will see how to generate suitable tokens in the next section. You may configure an IdentityPool to map claims inside a token to a role in one of several ways:
- IdentityPool maps unauthenticated users to role A and authenticated users to role B. As an example, take an Instagram-like photo sharing app. Code running for anonymous users can assume a role to get S3 objects, whereas the same code running for logged-in users can also put S3 objects. Already a decent level of data-layer isolation.
- You may configure IdentityPool with rules to resolve claims inside the token to a role. Mind the soft limit of 25 rules.
- You may include the cognito:roles claim inside the token and configure IdentityPool to do token-based role mapping. Note that, your code may assume only one role out of the list of roles inside the token. Your code can decide which role to assume via the CustomRoleArn parameter.
Let us now explore ways in which your service can generate JWT tokens suitable for Cognito IdentityPool.
Cognito UserPool
IdentityPool needs an identity provider that can generate the JWT token. If you want to stay in the AWS universe, Cognito UserPool is an easy choice, but notice that IdentityPool supports other providers as well.
In essence, a UserPool is a secure database of users. UserPool factors out many mundane tasks then dealing with users, such as self-signup via email or SMS validation, password reset, and two-factor authentication. The InitiateAuth API allows your service to convert a username and password (potentially via several rounds of challenges) into a JWT token. This is a public API, hence it requires to AWS credentials. The token is populated with some standard claims, such as the user ID and the user’s email address. UserPool also allows to create groups, assign a role to a group (1-to-1 mapping) and assign users to groups (M-to-N mapping). The roles are automatically added to the cognito:roles claim within the token.
Alternatively, you may add and remove roles and claims via a pre-token-generation lambda. Unfortunately, the pre-token-generation lambda only takes user attributes. In fact, the ClientMetaData from InitiateAuth is not sent to the pre-token-generation lambda. This means that your service cannot sub-scope a token. For example, you cannot create a token for user A and a token for project 1 of user A. This may have undesirable effects on what API your service exposes. If you need more flexibility, consider using Auth0 whose equivalent hook does receive client metadata.
Putting It All Together
The next two UML sequence diagrams show how your photo sharing service would work with data-layer isolation. The service execution timeline is highlighted with grey (no AWS credentials), blue or green, depending on what data its AWS credentials allow access to. As you can notice, your code only has access to the data it is allowed to present, hence adding another level of isolation. This helps you push features faster, without having to constantly worry about isolation bugs.
Caution: Eventual Consistency
Policies and roles seem to be eventually consistent. This means that your service may fail in weird ways. Our hourly tests caught the following situations:
- The IAM CreateRole call returned, but Cognito IdentityPool was unaware of the role.
- Cognito IdentityPool was aware of the role, but DynamoDB was not.
- DynamoDB was aware of the role, but the role did not contain the inline policy yet.
It may take as much as 30 seconds for AWS to settle. Fortunately, your service can easily distinguish the above cases and deal with them by retrying. The retry logic may either reside in your service code or in client code. Since AWS bills lambdas for every 100ms of execution, service-side retry may be costly. Therefore, we suggest returning HTTP 503 Service Unavailable with a suitable Retry-After header.
Caveats
- 1024 roles is the default soft limit. If this limit is too small for your use-case, either request an increase of the soft limit or increase the granularity of isolation.
- CloudFormation cannot configure the pre-token-generation lambda of a Cognito UserPool. We suggest using a custom resource or Terraform.
- Make sure your code forgets credentials after serving a request. A common error is to cache AWS credentials in your Lambda or EC2 instance and reuse them when serving the next request, likely from a different user. Better off, include a test to make sure such errors cannot happen.
- Clearly mark AWS resources created by your service at run-time. Your service should mark roles created at run-time with a stack name to simplify cleanup. Alternatively, you may use AWS role paths, but beware that, if you include roles in your token, they use up precious space.
Take-Aways
- Data isolation is critical to keep you in business.
- End-point isolation is error-prone, but can be aided with data-layer isolation.
- Data-layer isolation is not a silver bullet, but can serve as an added security barrier.
- AWS offers a rich set of features to enable data-layer isolation.
- Still, some features are missing: ClientMetaData is missing in pre-token-generation lambda, CloudFormation cannot set pre-token-generation lambda, policies support limited variables.
Does your service deal with highly sensitive data and needs as much isolation as possible? If so, don’t hesitate to contact us!