/ Programming

Node.js AWS SDK: How to list all the keys of a large S3 bucket?

Let's say you have a big S3 bucket with several thousand files. Now, you need to list all the keys in that bucket in your Node.js script. The AWS SDK for Node.js provides a method listObjects but that provides only 1000 keys in one API call. It does however, also send a flag IsTruncated to indicate whether the result was truncated or not. If the response contains IsTruncated as true, then it means you need to call the listObjects again, but this time, you need to pass a Marker in your parameters which tells AWS:

Hey, I've received the list of objects upto this Marker object, send me the ones after this one please. Thanks!

We'll use this idea to write our code in a simple and easy to understand manner using Javascript's one of the new features of ES8 called Async/Await. For that to work, you will need Node.js version 8 or higher.

First, I'll show you the script and then we will break it down to understand what it is doing. So, here's the code:

const AWS = require('aws-sdk');

const s3 = new AWS.S3({
  region: 'eu-central-1',
  accessKeyId: process.env.AWS_ACCESS_KEY_ID,
  secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY,
});

async function listAllObjectsFromS3Bucket(bucket, prefix) {
  let isTruncated = true;
  let marker;
  while(isTruncated) {
    let params = { Bucket: bucket };
    if (prefix) params.Prefix = prefix;
    if (marker) params.Marker = marker;
    try {
      const response = await s3.listObjects(params).promise();
      response.Contents.forEach(item => {
        console.log(item.Key);
      });
      isTruncated = response.IsTruncated;
      if (isTruncated) {
        marker = response.Contents.slice(-1)[0].Key;
      }
  } catch(error) {
      throw error;
    }
  }
}

listAllObjectsFromS3Bucket('<your bucket name>', '<optional prefix>');

The above script will print all the keys from the bucket matching the prefix that you provided. If you want to do something useful with the objects instead of just printing them to the console, you can easily tweak the above script to do that.

Now let us break it down to smaller parts and understand what each part is doing. Starting from the top:

const AWS = require('aws-sdk');

const s3 = new AWS.S3({
  region: 'eu-central-1',
  accessKeyId: process.env.AWS_ACCESS_KEY_ID,
  secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY,
});

This is quite simple. We are importing the module aws-sdk and then instantiating an s3 client using the accessKeyId and secretAccessKey from our environment variables. Now, instead of process.env.AWS_ACCESS_KEY_ID, we could also use a hard coded value of our access key id but I won't recommend that because of security concerns. It's always good to separate configuration from code and it's also a good practice to provide credentials via enviroment variables. In my script, I am using eu-central-1 region of AWS but you can change that to the region where your S3 bucket is. Now that we have our s3 client instantiated, we can now call S3 related methods of the AWS API. For our problem, we just need one method and that is s3.listObjects.

Let us take a look at the next section of our script:

async function listAllObjectsFromS3Bucket(bucket, prefix)

listAllObjectsFromS3Bucket is an asynchronous function which expects two parameters, bucket and prefix. An important thing to note here is the use of keyword async before the function keyword. This is necessary because we are using the await keyword inside this function. For the ES8 Async/Await to work, whenever we use await in a function, that function must have the async keyword as a prefix to the function definition. If you remove the async keyword from the function definition, you will get a SyntaxError. Feel free to try that.

Now if you look inside the while loop in the function definition, you will see the line:

const response = await s3.listObjects(params).promise();

Let's understand what this line of code is doing. If you check the AWS documentation for s3.listObjects method, you will see that the function expects two arguments. First is the params and second is the callback. But in our code we only provided params and no callback. Why is that?

Because almost all the AWS SDK methods also support promises. I would almost always recommend using promises instead of callbacks because of several advantages of promises over callbacks. We won't get into that discussion in this tutorial but let us see how we can use the AWS SDK methods to return promises instead of passing a callback function to them. Well, as you can see in our code, it is quite simple. We just need to do two things:

  • Omit the callback argument from the function call
  • Call the .promise() method to get the promise

So, the result of s3.listObjects(params).promise() will be a promise which will also have a then method of its own. Just to give you a clearer picture, consider the following code snippet which uses callback:

s3.listObjects(params, function (error, data){
    // do something with error and data here
});

We can convert this to promise based code like below:

s3.listObjects(params).promise()
    .then(function (data){
        // do something with data here
    })
    .catch(function (error) {
        // handle your error here
    });

Async/Await takes it to the next level. Whenever a function returns a promise, we can use Async/Await with that function. So the above code can be re-written using Async/Await like below:

try {
    const data = await s3.listObjects(params).promise();
    // do something with data here
} catch(error) {
    // handle your error here
}

Now for someone who is new to Javascript, the Async/Await solution will be much easier to understand. The beauty of Async/Await is that it lets you write asynchronous code as if it's synchronous code. Although it has some pitfalls but in many of the cases, it makes your code much easier to understand. Please note however that the try/catch block in the above code snippet must exist inside some function with async prefix.

Now coming to the next section of our code:

response.Contents.forEach(item => {
  console.log(item.Key);
});

Here we are just logging the Key of every item in the response to stdout.

isTruncated = response.IsTruncated;
if (isTruncated) {
  marker = response.Contents.slice(-1)[0].Key;
}

isTruncated is a flag on which our while loop is based. It is initialized to true so that first iteration executes. In subsequent iterations, its value depends on the response.IsTruncated as returned by the AWS SDK. When, isTruncated is true, we are assigning the Key of the last element in the response to our variable marker. The listObjects function accepts a parameter called Marker inside the params object. If the Marker is not provided, it starts fetching the list of objects from beginning. However, whenever Marker is provided, it starts fetching the list of objects after that element. The expression

response.Contents.slice(-1)[0].Key;

will return the Key of the last element of the response.Contents array. In each iteration of the while loop, we are setting our marker to key of the last element of the response. When we reach the end of the list, response.IsTruncated will be false and our code will exit the while loop.

And that's how we can list all the objects in an S3 bucket with large number of objects in it. Happy Coding :-)

Node.js AWS SDK: How to list all the keys of a large S3 bucket?
Share this