
S3st: Stream data from multiple S3 objects directly into your terminal - loige
https://www.npmjs.com/package/s3st
======
aliencat
You can also acomplish this with shell:

export BUCKET=____; aws s3 ls "$BUCKET" | tail -n+2 | awk '{print $4}' | while
read k; do aws s3 cp "s3://$BUCKET/$k" -; done

~~~
loige
I especially like the "done" (!) at end :) Thanks for this one-liner!

------
tbrock
Cool. I had a similar use case and created a tool to stream colorized logs
from CloudWatch to your terminal that is a little more ergonomic to use than
this:

[https://github.com/TylerBrock/saw](https://github.com/TylerBrock/saw)

~~~
loige
I didn't know saw! It looks very very cool, and it's written in go! Thanks for
(making and) sharing this!

------
djhworld
One thing to be careful of with this is the Data Transfer (egress) cost you
will incur streaming data out from S3.

If you're just wanting to do a 'grep' style action on an S3 prefix, might be
worth looking into "S3 Select"for your use case instead

~~~
loige
Very good point, this is probably worth mentioning in the README. I'll add a
note there for sure!

------
saimiam
Pretty neat. I'm working on a product that relies heavily on S3 buckets and
tagged files.

Does s3st support tags or other ways of identifying which files to stream
other than filtering by the content of the files? Asking because I didn't see
this feature in the demo.

~~~
loige
It doesn't. It only supports a prefix to filter out objects. I am happy to get
PRs for this kind of features if you have a use case!

~~~
saimiam
Copy that. I’ll take a look.

------
empthought
What are the advantages of this over a shell pipeline with aws-cli?

~~~
rsync
Or, what is the advantage over mounting the s3 bucket as a local filesystem
with s3fs[1] or rclone mount[2] ?

[1] [https://github.com/s3fs-fuse/s3fs-fuse](https://github.com/s3fs-
fuse/s3fs-fuse)

[2]
[https://rclone.org/commands/rclone_mount/](https://rclone.org/commands/rclone_mount/)

~~~
jorblumesea
If your s3 bucket is huge, that's probably not a good idea. Most likely, the
use case here is streaming and search for tagged data within a very large s3
dataset.

------
alainchabat
We have a large s3 bucket 2 billions objects and we start thinking about
cleaning it a bit. Is anyone had done such things or any tools on :

\- categorising what's inside

\- checking what's used or not

Thanks!

~~~
ben509
The first step would be figuring out what's in there, so maybe look at Glue[1]
and see if it can determine your existing schema.

But usually you need to run run arbitrary code against the contents of a large
S3 bucket, and that gets tricky. The main problem is tracking what you've done
vs. what you need to do, because if you haven't categorized your data yet, you
can expect that code processing it will break.

One technique is queues in SQS:

1\. Keys to process

2\. Keys that succeeded

3\. Keys that failed

(Regular queues, FIFO queues probably won't be useful. A queue can have an
unlimited backlog, but the maximum message timeout is two weeks. That's
probably more than enough time to iterate over some code in Lambda.)

Your initial lambda should be triggered by KeysToProcess, which you can
initiate off a developer machine and just run ListBucket and create a pile of
messages.

When the lambda is done, it passes its information to KeysThatSucceeded. (Or
possibly another S3 bucket, or Dynamo, or a database, or just drop its key if
you determine you don't need it.)

Point your dead letter queue to KeysThatFailed. Let the messages pile up in
there until you've figured out the errors and are ready to try again.

And then you can trigger off KeysThatFailed, point the dead letter queue at a
new KeysThatFailed2, rinse, repeat until you're satisfied it's correct.

[1]: [https://aws.amazon.com/glue/](https://aws.amazon.com/glue/)

------
sdan
What are the potential use cases?

~~~
loige
There's one in the example (streaming from multiple cloudtrail files and
grepping on the resulting stream)

