Skip to content

Borewit/tokenizer-s3

Repository files navigation

Node.js CI CodeQL NPM version npm downloads Known Vulnerabilities

@tokenizer/s3

The tokenizer-s3 module enables seamless integration with Amazon Web Services (AWS) S3, allowing you to read and tokenize data from S3 objects in a streaming fashion. This module extends the functionality of the strtok3 tokenizer by providing support for chunked S3 data access.

Features

Streaming Support: Efficiently read and tokenize data from Amazon S3 objects using streaming, which is ideal for handling large files without loading them entirely into memory. Integration with strtok3: Works seamlessly with the strtok3 tokenizer to process S3 data streams, making it easy to handle various tokenization tasks. Flexible Access: Provides options to configure S3 access, allowing for customized tokenization workflows based on your specific needs. Promise-Based API: Utilizes a promise-based API for easy integration into modern asynchronous workflows.

Installation

npm install @tokenizer/s3

Sponsor

If you appreciate my work and want to support the development of open-source projects like music-metadata, file-type, and listFix(), consider becoming a sponsor or making a small contribution. Your support helps sustain ongoing development and improvements. Become a sponsor to Borewit

or

Buy me A coffee

API Documention

makeChunkedTokenizerFromS3

Initialize a tokenizer, with the option for random access, from an Amazon S3 client for use in extracting metadata from media files.

Function Signature

function makeChunkedTokenizerFromS3(s3: S3Client, objRequest: GetObjectRequest): Promise<IRandomAccessTokenizer>

Reads from the S3 as a stream.

Parameters

  • s3 (S3Client):

    The S3 client used to make requests to Amazon S3.

    [!NOTE] To configure AWS client authentication see Configuration and credential file settings.

  • objRequest (GetObjectRequest):

    The S3 object request containing details about the S3 object to fetch. This includes properties like the bucket name and object key.

  • options (IS3Options, optional):

Returns

  • Promise<IRandomAccessTokenizer>:

    A Promise that resolves to an instance of IRandomAccessTokenizer. This tokenizer can be used to extract metadata from the specified media file in the S3 object. It supports random access reads.

makeStreamingTokenizerFromS3

Initialize a tokenizer from an Amazon S3 client for use in extracting metadata from media files.

Function Signature

function makeStreamingTokenizerFromS3(s3: S3Client, objRequest: GetObjectRequest): Promise<ITokenizer>

Reads from the S3 as a stream.

Parameters

  • s3 (S3Client):

    The S3 client used to make requests to Amazon S3.

    [!NOTE] To configure AWS client authentication see Configuration and credential file settings.

  • objRequest (GetObjectRequest):

    The S3 object request containing details about the S3 object to fetch. This includes properties like the bucket name and object key.

Returns

  • Promise<ITokenizer>:

    A Promise that resolves to an instance of ITokenizer. This tokenizer can be used to extract metadata from the specified media file in the S3 object.

Compatibility

Module: version 0.3.0 migrated from CommonJS to pure ECMAScript Module (ESM). The distributed JavaScript codebase is compliant with the ECMAScript 2020 (11th Edition) standard.

This module requires a Node.js ≥ 16 engine. It can also be used in a browser environment when bundled with a module bundler.

For TypeScript CommonJs backward compatibility, you can use load-esm.

Examples

Determine S3 file type

Determine file type (based on it's content) from a file stored Amazon S3 cloud:

import { fileTypeFromTokenizer } from 'file-type';
import { fromEnv } from '@aws-sdk/credential-providers';
import { S3Client } from '@aws-sdk/client-s3';
import { makeChunkedTokenizerFromS3 } from '@tokenizer/s3';

(async () => {

  // Initialize S3 client
  const s3 = new S3Client({
    region: 'eu-west-2',
    credentials: fromEnv(),
  });

  // Initialize S3 tokenizer
  const s3Tokenizer = await makeChunkedTokenizerFromS3(s3, {
    Bucket: 'affectlab',
    Key: '1min_35sec.mp4'
  });

  // Figure out what kind of file it is
  const fileType = await fileTypeFromTokenizer(s3Tokenizer);
  console.log(fileType);
})();

See also example at file-type.

Reading audio metadata from Amazon S3

Retrieve music-metadata

import { makeChunkedTokenizerFromS3 } from '@tokenizer/s3';
import { S3Client } from '@aws-sdk/client-s3';
import { parseFromTokenizer } from 'music-metadata/lib/core';

/**
 * Retrieve metadata from Amazon S3 object
 * @param objRequest S3 object request
 * @param options `tokenizer-s3` options
 * @return Metadata
 */
async function parseS3Object(s3, objRequest, options) {
  const s3Tokenizer = await makeChunkedTokenizerFromS3(s3, objRequest);
  return parseFromTokenizer(s3Tokenizer, options);
}

(async () => {
  const s3 = new S3Client({});

  const metadata = await parseS3Object(s3, {
    Bucket: 'standing0media',
    Key: '01 Where The Highway Takes Me.mp3'
  });

  console.log(metadata);
})();

A module implementation of this example can be found in @music-metadata/s3.

Dependency graph

dependency graph