You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -66,19 +66,23 @@ Note that these examples are *not* exhaustive and you can use S3 in other places
66
66
67
67
## S3 FileSystem Implementations
68
68
69
-
Flink provides three independent S3 filesystem implementations, each with different trade-offs:
69
+
Flink provides three independent S3 filesystem implementations:
70
70
71
-
-**Native S3 FileSystem** (`flink-s3-fs-native`): Built directly on AWS SDK v2 with async I/O and parallel transfers, removing the dependency from Hadoop entirely. Supports both checkpointing and the FileSink in a single plugin, removing the need to choose between Presto (checkpointing) and Hadoop (FileSink). [Benchmarks](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406620396) show ~2x higher checkpoint throughput (~200 MB/s vs ~90 MB/s) compared to the Presto implementation at state sizes up to 15 GB. **Experimental** in Flink 2.3.
72
-
-**Presto S3 FileSystem** (`flink-s3-fs-presto`): Based on Presto project code. The proven choice for checkpointing in production.
73
-
-**Hadoop S3 FileSystem** (`flink-s3-fs-hadoop`): Based on Hadoop project code. Supports both checkpointing and the FileSink.
|**Native S3** (`flink-s3-fs-native`) | ✓ | ✓ |**Experimental** in Flink 2.3. Built on AWS SDK v2; no Hadoop dependency. |
74
+
|**Presto S3** (`flink-s3-fs-presto`) | ✓ | x | Production-proven for checkpointing. |
75
+
|**Hadoop S3** (`flink-s3-fs-hadoop`) | ✓ | ✓ | Mature; the only stable implementation that provides `RecoverableWriter` for the FileSink. |
76
+
77
+
Previously, users had to choose between Presto (recommended for checkpointing throughput) and Hadoop (the only implementation with `RecoverableWriter`, required by the [FileSink]({{< ref "docs/connectors/datastream/filesystem" >}})). The Native S3 implementation unifies both capabilities in a single plugin and measurements show significant checkpoint throughput improvements over the Presto implementation.
74
78
75
79
All three are self-contained with no dependency footprint, so there is no need to add Hadoop to the classpath to use them.
76
80
77
81
## Common Configuration
78
82
79
83
### Configure Access Credentials
80
84
81
-
After setting up the S3 FileSystem implementation, you need to make sure that Flink is allowed to access your S3 buckets.
85
+
After setting up the S3 FileSystem implementation, you need to make sure that Flink is allowed to access your S3 buckets. The following three approaches are **independent alternatives** — choose the one that fits your environment:
82
86
83
87
#### Identity and Access Management (IAM) (Recommended)
84
88
@@ -151,7 +155,7 @@ The legacy configuration key `s3.path.style.access` is still supported as a fall
151
155
**Experimental**: The Native S3 FileSystem is experimental in Flink 2.3. It is functionally complete and has demonstrated strong performance in benchmarks.
152
156
{{< /hint >}}
153
157
154
-
The Native S3 FileSystem is a pure-Java implementation built on the AWS SDK v2 completely removing the dependencies from hadoop. It is registered under the schemes *s3://* and *s3a://*. It requires no additional dependencies and provides a drop-in replacement for the Presto and Hadoop implementations.
158
+
The Native S3 FileSystem is a pure-Java implementation built on the AWS SDK v2 completely removing the dependency on Hadoop. It is registered under the schemes *s3://* and *s3a://*. It provides a drop-in replacement for the Presto and Hadoop implementations, supporting checkpointing, the [FileSink]({{< ref "docs/connectors/datastream/filesystem" >}}) (via `RecoverableWriter`), server-side encryption (SSE-S3, SSE-KMS), cross-account access via IAM role assumption, entropy injection, and bulk copy via S3TransferManager.
- **Assume role**: Cross-account access via IAM role assumption
172
-
- **Entropy injection**: Optimize S3 scalability through random key prefixes
173
-
- **Bulk copy**: Efficient multi-part copy operations via S3TransferManager
174
-
175
169
#### Configuration
176
170
177
-
The Native S3 FileSystem uses the following configuration options:
171
+
In addition to the [common configuration](#common-configuration) options (`s3.access-key`, `s3.secret-key`, `s3.endpoint`, `s3.path-style-access`), the Native S3 FileSystem supports the following options:
178
172
179
173
```yaml
180
-
# AWS credentials (if using static credentials)
181
-
s3.access-key: your-access-key
182
-
s3.secret-key: your-secret-key
183
-
184
-
# AWS region (optional; auto-detected if not specified)
185
-
s3.region: us-east-1
186
-
187
-
# Custom S3 endpoint for S3-compatible storage
188
-
s3.endpoint: your-endpoint-hostname
189
-
190
-
# Path style access for S3-compatible storage
191
-
s3.path-style-access: true
192
-
193
174
# Server-side encryption
194
175
s3.sse.type: sse-s3 # or sse-kms, aws:kms, AES256, none (default)
195
-
s3.sse.kms.key-id: arn:aws:kms:region:account:key/id # For SSE-KMS
176
+
s3.sse.kms.key-id: arn:aws:kms:region:account:key/id # Required for SSE-KMS
When `fs.s3.aws.credentials.provider` is not set, the Native S3 FileSystem automatically builds a credentials chain in the following order: delegation tokens, static credentials (if `s3.access-key` and `s3.secret-key` are configured), and the AWS SDK v2 `DefaultCredentialsProvider` (environment variables, instance profiles, etc.). You only need to set this option if you require a custom provider chain.
223
194
224
-
See the [AWS SDK v2 documentation](https://docs.aws.amazon.com/sdk-for-java/) for additional configuration details.
225
-
226
195
---
227
196
228
197
### Presto S3 FileSystem
@@ -231,13 +200,7 @@ See the [AWS SDK v2 documentation](https://docs.aws.amazon.com/sdk-for-java/) fo
231
200
You don't have to configure this manually if you are running [Flink on EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html).
232
201
{{< /hint >}}
233
202
234
-
The Presto S3 FileSystem is based on code from the [Presto project](https://prestodb.io/). It is registered under the schemes *s3://* and *s3p://*.
235
-
236
-
#### Features
237
-
238
-
- **Recommended for checkpointing**: The Presto implementation is the recommended file system for checkpointing to S3
239
-
- **Self-contained**: No Hadoop dependency required
240
-
- **Production-ready**: Stable and widely used
203
+
The Presto S3 FileSystem is based on code from the [Presto project](https://prestodb.io/). It is registered under the schemes *s3://* and *s3p://*. It is the production-proven choice for checkpointing to S3. It does not support the [FileSink]({{< ref "docs/connectors/datastream/filesystem" >}}) (`createRecoverableWriter` throws `UnsupportedOperationException`).
241
204
242
205
#### Setup
243
206
@@ -250,36 +213,13 @@ cp ./opt/flink-s3-fs-presto-{{< version >}}.jar ./plugins/s3-fs-presto/
250
213
251
214
#### Configuration
252
215
253
-
Configure it using [the same configuration keys as the Presto file system](https://prestodb.io/docs/0.272/connector/hive.html#amazon-s3-configuration), by adding the configurations to your [Flink configuration file]({{< ref "docs/deployment/config#flink-configuration-file" >}}):
Refer to the [Presto documentation](https://prestodb.io/docs/0.272/connector/hive.html#amazon-s3-configuration) for all available configuration options.
216
+
The [common configuration](#common-configuration) options apply. In addition, Presto-specific keys are supported via the [Presto file system configuration](https://prestodb.io/docs/0.272/connector/hive.html#amazon-s3-configuration).
271
217
272
218
---
273
219
274
220
### Hadoop S3 FileSystem
275
221
276
-
The Hadoop S3 FileSystem is based on code from the [Hadoop Project](https://hadoop.apache.org/). It is registered under the schemes *s3://* and *s3a://*.
- **Self-contained**: No additional Hadoop installation required
282
-
- **Mature implementation**: Long-established code from the Hadoop ecosystem
222
+
The Hadoop S3 FileSystem is based on code from the [Hadoop Project](https://hadoop.apache.org/). It is registered under the schemes *s3://* and *s3a://*. It is the only stable implementation that supports the [FileSink]({{< ref "docs/connectors/datastream/filesystem" >}}) (via `RecoverableWriter`).
283
223
284
224
#### Setup
285
225
@@ -292,27 +232,7 @@ cp ./opt/flink-s3-fs-hadoop-{{< version >}}.jar ./plugins/s3-fs-hadoop/
292
232
293
233
#### Configuration
294
234
295
-
Configure it using [Hadoop's s3a configuration keys](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#S3A) by adding the configurations to your [Flink configuration file]({{< ref "docs/deployment/config#flink-configuration-file" >}}):
Hadoop configuration keys are automatically translated. For example, `fs.s3a.connection.maximum` becomes `s3.connection.maximum`. Refer to the [Hadoop S3A documentation](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#S3A) for all available options.
235
+
The [common configuration](#common-configuration) options apply. In addition, [Hadoop's s3a configuration keys](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#S3A) are supported. Hadoop configuration keys are automatically translated — for example, `fs.s3a.connection.maximum` becomes `s3.connection.maximum`.
0 commit comments