Skip to content

Commit a247bf9

Browse files
KSDaemonqiao-x
andauthored
feat(databricks-jdbc-driver): Add export bucket support for Google Cloud Storage (#9445)
* Databricks export bucket for google cloud storage * simplify docs * add correct search prefix and filter after extract * rollback tableFullName construction * abstract out the bucket type array * some fixes and code polishment * prepare CI * add docs * remove copy-paste --------- Co-authored-by: qiao-x <105260504+qiao-x@users.noreply.github.com>
1 parent 455174b commit a247bf9

File tree

9 files changed

+28493
-28
lines changed

9 files changed

+28493
-28
lines changed

.github/workflows/drivers-tests.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,8 @@ jobs:
210210
databricks-jdbc-export-bucket-s3-prefix
211211
databricks-jdbc-export-bucket-azure
212212
databricks-jdbc-export-bucket-azure-prefix
213+
databricks-jdbc-export-bucket-gcs
214+
databricks-jdbc-export-bucket-gcs-prefix
213215
redshift
214216
redshift-export-bucket-s3
215217
snowflake
@@ -237,6 +239,8 @@ jobs:
237239
- databricks-jdbc-export-bucket-s3-prefix
238240
- databricks-jdbc-export-bucket-azure
239241
- databricks-jdbc-export-bucket-azure-prefix
242+
- databricks-jdbc-export-bucket-gcs
243+
- databricks-jdbc-export-bucket-gcs-prefix
240244
- mssql
241245
- mysql
242246
- postgres

docs/pages/product/configuration/data-sources/databricks-jdbc.mdx

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,24 @@ CUBEJS_DB_EXPORT_BUCKET_AWS_SECRET=<AWS_SECRET>
122122
CUBEJS_DB_EXPORT_BUCKET_AWS_REGION=<AWS_REGION>
123123
```
124124

125+
#### Google Cloud Storage
126+
127+
<InfoBox>
128+
129+
When using an export bucket, remember to assign the **Storage Object Admin**
130+
role to your Google Cloud credentials (`CUBEJS_DB_EXPORT_GCS_CREDENTIALS`).
131+
132+
</InfoBox>
133+
134+
To use Google Cloud Storage as an export bucket, first complete [the Databricks guide on
135+
connecting to cloud object storage using Unity Catalog][databricks-docs-uc-gcs].
136+
137+
```dotenv
138+
CUBEJS_DB_EXPORT_BUCKET=gs://databricks-export-bucket
139+
CUBEJS_DB_EXPORT_BUCKET_TYPE=gcs
140+
CUBEJS_DB_EXPORT_GCS_CREDENTIALS=<BASE64_ENCODED_SERVICE_CREDENTIALS_JSON>
141+
```
142+
125143
#### Azure Blob Storage
126144

127145
To use Azure Blob Storage as an export bucket, follow [the Databricks guide on
@@ -136,7 +154,7 @@ CUBEJS_DB_EXPORT_BUCKET=wasbs://my-bucket@my-account.blob.core.windows.net
136154
CUBEJS_DB_EXPORT_BUCKET_AZURE_KEY=<AZURE_STORAGE_ACCOUNT_ACCESS_KEY>
137155
```
138156

139-
Access key provides full access to the configuration and data,
157+
Access key provides full access to the configuration and data,
140158
to use a fine-grained control over access to storage resources, follow [the Databricks guide on authorize with Azure Active Directory][authorize-with-azure-active-directory].
141159

142160
[Create the service principal][azure-authentication-with-service-principal] and replace the access key as follows:
@@ -173,6 +191,8 @@ bucket][self-preaggs-export-bucket] **must be** configured.
173191
https://docs.databricks.com/data/data-sources/azure/azure-storage.html
174192
[databricks-docs-uc-s3]:
175193
https://docs.databricks.com/en/connect/unity-catalog/index.html
194+
[databricks-docs-uc-gcs]:
195+
https://docs.databricks.com/gcp/en/connect/unity-catalog/cloud-storage.html
176196
[databricks-docs-jdbc-url]:
177197
https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#get-server-hostname-port-http-path-and-jdbc-url
178198
[databricks-docs-pat]:

packages/cubejs-databricks-jdbc-driver/src/DatabricksDriver.ts

Lines changed: 49 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -4,27 +4,23 @@
44
* @fileoverview The `DatabricksDriver` and related types declaration.
55
*/
66

7+
import { assertDataSource, getEnv, } from '@cubejs-backend/shared';
78
import {
8-
getEnv,
9-
assertDataSource,
10-
} from '@cubejs-backend/shared';
11-
import {
9+
DatabaseStructure,
1210
DriverCapabilities,
11+
GenericDataBaseType,
1312
QueryColumnsResult,
1413
QueryOptions,
1514
QuerySchemasResult,
1615
QueryTablesResult,
17-
UnloadOptions,
18-
GenericDataBaseType,
1916
TableColumn,
20-
DatabaseStructure,
17+
UnloadOptions,
2118
} from '@cubejs-backend/base-driver';
22-
import {
23-
JDBCDriver,
24-
JDBCDriverConfiguration,
25-
} from '@cubejs-backend/jdbc-driver';
19+
import { JDBCDriver, JDBCDriverConfiguration, } from '@cubejs-backend/jdbc-driver';
2620
import { DatabricksQuery } from './DatabricksQuery';
27-
import { resolveJDBCDriver, extractUidFromJdbcUrl } from './helpers';
21+
import { extractUidFromJdbcUrl, resolveJDBCDriver } from './helpers';
22+
23+
const SUPPORTED_BUCKET_TYPES = ['s3', 'gcs', 'azure'];
2824

2925
export type DatabricksDriverConfiguration = JDBCDriverConfiguration &
3026
{
@@ -103,6 +99,11 @@ export type DatabricksDriverConfiguration = JDBCDriverConfiguration &
10399
* Azure service principal client secret
104100
*/
105101
azureClientSecret?: string,
102+
103+
/**
104+
* GCS credentials JSON content
105+
*/
106+
gcsCredentials?: string,
106107
};
107108

108109
type ShowTableRow = {
@@ -209,7 +210,7 @@ export class DatabricksDriver extends JDBCDriver {
209210
// common export bucket config
210211
bucketType:
211212
conf?.bucketType ||
212-
getEnv('dbExportBucketType', { supported: ['s3', 'azure'], dataSource }),
213+
getEnv('dbExportBucketType', { supported: SUPPORTED_BUCKET_TYPES, dataSource }),
213214
exportBucket:
214215
conf?.exportBucket ||
215216
getEnv('dbExportBucket', { dataSource }),
@@ -246,6 +247,10 @@ export class DatabricksDriver extends JDBCDriver {
246247
azureClientSecret:
247248
conf?.azureClientSecret ||
248249
getEnv('dbExportBucketAzureClientSecret', { dataSource }),
250+
// GCS credentials
251+
gcsCredentials:
252+
conf?.gcsCredentials ||
253+
getEnv('dbExportGCSCredentials', { dataSource }),
249254
};
250255
if (config.readOnly === undefined) {
251256
// we can set readonly to true if there is no bucket config provided
@@ -429,8 +434,7 @@ export class DatabricksDriver extends JDBCDriver {
429434
metadata[database] = {};
430435
}
431436

432-
const columns = await this.tableColumnTypes(`${database}.${tableName}`);
433-
metadata[database][tableName] = columns;
437+
metadata[database][tableName] = await this.tableColumnTypes(`${database}.${tableName}`);
434438
}));
435439

436440
return metadata;
@@ -527,7 +531,7 @@ export class DatabricksDriver extends JDBCDriver {
527531
* Returns table columns types.
528532
*/
529533
public override async tableColumnTypes(table: string): Promise<TableColumn[]> {
530-
let tableFullName = '';
534+
let tableFullName: string;
531535
const tableArray = table.split('.');
532536

533537
if (tableArray.length === 3) {
@@ -643,7 +647,7 @@ export class DatabricksDriver extends JDBCDriver {
643647
* export bucket data.
644648
*/
645649
public async unload(tableName: string, options: UnloadOptions) {
646-
if (!['azure', 's3'].includes(this.config.bucketType as string)) {
650+
if (!SUPPORTED_BUCKET_TYPES.includes(this.config.bucketType as string)) {
647651
throw new Error(`Unsupported export bucket type: ${
648652
this.config.bucketType
649653
}`);
@@ -733,6 +737,12 @@ export class DatabricksDriver extends JDBCDriver {
733737
url.host,
734738
objectSearchPrefix,
735739
);
740+
} else if (this.config.bucketType === 'gcs') {
741+
return this.extractFilesFromGCS(
742+
{ credentials: this.config.gcsCredentials },
743+
url.host,
744+
objectSearchPrefix,
745+
);
736746
} else {
737747
throw new Error(`Unsupported export bucket type: ${
738748
this.config.bucketType
@@ -759,16 +769,22 @@ export class DatabricksDriver extends JDBCDriver {
759769
*
760770
* For Azure blob storage you need to configure account access key in
761771
* Cluster -> Configuration -> Advanced options
762-
* (https://docs.databricks.com/data/data-sources/azure/azure-storage.html#access-azure-blob-storage-directly)
772+
* https://docs.databricks.com/data/data-sources/azure/azure-storage.html#access-azure-blob-storage-directly
763773
*
764774
* `fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>`
765775
*
766776
* For S3 bucket storage you need to configure AWS access key and secret in
767777
* Cluster -> Configuration -> Advanced options
768-
* (https://docs.databricks.com/data/data-sources/aws/amazon-s3.html#access-s3-buckets-directly)
778+
* https://docs.databricks.com/data/data-sources/aws/amazon-s3.html#access-s3-buckets-directly
769779
*
770780
* `fs.s3a.access.key <aws-access-key>`
771781
* `fs.s3a.secret.key <aws-secret-key>`
782+
*
783+
* For Google cloud storage you can configure storage credentials and create an external location to access it
784+
* or configure account service key (legacy)
785+
* https://docs.databricks.com/gcp/en/connect/unity-catalog/cloud-storage/storage-credentials
786+
* https://docs.databricks.com/gcp/en/connect/unity-catalog/cloud-storage/external-locations
787+
* https://docs.databricks.com/aws/en/connect/storage/gcs
772788
*/
773789
private async createExternalTableFromSql(tableFullName: string, sql: string, params: unknown[], columns: ColumnInfo[]) {
774790
let select = sql;
@@ -780,15 +796,15 @@ export class DatabricksDriver extends JDBCDriver {
780796
try {
781797
await this.query(
782798
`
783-
CREATE TABLE ${tableFullName}
784-
USING CSV LOCATION '${this.config.exportBucketMountDir || this.config.exportBucket}/${tableFullName}.csv'
799+
CREATE TABLE ${tableFullName}_tmp
800+
USING CSV LOCATION '${this.config.exportBucketMountDir || this.config.exportBucket}/${tableFullName}'
785801
OPTIONS (escape = '"')
786802
AS (${select});
787803
`,
788804
params,
789805
);
790806
} finally {
791-
await this.query(`DROP TABLE IF EXISTS ${tableFullName};`, []);
807+
await this.query(`DROP TABLE IF EXISTS ${tableFullName}_tmp;`, []);
792808
}
793809
}
794810

@@ -798,30 +814,36 @@ export class DatabricksDriver extends JDBCDriver {
798814
*
799815
* For Azure blob storage you need to configure account access key in
800816
* Cluster -> Configuration -> Advanced options
801-
* (https://docs.databricks.com/data/data-sources/azure/azure-storage.html#access-azure-blob-storage-directly)
817+
* https://docs.databricks.com/data/data-sources/azure/azure-storage.html#access-azure-blob-storage-directly
802818
*
803819
* `fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>`
804820
*
805821
* For S3 bucket storage you need to configure AWS access key and secret in
806822
* Cluster -> Configuration -> Advanced options
807-
* (https://docs.databricks.com/data/data-sources/aws/amazon-s3.html#access-s3-buckets-directly)
823+
* https://docs.databricks.com/data/data-sources/aws/amazon-s3.html#access-s3-buckets-directly
808824
*
809825
* `fs.s3a.access.key <aws-access-key>`
810826
* `fs.s3a.secret.key <aws-secret-key>`
827+
*
828+
* For Google cloud storage you can configure storage credentials and create an external location to access it
829+
* or configure account service key (legacy)
830+
* https://docs.databricks.com/gcp/en/connect/unity-catalog/cloud-storage/storage-credentials
831+
* https://docs.databricks.com/gcp/en/connect/unity-catalog/cloud-storage/external-locations
832+
* https://docs.databricks.com/aws/en/connect/storage/gcs
811833
*/
812834
private async createExternalTableFromTable(tableFullName: string, columns: ColumnInfo[]) {
813835
try {
814836
await this.query(
815837
`
816-
CREATE TABLE _${tableFullName}
817-
USING CSV LOCATION '${this.config.exportBucketMountDir || this.config.exportBucket}/${tableFullName}.csv'
838+
CREATE TABLE ${tableFullName}_tmp
839+
USING CSV LOCATION '${this.config.exportBucketMountDir || this.config.exportBucket}/${tableFullName}'
818840
OPTIONS (escape = '"')
819841
AS SELECT ${this.generateTableColumnsForExport(columns)} FROM ${tableFullName}
820842
`,
821843
[],
822844
);
823845
} finally {
824-
await this.query(`DROP TABLE IF EXISTS _${tableFullName};`, []);
846+
await this.query(`DROP TABLE IF EXISTS ${tableFullName}_tmp;`, []);
825847
}
826848
}
827849
}

packages/cubejs-testing-drivers/fixtures/databricks-jdbc.json

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,24 @@
3939
"CUBEJS_DB_EXPORT_BUCKET_AZURE_KEY": "${DRIVERS_TESTS_CUBEJS_DB_EXPORT_BUCKET_AZURE_KEY}"
4040
}
4141
}
42+
},
43+
"export-bucket-gcs": {
44+
"cube": {
45+
"environment": {
46+
"CUBEJS_DB_EXPORT_BUCKET_TYPE": "gcs",
47+
"CUBEJS_DB_EXPORT_BUCKET": "gs://databricks-drivers-tests-preaggs",
48+
"CUBEJS_DB_EXPORT_GCS_CREDENTIALS": "${DRIVERS_TESTS_CUBEJS_DB_EXPORT_GCS_CREDENTIALS}"
49+
}
50+
}
51+
},
52+
"export-bucket-gcs-prefix": {
53+
"cube": {
54+
"environment": {
55+
"CUBEJS_DB_EXPORT_BUCKET_TYPE": "gcs",
56+
"CUBEJS_DB_EXPORT_BUCKET": "gs://databricks-drivers-tests-preaggs/testing_prefix/for_export_buckets",
57+
"CUBEJS_DB_EXPORT_GCS_CREDENTIALS": "${DRIVERS_TESTS_CUBEJS_DB_EXPORT_GCS_CREDENTIALS}"
58+
}
59+
}
4260
}
4361
},
4462
"cube": {

packages/cubejs-testing-drivers/package.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@
3232
"databricks-jdbc-export-bucket-s3-prefix-full": "yarn test-driver -i dist/test/databricks-jdbc-export-bucket-s3-prefix-full.test.js",
3333
"databricks-jdbc-export-bucket-azure-full": "yarn test-driver -i dist/test/databricks-jdbc-export-bucket-azure-full.test.js",
3434
"databricks-jdbc-export-bucket-azure-prefix-full": "yarn test-driver -i dist/test/databricks-jdbc-export-bucket-azure-prefix-full.test.js",
35+
"databricks-jdbc-export-bucket-gcs-full": "yarn test-driver -i dist/test/databricks-jdbc-export-bucket-gcs-full.test.js",
36+
"databricks-jdbc-export-bucket-gcs-prefix-full": "yarn test-driver -i dist/test/databricks-jdbc-export-bucket-gcs-prefix-full.test.js",
3537
"mssql-driver": "yarn test-driver -i dist/test/mssql-driver.test.js",
3638
"mssql-core": "yarn test-driver -i dist/test/mssql-core.test.js",
3739
"mssql-full": "yarn test-driver -i dist/test/mssql-full.test.js",

0 commit comments

Comments
 (0)