Long Time to Collect Results of Distributed Spark-Sklearn Training

I'm running 15 combinations of a Logistic Regression model with spark-sklearn and I'll see that all tasks have completed but there is a huge amount of time to collect all of the results.  I'm guessing it's the number of my coefficients that I'm bringing back to the driver.  But I've noticed it several times when I'm working with wide datasets or deep random forests.  Is it just expected due to network traffic?

Data set size: 31,358 rows, 10000 columns

```
param_grid = [
  dict(
  penalty=['l2'], 
  C = [1.0, 0.5, 0.1], 
  solver = ['newton-cg', 'lbfgs', 'sag']
  ),
  dict(
  penalty=['l1', 'elasticnet'], 
  C = [1.0, 0.5, 0.1], 
  solver = ['saga',]
  )
]
grid = GridSearchCV(sc, estimator=LogisticRegression(max_iter=500), param_grid=param_grid, n_jobs=-1, cv=5)
grid_result = grid.fit(X_train, y_train)
```

Environment:
* Azure Databricks ML 5.5 Runtime
* 9 Worker nodes with 56GB and 8 cores each

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long Time to Collect Results of Distributed Spark-Sklearn Training #114

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Long Time to Collect Results of Distributed Spark-Sklearn Training #114

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions