Skip to content
This repository was archived by the owner on Dec 4, 2019. It is now read-only.
This repository was archived by the owner on Dec 4, 2019. It is now read-only.

Long Time to Collect Results of Distributed Spark-Sklearn Training #114

@wjohnson

Description

@wjohnson

I'm running 15 combinations of a Logistic Regression model with spark-sklearn and I'll see that all tasks have completed but there is a huge amount of time to collect all of the results. I'm guessing it's the number of my coefficients that I'm bringing back to the driver. But I've noticed it several times when I'm working with wide datasets or deep random forests. Is it just expected due to network traffic?

Data set size: 31,358 rows, 10000 columns

param_grid = [
  dict(
  penalty=['l2'], 
  C = [1.0, 0.5, 0.1], 
  solver = ['newton-cg', 'lbfgs', 'sag']
  ),
  dict(
  penalty=['l1', 'elasticnet'], 
  C = [1.0, 0.5, 0.1], 
  solver = ['saga',]
  )
]
grid = GridSearchCV(sc, estimator=LogisticRegression(max_iter=500), param_grid=param_grid, n_jobs=-1, cv=5)
grid_result = grid.fit(X_train, y_train)

Environment:

  • Azure Databricks ML 5.5 Runtime
  • 9 Worker nodes with 56GB and 8 cores each

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions