This repository was archived by the owner on Mar 10, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 120
Query Test Runs
Denny Lee edited this page Mar 11, 2017
·
6 revisions
Below are the results of some query test runs using the different Spark to DocumentDB connector methods.
Below are the results of connecting Spark to DocumentDB via pyDocumentDB with the following configuration:
- Single VM Spark cluster (one master, one worker) on Azure DS11 v2 VM (14GB RAM, 2 cores) running Ubuntu 16.04 LTS using Spark 2.1.
- DocumentDB single partition collection configured to 10,000 RUs
-
airport.codeshas 512 documents -
DepartureDelays.flightshas 1.05M documents (single collection) -
DepartureDelays.flights (pColl)has 1.39M documents (partitioned collection)
The queries were:
- Q1:
SELECT c.City FROM c WHERE c.State='WA' - Q2:
SELECT TOP 100 c.date, c.delay, c.distance, c.origin, c.destination FROM c - Q3:
SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA' - Q4:
SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c
Below are the results from querying a single collection
| Query | # of rows | Collection | Response Time (First) | Response Time (Second) |
|---|---|---|---|---|
| Q1 | 7 | airport.codes | 0:00:00.225645 | 0:00:00.006784 |
| Q2 | 100 | DepartureDelays.flights | 0:00:00.214985 | 0:00:00.009669 |
| Q3 | 14,808 | DepartureDelays.flights | 0:00:01.498699 | 0:00:01.323917 |
| Q4 | 1,048,575 | DepartureDelays.flights | 0:01:37.518344 |
Below are the results from querying a partitioned collection (25 partitions)
| Query | # of rows | Collection | Response Time (First) | Response Time (Second) |
|---|---|---|---|---|
| Q2 | 100 | DepartureDelays.flights (pColl) | 0:00:00.774820 | 0:00:00.508290 |
| Q3 | 23,078 | DepartureDelays.flights (pColl) | 0:00:05.146107 | 0:00:03.234670 |
| Q4 | 1,391,578 | DepartureDelays.flights (pColl) | 0:02:36.335267 |