Skip to content

Query exporting only 2ˆ28 documents (data have more) even after setting query limit to 2ˆ29 #5176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
laisuchoa opened this issue Aug 14, 2024 · 1 comment · Fixed by #5515
Labels

Comments

@laisuchoa
Copy link

laisuchoa commented Aug 14, 2024

Description

We're working with a CouchDB database that holds about 272 million records. When attempting to retrieve all documents using a curl command and writing to a file, we noticed that only 268435456 records were returned, despite the database having more entries.

The curl command in question:

curl -k -X GET "https://$(cat credential_couchdb)@{host}/{db}/_all_docs?include_docs=true" > all_docs

This led us to suspect a default query limit issue, which aligns with discussions in this GitHub pull request.

To resolve this, we updated the config file to set both partition_query_limit and query_limit to 536870912 (2^29).

"query_server_config":{"partition_query_limit":"536870912","query_limit":"536870912"}

Despite this change and the offset field from the output file displaying the correct amount of total records, the output still holds 268435456 records.

Expected Behaviour

We expect the number of records in the output file to match the offset field value, not to be stuck in this default value count.

Additional Context

The CouchDB instance in question is integrated as a state database within a Hyperledger Fabric setup.
The database is updated once a day.

@lucasmation

@laisuchoa laisuchoa changed the title Query limit configuration not reflecting expected document retrieval count in large database Query exporting only 2ˆ28 documents (data have more) even after setting query limit to 2ˆ29 Aug 19, 2024
@rnewson
Copy link
Member

rnewson commented Aug 19, 2024

-define(MAX_VIEW_LIMIT, 16#10000000).

limit = ?MAX_VIEW_LIMIT,

You may need to paginate, or pass a larger ?limit= query parameter.

nickva added a commit that referenced this issue Apr 24, 2025
Previously, we set a default limit that was an effective "infinity" (2^28).
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping the rest of the data. Fix that by
increasing the limit to 100B.

We have a "query_limit" config parameter to increase the limit, however that
turned out to be broken and did not take effect, when the user tried, so fix
that as well.

Since we still support using the local API (localhost:5986 or _node/$node), and
the worker function is used for the clustered and local calls, the
`apply_limit/2` function is called twice for the same request: once on the
coordinator, and then on each worker. Since the coordinator sets limit as limit
+ skip, that means that we'd be failing all skip > 0 calls on the
coordinator (limit + skip > max_limit). To handle that use an extra option flag
to indicate that we already applied the limit and don't apply it again.

When running the tests also discoverd that our _dbs_info and list endpoints did
not validate query parameters, so ensure we validate them, which means users can
now configure query limits for them.

Fix #5176
nickva added a commit that referenced this issue Apr 25, 2025
Previously, we set a default limit that was an effective "infinity" (2^28).
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping the rest of the data. Fix that by
increasing the limit to 100B.

We have a "query_limit" config parameter to increase the limit, however that
turned out to be broken and did not take effect, when the user tried, so fix
that as well.

Since we still support using the local API (localhost:5986 or _node/$node), and
the worker function is used for the clustered and local calls, the
`apply_limit/2` function is called twice for the same request: once on the
coordinator, and then on each worker. Since the coordinator sets limit as limit
+ skip, that means that we'd be failing all skip > 0 calls on the
coordinator (limit + skip > max_limit). To handle that use an extra option flag
to indicate that we already applied the limit and don't apply it again.

When running the tests also discoverd that our _dbs_info and list endpoints did
not validate query parameters, so ensure we validate them, which means users can
now configure query limits for them.

Fix #5176
nickva added a commit that referenced this issue Apr 25, 2025
Previously, we set a default limit that was an effective "infinity" (2^28).
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping the rest of the data. Fix that by
increasing the limit to 100B.

We have a "query_limit" config parameter to increase the limit, however that
turned out to be broken and did not take effect, when the user tried, so fix
that as well.

Since we still support using the local API (localhost:5986 or _node/$node), and
the worker function is used for the clustered and local calls, the
`apply_limit/2` function is called twice for the same request: once on the
coordinator, and then on each worker. Since the coordinator sets limit as limit
+ skip, that means that we'd be failing all skip > 0 calls on the
coordinator (limit + skip > max_limit). To handle that use an extra option flag
to indicate that we already applied the limit and don't apply it again.

When running the tests also discoverd that our _dbs_info and list endpoints did
not validate query parameters, so ensure we validate them, which means users can
now configure query limits for them.

Fix #5176
nickva added a commit that referenced this issue Apr 25, 2025
Previously, we set a default limit that was an effective "infinity" (2^28).
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping the rest of the data. Fix that by
increasing the limit to 100B.

We have a "query_limit" config parameter to customize the limit, however that
turned out to be broken and did not take effect, so fix that as well.

Since we still support using the local API (localhost:5986 or _node/$node) the
`apply_limit/2` function is called twice for the same request: once on the
coordinator, and then on each worker. Since the coordinator sets `limit` as
`limit+skip`, that means that we'd be failing all `skip > 0` calls on the
coordinator (`limit + skip > max_limit`). To handle that, use an extra option
flag to indicate that we already applied the limit and don't apply it again.

When running the tests also discovered that our `_dbs_info` and `list`
endpoints did not validate query parameters, so ensure we validate them, which
means users can now configure query limits for them.

Fix #5176
nickva added a commit that referenced this issue Apr 25, 2025
Previously, we set a default limit that was an effective "infinity" (2^28).
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping the rest of the data. Fix that by
increasing the limit to 100B.

We have a "query_limit" config parameter to customize the limit, however that
turned out to be broken and did not take effect, so fix that as well.

Since we still support using the local API (localhost:5986 or _node/$node) the
`apply_limit/2` function is called twice for the same request: once on the
coordinator, and then on each worker. Since the coordinator sets `limit` as
`limit+skip`, that means that we'd be failing all `skip > 0` calls on the
coordinator (`limit + skip > max_limit`). To handle that, use an extra option
flag to indicate that we already applied the limit and don't apply it again.

When running the tests also discovered that our `_dbs_info` and `list`
endpoints did not validate query parameters, so ensure we validate them, which
means users can now configure query limits for them.

Fix #5176
nickva added a commit that referenced this issue Apr 25, 2025
Previously, we set a default limit that was an effective "infinity" (2^28).
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping the rest of the data. Fix that by
increasing the limit to 100B.

We have a "query_limit" config parameter to customize the limit, however that
turned out to be broken and did not take effect, so fix that as well.

Since we still support using the local API (localhost:5986 or _node/$node) the
`apply_limit/2` function is called twice for the same request: once on the
coordinator, and then on each worker. Since the coordinator sets `limit` as
`limit+skip`, that means that we'd be failing all `skip > 0` calls on the
coordinator (`limit + skip > max_limit`). To handle that, use an extra option
flag to indicate that we already applied the limit and don't apply it again.

When running the tests also discovered that our `_dbs_info` and `list`
endpoints did not validate query parameters, so ensure we validate them, which
means users can now configure query limits for them.

Fix #5176
nickva added a commit that referenced this issue Apr 25, 2025
Previously, we set a default limit that was an effective infinity (2^28). It
seems back in the 32 bit days that was the Erlang's largest small integer [1].
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping some of the data. Fix that by
increasing the limit to a larger "infinity" (highest 64 bit Erlang small
integer [1]).

We did have a "query_limit" config parameter to customize the limit, however
that turned out to be broken and did not take effect when the user tried it for
all_docs, so fix that as well. Fix that and use a test to ensure the limit gets
reduced appropriately. To make the setting more user friendly, allow `infinity`
as the value.

Also, in the case of all_docs, we validated args and applied the limit check
twice: once in the coordinator and another time on each worker, which wasted
CPU resources and made things a bit confusing. To fix that, remove the
validation from the common worker code in couch_mrview and validate once:
either on the coordinator side, or local (port 5986) callback, right in the
http callback.

[1] https://www.erlang.org/doc/system/memory.html

Fix #5176
nickva added a commit that referenced this issue Apr 26, 2025
Previously, we set a default limit that was an effective infinity (2^28). It
seems back in the 32 bit days that was the Erlang's largest small integer [1].
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping some of the data. Fix that by
increasing the limit to a larger "infinity" (highest 64 bit Erlang small
integer [1]).

We did have a "query_limit" config parameter to customize the limit, however
that turned out to be broken and did not take effect when the user tried it for
all_docs, so fix that as well. Fix that and use a test to ensure the limit gets
reduced appropriately. To make the setting more user friendly, allow `infinity`
as the value.

Also, in the case of all_docs, we validated args and applied the limit check
twice: once in the coordinator and another time on each worker, which wasted
CPU resources and made things a bit confusing. To fix that, remove the
validation from the common worker code in couch_mrview and validate once:
either on the coordinator side, or local (port 5986) callback, right in the
http callback.

[1] https://www.erlang.org/doc/system/memory.html

Fix #5176
nickva added a commit that referenced this issue Apr 26, 2025
Previously, we set a default limit that was an effective infinity (2^28). It
seems back in the 32 bit days that was the Erlang's largest small integer [1].
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping some of the data. Fix that by
increasing the limit to a larger "infinity" (highest 64 bit Erlang small
integer [1]).

We did have a "query_limit" config parameter to customize the limit, however
that turned out to be broken and did not take effect when the user tried it for
all_docs, so fix that as well. Fix that and use a test to ensure the limit gets
reduced appropriately. To make the setting more user friendly, allow `infinity`
as the value.

Also, in the case of all_docs, we validated args and applied the limit check
twice: once in the coordinator and another time on each worker, which wasted
CPU resources and made things a bit confusing. To fix that, remove the
validation from the common worker code in couch_mrview and validate once:
either on the coordinator side, or local (port 5986) callback, right in the
http callback.

[1] https://www.erlang.org/doc/system/memory.html

Fix #5176
nickva added a commit that referenced this issue Apr 26, 2025
Previously, we set a default limit that was an effective infinity (2^28). It
seems back in the 32 bit days that was the Erlang's largest small integer [1].
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping some of the data. Fix that by
increasing the limit to a larger "infinity" (highest 64 bit Erlang small
integer [1]).

We did have a "query_limit" config parameter to customize the limit, however
that turned out to be broken and did not take effect when the user tried it for
all_docs, so fix that as well. Fix that and use a test to ensure the limit gets
reduced appropriately. To make the setting more user friendly, allow `infinity`
as the value.

Also, in the case of all_docs, we validated args and applied the limit check
twice: once in the coordinator and another time on each worker, which wasted
CPU resources and made things a bit confusing. To fix that, remove the
validation from the common worker code in couch_mrview and validate once:
either on the coordinator side, or local (port 5986) callback, right in the
http callback.

[1] https://www.erlang.org/doc/system/memory.html

Fix #5176
nickva added a commit that referenced this issue Apr 27, 2025
Previously, we set a default limit that was an effective infinity (2^28). It
seems back in the 32 bit days that was the Erlang's largest small integer [1].
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping some of the data. Fix that by
increasing the limit to a larger "infinity" (highest 64 bit Erlang small
integer [1]).

We did have a "query_limit" config parameter to customize the limit, however
that turned out to be broken and did not take effect when the user tried it for
all_docs, so fix that as well. Fix that and use a test to ensure the limit gets
reduced appropriately. To make the setting more user friendly, allow `infinity`
as the value.

Also, in the case of all_docs, we validated args and applied the limit check
twice: once in the coordinator and another time on each worker, which wasted
CPU resources and made things a bit confusing. To fix that, remove the
validation from the common worker code in couch_mrview and validate once:
either on the coordinator side, or local (port 5986) callback, right in the
http callback.

[1] https://www.erlang.org/doc/system/memory.html

Fix #5176
nickva added a commit that referenced this issue Apr 30, 2025
Previously, we set a default limit that was an effective infinity (2^28). It
seems back in the 32 bit days that was the Erlang's largest small integer [1].
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping some of the data. Fix that by
increasing the limit to a larger "infinity" (highest 64 bit Erlang small
integer [1]).

We did have a "query_limit" config parameter to customize the limit, however
that turned out to be broken and did not take effect when the user tried it for
all_docs, so fix that as well. Fix that and use a test to ensure the limit gets
reduced appropriately. To make the setting more user friendly, allow `infinity`
as the value.

Also, in the case of all_docs, we validated args and applied the limit check
twice: once in the coordinator and another time on each worker, which wasted
CPU resources and made things a bit confusing. To fix that, remove the
validation from the common worker code in couch_mrview and validate once:
either on the coordinator side, or local (port 5986) callback, right in the
http callback.

[1] https://www.erlang.org/doc/system/memory.html

Fix #5176
nickva added a commit that referenced this issue Apr 30, 2025
Previously, we set a default limit that was an effective infinity (2^28). It
seems back in the 32 bit days that was the Erlang's largest small integer [1].
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping some of the data. Fix that by
increasing the limit to a larger "infinity" (highest 64 bit Erlang small
integer [1]).

We did have a "query_limit" config parameter to customize the limit, however
that turned out to be broken and did not take effect when the user tried it for
all_docs, so fix that as well. Fix that and use a test to ensure the limit gets
reduced appropriately. To make the setting more user friendly, allow `infinity`
as the value.

Also, in the case of all_docs, we validated args and applied the limit check
twice: once in the coordinator and another time on each worker, which wasted
CPU resources and made things a bit confusing. To fix that, remove the
validation from the common worker code in couch_mrview and validate once:
either on the coordinator side, or local (port 5986) callback, right in the
http callback.

[1] https://www.erlang.org/doc/system/memory.html

Fix #5176
nickva added a commit that referenced this issue Apr 30, 2025
Previously, we set a default limit that was an effective infinity (2^28). It
seems back in the 32 bit days that was the Erlang's largest small integer [1].
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping some of the data. Fix that by
increasing the limit to a larger "infinity" (highest 64 bit Erlang small
integer [1]).

We did have a "query_limit" config parameter to customize the limit, however
that turned out to be broken and did not take effect when the user tried it for
all_docs, so fix that as well. Fix that and use a test to ensure the limit gets
reduced appropriately. To make the setting more user friendly, allow `infinity`
as the value.

Also, in the case of all_docs, we validated args and applied the limit check
twice: once in the coordinator and another time on each worker, which wasted
CPU resources and made things a bit confusing. To fix that, remove the
validation from the common worker code in couch_mrview and validate once:
either on the coordinator side, or local (port 5986) callback, right in the
http callback.

[1] https://www.erlang.org/doc/system/memory.html

Fix #5176
nickva added a commit that referenced this issue Apr 30, 2025
Previously, we set a default limit that was an effective infinity (2^28). It
seems back in the 32 bit days that was the Erlang's largest small integer [1].
However, that turned out to be too low and it surprised a user when it
truncated their all_docs output skipping some of the data. Fix that by
increasing the limit to a larger "infinity" (highest 64 bit Erlang small
integer [1]).

We did have a "query_limit" config parameter to customize the limit, however
that turned out to be broken and did not take effect when the user tried it for
all_docs, so fix that as well. Fix that and use a test to ensure the limit gets
reduced appropriately. To make the setting more user friendly, allow `infinity`
as the value.

Also, in the case of all_docs, we validated args and applied the limit check
twice: once in the coordinator and another time on each worker, which wasted
CPU resources and made things a bit confusing. To fix that, remove the
validation from the common worker code in couch_mrview and validate once:
either on the coordinator side, or local (port 5986) callback, right in the
http callback.

[1] https://www.erlang.org/doc/system/memory.html

Fix #5176
@nickva nickva closed this as completed in 7aa8a4e Apr 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants