-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
p2p: Improved peer selection with /24 subnet deduplication to disadvantage 'spy nodes' #9939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass: some high-level commentary, and also some reminders to use const
in variable declarations when possible.
// build a set of all the /16 we're connected to, and prefer a peer that's not in that set | ||
std::set<uint32_t> classB; | ||
if (&zone == &m_network_zones.at(epee::net_utils::zone::public_)) // at returns reference, not copy | ||
// Build a list of all distinct /24 subnets we are connected to now right now; to catch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why narrow the subnets of connections we avoid duplicating from /16 to /24? This would allow the code to make connections to IPs more similar to the set of connections that we already have, not less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The background here is @Rucknium 's paper, to be found here, and there especially figure 2 on page 5.
Rucknium and I discussed /16 versus /24 back and forth over weeks. If I read the backlog correctly, I asked whether /24 instead of the proposed /16 in the paper would be alright as well. Rucknium agreed to an attempt to go for that. They also did a large number of test runs with statistics to see whether my code actually worked, and to what degree, and /24 seemed to work out.
When proposing "going finer", I imagined that with /24 you exclude less honest nodes that just have the bad luck to be in the same subnet as a spy node. With /16 this danger seems much bigger to me. And well, as of now the spy nodes do us the favor to nicely mostly crowd into /24 subnets that they fill, using all 256 IPs there.
This would allow the code to make connections to IPs more similar to the set of connections that we already have, not less.
Not sure how to weight that. In any case, the whole peer selection process is essentially random on several levels. I don't think there is a real danger that we accidentally somehow seek out /24 subnets that are immediate neighbours of spy node infested subnets and thus probably infested themselves. I would also speculate that if this was real danger Rucknium's test runs should have unearthed that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're misunderstanding each other here. I mean that the old code was already de-duplicating subnets for which ones which it was already connected to, not the new type of de-duplication where we treat many IPs under a new /24 subnet as having one total vote. Did Rucknium discuss this type of de-duplication at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have a misunderstanding. Yes, it's about this very first phase of making a list of subnets we are already connected to. Originally built as a list of /16 subnets, now narrowed down to a list of /24 subnets.
And yes, understood as such by Rucknium and me, and as I said discussed back and forth whether to make a difference in subnet size between this initial "connected to" list evaluation and the later candidate filtering, or treat everything /24, and the latter is where we landed and which Rucknium gave their ok to, after some simulations and/or test runs with their actual daemon, as far as I remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some simulations that produced realistic network connections when nodes adopt different subnet connection schemes.
The simulation works like this:
- A set of IP addresses of reachable node is collected, based on a Monero network scan from May 2025. Only reachable nodes will be used in the simulation, but the network simulation function has to option to set up unreachable nodes if desired. Since the IP addresses of unreachable nodes are unknown, they cannot be used for the network latency calculations later.
- Each node follows a procedure to establish "connections" to other nodes:
- The node considers the whole list of IP addresses. Any node it is already connected to is excluded. Any node that is in a subnet of a node that it is already connected to is also excluded (this subnet level is a configurable parameter).
- The remaining candidate list is de-duplicated based on a specified subnet level, which can be different from the already-connected subnet level.
- A peer is chosen from this list with uniform random selection.
- Steps 3 - 5 are performed 12 times to fill up the default outbound connections limit.
- A peer churning process is simulated. This involves randomly dropping a peer from the output peer list and re-running steps 3 - 5 to fill in the dropped peer slot. 100 peers are dropped and their slots filled. This churn step is needed because of how the "already-connected" criteria statistically depends on the number of peer slots filled.
- Nodes that are on the MRL ban list accept inbound connections, but do not establish their own outbound connections. In the simulation, they are skipped when it is their turn to create connections.
Note that any "limits" on the number of inbound peer connections to nodes are ignored.
I performed simulations of four scenarios using the gen.network()
function of version 0.2.4 of my xmrpeers
R package. The combinations of "already-connected" subnet level being set to /16
and /24
and subnet deduplication being enabled are the /24
level or not being enabled at all.
- The scenario of the status quo is "already-connected" subnet level being set to
/16
and subnet deduplication not being enabled. - This scenario of this PR is "already-connected" subnet level being set to
/24
and subnet deduplication being enabled.
The simulation creates a network/graph edgelist. The edgelist can be used to compute the number of outbound connections of honest nodes to suspected spy nodes. It can also be used to roughly measure network latency. Below are some tables.
Average number of connections to spy nodes
already.connected.subnet.level | do.deduplication | mean.malicious.connections |
---|---|---|
16 | TRUE | 1.047757 |
24 | TRUE | 1.064399 |
16 | FALSE | 3.582851 |
24 | FALSE | 3.626628 |
When honest nodes have the default 12 outbound connections and subnet deduplication is enabled, setting the "already-connected" subnet qualification to the /16
subnet level gets you an average of 1.048 connections to the spy nodes. When it is /24
, the average is 1.064.
@jeffro256 , you asked :
But there is also geographical distribution concerns also, not related to spy nodes. How does going from /16 to /24 affect that? A less diverse set of already-connected peers might slow down tx&block propagation for those farthest away from the "center" of the node graph
According to Wikipedia.
The eccentricity
$\epsilon(v)$ of a vertex$v$ is the greatest distance between$v$ and any other vertex....It can be thought of as how far a node is from the node most distant from it in the graph.
The eccentricity of every node can be measured. For eccentricity computations, the spy nodes were removed from the network because they do not establish outbound connections.
The edges of the network graph were weighted by the geographic distance between the IP addresses. The units of the weight of each edge is the number of milliseconds it would take light to travel on the Earth's surface in the shortest path between two points. Actual internet latency speeds will be slower due to internet cables not following the shortest path between each IP address's geographic location and computations at internet exchange points.
Mean eccentricity, in milliseconds
already.connected.subnet.level | do.deduplication | mean.weighted.eccentricity |
---|---|---|
16 | TRUE | 69.41301 |
24 | TRUE | 68.58541 |
16 | FALSE | 70.05076 |
24 | FALSE | 70.27645 |
When subnet deduplication is enabled and the "already-connected" subnet qualification is /16
, the time that a photon could travel from an average node to the "most distant" node in the network is 69.41 milliseconds. The next table computes the maximum, i.e. the maximum distance between any two nodes on the network.
Maximum eccentricity, in milliseconds
already.connected.subnet.level | do.deduplication | max.weighted.eccentricity |
---|---|---|
16 | TRUE | 106.5484 |
24 | TRUE | 108.3377 |
16 | FALSE | 110.4373 |
24 | FALSE | 110.8845 |
We also get unweighted eccentricity, which just counts each hop as one unit. The unweighted eccentricity of every honest node in the network in every scenario was 4, i.e. the minimum number of hops between a specific node and and any other node was always 4 regardless of which node we start at. In a few of the simulation runs iwth a different random seed, a few of the nodes had an unweighted eccentricity of 5 when do.deduplication
was FALSE
. But in this random seed, every is 4:
Table of unweighted eccentricity
already.connected.subnet.level | do.deduplication | table.unweighted.eccentricity |
---|---|---|
16 | TRUE | 4: 2764 |
24 | TRUE | 4: 2764 |
16 | FALSE | 4: 2764 |
24 | FALSE | 4: 2764 |
Code to reproduce the analysis is below. Click to expand. Note that this code makes network calls to a geo-IP service.
Reproduction code
# WARNING: This script makes network calls to ip-api.com
install.packages(c("remotes", "geosphere", "knitr"))
remotes::install_github("Rucknium/xmrpeers", upgrade = FALSE,
ref = "35408988f80a306627f5c10abb10ffb89f008bf4")
# ref is to get specific version so this is reproducible
library(data.table)
data("good_peers", "ban_list", package = "xmrpeers")
good_peers.cleaned <- good_peers
good_peers.cleaned <- stringr::str_extract(good_peers.cleaned, "[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}")
# Remove ports from IP addresses
good_peers.cleaned <- good_peers.cleaned[good_peers.cleaned != ""]
# Removes IPv6 addresses
good_peers.cleaned <- unique(na.omit(good_peers.cleaned))
# https://ip-api.com/docs/api:batch
# Docs say:
# This endpoint is limited to 15 requests per minute from an IP address.
# A request needs to formatted as a JSON array, containing up to 100 IP addresses or objects.
# Only POST requests are accepted.
# The returned HTTP header X-Rl contains the number of requests remaining in the
# current rate limit window. X-Ttl contains the seconds until the limit is reset.
# Your implementation should always check the value of the X-Rl header, and if
# its is 0 you must not send any more requests for the duration of X-Ttl in seconds.
good_peers.cleaned.split <- split(good_peers.cleaned,
f = rep(seq_len(ceiling(length(good_peers.cleaned) / 100)), each = 100)[seq_along(good_peers.cleaned)])
# Split into chunks of 100, with remainder
handle <- RCurl::getCurlHandle()
h = RCurl::basicHeaderGatherer()
w = RCurl::basicTextGatherer()
ip.geo <- list()
for (i in seq_along(good_peers.cleaned.split)) {
json.post <- RJSONIO::toJSON(
good_peers.cleaned.split[[i]]
)
# returns NULL
RCurl::postForm("http://ip-api.com/batch",
.opts = RCurl::curlOptions(
postfields = json.post,
httpheader = c('Content-Type' = 'application/json', Accept = 'application/json'),
headerfunction = h$update,
writefunc = w$update
# Updates the value of "h" and "w" as a side effect.
# (header = TRUE would print the header to the console)
),
curl = handle
)
# "postForm does not return response with headerfunction"
# https://github.com/omegahat/RCurl/issues/5
# "You should also pass the writefunc (e.g. basicTextGatherer) along with the headerfunction."
ip.geo[[i]] <- data.table::rbindlist(RJSONIO::fromJSON(w$value()), fill = TRUE,
ignore.attr = TRUE)
w$reset()
# Reset to make room for the new data
# "This can be used to reuse the same object across requests but to avoid
# cumulating new input with the material from previous requests."
throttle.limit <- as.numeric(h$value()[["X-Rl"]])
if (throttle.limit == 0 ) {
Sys.sleep(75)
# Sleep for 75 seconds if (unexpectedly) we have no more requests available
# during this minute.
}
message(i, " of ", length(good_peers.cleaned.split), " API requests completed. ",
"Throttle limit in current minute: ", throttle.limit)
Sys.sleep(5)
# Sleep 5 seconds. Gives one second tolerance to the limit of 15 requests per minute.
}
ip.geo <- data.table::rbindlist(ip.geo, fill = TRUE)
ip.geo <- ip.geo[, .(ip = query, lat, lon)]
stopifnot( all(complete.cases(ip.geo)))
# Make sure no missing values
future::plan(future::multisession,
workers = max(c(1, floor(parallelly::availableCores()/5))))
# Each process uses 400-500% CPU, so set to available threads divided by 5
peer.selection.scenarios <- expand.grid(
already.connected.subnet.level = c(16, 24),
do.deduplication = c(TRUE, FALSE))
results <- vector("list", nrow(peer.selection.scenarios))
set.seed(314)
for (scenario in seq_len(nrow(peer.selection.scenarios))) {
message("\nStarting simulation for scenario ", scenario, " of ",
nrow(peer.selection.scenarios))
already.connected.subnet.level <-
peer.selection.scenarios$already.connected.subnet.level[scenario]
do.deduplication <- peer.selection.scenarios$do.deduplication[scenario]
# Need these to be set as variables due to scoping issues with paralellized sessions
generated.network <- xmrpeers::gen.network(
outbound.ips = good_peers.cleaned,
malicious.ips = ban_list,
n.unreachable = 0,
already.connected.subnet.level = already.connected.subnet.level,
deduplication.subnet.level = 24,
do.deduplication = do.deduplication,
default.outbound.connections = 12,
dropped.connection.churns = 100,
compute.network.stats = FALSE
)
malicious.indices <- generated.network$nodes[ xmrpeers::in.ip.set(ip, ban_list), index]
nodes.honest <- generated.network$nodes[ ! index %in% malicious.indices, ]
edgelist.honest <- generated.network$edgelist[ ! origin %in% malicious.indices, ]
edgelist.honest.completely <- generated.network$edgelist[ ! destination %in% malicious.indices, ]
edgelist.honest[, malicious.connection := destination %in% malicious.indices]
malicious.connection.count <- edgelist.honest[, .(count = sum(malicious.connection)), by = "origin"]
results[[scenario]]$nrow <- nrow(malicious.connection.count)
results[[scenario]]$summary <- summary(malicious.connection.count$count)
nodes.honest <- merge(nodes.honest, ip.geo, by = "ip", all.x = TRUE)
p1 <- nodes.honest[match(edgelist.honest.completely$origin, index), .(lon, lat)]
p2 <- nodes.honest[match(edgelist.honest.completely$destination, index), .(lon, lat)]
edge.weights <- geosphere::distGeo(as.matrix(p1), as.matrix(p2))
edgelist.honest.completely[, weight := edge.weights]
# igraph::eccentricity() will automatically use weights
edgelist.graph <- igraph::graph_from_data_frame(edgelist.honest.completely, directed = FALSE)
# NOTE: graph_from_data_frame() is correct. graph_from_edgelist() gives incorrect
# results, probably from gaps in the index numbers.
weighted.eccentricity <- igraph::eccentricity(edgelist.graph)
speed.of.light <- 299792458 # meters per second
results[[scenario]]$weighted.eccentricity.summary <-
summary( 1000 * weighted.eccentricity / speed.of.light)
# Multiply by 1000 to get milliseconds
unweighted.eccentricity <- igraph::eccentricity(edgelist.graph, weights = NA)
results[[scenario]]$unweighted.eccentricity.table <-
table(unweighted.eccentricity)
}
scenarios.results <- data.table::as.data.table(peer.selection.scenarios)
scenarios.results[, mean.malicious.connections := sapply(results, function(x) x$summary[["Mean"]]) ]
scenarios.results[, mean.weighted.eccentricity :=
sapply(results, function(x) x$weighted.eccentricity.summary[["Mean"]]) ]
scenarios.results[, max.weighted.eccentricity :=
sapply(results, function(x) x$weighted.eccentricity.summary[["Max."]]) ]
scenarios.results[, table.unweighted.eccentricity :=
sapply(results, function(x) {
y <- x$unweighted.eccentricity.table
paste0(names(y), ": ", unclass(y), collapse = "; ")
}) ]
knitr::kable(scenarios.results[,
.(already.connected.subnet.level, do.deduplication, mean.malicious.connections)],
format = "pipe")
knitr::kable(scenarios.results[,
.(already.connected.subnet.level, do.deduplication, mean.weighted.eccentricity)],
format = "pipe")
knitr::kable(scenarios.results[,
.(already.connected.subnet.level, do.deduplication, max.weighted.eccentricity)],
format = "pipe")
knitr::kable(scenarios.results[,
.(already.connected.subnet.level, do.deduplication, table.unweighted.eccentricity)],
format = "pipe")
Anecdotally, this does a great job at reducing the number of connections to spy nodes. My running spy node connection count ratio went from ~22% to ~1.5%. You can run this patch to test it out: log_outgoing_spies.patch. Run your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2nd pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is looking pretty good, and does well in light stress testing. It does a good job of clearing out the spies, too. I'll be ready to approve after some small cleanups.
…ntage 'spy nodes'
For background info about those "spy nodes" check this GitHub issue, this Reddit post and this wonderful work from Rucknium.
This PR was developed in close cooperation with @Rucknium who gave much valuable input and also actually checked that the new method improves things: The daemon connects to considerably less "spy nodes" with it.
As I rewrote large parts of the
make_new_connection_from_peerlist
method, the diff with the old version of the method comes out quite confusing. Maybe for review hold the two full method sources side-by-side.