Skip to content

Commit f210ffd

Browse files
README: consistently write "web graph" or "WebGraph" referencing the framework
1 parent 29620f6 commit f210ffd

2 files changed

Lines changed: 13 additions & 13 deletions

File tree

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ The Java tools are compiled and packaged by [Maven](https://maven.apache.org/).
1111
java -cp target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar <classname> <args>...
1212
```
1313

14-
The assembly jar file includes also the [WebGraph](https://webgraph.di.unimi.it/) and [LAW](https://law.di.unimi.it/software.php) packages required to process the webgraphs and compute [PageRank](https://en.wikipedia.org/wiki/PageRank) or [Harmonic Centrality](https://en.wikipedia.org/wiki/Centrality#Harmonic_centrality).
14+
The assembly jar file includes also the [WebGraph](https://webgraph.di.unimi.it/) and [LAW](https://law.di.unimi.it/software.php) packages required to process the web graphs and compute [PageRank](https://en.wikipedia.org/wiki/PageRank) or [Harmonic Centrality](https://en.wikipedia.org/wiki/Centrality#Harmonic_centrality).
1515

1616

1717
### Javadocs
@@ -26,7 +26,7 @@ Run `mvn spotless:check` and `mvn spotless:apply`, see the [Spotless Maven guide
2626

2727
## Memory and Disk Requirements
2828

29-
Note that the webgraphs are usually multiple Gigabytes in size and require for processing
29+
Note that the web graphs are usually multiple Gigabytes in size and require for processing
3030
- a sufficient Java heap size ([Java option](https://docs.oracle.com/en/java/javase/21/docs/specs/man/java.html#extra-options-for-java) `-Xmx`)
3131
- enough disk space to store the graphs and temporary data.
3232

@@ -49,7 +49,7 @@ To analyze the graph structure and calculate rankings you may further process th
4949

5050
A couple of scripts may help you to run the WebGraph tools to build and process the graphs are provided in [src/script/webgraph_ranking/](src/script/webgraph_ranking/). They're also used to prepare the Common Crawl web graph releases.
5151

52-
To process a webgraph and rank the nodes, you should first adapt the configuration to your graph and hardware setup:
52+
To process a web graph and rank the nodes, you should first adapt the configuration to your graph and hardware setup:
5353
```
5454
vi ./src/script/webgraph_ranking/webgraph_config.sh
5555
```
@@ -59,12 +59,12 @@ After running
5959
```
6060
the `output_dir/` should contain all generated files, eg. `graph_name.graph` and `graph_name-ranks.txt.gz`.
6161

62-
The shell script is easily adapted to your needs. Please refer to the [LAW dataset tutorial](https://law.di.unimi.it/tutorial.php), the [API docs of LAW](https://law.di.unimi.it/software/law-docs/index.html) and [webgraph](https://webgraph.di.unimi.it/docs/) for further information.
62+
The shell script is easily adapted to your needs. Please refer to the [LAW dataset tutorial](https://law.di.unimi.it/tutorial.php), the [API docs of LAW](https://law.di.unimi.it/software/law-docs/index.html) and [WebGraph](https://webgraph.di.unimi.it/docs/) for further information.
6363

6464

65-
## Exploring Webgraph Data Sets
65+
## Exploring Web Graph Data Sets
6666

67-
The Common Crawl webgraph data sets are announced on the [Common Crawl web site](https://commoncrawl.org/tag/webgraph/).
67+
The Common Crawl web graph data sets are announced on the [Common Crawl web site](https://commoncrawl.org/tag/webgraph/).
6868

6969
For instructions how to explore the web graphs using the JShell please see the tutorial [Interactive Graph Exploration](./graph-exploration-README.md).
7070

graph-exploration-README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Interactive Graph Exploration
22

3-
A tutorial how to interactively explore the Common Crawl webgraphs – or other graphs using the webgraph format – using the [JShell](https://docs.oracle.com/en/java/javase/21/jshell/index.html) and the [GraphExplorer](src/main/java/org/commoncrawl/webgraph/explore/GraphExplorer.java) class.
3+
A tutorial how to interactively explore the Common Crawl web graphs – or other graphs using the [WebGraph](https://webgraph.di.unimi.it/) format – using the [JShell](https://docs.oracle.com/en/java/javase/21/jshell/index.html) and the [GraphExplorer](src/main/java/org/commoncrawl/webgraph/explore/GraphExplorer.java) class.
44

55

66
## Quick Start
@@ -117,13 +117,13 @@ A tutorial how to interactively explore the Common Crawl webgraphs – or other
117117
## Using the Java Classes
118118

119119
The Java classes "GraphExplorer" and "Graph" bundle a set of methods which help exploring the graphs:
120-
- load the webgraph, its transpose and the vertex map
120+
- load the web graph, its transpose and the vertex map
121121
- access the vertices and their successors or predecessors
122122
- utilities to import or export a list of vertices or counts from or into a file
123123

124124
The methods are bundled in the classes of the Java package `org.commoncrawl.webgraph.explore`. To get an overview over all provided methods, inspect the source code or see the section [Javadocs](README.md#javadocs) in the main README for how to read the Javadocs. Here only few examples are presented.
125125

126-
We start again with launching the JShell and loading a webgraph:
126+
We start again with launching the JShell and loading a web graph:
127127

128128
```
129129
$> jshell --class-path "$CC_WEBGRAPH_JAR" \
@@ -144,7 +144,7 @@ jshell> e.getGraph()
144144
$45 ==> org.commoncrawl.webgraph.explore.Graph@4f933fd1
145145
```
146146

147-
First, the vertices in the webgraphs are represented by numbers. So, we need to translage between vertex label and ID:
147+
First, the vertices in the web graphs are represented by numbers. So, we need to translage between vertex label and ID:
148148

149149
```
150150
jshell> g.vertexLabelToId("org.wikipedia")
@@ -154,7 +154,7 @@ jshell> g.vertexIdToLabel(115107569)
154154
$47 ==> "org.wikipedia"
155155
```
156156

157-
One important note: Common Crawl's webgraphs list the host or domain names in [reverse domain name notation](https://en.wikipedia.org/wiki/Reverse_domain_name_notation). The vertex lists are sorted by the reversed names in lexicographic order and then numbered continuously. This gives a close-to-perfect compression of the webgraphs itself. Most of the arcs are close in terms of locality because subdomains or sites of the same region (by country-code top-level domain) are listed in one continous block. Cf. the paper [The WebGraph Framework I: Compression Techniques](https://vigna.di.unimi.it/ftp/papers/WebGraphI.pdf) by Paolo Boldi and Sebastiano Vigna.
157+
One important note: Common Crawl's web graphs list the host or domain names in [reverse domain name notation](https://en.wikipedia.org/wiki/Reverse_domain_name_notation). The vertex lists are sorted by the reversed names in lexicographic order and then numbered continuously. This gives a close-to-perfect compression of the web graphs itself. Most of the arcs are close in terms of locality because subdomains or sites of the same region (by country-code top-level domain) are listed in one continous block. Cf. the paper [The WebGraph Framework I: Compression Techniques](https://vigna.di.unimi.it/ftp/papers/WebGraphI.pdf) by Paolo Boldi and Sebastiano Vigna.
158158

159159
Now, let's look how many other domains are linked from Wikipedia?
160160

@@ -163,7 +163,7 @@ jshell> g.outdegree("org.wikipedia")
163163
$46 ==> 2106338
164164
```
165165

166-
Another note: Common Crawl's webgraphs are based on sample crawls of the web. Same as the crawls, also the webgraphs are not complete and the Wikipedia may in reality link to far more domains. But 2 million linked domains is already not a small sample.
166+
Another note: Common Crawl's web graphs are based on sample crawls of the web. Same as the crawls, also the web graphs are not complete and the Wikipedia may in reality link to far more domains. But 2 million linked domains is already not a small sample.
167167

168168
The Graph class also gives you access to the successors of a vertex, as array or stream of integers, but also as stream of strings (vertex labels):
169169

@@ -215,7 +215,7 @@ abogado.super
215215
ac.789bet
216216
```
217217

218-
Technically, webgraphs only store successor lists. But the Graph class holds also two graphs: the "original" one and its transpose. In the transposed graph "successors" are "predecessors", and "outdegree" means "indegree". Some methods on a deeper level take one of the two webgraphs as argument, here it makes a difference whether you pass `g.graph` or `g.graphT`, here to a method which translates vertex IDs to labels and extracts the top-level domain:
218+
Technically, web graphs only store successor lists. But the Graph class holds also two graphs: the "original" one and its transpose. In the transposed graph "successors" are "predecessors", and "outdegree" means "indegree". Some methods on a deeper level take one of the two web graphs as argument, here it makes a difference whether you pass `g.graph` or `g.graphT`, here to a method which translates vertex IDs to labels and extracts the top-level domain:
219219

220220
```
221221
jshell> g.successorTopLevelDomainStream(g.graph, g.vertexLabelToId("org.wikipedia")).limit(5).forEach(System.out::println)

0 commit comments

Comments
 (0)