Skip to content

Web Crawling

PiggyPiglet edited this page Feb 6, 2021 · 3 revisions

Web Crawling

Web crawling is how indexes are originally generated. The concept is relatively simple (although the implementation really isn't). Find the type index of the javadoc, e.g. JDK 11 index. If the index is split up into multiple pages, such as in the case of the JDK, locate and load all of the index pages into jsoup documents.

Unfortunately, the javadocs have changed quite a bit over the years, but some things stay consistent, so it's essential we use those consistent things in order to determine what's what in the html, unless we want to write multiple implementations for different javadoc versions. Luckily, I was able to find such a consistency when figuring out which entries in the indexes were types or not. Each anchor in the index has a title attribute, which always starts with what the type is. For example, "class in java.util", or "annotation in java.util". This allows us to easily determine if it's a type or not (and at the beginning of the indexing process, we ONLY want types, not methods or fields).

So, we have a set, containing the values "class", "enum", "annotation", "interface", and if the title attribute starts with one of those, then we can be sure this entry in the index is a type. So we grab the url of that type's javadoc page, and pass it off to the page deserializer/crawler.

Here's where things start getting increasingly complicated. I mentioned before about consistency, well unfortunately there aren't always consistencies between different javadoc versions (or the consistencies that are there are more effort to use than make 2 separate impls), so our logic starts branching into 2 at this point.

I didn't actually check specifically when such a drastic change happened, all I know is that some javadocs are "old", and some are "new". There's actually 4 versions of javadocs I discovered while making this project. The first is so ghastly old that I didn't bother with implementing it. The second is what I call "old" in the project, and the last 2 are "new". The last 2 are grouped into the same because for the most part, they're the same, except for some really really small changes.

In our javadoc page deserializer, the first thing we grab is the description. Despite its name, the description contains a whole lot more than the description, which will be discussed later. Additionally, we also fetch the header title (not to be confused with <head><title>blah</title></head>), which is the big name at the top of the page. We then get the previous element from the title, which will be the package. We use this method as some objects have multiple packages, which are all but irrelevant for our uses, so we just need the last.

With this info, we head into the type deserializer, which is one of the three components the javadoc page deserializer calls. Now, back to the description, the description isn't just the text description of the object (e.g. "this event is called when a pigeon flies into the left jet engine"), it also includes the type declaration, and key metadata about the type. The type declaration is the little snippet of code which is constructed identically to the actual code in the class, e.g.

public final class PigeonEngineCollisionEvent
extends Event
implements Cancellable

This is where we get our type's annotations, modifiers, actual type (class, enum, etc), name, type parameters, super classes, and super interfaces. Additionally, we also scan for things like "All Implemented Interfaces", and deprecation blocks/comments.

Now that our type is deserialized, we're back in the javadoc page deserializer, onto the fun "detail" deserializations. We consider details things like methods, constructors, enum constants, fields, etc. They're called details, because this is what the actual javadoc calls them. In "old" javadocs, the table name is literally "Field Detail", or similar. In "new" javadocs, these tables have a class called .detail which we use to fetch the individual details. Anyway, so we get all these details, group them into methods (methods, constructors) and fields (fields, constants), then pass them off to their respective deserializer.

For methods, that's the MethodDesrializer. First thing first in the method deserializer, is call the detail deserializer!, because seeing as how the javadoc treats these "details" so similarly, the code is also similar, and some things can be abstracted away into a general purpose detail deserializer.

The detail deserializer can deserialize the "declaration code snippet", to retrieve it's annotations, modifiers, return type, and name, and also get some basic details such as the description, and deprecation status/comment. Unfortunately for the code snippet, there isn't really any consistencies, so we have to do it the hard way, with two separate implementations. The old impl uses some horrendous string magic to figure out what is what, whereas the new impl has nice little classes that we can use.

Back to the method deserializer, we can begin working on our other necessary things (parameters, throws, and returns). First of all we need to determine if this method is a constructor or not, which we can do by simply checking if the method name is equal to the parent type name (which is passed by the javadoc page deserializer). If it is, we can skip ahead and set our modifiers to what the detail deserializer assumed we were returning, and our return type to what the detail deserializer assumed the method was called. This works because the detail deserializer uses indexes to figure out what is what. The last element is the name, the second last is the return type, and anything before that is a modifier. The issue with constructors though, is they only have two parts, a modifier and return type. The detail deserializer doesn't know this though (and doesn't need to).

On to parameters, once again we have to split into our old and new impls. If it's old, we find the first index of the method name, + (, or + nbsp( in the code snippet. We then substring the code snippet from that index, to the last index of ). That leaves us with our parameters alone. The new impl uses the .arguments class to determine parameters.

These impls only get us the parameter string though, which could include mutability modifiers, annotations, type parameters, all sorts of stinky things that could mess up our code. The first step is to kill annotations & type parameters, or more specifically their content. While they may contain essential information, unfortunately there's really no way to split up the parameters with them there. Picture this:

public void execute(@NotNull @Description("the player's name") final CustomString<String, BaseCommand> name)

Believe me, I really tried to work with this, but it just wasn't happening. We need two key things from the parameter for searching purposes, the type, and the name. With all the extra spaces, commas, and things that could confuse the program, I decided it was best to just cut the contents of annotations and type parameters. So essentially that gets converted to:

public void execute(@NotNull @Description final CustomString name)

Yeah, it's not ideal, feel free to pr a fix.

After all that, we simply scan the rest of the html for parameter descriptions, throwable descriptions, and returns descriptions.

The field deserializer actually has nothing unique, it relies entirely on the detail deserializer. Have a look for yourself lol, it's literally just a call to the detail deserializer.

After that, we head back into the javadoc page deserializer. So close to the end now. With our methods and fields, we go back to the parent type, and add them all (as fqns) to the field & method set. I'm not really a fan of the mutability requirement here, but oh well. We then combine the serialized objects into a single set (with the type, and details), which will be used by the web crawl populator. Only one thing left to do, which is to apply the details of parent types, to their heirs. A little bit of recursion and we're done.

Now this might sound like a long complicated process, but let's get some perspective. JDK 11, has 4686 indexable types, and those types have a total 52428 details (methods/constructors/fields/constants), which are all fully deserialized in a whopping ~3800ms on my computer (i7 7700K). And half of that time is from jsoup alone!

Clone this wiki locally