-
Notifications
You must be signed in to change notification settings - Fork 187
Add support for parent/child TypeMatchers #86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thanks for investing the time on this. Overall, LGTM, but just a small comment: The hierarchy solves part of the #73 issue's problem about deterministic matching, but what about the other matchers? Should all they use hierarchical matchers according to the proposed implementation? I might understand that the purpose of this PR is not strictly to solve this issue but makes sense to me to solve it consistently before releasing a new major version after the merge since there are public interface breaking changes here. Fundamentally, I would change the Would you be interested in pursuing that goal here or do you prefer to merge things and iterate separately in that direction? |
|
I thought about priority some. What I realized was, even with priority you still want hierarchy. Otherwise, you're forced into priority inversion in order to establish the hierarchy. IOW, say I have a buffer full of zip-signature data. In order for priority to correctly identify it no matter the type, I have to test every signature that represents a zip file, in backwards order from the least likely to the most, before I can declare it a zip file. If it happens to not be an epub, or an Office doc, then it'll fail all of those tests and match the zip signature, at which point I can call it a zip file. With hierarchy, for the most part the same thing happens. To declare it a zip file, I have to test it against every one of its child matchers. But there are two key differences:
Ideally, the more-specific matchers wouldn't be in Priority/ordering isn't necessarily a bad idea. But in terms of efficiency, I'd think the most benefit would come from prioritizing the most likely/common types. But to do that without then misidentifying more specific types that share the same signature, their parent/child relationship -- which in a sense is the inverse of priority -- still needs to be tracked as well. As far as working on it, I can't promise any timeframe, but I'm not abandoning this PR. If it needs further development, I'm happy to discuss changes. A priority implementation, specifically, I think is probably a separate enough thing that even if I were to work on it, I'd do it as a separate PR. I don't see that as something that would invalidate this PR, or vice versa. |
Going a different route from #73, this PR modifies the matchers interface to account for parent/child relationships between types. (I initially called the children
subtypes, as you can see from the branch name, but since that term's already used in the MIME code it would've been confusing as heck.)Modifications to existing code
matchers/matchers.goMatcherandTyperMatcherinterface requires a method,Match([]bytes), which returnstrueif the buffer matches the matching functionTyperinterface adds Type([]bytes), which returns the associatedtypes.Type` if the buffer matches the matching functionByteMatcherByteMatcherimplementsMatcherTypeMatcheris now a struct, containing atypes.Typeand aByteMatcherTypeMatcherimplements bothMatcherandTyperTypeMatcher.Match()checks whether the buffer matches the type or any of its child typesTypeMatcher.Type()checks the buffer against any child types, then the parent type, and returns the type on successful matchChildMatchersis amap[types.Type][]TypeMatcherwhich registers the parent-child relationship between aTypeand its childTypeMatchers(not directly betweenTypeMatchers)match.goMatch()andType()methods, instead of runningByteMatchersdirectly.filetype.goIs()andIsType()functions now useMatch()andType(), instead of runningByteMatchersdirectly. (Actually,Isnow just callsIsType, which checks the type'sTypeMatcher.Match().)Is()andIsType()will return true for both a buffer's specific type, and its parent type. (e.g.IsType(types.TypeZip)andIs("zip")are true for a buffer containing.docxbytes.)New source files
The new file
matchers/children.goadds storage for the the parent-child mappings:TypeMatcheris expanded with anAddChild(types.Type, ByteMatcher)method, to add a child of the matcher'sTypeAddChildType(parent, child types.Type, ByteMatcher)is also added, to create children for types that don't yet have aTypeMatcher. (This was necessary to keep thematchers.Map{}functionality working, sinceChildMatcherswill end up getting populated beforeMatchers.)Updates to existing data structures
Epub,Docx,Xlsx, andPptxare all made children ofTypeZipTesting
ChildMatcherfunctionalityNotes/Caveats
In theory this could be extended to support a multi-level hierarchy, with child types having children of their own. However, as that wasn't necessary to support current needs, only one level is implemented. The
Type()method ofTypeMatcherdoesn't use theType()method of its children to look up their matching types, it queries the child'sByteMatcherdirectly. RecursiveChildMatcherlookups would be necessary to support children of children.There's no way to support parent-child relationships in an arbitrary
MatchMap()call. Because it does run theByteMatchers it's passed directly, it will ignoreChildMatchersand can return incorrect results if it's passed amatchers.Mapcontaining overlapping matchers.Fixes #38, fixes #40