-
Notifications
You must be signed in to change notification settings - Fork 11
New trait to support work variables #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I haven't had time to review the code, but the idea makes complete sense. If no one objects for a while, I'd say merge it. I'd like to hear from @digama0 if he has a chance to review it. |
src/formula.rs
Outdated
} | ||
|
||
fn label_for_symbol(&self, _: Symbol) -> (Label, TypeCode) { | ||
panic!("Symbols in regular formulas shall only be regular symbols"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a panic? Even in the regular case, var -> varhyp lookups are a thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only place where it is called currently is a case where it has been established the symbol tested is work variable, so an implementation is currently not required.
grammar.rs
, line 1276:
let opt_node = if !rslv.is_work_variable(token.symbol) {
map.get(&(SymbolType::Constant, token.symbol))
} else {
// The next symbol is not a regular symbol, resolve it first
let (label, typecode) = rslv.label_for_symbol(token.symbol);
But you are right that we could provide an implementation for the case where the symbol provided is a regular variable. I'll add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See commit.
src/formula.rs
Outdated
}); | ||
} | ||
dt.finish() | ||
} | ||
} | ||
|
||
type FlattenStack<'a> = Vec<( | ||
Either<TokenIter<'a>, Symbol>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using an enum
here instead of Either
. It is not obvious what the two variants mean from the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.stack.pop(); | ||
self.next() | ||
(Either::Right(symbol), _) => { | ||
self.stack.pop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong. You are popping the stack but you haven't done anything with the sibling_iter yet. Isn't that going to lose items following the work variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually whenever there is a single symbol on the stack, there are no siblings, so there is no sibling iterator to track.
This is actually better visible with the new enum you suggested above.
src/formula.rs
Outdated
match self.symbol_token(symbol) { | ||
Cow::Borrowed(token) => as_str(token).into(), | ||
Cow::Owned(token) => String::from_utf8(token) | ||
.expect("TokenPtr is supposed to be UTF8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be modified to use as_str()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I try e.g. as_str(&token)
, Rust complains that I'm trying to return a reference to a local variable.
Indeed, as_str
does not return an owned structure, but just a &str
which is borrowed from its input token
.
In any case there is duplicated code here, and I agree this is a concern. Maybe I could move this code to a new as_string
function, right next to as_str
?
Or do you have a better suggestion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as_string sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See as_string commit
@@ -1102,29 +1100,6 @@ impl Grammar { | |||
{ | |||
// Empty parser command | |||
match k.as_ref(buf) { | |||
b"syntax" => match rest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this get removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I shall have put that in a separated PR.
This piece of code records the typecodes declared with the $j syntax
commands.
However this is done both at the initialization phase, and again here during the handling of the special garden_path
commands, which happens after a first grammar has been built.
As a result, the typecodes
array contained duplicated elements.
src/grammar_tests.rs
Outdated
@@ -116,7 +116,7 @@ fn test_parse_formula() { | |||
&mut fmla_vec.clone().into_iter(), | |||
&[wff, class], | |||
false, | |||
&names, | |||
&names as &Nameset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&names as &Nameset, | |
&**names, |
Alternatively and additionally, Arc<T: Resolver>: Resolver
seems like a reasonable impl and would allow you to pass &names
directly, although it is probably better at runtime to undo the indirection here rather than in the method calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check that - here this is only for test purpose, so we don't need to squeeze out the highest performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/nameck.rs
Outdated
@@ -63,7 +63,7 @@ use std::u32; | |||
/// | |||
/// [INC]: https://github.com/sorear/smetamath-rs/issues/11 | |||
#[derive(Copy, Clone, Debug, PartialOrd, Ord, PartialEq, Eq, Default, Hash)] | |||
pub struct Atom(u32); | |||
pub struct Atom(pub u32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for making this public? This will make garbage atoms user-constructible, and its relation to the other invariants of the library are not clear. I don't see why you would ever want to be able to write Atom(37)
. Read only access to the number can be done by a Atom::index()
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main drawback of this approach: an implementation of the Resolver
trait needs to build and return its own Symbol
for work variables, which are atoms.
Slightly better here might be to only provide a public constructor, this would at least make these atoms practically unmutable, since even with a mutable reference to one, a library caller could not modify the underlying identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the resolver needs to create Symbol
s, then it should have access to a name generator which can ensure that the Atom
s are globally unique, and it should not be directly responsible for manipulating the number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[must_use] | ||
pub fn last_atom(&self) -> Atom { | ||
Atom(self.atom_table.table.len() as u32) | ||
pub fn is_extra_atom(&self, atom: Atom) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we should promise that extra atoms always come in a block after regular atoms, or even that there is a way to distinguish them. What if the parser is restarted after some extra atoms are created, in incremental use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So right now I'm kind of eating the cake starting from both sides:
Nameset
naturally allocates atoms from 0 up, continuously, while I'm getting the identifiers from MAX down, assuming the passed ID stays small.
What my client currently does to compute the ID, is that it also ranks them and assumes that small indices are used first (&W1
, &S1
, &C1
, &W2
, &S2
, &C2
, &W3
, etc.) This allows a direct computation without any lookup.
There are currently around 100'000 atoms (99'577) in set.mm, so there is still plenty of space until 2^32.
Beyond the fact that this method works with direct computations and no lookup, which shall be faster, it most importantly works with an immutable Nameset
instance. If Nameset
were to reserve internal spots for the "extra" atoms, this means it needs to be mutated, which goes against the current philosophy of "once the DB is parsed, everything is immutable", which allows very fast execution since we only pass around Arc
s to the different structures of the DB.
Let's assume we went up to work variable &W99999
, and that each time the parser is restarted everything changes and it requires a new set of 100'000 atom indices. Using u32
as currently, the parsing could be restarted 43'000 times until there is a clash.
Do you have another idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beyond the fact that this method works with direct computations and no lookup, which shall be faster, it most importantly works with an immutable
Nameset
instance. IfNameset
were to reserve internal spots for the "extra" atoms, this means it needs to be mutated, which goes against the current philosophy of "once the DB is parsed, everything is immutable", which allows very fast execution since we only pass aroundArc
s to the different structures of the DB.
I think that a name generator that does not mutate can't possibly work correctly. I think it is fine to use interior mutability here, and the only state a name generator needs to track is the last allocated ID so a AtomicU32
will suffice, so API-wise it can still take a &self
.
I'm still a little dubious about this mmj2-styled work variable thing. This really ought to be some kind of enum, work vars are not the same as regular vars in a whole bunch of different ways, and adding phantom $f
statements to the database seems like a really weird encoding to me.
This introduces a trait
Resolver
meant to handle work variables outside of metamath-knife.The new trait shall be able to resolve work variable names like e.g.
&W1
into a virtual symbol, as well as into the corresponding a virtual floating axiom, and vice-versa for printing back that symbol.An implementation is provided in the form of
Nameset
, which does not resolve any additional work variables, but only regular symbols straight from the database.