Home   Archive   Permalink



Explaining parse rules

Another recent topic here contained a parsing example, of a function that would scan a string and return a true or false result if the string contained a specified whole word. In other words, if the string contained "truth" it would NOT contain "ruth." A function was presented that solved that problem (pasted below) and it works just fine, but, I would like to be able to explain the rules to someone, and I don't really know the verbiage to use to accomplish that. It has to be something more than, "It works, so just leave it alone and don't modify it." I wonder if someone might offer some way to talk about it. I did just now read the Core user guide chapter on parsing, and it helped, but not completely.
    
I am thinking of something along the lines of, "The input string can contain any number of delimiters, followed by the word we are searching for, and then any delimiter or else the end of the string. If we find the word, return ??? what exactly? Or, if we don't find the word, then there can be any characters or delimiters to the end. (And then what does the "end skip" do?)
    
Thank you.
    
R E B O L [
     Title: "Parse whole words"
     Purpose: {Demo function to scan given text for some whole word
     and just return truth or falsity if the word is found.}
]
    
space: charset "^/^- "
punct: charset {!"#$%&'()*+,-./:;<=>?@[]^^`{|}~}
chars: complement delims: union space punct
    
find-word: func [text [string!] word [string!] /local mark][
     "Scan text for a given word, return true or false"
     parse/all text [
         any delims ; leading space/punctuation
         any [
             mark: word [delims | end] (return mark)
             ; We have our word bound by space, we're done here.
             ; Remember that anything not "none" is true.
             |
             some chars some delims
             ; Skips a match and then further space
         ]
         end skip ; Force return FALSE from PARSE.
     ]
]
    
;;Uncomment to test
either find-word "the quick brown fox" "quick brown" [
     print ["'quick brown' found"]
] [
     print ["'quick brown' NOT found"]
]
either find-word "the quick brown fox" "xxxxxx" [
     print ["'xxxxxx' found"]
] [
     print ["'xxxxxx' NOT found"]
]
halt


posted by:   Steven White       28-Nov-2018/16:54:53-8:00



Steven,
    
Thanks for this function, it works great. I have incorporated it into my first REBOL application. It really helps with more accurate text searches.
    
Mike

posted by:   Michael Todd       28-Nov-2018/19:44:22-8:00



That's the one from Chris in the other thread. I did not want to pollute the other thread with questions not related to the main issue of refreshing text lists. I did "functionize" it for my own use, but I copied the code right out of his response. I have the impression that help-forum protocol is to keep things clean and on-topic. It might also help for searching in the future after we have forgotten the present. I don't know; I am a Perpetual Beginner. Interestingly (off-topic), I had the experience once of searching rebolforum to see if anyone had asked some question, and found that *I* had asked that question, and had forgotten.
    
I am hoping that by collecting enough parsing examples I can either come to understand it, or have so many examples it won't matter if I can't understand it because I will be able to copy somebody else's solution.

posted by:   Steven White       28-Nov-2018/22:23:35-8:00



<q> "The input string can contain any number of delimiters, followed by the word we are searching for, and then any delimiter or else the end of the string. If we find the word, return ??? what exactly? Or, if we don't find the word, then there can be any characters or delimiters to the end. (And then what does the "end skip" do?) </q>
    
I should start by saying that I don't normally use RETURN in this way: I don't much for RETURN at all, rather let all the branches play out to their conclusion. I used it here for expediency.
    
The problem here is defined as 'whole words'--we need to discern what a whole word is. In Regex, you might write `/\bWordToMatch\b/` with that handy little `\b` shorthand which is a zero-width match between a stream of `\w` (word) characters and `\W` (non-word) characters.
    
In Rebol, there is no `\b`, so we can't say, for instance: FIND "THIS STRING" "^(WORD-BOUNDARY)STRING^(WORD-BOUNDARY)", so we'll have to resort to Rebol's all-powerful string-splitting penknife: PARSE. Indeed, there are no built-in character classes in Rebol (try HELP BITSET! at the console--compare that to HELP TUPLE! in Rebol/View), and no magic `\b` marker, so [TO "^(WORD-BOUNDARY)STRING^(WORD-BOUNDARY)"] is out of the question.
    
Our whole word match is based on this--our word is bound at the start either by the head of the string or some `\W` and at the end by the tail of the string or some `\W`. We now have to ponder `\w` and `\W` (specifically the latter) in order to see our string in binary terms: `/(\W+|\w+)*/`. For the purposes of this exercise, I'm going to set this definitions as follows--our NON-WORD is simply SPACE and PUNCTUATION, and our WORD is anything that's not that:
    
     word: complement non-word: charset {^/^- !"#$%&'()*+,-./:;<=>?@[]^^`{|}~}
    
To flesh out our thinking, we'll say:
    
     phrase: "Is neither fowl nor owl though."
     term: "owl"
    
A positive match is simply:
    
     positive-match: [term [non-word | end]] ; TARGET followed by any NON-WORD character or END (tail).
    
The tricky part is making sure that what precedes it is also NON-WORD. Matching HEAD is easy:
    
     [any non-word positive-match]
    
That's great if it's at the beginning or only preceded by space/punctuation. What if there are other words beforehand? Well, if we have a WORD sequence that isn't POSITIVE-MATCH, then we assume it will have a NON-WORD sequence before we can once again test for POSITIVE-MATCH:
    
     negative-match: [some word some non-word]
    
We go looping through our string, skipping through those NEGATIVE-MATCHes until we hit upon a POSITIVE-MATCH:
    
     [
         any non-word ; we want to start our loop either at HEAD or after a NON-WORD sequence
         some [
             positive-match ; always test this first
             |
             negative-match ; we're at the end of a NON-WORD sequence here
         ]
     ]
    
Ok, this is great, but we need some way to mark out our successful find when we get there. We'll introduce MARK.
    
     mark: none
    
     some [
         mark: positive-match
         |
         negative-match
     ]
    
It's difficult to discern though whether MARK is at the beginning of a POSITIVE-MATCH or a NEGATIVE-MATCH. If you run this rule now, we'll be at "though." because we didn't stop the loop. Adding BREAK will ensure that the loop won't run again:
    
     some [
         mark: positive-match break
         | ...
     ]
    
However if PHRASE doesn't contain TERM, then we're going to end up with a false positive. Instead of waiting for the PARSE rule to play out, I've short circuited the whole thing by just RETURNing the mark (breaks out of PARSE back to the containing function) and adding END SKIP to ensure PARSE (and our function) will always return false (if END is TRUE, then SKIP will always be FALSE) if there is no POSITIVE-MATCH. Our function thus behaves like FIND.
    
Ok, so one final tweak to bypass the RETURN/END SKIP hack would be:
    
     some [
         mark: positive-match break ; BREAK stops the loop on success
         |
         (mark: none) negative-match ; MARK is reset
     ]
    
And there we have it, MARK will either refer to the point before POSITIVE-MATCH, or be NONE. The return value of PARSE no longer matters.
    
     find-word: func [
         phrase [string!] term [string!]
         /local word non-word mark
     ][
         word: complement non-word: charset {^/^- !"#$%&'()*+,-./:;<=>?@[]^^`{|}~}
         mark: none
    
         parse/all phrase [
             any non-word
             some [
                 mark: term [non-word | end] break
                 |
                 (mark: none) some word some non-word
             ]
         ]
    
         mark
     ]
    
     probe find-word "Is neither fowl nor owl though." "owl"
     probe find-word "Is neither fowl nor owl though." "bowl"


posted by:   Chris       29-Nov-2018/13:03:31-8:00



It can be illustrative in trying PARSE to use Red's PARSE-TRACE function. Although not entirely compatible with Rebol 2's PARSE (it has some extra features), it will give you some idea of how PARSE's decision making works:
    
     parse-trace "aabbbaaab" [some ["a" | some "b"]]


posted by:   Chris       29-Nov-2018/13:23:07-8:00



Also--there's some background for using PARSE in this way in my response to pattern-matching URLs here:
    
http://ross-gill.com/page/Beyond_Regular_Expressions

posted by:   Chris       30-Nov-2018/12:52:13-8:00



^^^ That URL pattern matching forms the basis for matching URLs for the feed I wrote for this forum:
    
http://rebol.info/feeds/rebolforum-full.feed

posted by:   Chris       30-Nov-2018/13:49:53-8:00



First, here it is fixed so we can see what is going on inside of it.
    
space: charset "^/^- "
punct: charset {!"#$%&'()*+,-./:;<=>?@[]^^`{|}~}
delims: union space punct
    
chars: complement delims
    
find-word: func [
    "Return true when a word is in text"
    text [string!]
    word [string!]
    /local mark
][
    parse/all text [
         any delims    
        any [
             mark: word [delims | end ] (return mark)
            |
             some chars some delims
         ]
         end skip
     ]
]
    
1. Tell REBOL to try to match every character including space to rules expressed in the PARSE dialect. That is what the /all does.
2. 0 or more times match any delimiters (punctuation marks or space) .
3. 0 or more times
    get a view of the input if the input at that position matches the word to find followed by a space (using delims because it has space) or the end of input. If that happens, tell REBOL to return the view from the found word to the end.
    or
    1 or more times, consume characters or delimiters
4. Force REBOL to return false as an answer since REBOL always returns true at input exhaustion, i.e., reaching the end by matching everything.


posted by:   Stone Johnson       21-Aug-2019/0:19:04-7:00



Errata:
    
In 3.
1 or more times consume characters
1 or more times consume delimiters


posted by:   Stone Johnson       21-Aug-2019/0:23:21-7:00



Name:


Message:


Type the reverse of this captcha text: "? t e s n u"



Home