More parsing confusion
I happened to be reading some of Nick's documentation and came upon the parsing example below. The purpose of the example is to remove in-line comments from REBOL code. I understand in concept of what it is doing, namely, find the first semicolon, then look ahead to the end of the line, calculate the number of characters between that semicolon and the end of the line, and remove that many characters. But I can't EXPLAIN it. What does the "begin:" mean? And the "ending:"? And why is there the ":begin" at the end of the "remove" line? And why is there the "any" in the rule? I feel like there is some key concept in parsing that I am supposed to know but don't, and if I did, I would have this head-slapping moment of clarity where I would say, "Oh, of course, it's so obvious." I wonder if someone might give me some guidance in understanding this example so I can continue my march toward parsing knowledge. Thank you. R E B O L [] CODE: { Owner_Name: "" ;; A Co-Owner_Name: "" ;; B Mail_Address_1: "" ;; C Mail_Address_2: "" ;; D In_care_of: "" ;; E City: "" ;; F State: "" ;; G Country: "" ;; H Zip: "" ;; I } parse/all code [any [ to #";" begin: to newline ending: ( remove/part begin ((index? ending) - (index? begin))) :begin ] ] write %uncommented.txt CODE editor CODE ; all comments removed
posted by: Steven White 8-May-2019/17:13:10-7:00
This method of comment removal doesn't actually work, because you have to consider comments inside strings. Interestingly enough, I was writing a parse rule that did actual comment removal last week (uses the Ren-C/Red-ism AHEAD, but shows the general method): https://github.com/metaeducation/ren-c/blob/65fcd12516f220a08893b9045bfd6ec79e72cabb/tools/common.r#L386 > What does the "begin:" mean? Rebol has historically used SET-WORD!s to capture the current parse position into a variable. Hence the SET is referring to setting the variable. Correspondingly it has used GET-WORD!s to seek the parse position to the position held in that variable. I think this is questionable. For one thing it was hard for me in the beginning to know SET-WORD! didn't mean "set the parse position", and GET-WORD! didn't mean "get the parse position". But also it seems a keyword like SEEK would be clearer (you'd have been less confused, right?) https://forum.rebol.info/t/changing-set-word-and-get-word-in-parse/1139 > And why is there the ":begin" at the end of the "remove" line? The parse position was at the end of the line. If you remove material from the start of the comment to the end of the line, the parse position will now be out of date, and somewhere on a future line. Seeking to the index saved in the `begin` variable puts you at the right place for processing the next comment. R3-Alpha and Red have a REMOVE command in the parse dialect that takes care of this problem in one swoop. You say `remove [...rule...]` and it will effectively mark the begin, end, and fix up the position. > And why is there the "any" in the rule? The rule inside the ANY finds and removes one comment. If your goal is to remove several comments, you need some rule that does iteration...like ANY or SOME. In writing my own comment removal rule linked above, I gained an understanding of how WHILE is different from ANY, and why it is necessary to have: https://forum.rebol.info/t/parses-advancement-rule-bad/1159
posted by: Fork 8-May-2019/17:34:54-7:00
To Steven and to all trying to grasp hold of PARSE. 1. PARSE is a word that tells REBOL to match either char! or datatypes!, depending upon the input being either string! or block!, to an expression written in the Parse Dialect. 2. Carl's Deep Lake Metaphor (source: http://www.rebol.com/article/0103.html ): "Athough we like to make REBOL look a lot like other programming languages, it is much deeper than it looks. I've said before that REBOL is like a lake. You see the surface and think that is all there is to it. But, once you step into it, you discover there is another dimension. " 3. Behind the word! parse itself is data that REBOL imputes meaning. That meaning is to **make the data work as a function**. Thus, someone, presumably Carl, wrote PARSE in the standard REBOL dialect, aka the "Do dialect." REBOL knows how to interpret ("parse") that dialect already. 4. Understanding REBOL Itself, First First, let's review what happens in the standard REBOL dialect with string!(s). >> s: "This is a string." == "This is a string." Right now, s marks the string! from the **head view.** Another way to say it, s tags the string! from the head view. If you were to tell REBOL to skip ahead in the string!, say this: >> s: skip s 3 == "s is a string." you have told REBOL to give you a view of the string! from index position of 3. Think of it as this: _ Hey REBOL. I want the 4-view of the string! _ If we ask REBOL to give us the head view again, i.e., the 1-view of the string: >> s: head s == "This is a string." We can ask REBOL what is the position of the view: >> index? s == 1 The word index? asks REBOL to return the current view position. Now, if we tell REBOL to skip from the head view, aka the 1-view: >> s: skip s 3 == "s is a string." When we ask REBOL, REBOL will tell us we are at the 4-view. That is so because we told REBOL to skip the 1-view, 2-view and 3-view >> index? s == 4 So why did I tell you all about **regular REBOL** first? Because if you can understand the above, you can understand the PARSE dialect. 5. Parse Dialect Marking (Tagging) You ask, "What does the "begin:" mean? And the "ending:"? And why is there the ":begin" at the end of the "remove" line?" The REBOL manual covers it here: http://www.rebol.com/docs/core23/rebolcore-15.html#section-7.4 From the first paragraph in that section (15.7.4), "The copy action makes a copy of the substring that it finds..." copy is a parse dialect word. Someone wrote this in the REBOL manual: "copy the next match sequence to a variable." But what it does is this: REBOL, using parse to process an expression written in parse dialect with a string! input, keep **a view of the string** from this point forward until told to truncate the view. Again, from the first paragraph in 15.7.4, "... but that is not always desirable. In some cases, it is better to save the current position of the input ..." Thus, all begin: means is this: **Hey REBOL, make an index pointer beginning at this point in the input stream.** Let's go back to our string! s and parse it with some rules expressed in the parse dialect. These are the dialect words: copy, to, end. REBOL interprets "T" as a user-created rule through PARSE. >> s: head s == "This is a string." >> parse s [ "T" copy x begin: to end] == true So we have told REBOL, using parse, process the input that s handles this way: 1. keep going if the first char! in s matches our rule "T" 2. keep an entire copy of s from the second char! forward 3. keep a view of s from the second char! forward Let's check that: >> x == "his is a string." >> begin == "his is a string." These look the same on your screen as these do on mine; but are these the same? Well, if we check the positions of the indexes from each of these **views**, REBOL tells us that it has a view of x from the first position, i.e., a 1-view, while it has a view of begin from the second position, i.e., a 2-view. >> index? x == 1 >> index? begin == 2 Let's ask REBOL if these are the same things. >> same? x begin == false We should feel good about this because **copy** in the parse dialect told REBOL to make a new string!, i.e., a copy of it. Think of that action like this: >> newstring: copy {} == "" But what about asking REBOL to see if begin and the 2-view of s are the same? >> s == "This is a string." >> same? next s begin == true The above is the same as this: == "This is a string." >> s: next s == "his is a string." >> same? s begin == true Because begin is a 2-view of s. It is not a new string!, i.e., it is not a copy of s. It is a **view** of s. >> begin == "his is a string." Do you get it yet? 6. ANOTHER VIEW So ending: is merely another view of the input. >> s: head s == "This is a string." >> parse s [ "T" begin: 3 skip ending: to end] == true >> s == "This is a string." >> begin == "his is a string." >> ending == " is a string." The words begin: and ending: have no special meaning. Through PARSE, REBOL imputes meaning. We could have chosen different words, say **a** and **z** like so: >> parse s [ "T" a: 3 skip z: to end] == true >> a == "his is a string." >> z == " is a string." 7. Telling REBOL to Copy Between Tags Look what happens if we add back the dialect directive copy and the name x. >> parse s [ "T" copy x a: 3 skip z: to end] == true >> x == "his" We have told REBOL, using parse, to process the input that s handles this way: 1. keep going if the first char! in s matches our rule "T" 2. keep an entire copy of s from the second char! to the fourth char! Now REBOL only copies what is between the marks, a and z Using PARSE, we have told REBOL to do this, in effect: >> f: copy/part next s 3 == "his" >> x == "his" REBOL now has two words f and x, each which REBOL evalutes to "his." f points to a different "his" than x does. >> same? f x == false But both the "his" of x and of f are equal in length and char!(s). >> equal? f x == true 8. Modifying a String through PARSE And so you asked: "And why is there the ":begin" at the end of the "remove" line?" Let's start with a simplier example. s: {Owner_Name: "" ;; A #"^/"} First, the /all refinement tells REBOL to look at every char! in the input. This tells REBOL to parse every character until it gets to a semi-colon and then zip through to the end. >> parse/all s [ any [ to #";" to end] ] == true parse/all s [ any [ to #";" begin: to end] ] From this point forward, it is more like a dialog between us and REBOL. Hey REBOL, parse every character until you get to a semi-colon and then give me a view before you zip through to the end. >> parse/all s [ any [ to #";" begin: to end ]] == true >> begin == {;; A #" "} Hey REBOL ... 1. parse every character until you get to a semi-colon 2. and then give me a view to the newline char! >> parse/all s [ any [ to #";" begin: to newline to end ]] == true >> begin == {;; A #" "} Hey REBOL ... 1. parse every character until you get to a semi-colon 2. and then give me a view to the newline char! 3. and then give me a view to the end >> parse/all s [ any [ to #";" begin: to newline ending: to end ]] == true >> begin == {;; A #" "} >> ending == { "} >> Now the bits between () is data in the DO dialect that asks REBOL to give it meaning and work with it. First, let's look at it as **ordinary REBOL data, er, code**. >> temp: [begin :begin] == [begin :begin] >> type? first temp == word! >> type? second temp == get-word! So what does a word! tell REBOL to do and what does a get-word! tell REBOL to do when given either one in the so-called "default function enviornment" as Carl calls it. 1. begin tells REBOL to evaluate it, i.e., give it meaning. 2. :begin tells REBOL to return whatever has been associated with it without any evaluation, i.e., **give me the code, er, data, but do not evalute it.** Both of these look the same because begin handles a string! >> begin == {;; A #" "} >> :begin == {;; A #" "} But to make it clear to you: box: func [ a [integer!] b [integer!] ][ as-pair a b ] >> box ** Script Error: box expected a argument of type: integer ** Near: box This fails because REBOL expects there to be two parameters passed to the arguments of box. >> box 23 19 == 23x19 So, now get the data, er code >> code: :box And this fails for the same reason as above. Code names the same code, er, data as box does. REBOL merely made an association to it in the dictionary. >> code ** Script Error: code expected a argument of type: integer ** Near: code >> code 23 19 == 23x19 Back to our story ... Recall that begin and ending are merely views of s. >> ending == { "} >> (index? ending) == 27 >> begin == {;; A #" "} >> (index? begin) == 20 ending is a 27th-view of s and begin is a 20th view of s, i.e., from the 27th and 20th positions respectively. This asks REBOL to do some arithmetic. >> ((index? ending) - (index? begin)) == 7 Why? We only want what is between the start of the comment to right before the newline char! >> length? begin == 9 >> length? ending == 2 >> begin == {;; A #" "} >> ending == { "} At the console, let's ask REBOL what REMOVE does USAGE: REMOVE series /part range DESCRIPTION: Removes value(s) from a series and returns after the remove. REMOVE is an action value. ARGUMENTS: series -- (Type: series port bitset none) REFINEMENTS: /part -- Removes to a given length or position. range -- (Type: number series port pair) First, let's make a copy of begin so we can play with it. >> haha: copy begin == {;; A #" "} remove/part tells REBOL to removes value(s) from a series to a given length or position. Taking the 7 from above: >> remove/part haha 7 == { "} REMOVE returns to the head of the series after the remove. >> :haha == { "} So this is how we leverage computers programmatically, that is, through substitution. >> remove/part haha ((index? ending) - (index? begin)) == { "} >> begin == {;; A #" "} >> ending == { "} >> haha == { "} So now, if we were to update our s string! like so: s: { Owner_Name: "" ;; A Co-Owner_Name: "" ;; B } Telling REBOL to use PARSE to get meaning from a parse dialect: >> parse/all s [ any [ to #";" begin: to newline ending: to end ]] == true >> begin == {;; A Co-Owner_Name: "" ;; B } >> ending == { Co-Owner_Name: "" ;; B } See? begin has a view from the first semi-colon, and ending has a view from the newline to the end of the string. That includes the next line. So if we modify our parse rules: >> parse/all s [ any [ [ to #";" begin: to newline ending: to end ( [ remove/part begin ((index? ending) - (index? begin)) [ ) :begin [ ] [ ] == false REBOL removes the comments via the DO dialect. >> s == { Owner_Name: "" Co-Owner_Name: "" } So why the :begin? Well, after removing some char!(s), s has shrunk. We need to tell REBOL that so REBOL can start a new view of the permanently altered input. Let's modify our parse rules. >> parse/all s [ any [ [ to #";" begin: to newline ending: to end [ ( [ print join "the string is: " s [ print join "it's length is:" length? s [ print join "length of view position begin is: " length? begin [ print join "amount to remove is: " ((index? ending) - (index? begin))) [ ( [ remove/part begin ((index? ending) - (index? begin)) [ ) [ (print join "view position is now: " index? :begin) :begin [ ] [ ] the string is: Owner_Name: "" ;; A Co-Owner_Name: "" ;; B it's length is:50 length of view position begin is: 29 amount to remove is: 5 view position is now: 22 the string is: Owner_Name: "" Co-Owner_Name: "" ;; B it's length is:45 length of view position begin is: 5 amount to remove is: 4 view position is now: 41 == false If we do not tell REBOL to start with the new altered view of s, REBOL thinks its job is done. >> parse/all s [ any [ [ to #";" begin: to newline ending: to end [ ( [ print join "the string is: " s [ print join "it's length is:" length? s [ print join "length of view position begin is: " length? begin [ print join "amount to remove is: " ((index? ending) - (index? begin))) [ ( [ remove/part begin ((index? ending) - (index? begin)) [ ) [ (print join "view position is now: " index? :begin) [ ] [ ] the string is: Owner_Name: "" ;; A Co-Owner_Name: "" ;; B it's length is:50 length of view position begin is: 29 amount to remove is: 5 view position is now: 22 == true
posted by: Stone Johnson 20-Aug-2019/17:13:59-7:00
Errata for the above: This: you have told REBOL to give you a view of the string! from index position of 3. Should be: This: you have told REBOL to give you a view of the string! from index position of 4.
posted by: Stone Johnson 20-Aug-2019/17:29:26-7:00
|