Home   Archive   Permalink



Parse trying to learn parse

this is my log1.txt file
line0: header
line1: aaaa
line1: bbbb
line1: cccc
line2: ddd
line2: eee
    
this is my code
R E B O L [ ]
    
log: read/string %log1.txt
    
parse/all log [ some [ thru "line0:" copy msgLine0 to newline
                         | thru "line1:" copy msgLine1 to newline
                         | thru "line2:" copy msgline2 to newline
    
    
                         ( print rejoin [ msgLine0 " * " msgline1 " * " msgLine2]
                            msgLine0: msgLine1: msgLine2: copy "" )
                        ]
    
                ]
    
    
my goal is to extract the 2nd column from the log1.txt file using parse.
    
this is the rstults:-
header * cccc * ddd
* * eee
    
it has missed lines aaa, bbb
    
the logic I am using in plain language is:-
one or more line0 | one or more line1 | one or more line2
    
wonder why this does not work.
    
can someone help me understand? thank you

posted by:   nubie     20-Feb-2020/21:14:21-8:00



Code in parentheses runs when the rule it is part of matches.
    
The way you have written this, your code in parentheses is connected only with the case that the third rule in the alternate list matches.
    
e.g. if you are to write:
    
     some [ rule0 | rule1 | rule2 (code) ]
    
It's like you had written:
    
     some [ rule0 | rule1 | [rule2 (code)] ]
    
So it will only be when rule2 matches that the code runs. Hence you are running the code exactly two times; once for each time the line2 matches.
    
If you wish the code to run each time *any* of the rules matches, you would write:
    
     some [ [rule0 | rule1 | rule2] (code) ]
    
If you wanted the code to run after a repeated match of all the rules (e.g. when the SOME rule has finished), you would write:
    
     some [rule0 | rule1 | rule2] (code)
    
Hopefully that clarifies things...

posted by:   Fork     20-Feb-2020/22:37:40-8:00



Thanks Fork for the clear explanations. I modified the code according to your advice. It works :)
    
I am able to extract the 2nd column from the data file. It is so much easier than reading line by line and extracting the 2nd column.
    
showing the modified code
    
R E B O L [ ]
    
log: read/string %log1.txt
    
msgLine0: msgLine1: msgLine2: copy ""
    
parse/all log [ some [ [ thru "line0:" copy msgLine0 to newline
                             | thru "line1:" copy msgLine1 to newline
                             | thru "line2:" copy msgline2 to newline
    
                         ]
    
                         ( print rejoin [ msgLine0 " * " msgLine1 " * " msgLine2]
                            msgLine0: msgLine1: msgLine2: copy "" )
    
                        ]
                ]
    
    
and results after running the code:-
header * *
* aaaa *
* bbbb *
* cccc *
* * ddd
* * eee

posted by:   nubie     21-Feb-2020/8:01:05-8:00



Hi Fork,
I tried something a little bit different but the results are weird. I just repeated the first group of data a 2nd time    
    
line0: header
line1: aaaa
line1: bbbb
line1: cccc
line2: ddd
line2: eee
line0: header2
line1: fff
line1: ggg
line1: hhh
line2: iii
line2: jjj
    
and I run the same code above, and this is what I am getting
    
header * *
header2 * *
* fff *
* ggg *
* hhh *
* * iii
* * jjj
    
where did teh aaa, bbb, ccc, ddd, eee go? it was there when there was less data.
    
I am kind of puzzled as to how parse works
    
generally speaking how does parse work?
    
does it take
the whole data set and compare against rule1,
then the whole data set against rule 2, then the whole data set against rule 3 ?
    
    
or does it take one line at a time and goes against rule1, then rule2 , then rule3?
line 1 of data against rule1, line1 of data against rule2, line 1 of data against rule3,
then line2 and then line3 and so on...
    
I tried adding a (print index? log) on each parse rule e.g
thru "line0:" copy msgLine0 to newline ( print index? log )
    
but it returns me 1 all the time.
    
kind of scratching my head...
    
    
    


posted by:   nubie     21-Feb-2020/17:53:54-8:00



You are using alternates (the pipe character, |, is used to separate the alternates). They are run in priority order, and you are using THRU with that... so the THRU of the earlier rules will always take priority.
    
For instance:
    
     parse "aba" [
         some [
             thru "a" (print "A!")
                 |
             thru "b" (print "B!")
         ]
     ]
    
That will give you:
    
     A!
     A!
    
Because it will try the alternate ruleset once, find it can reach an "a". Then try the alternate ruleset a second time, and find it can reach an "a" again. It never even looks for a "b" until it has already passed it.
    
So combining THRU and an alternates list is going to get you a pecking order you don't appear to like. What other choice you make depends on what you are looking for. For instance:
    
     some [
         ; grab the data assuming `line` starts each line
         ; copy up TO (but not including the newline)
         ;
         [ "line0:" space copy msgLine0 to newline
         | "line1:" space copy msgLine1 to newline
         | "line2:" space copy msgline2 to newline ]
    
         newline ; now consume the newline (could also SKIP)
        
         ( print rejoin [ msgLine0 " * " msgLine1 " * " msgLine2]
            msgLine0: msgLine1: msgLine2: copy "" )    
     ]
    
This is assuming that your input data has a newline even on the last line (this is actually a convention in Unix--that the last line of a file should have a newline on it--which has good reasons). But if you don't like that assumption you can have rules like `[newline | end]` to match either. And you can say things like `nend: [newline | end]` to make compound rules and reference them.

posted by:   Fork     21-Feb-2020/21:41:49-8:00



Thanks a lot Fork for the clear explanations again.
    
I was using thru, as I thought it was easier to use and it would cover many scenarios, not really knowing the full implication. Now it is clearer to me with the explanations you provided above.
    
I have modified the code now based on your advice. I have defined a space variable and use "any space" in the rule, just to cover a scenario, that a line may be starting with space.
    
R E B O L [ ]
    
log: read/string %log1.txt
    
msgLine0: msgLine1: msgLine2: copy ""
    
space: " "
    
parse/all log [ some [ [ any space "line0:" copy msgLine0 to newline
                            | any space "line1:" copy msgLine1 to newline
                            | any space "line2:" copy msgline2 to newline
                         ]
    
                         skip
    
                         ( print rejoin [ msgLine0 " * " msgLine1 " * " msgLine2]
                            msgLine0: msgLine1: msgLine2: copy ""
                         )
    
                        ]
                ]
    
    
    
I will continue playing witn parse, to learn more about it. So far after those 4 days reading the parse chapter in Rebol core docs and playing with it, I like what it can do. the next thing I will try is to see if i can find a way to trace it when it is running. one thing that I have been trying to see is what gets fed into each rule at each step of the way and where the pointer is.
    
    
Thanks again for your help.


posted by:   nubie     22-Feb-2020/9:14:28-8:00



Nice that you are enjoying PARSE. It's a fairly addictive alternative to RegEx. Being able not just to form named rules to break your problem into smaller parts... but also to build those rules programmatically... can be kind of a revelation (especially if you're not coming from a Lisp background).
    
It's one of the best practical examples so far of how Rebol has taken the same box of parts (like blocks and words and code in parentheses) and given them a new meaning. There's this freedom for what you can make it do when there are really "no keywords".
    
So once you catch on to this idea, you can look at solving other problem domains with a similar "liberated" mindset.

posted by:   Fork     22-Feb-2020/10:04:23-8:00



Yes definitely better than regular expressions :)

posted by:   nubie     24-Feb-2020/22:24:51-8:00