Home   Archive   Permalink

Parsing with to

I'm building a web scraping application which when modeled like this works as I expect:
p:     " stuff <h1> a header </h1> <p> some words </p> <p> more words </p> whatever "
chars: charset [ #"a" - #"z" ]
h1: [ <h1> copy title any chars (print title)</h1> ]
para: [ <p> copy ptext any chars (print ptext) </p> ]
r:     parse p [ any chars h1 some para any chars end ]
print r
Output is:
    a header
some words
    more words
If, instead, I do something similar with "to", I get an error.
p:     " stuff <h1> a header </h1> <p> some words </p> <p> more words </p> whatever "
header: [ <h1> copy title to </h1> (print title) ]
para: [ <p> copy ptext to </p> (print ptext)]
r: parse p [ to header [some para ] to end]
print r
** Script Error: Invalid argument: <h1> copy title to </h1> print title
** Near: r: parse p [to header [some para] to end]
Can someone explain why "to" behaves so differently than "any chars"? I'm sure that I'm missing something.
Thanks for your help.

posted by:   Andyh       25-Jan-2012/23:46:28-8:00

This is because string parsing and block parsing is totally different. Take a look in Core manual about them.
In string parsing there are characters, in block parsing there are REBOL values (words, numbers, blocks)

posted by:   Endo       26-Jan-2012/3:26:36-8:00

TO cannot accept a rule as argument. You could use instead:
     r: parse p [ to <h1> header [some para ] to end]

posted by:   DocKimbel       28-Jan-2012/3:38:03-8:00

Thanks a bunch Doc. I think I've got it. Now I'll try some real web pages!

posted by:   Andyh       29-Jan-2012/15:02:26-8:00

Happy scraping! That is what's made me want to learn REBOL in the first place, twelve years ago. ;-)

posted by:   DocKimbel       30-Jan-2012/13:47:12-8:00