Home   Archive   Permalink



Parsing baby steps

Still just not getting the parsing. I am looking at the manual, and at the makedoc2 source code (a parse-fest if there ever was one) and trying to concoct a simple example to see what it is in my head that might be blocking my understanding. The example below has an explanation in the comments for what I am trying to do in the example. I understand the principle. When I get the === I will copy what follows to the end of the line, trim it, and surround it with h1 tags. When I get a blank line, I will copy to the next blank line and surround what I get with the p tags. But I just don't see what to do to make that happen and I wonder if anyone can offer guidance.
    
Thank you.
    
R E B O L []
    
;; [---------------------------------------------------------------------------]
;; [ Demo for the purpose of trying to understand parsing.                     ]
;; [                                                                         ]
;; [ This demo will transform simple text with one markup item into html.     ]
;; [ The one markup item is the === on one line that indicates a heading     ]
;; [ at the h1 level. The text on that line should be trimmed and surrounded ]
;; [ by the h1 tags. The other lines of text should be divided on the         ]
;; [ blank line and be surrounded by the "p" tags.                             ]
;; [                                                                         ]
;; [ So text like this:                                                        ]
;; [                                                                         ]
;; [ ===Heading 1                                                             ]
;; [                                                                         ]
;; [ Paragraph 1-1                                                             ]
;; [                                                                         ]
;; [ ===Heading 2                                                             ]
;; [                                                                         ]
;; [ Paragraph 2-1                                                             ]
;; [                                                                         ]
;; [ Should be transformed to this:                                            ]
;; [                                                                         ]
;; [

Heading 1

                                                        ]
;; [

Paragraph 1-1

                                                     ]
;; [

Heading 2

                                                        ]
;; [

Paragraph 2-1

                                                     ]
;; [                                                                         ]
;; [ ...or something equivalent.                                             ]
;; [---------------------------------------------------------------------------]
    
;; -- This is the sample input data that we will parse.
IN-TEXT: {
===Heading one
    
This is a paragraph of text under heading one.
We would want it surrounded by the "p" tags.
    
This is a second paragraph
that should have its own set of "p" tags.
    
===A second heading
    
The above heading would be emitted with the "h1"
tags.
    
And here is a second paragraph under the second
heading just to show things are working
}
    
;; -- This will be the parsed input data with its html tags.
HTML-OUT: copy ""
    
;; -- Parse IN-TEXT, mark it up, and append it to HTML-OUT.
    
;; ???
    
;; -- Display the output and halt for probing.
print HTML-OUT
halt
    


posted by:   Steven White     10-Sep-2018/13:13:06-7:00



A simplistic approach to this would be to use BITSET! to positively identify content portions. This is sort-of how MakeDoc works, but with a few more subtle rules.
    
; anything but newlines
content: complement charset "^/"
    
scan-doc: func [text [string!]][
     ; parse/all for rebol 2
     collect [
         parse/all text [
             any [
                 newline
                 | "===" opt " " copy part some content (
                     keep 'heading
                     keep part
                 )
                 | copy part [some content any [newline some content] (
                     keep 'para
                     keep part
                 )
             ]
         ]
     ]
]
    
probe scan-doc {
    
=== A Header
    
A Paragraph
    
Another Paragraph
    
}

posted by:   Chris     10-Sep-2018/13:30:33-7:00



*This line was missing a close bracket
    
| copy part [some content any [newline some content]] (
    
Note that this lets you identify multiline paragraphs.

posted by:   Chris     10-Sep-2018/13:33:45-7:00



While this seems an easy task, a few matters make it more difficult than it looks with historical PARSE.
    
One of the not-so-easy aspects is that the TO and THRU doesn't allow you to use complex rules, which complicates your paragraph termination conditions. This is a decision which was reversed in Red (and will be also in Ren-C, when time permits).
    
You can try this in Red, and while there are likely workarounds for it in Rebol2 and R3-Alpha I'd rather consider the fact that this doesn't work as-is a bug than figure out what that would be:
    
     heading-rule: [
         "===" copy heading to "^/" (
             append html-out reduce [
                

heading

newline
             ]
         )
     ]
    
     paragraph-rule: [
         copy paragraph to ["^/^/" | "^/" end | end] (
             append html-out reduce [
                

paragraph

newline
             ]
         )
     ]
    
     parse in-text [
         (html-out: copy {})
         some [newline | heading-rule | paragraph-rule]
     ]
    
     print mold html-out

posted by:   Fork     10-Sep-2018/15:17:01-7:00



Here is your mini markdown, in effect.
    
md: {
===Heading 1
    
Paragraph 1-1
    
===Heading 2
    
Paragraph 2-1
}
    
REBOL is interactive. Start small. Hammer away until you train your mind to experience ah ha! moments.
    
Take big problems and break those into smaller ones.
    
>> md: {=}
== "="
>> parse md ["="]
== true
>> md: {===}
== "==="
>> parse md [3 "="]
== true
== true
>> md: {===H}
== "===H"
>> parse md [3 "=" letter]
== true
>> md: {===Heading 1}
== "===Heading 1"
>> parse md [3 "=" some [letter | figure]]
== true
>> parse md [3 "=" hgd: some [letter | figure] ]
== true
>> hgd
== "Heading 1"
    
OK, let's make our rules. We need rules for the headings and rules for the paragraphs. Our goal is to fill in these blocks with valid parse dialect.
    
head-rules: [
]
    
para-rules: [
]
    
For the heading rules:
    
1. 0 or more times match any new line characters
2. match three equals signs
3. copy the input if 1 or more times any letters or numbers to a new line
4. build an html element with the open and closing tags and the content in the middle
    
    
Note:
    
letter: charset [#"a" - #"z" #"A" - #"Z"]
figure: charset [#"0" - #"9"]
    
page: copy {}
    
hr: head-rules: [
    any newline
    3 "=" copy hgd some [letter | figure] to newline
    (
        append page rejoin ["" build-tag [h1] hgd build-tag [/h1]]
    )
]
    
Note: BUILD-TAG has bugs. The empty string! in front is a hack to overcome the bugs.
    
1. 0 or more times match any new line characters
2. copy the input if 1 or more times letters match
3. copy the input if a number matches
4. copy the input if a number matches to a new line
5. build an html element with the open and closing tags and the content in the middle
    
    
pr: para-rules: [
    any newline
    copy pg some letter copy x figure dash copy y figure to newline
    (
        append page rejoin ["" build-tag [p] pg " " x "-" y build-tag [/p]]
    )    
]
    
rules: [
    any [() some head-rules | some para-rules]
    to end
]
    
What is happening there?
    
1. 0 or more times match to the sub rules
2. 1 or more times match to head-rules or match to para-rules
    
Note: the () is a hack to use the escape key in REBOL 2.7.8 in case the any gets REBOL stuck in a parsing loop.
    
>> parse md rules
== true
>> page
== {

Heading 1

Paragraph 1-1

Heading 2

Paragraph 2-1

}


posted by:   Stone Johnson     21-Aug-2019/1:45-7:00



hr: head-rules: [
    3 "=" copy hgd some [letter | figure] to newline
    (
        append page maketag 'h1 hgd
    )
]
    
pr: para-rules: [
    copy pg some letter copy x figure dash copy y figure to newline
    (
        append page maketag 'p rejoin [pg " " x "-" y]
    )    
]
    
maketag: func [
    tag [word!]
    text [string!]
][
    rejoin ["" build-tag reduce [tag] text build-tag to-block rejoin ["/" tag]]
]
    
rules: [
    any [() newline | some head-rules | some para-rules]
    to end
]
    
>> parse md rules
== true
>> page
== {

Heading 1

Paragraph 1-1

Heading 2

Paragraph 2-1

}


posted by:   REFACTORING FROM ABOVE     21-Aug-2019/1:58:13-7:00