Home   Archive   Permalink



PARSE for whitespace and case

I wrote this script that tells you if two strings are identical, barring whitespace and capitalisation differences.
    
     effectively-same? 'a' 'a' ==> ['same']
     effectively-same? 'A' 'a' ==> ['case']             ;; same ignoring case
     effectively-same? 'a ' ' a' ==> ['whitespace']         ;; same, ignoring non-separating whitespace
     effectively-same? 'a' 'A ' ==> ['whitespace' 'case'] ;; same ignoring whitespace and case
     effectively-same? 'a' 'b' ==> ['different']
    
I'd be fascinated to see if any PARSE guru can turn my 20 lines of procedural code into
a single, elegant parse that runs x100 faster. If you have a go, you can choose your own result format - doesn't
have to be a block.
    
Have fun, if you give it a try.
    
Sunanda
    
     effectively-same?: func [
;; ====================================
     target [string!]
     other [string!]
     /local
     ws-target
     ws-other
     result
     ][
        result: copy []
        if strict-equal? target other [
            return ['same']
         ]
             ;; remove whitespace
        ws-target: trim/lines copy target
        ws-other: trim/lines copy other
             ;; and further fold case to lower
        lc-target: lowercase copy ws-target
        lc-other: lowercase copy ws-other
        if not strict-equal? lc-target lc-other [return ['different']]
        
        if all [equal? ws-target ws-other
                not equal? target other][
                 append result ['whitespace'] ;; Whitespace differs
                ]    
        if all [not strict-equal? ws-target ws-other
                strict-equal? lc-target lc-other][
                append result 'case'        ;; Case differs
             ]
     return result
     ]
        
        
Test harness that works with the above code:    
        
     test-es: func [
;; ====================================
     /local
        test-sets
        res
        ws-chars
        str1
        str2
][
     ws-chars: reduce [' ' newline tab lf cr]
     test-sets: [
             'a'    'a'     ['same']
             'a'    'A'     ['case']
             'a'    '?a'     ['whitespace']
             '?a'    'a'     ['whitespace']
             '?a'    'a?'    ['whitespace']
             'a'    'a?'     ['whitespace']
             '?a?' 'A?'    ['case' 'whitespace']
             '?a?' '?b?' ['different']
             'a' 'b'     ['different']
             ]
    
    loop 100 [
         foreach [target other expected-result] test-sets [
             str1: copy target
             str2: copy other
             while [find str1 '?'] [replace str1 '?' random/only ws-chars]
             while [find str2 '?'] [replace str2 '?' random/only ws-chars]
             res: sort effectively-same? str1 str2
             if res <> expected-result [
                 print 'Oops!'
                 print ['str1:' mold str1]
                 print ['str2:' mold str2]
                 print ['expected' mold expected-result]
                 print ['got' mold res]
                 ]
             ]
         ]
     print 'test-es ended'
] ;; func
        test-es         ;; run tests

posted by:   Sunanda       3-Oct-2016/6:51:45-7:00



Shorter and a little faster:
    
    effectively-same?: func [
        target [string!]
        other [string!]
    ][
        case [
            strict-equal? target other [[same]]
            equal? target other [[case]]
            strict-equal? target: trim/lines copy target other: trim/lines copy other [[whitespace]]
            equal? target other [[case whitespace]]
            true [[different]]
        ]
    ]
    
This PARSE version might be a little faster on larger strings, not sure:
    
    effectively-same?: func [
        target [string!]
        other [string!]
    ][
        case [
            parse/all/case target [other][[same]]
            parse/all target [other][[case]]
            parse/case target: trim/lines copy target reduce [other: parse other " ^/^M^-"][[whitespace]]
            parse/all target [other][[case whitespace]]
            true [[different]]
        ]
    ]
    
My inclination is that the overhead of building a single parse rule that checks for case and non-case matches AND space permutations would likely preclude a single parse rule being more efficient, or indeed elegant.

posted by:   Chris       3-Oct-2016/15:15:21-7:00



A change to the line in the PARSE version:
    
    parse/case target: trim/lines copy target reduce [other: trim/lines copy other][[whitespace]]


posted by:   Chris       3-Oct-2016/15:17:55-7:00



First, you claim this: "I wrote this script that tells you if two strings are identical."
    
But with computers, identity means same memory address. To check that:
    
>> a: "string"
== "string"
>> b: copy a
== "string"
    
>> same? a b
== false
    
>> d: a
== "string"
>> same? a d
== true
    
The first fails because the RVC gets told to make a copy at a pseudo new address (the RVC does memory management).
    
The second works because the RVC gets told to reference the same evaluation result with another name.
    
That said, you can check if the strings match, that is, if the strings have the same sequence of characters.
    
You present the problems as these (numbered for reference)
    
1. same == 'a' 'a'
2. same, no case == 'A' 'a'
3. same, ignore whitespace == 'a ' ' a'
4. same ignore whitespace and case == 'a' 'A '
5. different 'a' 'b'
    
        
alike: func [
    
    y [string!]
    z [string!]
    /nocase
    /nospace
][
    any [
        ;; 4
        if all [nocase nospace] [
            z: trim/all z
            return parse trim/all y [z]    
        ]        
        ;; 3
        if nospace [
            z: trim/all z
            return parse/case trim/all y [ some [space | z ] ]
        ]    
        ;; 2
        if nocase [ return parse/all y [z] ]
        ;; 1
        return parse/all/case y [z]
    ]
]
    

posted by:   Time Series Lord       3-Oct-2016/16:49:55-7:00



And it would be better if I had named the above as alike?

posted by:   Time Series Lord       3-Oct-2016/16:55:10-7:00



;; 4 could be written as:
    
        if all [nocase nospace] [
            return parse trim/all y reduce [trim/all z ]
        ]        
    
    


posted by:   Time Series Lord       3-Oct-2016/17:06-7:00



Final version.
    
Note: The code above was copied from the wrong scratch efforts.
    
Output:
    
>> alike? "a" "a"
== true
>> alike?/nocase "A" "a"
== true
>> alike?/nospace "a " "a"
== true
>> alike?/nospace/nocase "a" "A "
== true
>> alike? "a" "b"
== false
    
alike?: func [
    
    y [any-string!]
    z [any-string!]
    /nocase
    /nospace
][
    any [
    
        ;; 4
        if all [nocase nospace] [
            return parse trim/all y reduce [trim/all z ]
        ]        
        ;; 3
        if nospace [
            return parse/all y [ some [space | z ] ]
        ]    
        ;; 2
        if nocase [ return parse/all y [z] ]
        ;; 1
        return parse/all/case y [z]
    ]
]

posted by:   Time Series Lord       3-Oct-2016/17:13:18-7:00



Thanks for the replies -- some interesting approaches there. I'm on a dodgy internet connection while abroad, so I'll play with them offline and comment more later.
    
same? vs equal? -- a similar problem exists with FIND which matches on equal? not same? So it is not useful for finding cached live cached code segments.
    
My usage case for effectively-same? is that the strings have come from different files -- so, yes, in strict Rebol terminology, I need an equals? not a same? Issue here is mixing user language ("tell me if these two are the same") with Rebolese.

posted by:   Sunanda       4-Oct-2016/2:19:32-7:00



Note that TRIM is a modifying function--if you TRIM as in some of those examples, then the original string is lost.
    
(modifying functions: alter, append, change, clear, detab, entab, insert, lowercase, remove, remove-each, replace, reverse, sort, trim, uppercase)

posted by:   Chris       4-Oct-2016/12:32:08-7:00



Now that list of modifying functions is interesting, not in itself but in that you know it. Did you find that in the documentation or did you hack it out by trial and error? It seems like that is information that a "REBOL master" should have memorized. And that brings up the next question, what other bits of information should a person know to be really handy with REBOL. Not that I'm expecting you to tell me, it's just an interesting idea that there should be some things that a person should "just know."

posted by:   Steven White       4-Oct-2016/13:39:42-7:00



That is right Chris. Trim is a cheat. However, Sunada failed to specify in what state he needed strings.
    
That can be fixed with ease:
    
alike?: func [
        
     y [any-string!]
     z [any-string!]
     /nocase
     /nospace
][
     any [
        
         ;; 4
         if all [nocase nospace] [
             Z: TRIM/ALL COPY Z
             Y: TRIM/ALL COPY Y
             return parse Y [Z]
         ]        
         ;; 3
         if nospace [
             return parse/all y [ some [space | z ] ]
         ]    
         ;; 2
         if nocase [ return parse/all y [z] ]
         ;; 1
         return parse/all/case y [z]
     ]
]
    


posted by:   Time Series Lord       4-Oct-2016/14:04:58-7:00



Steven, the list of modifying functions comes from the dictionary: http://www.rebol.com/docs/dictionary.html (group no. 7) and there's some background on them in this post: http://www.rebol.com/article/0206.html 'Is REBOL a Pure Functional Language?'
    
The most important piece of handy information to retain is where the docs are :)
    


posted by:   Chris       4-Oct-2016/20:55:59-7:00



Thanks again for the replies. I used up most of the time I was going to spend looking at them because I'd stupidly
saved the webpage as a PDF - oh the damage that does to code -- all sorts of unacceptable (to Rebol) hyphen and space and
newline characters. In the end, it took me a whole 10 lines of Rebol code to clean the mess. If I inadvertently broke
your code in doing so - my apologies: and disregard my comments about its behaviour.
    
Warning: I am going to move the goalposts.....I'm looking for code that runs unchanged under all three mainstream
variants: R2, R3, and Red.
    
Chris, your "Shorter and a little faster" looks good! Runs all my published (and unpublished) test cases, and
is just fine in R2, R3 and Red.
    
And I like the way you rethought the problem. I started with identifying [whitespace] so identifying [case] was bolted on.
You did it the other way around - and got a much more elegant result.
    
Chris too, your parse version is almost as good. Works in R2 and R3 with one blip. Red says "parse has no ALL refinement".
The blip is: effectively-same "" "" (ie two empty strings) - comes back: [whitespace].
    
Time, alike? looks good, but it's not quite up their with "Shorted and a litter faster": R2 and Red both dislike
some of the parse dialect (SPACE and ALL respectively). In R3 it has the same blip as Chris' parse version.
    
If any thing, this highlights a common issue in the Rebolsphere. It is easy (with wrapper functions along the lines of
r2-forward.r) to write portable Rebol code that will run in all mainstream Rebol versions. But portable parse is
problematic - especially as there is no (easy) user way to extend it with wrappers. So, for a while at least, scripts
that depend on parse are less portable.
    
Steve, Chris, TRIM. Chris, you are quite right - you can trash the input strings as much as you like, you've only been given copies.
    
    
Thanks again, guys.

posted by:   Sunanda       5-Oct-2016/2:17:12-7:00



"Warning: I am going to move the goalposts.....I'm looking for code that runs unchanged under all three mainstream
variants: R2, R3, and Red." ~ Sunanda
    
Well, since you're not paying me to produce production code, Sunanda, I'll bow out.
    
I banged out alike? in seconds at the console. alike? is REBOLish as it returns a logic!, which is what anyone should want.
    
Good luck getting others to work for you.

posted by:   Time Series Lord       5-Oct-2016/11:40:29-7:00