Home   Archive   Permalink



Parse for Pattern Matching exercise

Hi!
    
I am interested in learning Parse better and decided to attempt an email validator. However, my solution gives false to a pattern that should be standard enough and I hope that someone could see where I made a mistake, as I feel quite lost.
    
Please note that I am not asking for a good email validator but aim instead to enhance my knowledge of Rebol’s Parse, and thus I am more interested in flaws in my logic and other approaches to solving the problem than on a recipe that just works.
    
So let me state the conditions for the exercise:
    
1. An email address is made of two parts: a user name and a domain name, separated by an ampersand:
    
name @ domain
    
2. The name is allowed to include the dot, the underscore or the hyphen provided that the dot is not the last or first character.
    
3. The domain can only include a hyphen as a special character, as long that it is not the last or first character...
    
4. but it can include more than one dot and, to make things really crazy...
    
    
5. it doesn't really require an extension (like .com, .org or .info) which, I know, it’s super odd yet it’s valid, and makes this exercise challenging.
    
Given these conditions I came up with:
    
alphanum: charset [#"A" - #"Z" #"a" - #"z" #"0" - #"9"]
underscore: charset [#"_"]
hyphen: charset [#"-"]
dot: charset [#"."]
chars: union union alphanum underscore hyphen
chardot: union chars dot
alphahyphen: union alphanum hyphen
    
name: [some chars any chardot some chars]
domain: [some alphanum any alphahyphen some alphanum any [dot some alphanum]]
email-rules: [name #"@" domain]
    
parse "boris.xyz@mywebsite.org" email-rules
    
This version expects name and domain to be at least two characters long but, other than that, should not reject most mainstream versions, like the one in the example. Yet, to my surprise, the parse returned false. Any ideas why?
    
Thanks in advance!

posted by:   brotherdamian     24-Jul-2019/3:39:26-7:00



I fully support your efforts to understand parsing, because I also do not understand it, and I will be interested in seeing any answers you might get. However, off-topic a bit, you know, I assume, that because email is one of the REBOL data types, there is a faster way to solve this particular problem:
    
R E B O L []
    
either equal? email! type? load "boris.xyz@mywebsite.org" [
     print "valid email"
] [
     print "not valid email"
]
    
either equal? email! type? load "some-string of characters" [
     print "valid email"
] [
     print "not valid email"
]
    
halt

posted by:   Steven White     2-Aug-2019/16:09:49-7:00



Hi Steven.
    
From: http://www.rebol.com/rebol-core.html
    
"Rich Set of Built-in Datatypes
In addition to the datatypes found in most languages, REBOL can also express money, times, dates, words, tags, logic, lists, hashes, tuples, XY pairs, and many other datatypes."
    
Given that, why use string parsing, which is char! by char! when you can leverage REBOL to the fullest and parse on datatypes, in this case, email!
    
;; the dopey passes data along in strings
>> stringemail: "boris.xyz@mywebsite.org"
== "boris.xyz@mywebsite.org"
    
;; Here is a fast way, perhaps the fastest way to validate a string email
    
>> email? attempt [to-email "boris.xyz@mywebsite.org" ]
== true
    
;; REBOL will transmute string emails into email!
;; That is your first hint it is valid.
>> to-email stringemail: "boris.xyz@mywebsite.org"
== boris.xyz@mywebsite.org
    
;; you can put anything into a block way fast
>> to-block to-email stringemail: "boris.xyz@mywebsite.org"
== [boris.xyz@mywebsite.org]
    
>> parse to-block to-email stringemail: "boris.xyz@mywebsite.org" [email!]
== true
    
;; or
    
>> parse to-block to-email "boris.xyz@mywebsite.org" [email!]
== true
    
;; or
    
>> parse reduce [to-email "boris.xyz@mywebsite.org"] [email!]
== true
    
;; or
    
>> temp: copy []
== []
>> append temp 'to-email
== [to-email]
>> append temp "boris.xyz@mywebsite.org"
== [to-email "boris.xyz@mywebsite.org"]
>> parse reduce temp [email!]
== true
    
    
But if you truly want to muck around with strings, let's get into it.
    
    
1. The word parse labels a function.
2. That function reads string!(s) or block!(s) and attempts to get meaning from the implied patterns matched against specified patterns.
3. Said another way, the input might imply a pattern that you hope matches a pattern you need.
    
The key requirements of Parse.
1. Parsing rules must be written in the Parse dialect.
2. Parsing rules must be in a rules block.
3. Data must be a series! (either a string! or a block!).
    
OK, so let's do a bit of comparison to series! in REBOL.
    
Let's say you have this string. Well, you should know for any string!, which REBOL "models" internally, REBOL has a pointer that tracks where you (or your "script," i.e., messages to REBOL) to be in the string!
    
>> s: "This is a well-meaning sentence."
== "This is a well-meaning sentence."
>> index? s
== 1
    
;; you're at the head of s. You have a "head view" of it.
    
>> head? s
== true
    
;; now let's skip "down" (maybe that should be across) the string by 1
    
>> s: next s
== "his is a well-meaning sentence."
>> index? s
== 2
    
;; The first bit of the string did not disappear. REBOL still has the whole string!.
    
>> head s
== "This is a well-meaning sentence."
    
;; Even though the current view of the string starts from index 2.
    
>> s
== "his is a well-meaning sentence."
    
;; More skipping
    
>> s: next s
== "is is a well-meaning sentence."
>> index? s
== 3
    
;; this is more like a leap by comparison
    
>> s: skip s 5
== " a well-meaning sentence."
    
>> index? s
== 8
    
OK, why am I focusing your thoughts on this? Well, when you tell REBOL to parse a string. That is why at REBOL does, one character at a time.
    
The difference is this though. For each character where the current index is, REBOL tries to match that character to a rule within your rules block! If a match happens REBOL can skip one and start again with the "next" rule in your rules block!
    
When REBOL applies parse (the function) against input (string! or block!), REBOL checks to see if the left and right hand sides are the same.
    
REBOL will keep applying parse until either a failure or reaching "input exhaustion," that is reaching the end by matching everything. A failure makes REBOL return false and stops the matching (parsing) regardless of how much stuff remains. At input exhaustion, REBOL returns true.
    
Let's break down your problem into two problems:
    
1. Parse the name.
2. Parse the domain.
    
OK, as I see your problem constraints:
    
1. No name can start with a dot or end with a dot.
2. Names can begin with underscores, hyphens along with letters and numbers.
3. No domain can start with a hyphen or end with a hyphen.
4. No domain can have an underscore.
5. A domain begins after the @ symbol.
6. A domain can have more than one dot.
    
Looked at it another way.
    
1. Any name must start with a letter, number, hyphen or underscore.
2. The characters in the middle can be any of letter, number, hyphen, underscore or dot.
3. Any name must end with a letter, number, hyphen or underscore.
    
1. Any domain must start with letter, number, dot
2. The characters in the middle can be any of letter, number, hyphen, or dot.
3. Any domain must start end letter, number, dot
    
Here is our general form: parse input rules. From the above, we have one problem broken into two:
    
name-rules: [
]
    
domain-rules: [
]
    
And thus our rules word! will look like this:
    
rules: [
    name-rules
    "@"
    domain-rules
]
    
    
    
Here is one solution:
    
letter: charset [#"a" - #"z" #"A" - #"Z"]
figure: charset [#"0" - #"9"]
dot: charset {.}
uscore: charset {_}
hyphen: charset {-}
    
Here are the rules for names.
    
nse: [ letter | uscore | hyphen ]    
nmid: [ letter | uscore | hyphen | dot ]    
names-rules: [nse nn nmid nse]
    
Now let's tackle the domain.
    
dse: [ letter | figure | dot ]    
dmid: [ letter | figure | hyphen | dot ]    
domain-rules: [ dse dn dmid dse ]
    
Notice in the big rules, I've done something.
    
1. I split the email using parse, which puts the results into a block.
2. I get the length less 2 for each of the name and the domain.
3. To get the numbers, I put that into a block and reduce it.
4. With a handle to it, I can get each number.
    
Why did I do this? Well, any rule will fail using the parse dialect word 'some.' We want an exact number of times to match through the middle stuff while leaving the last character to match the hard core constraint.
    
This method of solving is the same for the domain.
    
rules: [    
    (
    times: reduce [ -2 + length? first b: parse s "@" -2 + length? second b]
    nn: times/1
    dn: times/2     
    )
    names-rules
    "@"
    domain-rules
]
    
>> s: "-bo.-_ris.xyz_@.234d---fadf.com"
== "-bo.-_ris.xyz_@.234d---fadf.com"
>> parse s rules
== true
    
>> parse s rules
== true
>> s
== "-bo.-_ris.xyz_@.234d---fadf.com"
>> s: "boris.xyz_@parsed.com"
== "boris.xyz_@parsed.com"
>> parse s rules
== true
>> s: "boris.xyz.@parsed.com"
== "boris.xyz.@parsed.com"
>> parse s rules
== false
>> s: "-----_____boris.xyz@parsed.com"
== "-----_____boris.xyz@parsed.com"
>> parse s rules
== true
    
I hope this helps you. It has helped me in thinking through the exercise.
    
You might want to test this with test cases. If the answer fails at some point against a test case, let me know.


posted by:   Stone Johnson     20-Aug-2019/3:06:20-7:00