Home   Archive   Permalink



Web Scraping

Hello, I very recently found Rebol language and I'd like to ask if this tool is well suited to web scraping. I want to write a program that:
    
1 Connects to a webpage: http://www.oanda.com/currency/historical-rates
    
2 Fills 2 date fields
3 Fills 2 text fields
4 Clicks the Get Table button
5 Waits for the requested data
6 Captures the numeric data just behind "High:" into a variable
7 Captures the numeric data just behind "High:" into a second variable.
    
Do you think this is possible with Rebol 2 or 3?
Any hint would be welcomed.
    
Cheers
    
Francisco

posted by:   Francisco       24-Aug-2010/7:30:56-7:00



Hi, Rebol can do this very easily,
You need to POST your data like below (note that I'm too lazy to do all the job, just get the idea)
    
t: read/custom http://www.oanda.com/currency/historical-rates [POST "exch2=usd&date1=06/07/08&date=09/07/08&..."]
    
add any other form elements in ..., t will hold the server response.
then parse the t. here is an example of parse:
    
t: {
     <TR>
     <TD>BGCOLOR="#DFDFDF"><FONT FACE="Verdana" size="2">  High:<FONT></TD>
     <TD BGCOLOR="#EFEFEF"><FONT FACE="Verdana" size="2"> 0.79060 </FONT></TD>
     </TR>
     <TR>
     <TD BGCOLOR="#DFDFDF"><FONT FACE="Verdana" size="2">  Low:</FONT></TD>
     <TD BGCOLOR="#EFEFEF"><FONT FACE="Verdana" size="2"> 0.77410 </FONT></TD>
     <TR>
}
    
>> parse/all t [some [thru {< FONT FACE="Verdana" size="2" >} copy v to {<} (print v) ]]
    
;result will be:
  High:
0.79060
  Low:
0.77410
    
Hope this will help. Sorry for my laziness.

posted by:   Endo       24-Aug-2010/8:11:11-7:00



@Nick: my post look so weird it's because it has some html parts. Could you fix that?

posted by:   Endo       24-Aug-2010/8:13:58-7:00



Hello Endo:
    
Thanks for your help.
    
Please could you post the part of the string that clicks the "Get Table" button?
    
In another .NET development platform I'had to use:
    
WebBrowserControl1.Navigate(Uri,String)
Uri,String: javascript:document.getElementById('SUBMIT').click();
    
    
Cheers
    
Francisco

posted by:   Francisco       24-Aug-2010/9:11:52-7:00



Hi, I don't simulate the clicking of Get Table button, instead, I send an http request & get the result.
Open the page and look at its source, you'll see the < form ...> and other form elements like < input name=date1 ...> so you can get the necessary names of form elements and fill them in your request:
    
t: read/custom http://example.com [POST {date1="010203"¤cy="usd"}] ;etc..
    
so, your request goes to server and t will hold the server response, then you can parse the t.
    
Also look at this document: http://www.rebol.org/art-display-article.r?article=x60w


posted by:   Endo       24-Aug-2010/10:35:11-7:00



Hello,
    
I'm still reading about the parser... (Serious stuff!!...) But as a preliminary test, I've used the following line in the rebol console:
    
t: read/custom http://www.oanda.com/currency/historical-rates [POST "exch2=eur&expr2=usd&date1=01/01/10&date=01/02/10"]
    
After that, I've sent t to the editor:
    
editor t
    
I found there the raw data from the first screen (the query) but the result screen is missing.
    
Do you see anything wrong in the code sent to the webserver?
    
Cheers
    
Francisco

posted by:   Francisco       24-Aug-2010/14:43:02-7:00



Francisco,
    
Here's an example string that works:
    
     t: read/custom http://www.oanda.com/currency/historical-rates [POST {lang=en&result=1&date1=08%2F18%2F10&date=08
%2F24%2F10&date_fmt=us&exch=USD&exch2=&expr=EUR&expr2=&margin_fixed=0&format=HTML&SUBMIT=Get+Table}]
    
To get that, instead of wading through the source, I saved the page, changed POST to GET in the source, opened the edited page, clicked the "Get Table" button, and copied the string from the URL bar :)

posted by:   Nick       24-Aug-2010/17:21:54-7:00



Endo,
    
Since you're communicating with REBOLers here, a sure fire way to send HTML code, with no worry about formatting errors, is to just post the compressed string:
    
     editor decompress #{
789C2BB152A8B60909B2E35200019B1017052777677F1FFF205B256517371054
B2B371F3F70B5170737476B5550A4B2D4A49CC4B545228CEAC4AB5553252B253
CB4B2A2EB056F0C84CCFB0B2D10729B5B3D10F71C16AA2AB1B08123251C140CF
DCD2C0CC4001C338200BEA54AAB8D927BF9C8A4E363731C4E5E45A005B7E5AF8
67010000
}

posted by:   Nick       24-Aug-2010/17:32:43-7:00



Francisco,
    
In this situation, you can use REBOL nicely for your needs. If you need to do page scraping, AutoIt is a really useful tool (Windows only).

posted by:   Nick       24-Aug-2010/23:07:50-7:00



Hi,
First of all, you should put all the form elements because they may needed by the server:
    
p: [POST "exch2=eur&expr2=usd&date1=01/01/10&date=01/02/10&result=1&date_fmt=dd/mm/yy&exch=EUR&expr=USD&margin_fixed=0&name=format=HTML"]
    
url: http://www.oanda.com/currency/historical-rates
    
editor read/custom url p
or
print read/custom url p
this will get the correct results. Parsing is a littbe bit difficult at a first glance. But then, you realize that it is really powerful.


posted by:   Endo       25-Aug-2010/2:57:10-7:00



Small fix:
p: [POST {exch2=eur&expr2=usd&date1=01/01/10&date=01/02/10&result=1&date_fmt=dd/mm/yy&exch=EUR&expr=USD&margin_fixed=0&format=HTML}]
    
there was "name=format=HTML" should be "format=HTML".

posted by:   Endo       25-Aug-2010/2:59:37-7:00



Endo & Nick: Thanks a lot... I'm still "processing" the parsing section of the tutorials.
    
Francisco

posted by:   Francisco       25-Aug-2010/9:27:33-7:00



Francisco,
    
For what you want to do with parse, it's really not too complex. To extract strings from a larger block of text, using start and end search strings, memorize this syntax:
    
parse (your-text-block) [
     any [thru {start} copy variable to {end} (actions with variable)] to end
]
    
Here's an example from http://re-bol.com/examples.txt :
    
R E B O L []
code: {
     view layout [
         btn {some text}
         btn {some more text}
     ]
}
strings: copy []
parse code [
     any [thru "{" copy a-string to "}" (append strings a-string)] to end
]
foreach str strings [
     if true = request rejoin [{Change "} str {"?}] [
         replace/all code str request-text/title rejoin [
             {Change "} str {" to:}
         ]
     ]
]
editor code

posted by:   Nick       25-Aug-2010/10:16:05-7:00



Hello,
Thanks for your support, with your help I could load the variables high and low with the correct numeric values taken from the webpage. There is yet a lot more to do as this is the core of a financial forecasting tool that has to retrieve the weekly ranges for a number of weeks from the pass and after some calculations apply the results to the current live data.
    
In my book the retrieval part of the problem was the hardest and with your help and Rebol has been a piece of cake.
    
Cheers
    
Francisco

posted by:   Francisco       25-Aug-2010/19:53:45-7:00



By the way, if you need other values instead of high & low, you can use this:
    
get the result in CSV format, parse it (<PRE>(take-this-part)</PRE>), t will be:
    
t: {08/20/2010,0.77960
08/21/2010,0.78530
08/22/2010,0.78710
08/23/2010,0.78710
08/24/2010,0.7880
08/25/2010,0.79080
08/26/2010,0.79040}
    
b: map-each v extract next parse t none 2 [to-decimal v]
    
this will get all the decimal values for selected week,
    
>> sort b
== [0.7796 0.7853 0.7871 0.7871 0.788 0.7904 0.7908]
>> first b
== 0.7796
>> last b
== 0.7908
>> first maximum-of b
== 0.7908
>> first minimum-of b
== 0.7796
    
etc. Hope this will help.


posted by:   Endo       26-Aug-2010/3:23:19-7:00



"take this part" is take the text inside PRE tags.

posted by:   Endo       26-Aug-2010/3:25:35-7:00