Home   Archive   Permalink



Options for structuring a flat-file database

Hi!
    
I am preparing to write a small application to help a friend keep track of her impressive CD collection (that she also so generously shares with friends, me included). This would be helpful to her but not really vital, as it is really more of an opportunity for me to further learn REBOL.
    
Because I do not expect this application to be too complex, I decided not to interface with a Relational Database System like MySQL or SQLite and use instead flat-files. I intend to have a file for her CDs, another for the people she lends to, mix those to produce a file on circulating CDs and a fourth file for when the CDs return, to keep a history. The result of returns could change some values in the CD file.
    
I have been reading all I can about REBOL doing databases like this. All agree on using a block that I can “Load” and “Save”. So far so good. But then we seem to have several options for structuring these flat flies. One suggests a big block of strings and other REBOL values (like dates), with a form like:
    
table-users: [ “name1” “surname1” #phone number1 “name2” “surname2” #phone number2 …]
    
and etcetera, which should work fine as long a I stick to the structure. With this version I can use neat tricks for series like: extract, find, select and sort/skip.
    
But then there is the recommendation of using a block for each record, as in:
    
table-users: [
    [ “name1” “surname1” #phone number1 ]
    [ “name2” “surname2” #phone number2 ]
    …
]
    
which should create more manageable data (instead of one big blob of text) that could be handled with foreach loops and path notation but seems to require some intermediate allocation of data. For example Nick in his tutorial uses:
    
user-list: copy []
foreach user table-users [append user-list user/1]
user-list: sort user-list
    
so that he can offer the users as data for a text-list (notice that he was able to use “sort” only after extraction). Meanwhile, Nick offers a different -elegant and seemingly economic- approach when dealing with the big blob of text:
    
text-list data (sort extract table-users 3)        ;however, how do I pick the surname instead?
    
Also, important tools like “exclude”, “union” and “intersect” seem to work with blocks of values instead of blocks of blocks (wow, going all “meta” here).
    
Then there is this book on REBOL by Ralph Roberts (2001, yeah, old) where he says that not all blocks are equal. He describes the “List” and the “Hash”data types (which look just like a block), where the former provides fast and efficient modification of values and the later is optimized for searches in large files. Unfortunately the examples do not clarify my options.
    
As I try to figure how to approach this exercise I focus on efficiency. Let's imagine that this app would be used by a big library instead. Which approach would handle to most number of records without breaking down? How many records before it would be better not to load them into memory and read them line by line from the file instead? Which option would perform faster searches? Which would introduce changes in values securely? Do I need to worry about sticking to hash or list data types? Should I introduce a primary key number for each record in case I have duplicate CDs (now this is looking like a database)? How can I perform a search in the one-block-per-record approach that will find a record with this name AND this surname?
    
I am grateful for any insights and recommendations that you could provide and also hope that these questions will help other beginner rebolers in enhancing their craft.
    
Cheers,

posted by:   Damian       27-Jun-2013/12:52:13-7:00



Hi Damian,
    
I have only time today to post a quick initial reply - I'll try to get back to you more later during the next couple days.
    
I once created a flat file database for a commercial inventory app which currently stores approximately 1.3 million fields of data (including all historical inventory items). That app regularly needs to perform 10 separate searches on each field in the entire database, using the 'find function. That *entire* process (all 10 searches) takes approximately .2 seconds to complete (1/5 of a second) each time it's run, on an inexpensive Windows netbook with an Atom processor.
    
That database uses the format:
    
inventory: [
"32259100090" "(ITEM NAME)" $4.72 "n" "$5.90" "14-Feb-2013" "0" $4.72 "(SUPPLIER)"
"29349820340" "(ITEM NAME)" $3.59 "n" "$4.99" "12-Mar-2012" "0" $4.76 "(SUPPLIER)"
]
    
It's easy to write test cases to compare the real time required to perform searches on any exact data format options you may consider, using "now/time/precise". The best answer for me has always been to actually perform such tests on randomly generated sample data sets - writing that code only takes a minute (running it to generate huge amounts of data may require a coffee break, but it's well worth the time). Take a peek at http://business-programming.com/business_programming.html#section-16 for some examples.
    
For single user applications with similar volumes of data as in the example above, where the entire database can be held in memory on a single machine, the block formats perform well. I tend to use the "flat" format most often (as in the example data above), because it's simple and fast, both in terms of writing code and performance. Most of the operations you had questions about can be handled with either of the following types of code patterns:
    
R E B O L []
inventory: [
"32259100090" "(ITEM NAME)" $4.72 "n" "$5.90" "14-Feb-2013" "0" $4.72 "(SUPPLIER)"
"29349820340" "(ITEM NAME)" $3.59 "n" "$4.99" "12-Mar-2012" "0" $4.76 "(SUPPLIER)"
]
    
temp: copy []
foreach [sku item p1 t p2 date qty p3 supplier] inventory [
     ; do something with each item, or each line
     if ((sku = "32259100090") and ((to-date date) < 14-Mar-2013)) [
         append temp reduce [sku item p1 t p2 date qty p3 supplier]
     ]
]
probe temp
    
count: 0
repeat i (length? inventory) [attempt [
     if (("32259100090" = y: copy inventory/:i) and ((z: copy to-date pick inventory (i + 1)) < 14-Mar-2013)) [
         append temp reduce [y z]
     ]
]]
probe temp
halt
    
I use foreach most often. I use indexes when I need to find any fields in relation to any other field (i.e., fields that skip any i + n or i - n number of indexes away from a given field in the file), when there is no unique key (i.e., when 2 or more fields such as fist name AND last name fields form the only unique identifier for a data row). But I can't think of a situation where it wouldn't be better to provide a unique key if it's required to differentiate between unique rows (SKU is the unique identifier in the example above).
    
You can see that saving data such as dates in the proper format saves 1 conversion step (to-date date). I suspect in most cases, you'll want to save data in the proper format, instead of as strings, but you can run tests on sample data to compare performance, if you'll be doing a lot of that sort of operation, and you want to perform string searches, for example.
    
I use nested blocks most often when *sorting by column* will be a primary operation. You can convert the database quickly from flat to nested formats. Again, several million fields should only take a fraction of a second, so this is not a dramatic performance bottleneck. If you're not sure in any cases, just run tests on some sample data.
    
For network applications with a few users, I prefer to write a small handler that accepts search data over a network port, performs the search in local memory, and then delivers the results over the network port. For bigger data sets or many simultaneous users, you may want to look at Doc's MySQL driver or other database options. These little flat file structures are really meant for what Carl describes as "programming in the small". They're convenient and portable for all sorts of user apps, but if you're going to be maintaining big piles of data, a database system may likely be among your expected tools.

posted by:   Nick       28-Jun-2013/8:42:45-7:00



P.S., regarding ";however, how do I pick the surname instead?"
    
R E B O L []
table-users: [
        "name1" "surname1" #9348223
        "name2" "surname2" #1232394
]
probe sort extract at table-users 1 3
probe sort extract at table-users 2 3
probe sort extract at table-users 3 3
halt

posted by:   Nick       29-Jun-2013/23:53:25-7:00



Thank you, Nick, for your insights and feedback! I went to read the link you recommend and it was also very helpful. What I am understanding so far I can summarize like this:
    
1. Flat-file means one block of REBOL data type values (including but not limited to strings). In contrast, nested blocks stands for a block of data where each record (i.e. a set of attributes) is also a block.
    
2. A flat-file database is easy to implement and run and can be transformed quite easily into a nested- blocks format if it were required.
    
3. You have developed an app that uses a flat-file structure that currently holds about 1.3 million records, can load them all into memory and takes a fifth of a second to sort, find or otherwise process; using an average (for today's standards) windows computer (and this would still be considered a small amount of data in the sense that it does not require the processing power of a RDBMS, correct?).
    
4. It is possible to time processing algorithms and pick the more efficient ones.
    
5. The design of data processing algorithms should consider minimizing the repetition of tasks that are slower or tax the processor more.
    
This helps me a lot and points in the direction of using a flat file structure that I can LOAD and handle as a Series data type, using a marker and an index, and all the other Series functions, to retrieve or set values.
    
There seems to be a third option for organizing data but that it's likely more appropriate for data received from a third party app: that of delimiter-separated values (like comas or tabs) and where each set of attributes has its own line, that is, are terminated by a newline character. Something like this:
    
name1    surname1    phone1
name2    surname2    phone2
etc.
    (notice that the tab and newline characters are invisible)
    
Apparently, the way to import this collection of data would be with the read/lines function/refinement. But it seems that this function/refinement will create a block of strings, with each string closed where the newline character was and, within the string, each value separated by the delimiter. Following the example, we would get something like this:
    
[“name1^-surname1^-#phone1” “name2^-surname2^-#phone2” “name3^-surn...” etc.]
    
To process this would require parsing and conversion, each time, and then transforming the block used by the script back into a tab-separated values list. Such seems like a lot of unnecessary work. Is this correct?
    
Cheers,

posted by:   Damian       30-Jun-2013/20:48:34-7:00



The only reason I ever use delimiters is when importing or exporting from/to a foreign software program or language. For example, for the inventory program mentioned above, the grocery store receives varied inventory information from every one of their many suppliers, all in different formats - mostly spreadsheets of their own design, often with extra lines and labels scattered throughout, sometimes even just text files and plain text emails containing the sku, pricing, qty, etc. information.
    
I wrote a little importer program to help convert the data from all their different sources, and to save it as part of one consistently formatted REBOL flat file. This import process would have to be performed no matter what language my program was written in (and I find REBOL's tools such as 'parse much easier to use than regular expressions, for example). The importer does a bunch of little helpful things like updating quantities, checking for badly formed data: nonsense symbols in bar codes, letters in prices, missing data fields, missing delimiters, extra lines and headers that are only meant to be read by humans, etc. The manager and owners also generally read their invoices and rearrange columns in the spreadsheets before importing.
    
The entire importer script in the case above is less than a page of code, and it saves a *lot* of manual labor dealing with all the inconsistent sources. Some basics about importing CSV files are described a bit at http://business-programming.com/business_programming.html#section-3.11 and http://business-programming.com/business_programming.html#section-17.8 , but this sort of work is mostly about understanding the specific requirements of each parsing situation. In the situation above, we noticed for example, that workers at each supplier regularly used consistent labeling and layout patterns, and that data regularly contained certain types of errors that were regularly identifiable, so I wrote parse code to handle each case, every time the users identified a "problem" with a new supply sheet (usually just a line or two of code). One supplier always sent their inventory lists in PDF format, so I used a little 3rd party app that extracted the data in their PDF format, into nice parse-able columns, and then automated the whole process of launching the app, reading/parsing/importing the data, etc. That may sound like a lot, but was just a few lines of code that took all of maybe 5 minutes to write.
    
Here's another example from intro of the same tutorial above. I put this together as a result of helping someone parse the output of their Paypal sales reports. It downloads a report from a web site, and parses and displays a total of the values in the "Gross Sales" (column #8). 4 short lines:
    
sum: $0
foreach line at read/lines http://re-bol.com/Download.csv 2 [
     sum: sum + to-money pick (parse/all line ",") 8
]
alert join "Total Gross Sales: " sum
    
Many more code examples of this type of parsing and data organization are found throughout the tutorial above, for example http://business-programming.com/business_programming.html#section-6.16 . Really, *all* the examples in the tutorial are meant to provide insight about how "real" things are accomplished using REBOL block format. Look at all the case studies to see working code and explanations about how I thought through the process (take a look, for example, at the guitar accompanient program to see how I created wave files to be played back as needed, or the freecell card game to see how I converted and used playing card images, or the Textris example to see how I kept tract of playing piece data in a textual "video" game, etc.)
    
Again, the only reason to use data formatted with delimiters in REBOL is if you want to *share* data with *other* programs, programming languages, data services, etc. I've done this not only for things like spreadsheets, but also for examples such as http://www.ns-basic.com/nsbasic.html#section-14 and http://rfobasic.com/#section-8.7 (examples of sharing data between NS-Basic and RFO Basic, which have limited abilities to process/exchange structured data formats). You'll find parsers for formats such as JSON and other common formats already written in REBOL (that one by Chris Ross Gill) to ease the data exchange process. The MySQL driver by Nenad (Doc Kimbel), for example, makes the connection and transfer between REBOL data structures and MySQL storage native and simple (results are returned in native REBOL blocks). There's also a great driver for SQLite by Ashley at http://www.dobeash.com/sqlite.html . The tutorial above has examples of how to use each. The tutorial at http://re-bol.com/etsy_api_tutorial.html explains how to use REBOL with Etsy.com developer API (based on a RESTful URL interface with JSON-formatted responses - Etsy linked that article in their developer resources http://www.etsy.com/developers/documentation/resources/developer_resources).
    
You can choose to use tools like that whenever you need or want, use SQL when it makes sense, RDBMS systems for bigger multi-user apps in particular, and still use any of the features of REBOL to build the rest of your app: GUIs, structure and flow of interaction with users (conditional logic handling, etc.), network socket and HTTP interaction, email, graphic output, CGI processing, data import/export, etc. For the types of apps that I write most often, native REBOL block structure is typically all that I need to store and manipulate user data.
    
Hope that helps :)

posted by:   Nick       2-Jul-2013/11:48:17-7:00



Carl Sassenrath also wrote a small phone directory available at rebol.org and likely in the Rebol Cookbook. I've repurposed that code for exactly the same purposes you have mentioned with little fuss. Added fields and filters and buttons etc. Works like a charm. I also repurposed it for use as a quick and dirty knowledge base tool - like Evernote and the former Catch mobile apps. I have 1600+ records of varying complexity from simple content like web-site login credentials to sections of code etc.

posted by:   Brock       10-Sep-2013/17:05:38-7:00