Home   Archive   Permalink



write seems to be adding extra characters

R E B O L [ ]
myLines: read/lines %test1.txt
write/lines %test2.txt myLines
    
halt
    
    
why does write/lines add one extra character newline to the data it reads from another file.
    
if my input file has one line
my output file has one extra line, with just a linefeed.


posted by:   AlloAllo     20-Nov-2018/23:29:26-8:00



Technically speaking, the definition of a "line" in a file is that it ends in a newline:
    
This means all files should end in newlines (though on Windows, many programs violate this):
    
https://stackoverflow.com/questions/729692/why-should-text-files-end-with-a-newline
    
So that is the convention WRITE/LINES should be following. If you deliberately want to go against the standard and have no newline on the last line, you'd have to implement your own WRITE-LINES-NO-NEWLINE-ON-LAST-LINE. But I'd recommend against that, and instead get your input files to comply with the standard.
    
If you mean that your original file already had a newline on its last line, and your new file has *two* newlines, then that's a bug.

posted by:   Fork     21-Nov-2018/1:20:15-8:00



I've updated an old ticket to suggest that READ/LINES error by default (without some explicit signal set on a port or codec of some kind) if the input file lacks a terminal newline:
    
https://github.com/rebol/rebol-issues/issues/2102#issuecomment-440706477
    
(I don't think there's much hope in fighting software complexity unless stands are made on these basic issues!)

posted by:   Fork     21-Nov-2018/11:36:42-8:00



thanks Fork.
    
write %test.txt "Hello Rebol"
write/lines %test1.txt read %test.txt
    
try these 2 lines, and at the end open the test1.txt, it has one extra character that the original file test.txt does not have.
    
    


posted by:   AlloAllo     23-Nov-2018/8:24:16-8:00



> try these 2 lines, and at the end open
> the test1.txt, it has one extra character
> that the original file test.txt does not have.
    
That case is a bit odd, because WRITE/LINES is accepting a plain STRING!. It's not clear what it should do in that case. As it happens R3-Alpha doesn't add the newline, but I think it should...if it's going to let WRITE/LINES take STRING! at all:
    
I've opened a bug about that:
    
https://github.com/rebol/rebol-issues/issues/2328
    
I guess to make a long story short, your file is unfinished if it is to be considered a "text file". But WRITE alone doesn't guarantee a finished file... it doesn't know if you're going to add more:
    
     write %test.txt "Hello Rebol"
     ... ;-- some more code
     write/append %test.txt "verse!^/"
    
That might have intended to get a file with "Hello Rebolverse!" on is own line in it.
    
The best thing for you to know is that to make a valid complete text file, it needs a newline on the last line. This is important for raw binary concatenation:
    
     bin1: read/binary %file1.txt
     bin2: read/binary %file2.txt
     write %concatenated.txt append bin1 bin2
    
There are "dumb" utilities on the command line which act like this, assuming a file already has an ending newline and they can just add more to it without checking. It's a compelling reason why the standard exists. So I think the best solution to this problem is for you to follow the standard, and the best solution for the system is to find ways of helping you follow it.
    
If you want a second opinion you can try raising an issue on Red's GitHub, since it acts like Rebol2--and you could see if they agree that terminal newlines should be enforced (one way or another) when using "/LINES"

posted by:   Fork     23-Nov-2018/23:54:54-8:00



when we do the write %test.txt "Hello Rebol", there is a special character on that line, when we open the test.txt file, is that or ?
    
Yeah actually what happened I was given a txt file with a lot of lines in there to process. at the end of each line there is a endofline or newline.    
    
so I did a read/lines of the file and use a foreach line to process each line.    
    
I would keep all the lines processed in a variable
    
and then in the end I did a write/lines to a file.
    
when I gave the file to my colleague to process, he had problems, this is when we found out that, Rebol had added an extra empty line to the file with just one or , which is weird to me as the input file already had a or at the end of the LAST line in the input file.
    
and I did more test to try find out whether the read/lines is the problem or the write/lines and this is how I found the extra character that was added.
    
I will try RED to see if it does the same.

posted by:   AlloAllo     24-Nov-2018/23:01:05-8:00



"newline" and "end of line" aren't real things, in ASCII there is just "carriage return" (CR) and "line feed" (LF).
    
Windows historically used a two-character sequence to end lines, `CR LF`, while Unix has used just `LF`. Macs used lone `CR` in the distant past, but eventually shifted to Unix compatibility.
    
In order to "simplify" things, Rebol on Windows transforms CR/LF into just LF in strings. Then it puts it back as CR/LF when you write the file out. I think this is a mistake to do automatically, and the right answer is to notice that basically the only program on Windows that can't handle line feeds is Notepad...and that it's time to make a stand and standardize your files on Windows to just LF:
    
http://blog.hostilefork.com/death-to-carriage-return/
    
Given all the ways you can corrupt CR/LF in text files (mixtures, backwards as LF CR, etc.) it's clearly more of a hassle. It's better to consider these to be a different file format, treat them as if you were loading a different character encoding or character set...then you can specify to the codec what you want it to do with the edge cases. But I want text-based processing to just error by default if your file contains CRs.
    
(Again: how is one ever going to make a stand against complexity if you don't push back? Those who want to remain in the world of complexity should be the ones paying for it, by using a codec...which you could just opt out of building into the executable if you were ready to commit to living in a saner world.)
    
> when I gave the file to my colleague to process, he had problems,
> this is when we found out that, Rebol had added an extra empty line
> to the file with just one or , which is weird to me
> as the input file already had a or at the end of
> the LAST line in the input file.
    
The test case you've given doesn't reproduce this. So you'll have to make a very small test case that does.
    
Note that from within Rebol, you can print the binary contents of a file. Here in Rebol2 on Windows:
    
     >> write %test.txt "AAA^/"
     >> print mold read/binary %test.txt
     #{4141410D0A}
    
There you see 3 A's, and a CR (0x0D) and LF (0x0A) sequence (on Unix it should be just LF). If you make a second file with WRITE/LINES of a READ/LINES of that, you get the same thing out.
    
     >> write/lines %rewritten.txt read/lines %test.txt
     >> print mold read/binary %rewritten.txt
     #{4141410D0A}
    
So if you're seeing something other than this behavior, you should look at what's going on. It may be that your processing made a block of lines and whatever your processing did added a newline:
    
     >> write/lines %duplicate.txt ["Hello" "World^/"]
     >> print mold read/binary %duplicate.txt
     #{48656C6C6F0D0A576F726C640D0A0D0A}
    
So if you're doing something between the READ/LINES and WRITE/LINES that could make something like that happen, that could be where the actual problem is.
    
Again: try to produce the smallest case that actually demonstrates the problem, since your examples haven't seemed to so far.

posted by:   Fork     25-Nov-2018/0:15:39-8:00