Board index » delphi » a) UUencode format b) binary vs. text files

a) UUencode format b) binary vs. text files

Hello,

Could a kind soul please tell me...

a) where I can find a source code for UU<en|de>code algorithm?
I've tried an archie search and looked through SimTel and Garbo,
but all I've found are executable en/decoders. I don't actually
need a full source code: just the specification for the uuencode
format would do, so I could write my own routines.

b) Is there a reliable way to determine whether a file is a plain
text file (so it can be read with TP's "readln," for instance) or
a binary file (so "blockread" should better be used)? I've read Prof.
Salmi's FAQs and looked in lots of programming handbooks, other faqs
etc. but the only suggestion I've found is that a binary file contains
ASCII characters with codes above 128 - obviously the Wrong Thing if
a file in question contains, e.g. tables, bullets (ascii #254 etc.)
or just national characters. Probably the perfect solution would be
to look at the way DOS stores the file, as long as DOS itself is able
to distinguish the two types. For now I read the first n bytes from
the file I want to check (usually n=20) and look for character #0.
It seems that _most_ exe, com, and data files have it, but this method
is purely heuristic and might not always work.

(NB there is a program caled "filetype" (filtyp11.zip in Garbo archives)
that tries to determine the type of the file based on its header, but
its fuction is to detect particular formats, e.g. wordprocesor or
database files, while for a plain text file it only reports "unknown.")

Thanks in advance,
Marek Jedlinski
(marek...@magnum.lodz.pl)

(PS. I'm new to this group so if I've comitted a Gross Stupidity
unawares, flame at will. Or, on second thought, don't :)

 

Re:a) UUencode format b) binary vs. text files


In article <Pine.SCO.3.91.951224191950.1781A-100...@interkom.magnum.lodz.pl>
           marek...@interkom.magnum.lodz.pl "Marek Jedlinski" writes:

Quote
> Hello,

> Could a kind soul please tell me...

> a) where I can find a source code for UU<en|de>code algorithm?
> I've tried an archie search and looked through SimTel and Garbo,
> but all I've found are executable en/decoders. I don't actually
> need a full source code: just the specification for the uuencode
> format would do, so I could write my own routines.

There are a few Turbo Pascal source code versions available on the net, there
is one in the Simtel Archive, and various ones in the SWAG files. If you can't
find one, let me know, and I'll mail you one.

Nigel.

          /----------------------------------------------------------\
          | Nigel Goodwin   | Internet : nig...@lpilsley.demon.co.uk |
          | Lower Pilsley   |                                        |
          | Chesterfield    |                                        |
          | England         |                                        |
          \----------------------------------------------------------/

Re:a) UUencode format b) binary vs. text files


Quote
>b) Is there a reliable way to determine whether a file is a plain
>text file (so it can be read with TP's "readln," for instance) or
>a binary file (so "blockread" should better be used)? I've read Prof.
>Salmi's FAQs and looked in lots of programming handbooks, other faqs
>etc. but the only suggestion I've found is that a binary file contains
>ASCII characters with codes above 128 - obviously the Wrong Thing if
>a file in question contains, e.g. tables, bullets (ascii #254 etc.)
>or just national characters. Probably the perfect solution would be
>to look at the way DOS stores the file, as long as DOS itself is able
>to distinguish the two types. For now I read the first n bytes from
>the file I want to check (usually n=20) and look for character #0.
>It seems that _most_ exe, com, and data files have it, but this method
>is purely heuristic and might not always work.

     It is very easy for we humans to tell if a file is text or binary, simply
view it with a utility that assumes it is text and puts some symbol in for
non-text letters.  We'd call it "text" if we could read it (i.e. if it were
limited to the printable Ascii character set, and the characters formed
sensible groups, separated by spaces and probably by <CR><LF> into words and
lines).

     If the question is "Should I use ReadLn or BlockRead", then you really
need to know (in advance) what the file contains.  Note that the simple
presence of a null character (#0) does NOT mean the file isn't text -- it
simply means that this character should be skipped when read and interpreted
as text.  Also note that if the file IS, in fact, a text file, then the
NewLine (which, depending on your implementation, can be <CR>, <LF>, or
<CR><LF>) is processed separately and differently from other characters (in
point of fact, it should be read as a space).

     If what you are trying to do is write a routine that deals with any and
all file types, then you do not want to assume anything about any of the
characters, including nulls, <CR>, and <LF>.

     I think what I would be tempted to try to answer the question "is this
text?" would be to scan the file and derive several statistics.  One is how
many control characters (codes below "space") are there, excluding <CR>, <LF>,
<tab>, and <FF>; another would be the average and maximum word length
(assuming space as a separator), and the average and maximum line length.  I'd
certainly feel justified in thinking a file is text if there were no odd
control characters, an average/max word length of, say, 5 and 12 characters,
and an average/max line length of 60 and 75 characters.  On the other hand, if
1/8 of the characters were control characters, the average word length were
50, and the average line length 62, I'd say "binary".  Where to draw the line
between these takes some experimentation ...

Bob Schor
Pascal Enthusiast

Re:a) UUencode format b) binary vs. text files


In message <Pine.SCO.3.91.951224191950.1781A-100...@interkom.magnum.lodz.pl>
        Marek Jedlinski <marek...@interkom.magnum.lodz.pl> writes:

Quote
> Hello,
> Could a kind soul please tell me...
> a) where I can find a source code for UU<en|de>code algorithm?
> I've tried an archie search and looked through SimTel and Garbo,
> but all I've found are executable en/decoders. I don't actually
> need a full source code: just the specification for the uuencode
> format would do, so I could write my own routines.

Hello, i have one.  I'll mail it to you.

--
    --==**RG**==----==**RG**==----==**RG**==----==**RG==----==**RG**==--
    irc: /nick _Nitro............................................#coders
    www: ...........................http://www.zetnet.co.uk/users/rgriff

    --==**RG**==----==**RG**==----==**RG**==----==**RG==----==**RG**==--

Re:a) UUencode format b) binary vs. text files


Quote
bsc...@vms.cis.pitt.edu (Bob Schor) wrote:
>     I think what I would be tempted to try to answer the question "is this
>text?"

I answered this one with the idea of using readln in an {$I-} state to
check for it messing up.  Unfortunately, the TP programmer's reference
backs me up on this one, but READLN doesn't change the IOResult,
and it merrily does a readln on a binary file (even though the
reference says ONLY USE ON TEXT FILES...)  An interesting problem.  It
SHOULD work this way, though to tell whether we got a text file or
not, by my thinking...

If you are wondering, my little test program, which implements this
idea..

program test;
   var
       thefile: text;
       thestr: string;
    begin
        assign(thefile, 'PASCAL6.EXE');
        reset(thefile);
        {$I-}readln(thefile, thestr);{$I+}
        if IOResult <> 0 then
            writeln('This is not a text file.')
        else
            writeln('This is a text file.');
         close(thefile);
     end.

Readln is only supposed to be used on text files, and the TP
programmers reference says that READLN will change IOResult
depending upon proper result...Maybe they forgot this usage
problem?  Bug?  I'm beginning to wonder...Any ideas, people?

Re:a) UUencode format b) binary vs. text files


Quote
On Wed, 27 Dec 1995 17:10:49 GMT, ggr...@2sprint.net (Valtar) wrote:
>bsc...@vms.cis.pitt.edu (Bob Schor) wrote:

>>     I think what I would be tempted to try to answer the question "is this
>>text?"

>I answered this one with the idea of using readln in an {$I-} state to
>check for it messing up.  Unfortunately, the TP programmer's reference
>backs me up on this one, but READLN doesn't change the IOResult,
>and it merrily does a readln on a binary file (even though the
>reference says ONLY USE ON TEXT FILES...)  An interesting problem.  It
>SHOULD work this way, though to tell whether we got a text file or
>not, by my thinking...

>If you are wondering, my little test program, which implements this
>idea..

>program test;
>   var
>       thefile: text;
>       thestr: string;
>    begin
>        assign(thefile, 'PASCAL6.EXE');
>        reset(thefile);
>        {$I-}readln(thefile, thestr);{$I+}
>        if IOResult <> 0 then
>            writeln('This is not a text file.')
>        else
>            writeln('This is a text file.');
>         close(thefile);
>     end.

>Readln is only supposed to be used on text files, and the TP
>programmers reference says that READLN will change IOResult
>depending upon proper result...Maybe they forgot this usage
>problem?  Bug?  I'm beginning to wonder...Any ideas, people?

You are misinterpreting what they mean when they say that READLN is only to be
used with text files. What they mean is that you should use it only with files
that you have declared as text files.

You can declare any file as a text file. There is absolutely no way for the
program to tell if the file that you opened is actually a "text" file containing
only ASCII text. Indeed, a text file can have any data in it, but the "lines"
should be terminated with a CR/LF. Of course a line can be any length...

Don't over-interpret the manuals. They mean exactly what they say. Check to see
what they actually mean by something like "text file" before you question the
results.

Richard S. (dick) MacDonald - AD0J

"If all economists were laid end to end,
   they would not reach a conclusion."  
  --- George Bernard Shaw              

Re:a) UUencode format b) binary vs. text files


Quote
On Wed, 27 Dec 1995 17:10:49 GMT, ggr...@2sprint.net (Valtar) wrote:
>program test;
>   var
>       thefile: text;
>       thestr: string;
>    begin
>        assign(thefile, 'PASCAL6.EXE');
>        reset(thefile);
>        {$I-}readln(thefile, thestr);{$I+}
>        if IOResult <> 0 then
>            writeln('This is not a text file.')
>        else
>            writeln('This is a text file.');
>         close(thefile);
>     end.
>Readln is only supposed to be used on text files, and the TP
>programmers reference says that READLN will change IOResult
>depending upon proper result...Maybe they forgot this usage
>problem?  Bug?  I'm beginning to wonder...Any ideas, people?

Look up the difference between 'supposed to use on' and 'only works
on'.

DOS stores files as byte streams - it doesn't care what the byte
values are.

ReadLn reads from the last read byte + 1 to the next EOL.  If the next
EOL is 500 bytes away, it will read 500 bytes (actually 499) into your
string variable (overwriting other variables).

IOResult is the error code returned by DOS on the read attempt.  Since
DOS doesn't care what or how many, there's no error, UNLESS you're
looking for the next EOL and fall off the end of the file.

Deciding whether an unknown file is text or not requires 2 things:
1 - Your definition of text - is a Word file text?
2 - Lots and lots of code.

Good luck.

--
Al

Re:a) UUencode format b) binary vs. text files


In article: <bschor.315.19496...@vms.cis.pitt.edu>  bsc...@vms.cis.pitt.edu (Bob

Quote
Schor) writes:

> >b) Is there a reliable way to determine whether a file is a plain
> >text file (so it can be read with TP's "readln," for instance) or
> >a binary file (so "blockread" should better be used)? I've read Prof.

[snip]

Quote

>      I think what I would be tempted to try to answer the question "is this
> text?" would be to scan the file and derive several statistics.  One is how
> many control characters (codes below "space") are there, excluding <CR>, <LF>,
> <tab>, and <FF>; another would be the average and maximum word length
> (assuming space as a separator), and the average and maximum line length.  I'd
> certainly feel justified in thinking a file is text if there were no odd
> control characters, an average/max word length of, say, 5 and 12 characters,
> and an average/max line length of 60 and 75 characters.  On the other hand, if
> 1/8 of the characters were control characters, the average word length were
> 50, and the average line length 62, I'd say "binary".  Where to draw the line
> between these takes some experimentation ...

> Bob Schor
> Pascal Enthusiast

The other thing to consider is if there is any place that the gap between 'End
of line' markers (<CR>,<LF> or <CR><LF>) is greater than 255 chars as that is
the maximum line size allowed in BP's readln, thus meaning that if such a file
was read using readln, you would lose the end of at least one line.

--
  Alan Wood             I have mastered the programming,
 ~~~~~~~~~~~            now to get the programs working.

Re:a) UUencode format b) binary vs. text files


Quote
On Thu, 04 Jan 1996 13:36:13 GMT, Alan Wood <A...@abfl.co.uk> wrote:
>The other thing to consider is if there is any place that the gap between 'End
>of line' markers (<CR>,<LF> or <CR><LF>) is greater than 255 chars as that is
>the maximum line size allowed in BP's readln, thus meaning that if such a file
>was read using readln, you would lose the end of at least one line.

I don't think that's true.  The little program below writes out 5
numbers to a 500 character line, and then reads them back in with no
problem in a single ReadLn.  It's written in BP 7.

Maybe you're thinking of the string length limitation?

Duncan Murdoch

Here's the program:

var                                      
  f : text;                              
  i,j,k,l,m : integer;                  
begin                                    
  assign(f,'test.txt');                  
  rewrite(f);                            
  for i:=1 to 5 do                      
    write(f,i:100);                      
  writeln(f);                            
  close(f);                              
  reset(f);                              
  readln(f,i,j,k,l,m);                  
  writeln('m=',m);                      
  close(f);                              
end.                                    

Re:a) UUencode format b) binary vs. text files


Re:a) UUencode format b) binary vs. text files


Quote
Marek Jedlinski <marek...@interkom.magnum.lodz.pl> writes:
> b) Is there a reliable way to determine whether a file is a plain
> text file (so it can be read with TP's "readln," for instance) or
> a binary file (so "blockread" should better be used)?

yes.  read the whole file with blockread until you find an indication
that the file was not text (or eof).  if you don't find such an
indication, then the file was text.

Quote
> etc. but the only suggestion I've found is that a binary file contains
> ASCII characters with codes above 128 - obviously the Wrong Thing if
> a file in question contains, e.g. tables, bullets (ascii #254 etc.)
> or just national characters.

why are these the Wrong Thing?

I don't test for `text' -- I test for `ASCII'.

here's the code I use, including an older slower version:

function isasciifile(fn: string): boolean;

const
  checkedsize=1024;

var
  result: boolean;

{$ifdef veryslowisasciifile}
  inf: file of byte;
{$endif}
  inf: file;
  whichbyte: integer;
  onebyte: byte;
{$ifdef veryslowisasciifile}
  stillsearching: boolean;
{$endif}
  buffer: array[1..checkedsize] of byte;
  numread: word;

begin
  result := true;

{$ifdef veryslowisasciifile}
  assign(inf,fn);
{$I-}
  reset(inf);
{$I+}
{$endif}

  assign(inf,fn);
{$I-}
  reset(inf,1);
{$I+}

  if ioresult<>0 then
    result := false
  else
    begin
{$ifdef veryslowisasciifile}
      stillsearching := true;

      for whichbyte := 1 to checkedsize do
        if stillsearching then
          begin
            if eof(inf) then
              stillsearching := false
            else
              begin
                read(inf,onebyte);
                if not
                (
                 (onebyte=9)
                or
                 (onebyte=10)
                or
                 (onebyte=13)
                or
                 ( (onebyte>=32) and (onebyte<=126) )
                )
                  then
                    begin
                      result := false;
                      stillsearching := false;
                    end;
              end;
          end;
      close(inf);
{$endif}

      blockread(inf,buffer,checkedsize,numread);
      close(inf);

      for whichbyte := 1 to numread do
        if result then
          begin
            onebyte := buffer[whichbyte];
            if not
            (
             (onebyte=9)
            or
             (onebyte=10)
            or
             (onebyte=13)
            or
             ( (onebyte>=32) and (onebyte<=126) )
            )
              then
                result := false;
          end;

    end;

  isasciifile := result;
end;
--
Russell_Sch...@locutus.ofB.ORG  Shad 86c

Other Threads