Bob Sch
Delphi Developer |
Wed, 18 Jun 1902 08:00:00 GMT
Re:a) UUencode format b) binary vs. text files
Quote>b) Is there a reliable way to determine whether a file is a plain >text file (so it can be read with TP's "readln," for instance) or >a binary file (so "blockread" should better be used)? I've read Prof. >Salmi's FAQs and looked in lots of programming handbooks, other faqs >etc. but the only suggestion I've found is that a binary file contains >ASCII characters with codes above 128 - obviously the Wrong Thing if >a file in question contains, e.g. tables, bullets (ascii #254 etc.) >or just national characters. Probably the perfect solution would be >to look at the way DOS stores the file, as long as DOS itself is able >to distinguish the two types. For now I read the first n bytes from >the file I want to check (usually n=20) and look for character #0. >It seems that _most_ exe, com, and data files have it, but this method >is purely heuristic and might not always work.
It is very easy for we humans to tell if a file is text or binary, simply view it with a utility that assumes it is text and puts some symbol in for non-text letters. We'd call it "text" if we could read it (i.e. if it were limited to the printable Ascii character set, and the characters formed sensible groups, separated by spaces and probably by <CR><LF> into words and lines). If the question is "Should I use ReadLn or BlockRead", then you really need to know (in advance) what the file contains. Note that the simple presence of a null character (#0) does NOT mean the file isn't text -- it simply means that this character should be skipped when read and interpreted as text. Also note that if the file IS, in fact, a text file, then the NewLine (which, depending on your implementation, can be <CR>, <LF>, or <CR><LF>) is processed separately and differently from other characters (in point of fact, it should be read as a space). If what you are trying to do is write a routine that deals with any and all file types, then you do not want to assume anything about any of the characters, including nulls, <CR>, and <LF>. I think what I would be tempted to try to answer the question "is this text?" would be to scan the file and derive several statistics. One is how many control characters (codes below "space") are there, excluding <CR>, <LF>, <tab>, and <FF>; another would be the average and maximum word length (assuming space as a separator), and the average and maximum line length. I'd certainly feel justified in thinking a file is text if there were no odd control characters, an average/max word length of, say, 5 and 12 characters, and an average/max line length of 60 and 75 characters. On the other hand, if 1/8 of the characters were control characters, the average word length were 50, and the average line length 62, I'd say "binary". Where to draw the line between these takes some experimentation ... Bob Schor Pascal Enthusiast
|