Revised March 28, use your browsers Reload or Refresh button to get the latest version.
The assignment was: write an awk program to check file type. Here is the detailed problem statement
Here is my solution, which I handed out in class:
# afile - file type detection program in Awk # Check each of these patterns on every input line # Use variables to remember whether we have seen each pattern anywhere in file # Awk variables are initialized to FALSE (0 or ""), any other values mean TRUE # HTML and Scheme use positive logic # Make identification if pattern is found anywhere in file /|/ { html = 1 } # Found HTML file. Any nonzero value means TRUE /\(define \(/ { scheme = 1 } # Found Scheme file. Use \ to escape ( # Numeric and Text require negative logic # Disqualify if forbidden character is found anywhere in file # Use ^ complement operator in character class /[^0-9$%*./^= ]/ { not_numeric = 1 } # Found non-numeric character /[^\000-\177]/ { not_text = 1 } # Found non-ASCII byte # When we reach the end of the file, more than one variable might be TRUE. # Use if-else (similar Scheme cond) to establish precedence END { if (html) print "HTML" else if (scheme) print "Scheme" else if (!not_numeric) print "Numeric" else if (!not_text) print "Text" else print "Other" }
I awarded up to four points for each solution, one point each for:
Many solutions were quite different from mine. Here are some interesting approaches.
Many solutions read the entire file into one big string. Then in the END section, just check that string for each pattern. With this method, it is not necessary to use variables to remember what was seen on each line. It looks like this:
# contents variable stores entire file contents in one string # This rule executed for each line, $0 is entire line # The awk string concatenation operator is juxtaposition { contents = contents $0 " " } # At END, check contents in precedence order END { if (contents ~ /<html>|<HTML>/) { print "HTML" } else ... ... }
Some solutions used alternate (but equally effective) ways of expressing the logic needed to check for the numeric and text types.
/[^0-9]/ { print "This line contains characters that are not digits" }
# $0 is the entire input line, !~ is the not-match operator $0 !~ /[^0-9]/ { print "This line is all digits" }
Some solutions used exit where the file type could be classified immediately.
/<html>|<HTML>/ { print "HTML"; exit }
This section decribes some of the errors people made.
A file is not numeric (or text) if it contains only one non-numeric character (or non-text character). That character might appear at the end of the last line in the file, so you have to check the entire file. Any solution that exits or prints Numeric or Text before reading the entire file must be wrong.
The numeric type was allowed to contain digits, spaces, and a specifict list of special characters including $, %, * etc. It was not sufficient to just check that the characters were not alphabetic - not all non-alphabetic characters were permitted.
This does not work as intended:
if (afile == "HTML" || "Scheme" || "Numeric" || "Text") print afile else print "Other"
In awk, any nonzero number or nonempty string means true, so "Scheme" etc. will always evaluate to true, so the else branch can never be reached. This is the correct way to express what was probably intended. Each branch of the || or operator must contain a test:
if (afile == "HTML" || afile == "Scheme" || afile == "Numeric" || afile == "Text") ...
The last else branch in a cascaded if ... else if ... else ... should be the default action that is executed when all the previous if tests are false. There should not be another if after the last else.
In this case the last if to test for Text is redundant because the same condition was already checked for at Other:
if (input ~ /[^\000-\176]/) { print "Other" } else ... else if (input ~ /[\000-\176]/) { print "Text" }
The previous example is not incorrect, but this would be sufficient:
if (input ~ /[^\000-\176]/) { print "Other" } else ... else { print "Text" }
These are not errors but I found them hard to understand.
Several solutions contained this rule:
if (input !~ /[A-Za-z]/ && /[0-9]/) { print "Numeric" }It prints Numeric when input contains only non-alphabetic characters and digits. I still can't figure out how or why it works. The table on p. 46 of the Awk book suggests && should bind tighter than !~.
Another solution used a pattern in an if:
if (/<html>/ || /<HTML>/) { x = "HTML"'; exit }
I would have thought a match operator ~ was needed. Apparently this matches against the whole input line $0 implicitly.
I did not require that test data and sample test runs be handed in, but of course you should have tested your program. Some obviously incorrect solutions would have been exposed by a simple test.
Some solutions came with sample test runs on simple one-line test cases that would not have been sufficient to expose errors.
An adequate set of test runs would include:
We can invoke afile from a shell script to handle a whole directory at once
#!/bin/sh # afile-loop: invoke afile for all the files in a directory # $1 is first command line argument, should be a directory # afile script must be in working directory when you run this command for f in $1/* do if [ -d $f ] # test if $f is a directory - Awk chokes on directories then echo `basename $f` is a directory else echo `basename $f` is `awk -f afile < $f` fi done
Here it is in action:
$ ./afile-loop /usr/java/jdk1.3 COPYRIGHT is Text LICENSE is Text README is Other README.html is HTML bin is a directory demo is a directory .. etc. ... man is a directory src.jar is Other