Next Previous Contents

18. File Input/Output

S-Lang provides built-in support for two different I/O facilities. The simplest interface is modeled upon the C language stdio interface and consists of functions such as fopen, fgets, etc. The other interface is modeled on a lower level POSIX interface consisting of functions such as open, read, etc. In addition to permitting more control, the lower level interface permits one to access network objects as well as disk files.

For reading data formatted in text files, e.g., columns of numbers, then do not overlook the high-level routines in the slsh library. In particular, the readascii function is quite flexible and can read data from text files that are formatted in a variety of ways. For data stored in a standard binary format such as HDF or FITS, then the corresponding modules should be used.

18.1 Input/Output via stdio

Stdio Overview

The stdio interface consists of the following functions:

In addition, the interface supports the popen and pclose functions on systems where the corresponding C functions are available.

Before reading or writing to a file, it must first be opened using the fopen function. The only exceptions to this rule involve use of the pre-opened streams: stdin, stdout, and stderr. fopen accepts two arguments: a file name and a string argument that indicates how the file is to be opened, e.g., for reading, writing, update, etc. It returns a File_Type stream object that is used as an argument to all other functions of the stdio interface. Upon failure, it returns NULL. See the reference manual for more information about fopen.

Stdio Examples

In this section, some simple examples of the use of the stdio interface is presented. It is important to realize that all the functions of the interface return something, and that return value must be handled in some way by the caller.

The first example involves writing a function to count the number of lines in a text file. To do this, we shall read in the lines, one by one, and count them:

    define count_lines_in_file (file)
    {
       variable fp, line, count;

       fp = fopen (file, "r");    % Open the file for reading
       if (fp == NULL)
         throw OpenError, "$file failed to open"$;

       count = 0;
       while (-1 != fgets (&line, fp))
         count++;

       () = fclose (fp);
       return count;
    }
Note that &line was passed to the fgets function. When fgets returns, line will contain the line of text read in from the file. Also note how the return value from fclose was handled (discarded in this case).

Although the preceding example closed the file via fclose, there is no need to explicitly close a file because the interpreter will automatically close a file when it is no longer referenced. Since the only variable to reference the file is fp, it would have automatically been closed when the function returned.

Suppose that it is desired to count the number of characters in the file instead of the number of lines. To do this, the while loop could be modified to count the characters as follows:

      while (-1 != fgets (&line, fp))
        count += strlen (line);
The main difficulty with this approach is that it will not work for binary files, i.e., files that contain null characters. For such files, the file should be opened in binary mode via
      fp = fopen (file, "rb");
and then the data read using the fread function:
      while (-1 != fread (&line, Char_Type, 1024, fp))
           count += length (line);
The fread function requires two additional arguments: the type of object to read (Char_Type in the case), and the number of such objects to be read. The function returns the number of objects actually read in the form of an array of the specified type, or -1 upon failure.

Sometimes it is more convenient to obtain the data from a file in the form of a character string instead of an array of characters. The fread_bytes function may be used in such situations. Using this function, the equivalent of the above loop is

      while (-1 != fread_bytes (&line, 1024, fp))
           count += bstrlen (line);

The foreach construct also works with File_Type objects. For example, the number of characters in a file may be counted via

     foreach ch (fp) using ("char")
       count++;
Similarly, one can count the number of lines using:
     foreach line (fp) using ("line")
      {
         num_lines++;
         count += strlen (line);
      }
Often one is not interested in trailing whitespace in the lines of a file. To have trailing whitespace automatically stripped from the lines as they are read in, use the "wsline" form, e.g.,
     foreach line (fp) using ("wsline")
      {
          .
          .
      }

Finally, it should be mentioned that none of these examples should be used to count the number of bytes in a file when that information is more readily accessible by another means. For example, it is preferable to get this information via the stat_file function:

     define count_chars_in_file (file)
     {
        variable st;

        st = stat_file (file);
        if (st == NULL)
          throw IOError, "stat_file failed";
        return st.st_size;
     }

18.2 POSIX I/O

18.3 Advanced I/O techniques

The previous examples illustrate how to read and write objects of a single data-type from a file, e.g.,

      num = fread (&a, Double_Type, 20, fp);
would result in a Double_Type[num] array being assigned to a if successful. However, suppose that the binary data file consists of numbers in a specified byte-order. How can one read such objects with the proper byte swapping? The answer is to use the fread_bytes function to read the objects as a (binary) character string and then unpack the resulting string into the specified data type, or types. This process is facilitated using the pack and unpack functions.

The pack function follows the syntax

BString_Type pack (format-string, item-list);
and combines the objects in the item-list according to format-string into a binary string and returns the result. Likewise, the unpack function may be used to convert a binary string into separate data objects:
(variable-list) = unpack (format-string, binary-string);

The format string consists of one or more data-type specification characters, and each may be followed by an optional decimal length specifier. Specifically, the data-types are specified according to the following table:

     c     char
     C     unsigned char
     h     short
     H     unsigned short
     i     int
     I     unsigned int
     l     long
     L     unsigned long
     j     16 bit int
     J     16 unsigned int
     k     32 bit int
     K     32 bit unsigned int
     f     float
     d     double
     F     32 bit float
     D     64 bit float
     s     character string, null padded
     S     character string, space padded
     z     character string, null padded
     x     a null pad character
A decimal length specifier may follow the data-type specifier. With the exception of the s and S specifiers, the length specifier indicates how many objects of that data type are to be packed or unpacked from the string. When used with the s or S specifiers, it indicates the field width to be used. If the length specifier is not present, the length defaults to one.

With the exception of c, C, s, S, z, and x, each of these may be prefixed by a character that indicates the byte-order of the object:

     >    big-endian order (network order)
     <    little-endian order
     =    native byte-order
The default is to use the native byte order.

Here are a few examples that should make this more clear:

     a = pack ("cc", 'A', 'B');         % ==> a = "AB";
     a = pack ("c2", 'A', 'B');         % ==> a = "AB";
     a = pack ("xxcxxc", 'A', 'B');     % ==> a = "\0\0A\0\0B";
     a = pack ("h2", 'A', 'B');         % ==> a = "\0A\0B" or "\0B\0A"
     a = pack (">h2", 'A', 'B');        % ==> a = "\0\xA\0\xB"
     a = pack ("<h2", 'A', 'B');        % ==> a = "\0B\0A"
     a = pack ("s4", "AB", "CD");       % ==> a = "AB\0\0"
     a = pack ("s4s2", "AB", "CD");     % ==> a = "AB\0\0CD"
     a = pack ("S4", "AB", "CD");       % ==> a = "AB  "
     a = pack ("S4S2", "AB", "CD");     % ==> a = "AB  CD"

When unpacking, if the length specifier is greater than one, then an array of that length will be returned. In addition, trailing whitespace and null characters are stripped when unpacking an object given by the S specifier. Here are a few examples:

    (x,y) = unpack ("cc", "AB");         % ==> x = 'A', y = 'B'
    x = unpack ("c2", "AB");             % ==> x = ['A', 'B']
    x = unpack ("x<H", "\0\xAB\xCD");    % ==> x = 0xCDABuh
    x = unpack ("xxs4", "a b c\0d e f");  % ==> x = "b c\0"
    x = unpack ("xxS4", "a b c\0d e f");  % ==> x = "b c"

Example: Reading /var/log/wtmp

Consider the task of reading the Unix system file /var/log/utmp, which contains login records about who logged onto the system. This file format is documented in section 5 of the online Unix man pages, and consists of a sequence of entries formatted according to the C structure utmp defined in the utmp.h C header file. The actual details of the structure may vary from one version of Unix to the other. For the purposes of this example, consider its definition under the Linux operating system running on an Intel 32 bit processor:

    struct utmp {
       short ut_type;              /* type of login */
       pid_t ut_pid;               /* pid of process */
       char ut_line[12];           /* device name of tty - "/dev/" */
       char ut_id[2];              /* init id or abbrev. ttyname */
       time_t ut_time;             /* login time */
       char ut_user[8];            /* user name */
       char ut_host[16];           /* host name for remote login */
       long ut_addr;               /* IP addr of remote host */
    };
On this system, pid_t is defined to be an int and time_t is a long. Hence, a format specifier for the pack and unpack functions is easily constructed to be:
     "h i S12 S2 l S8 S16 l"
However, this particular definition is naive because it does not allow for structure padding performed by the C compiler in order to align the data types on suitable word boundaries. Fortunately, the intrinsic function pad_pack_format may be used to modify a format by adding the correct amount of padding in the right places. In fact, pad_pack_format applied to the above format on an Intel-based Linux system produces the result:
     "h x2 i S12 S2 x2 l S8 S16 l"
Here we see that 4 bytes of padding were added.

The other missing piece of information is the size of the structure. This is useful because we would like to read in one structure at a time using the fread function. Knowing the size of the various data types makes this easy; however it is even easier to use the sizeof_pack intrinsic function, which returns the size (in bytes) of the structure described by the pack format.

So, with all the pieces in place, it is rather straightforward to write the code:

    variable format, size, fp, buf;

    typedef struct
    {
       ut_type, ut_pid, ut_line, ut_id,
       ut_time, ut_user, ut_host, ut_addr
    } UTMP_Type;

    format = pad_pack_format ("h i S12 S2 l S8 S16 l");
    size = sizeof_pack (format);

    define print_utmp (u)
    {

      () = fprintf (stdout, "%-16s %-12s %-16s %s\n",
                    u.ut_user, u.ut_line, u.ut_host, ctime (u.ut_time));
    }

   fp = fopen ("/var/log/utmp", "rb");
   if (fp == NULL)
     throw OpenError, "Unable to open utmp file";

   () = fprintf (stdout, "%-16s %-12s %-16s %s\n",
                          "USER", "TTY", "FROM", "LOGIN@");

   variable U = @UTMP_Type;

   while (-1 != fread (&buf, Char_Type, size, fp))
     {
       set_struct_fields (U, unpack (format, buf));
       print_utmp (U);
     }

   () = fclose (fp);
A few comments about this example are in order. First of all, note that a new data type called UTMP_Type was created, although this was not really necessary. The file was opened in binary mode, but this too was optional because, for example, on a Unix system there is no distinction between binary and text modes. The print_utmp function does not print all of the structure fields. Finally, last but not least, the return values from fprintf and fclose were handled by discarding them.


Next Previous Contents