Chapter 5. Hints on a few assignments

File processing

Hints for the Scandir assignment

The Assignment

The BSD scandir() takes a directory name, a comparison function, a filter function, a pointer to some work space, and proceeds to read the directory keeping only the files that pass the filter function, and then sorting the result according to the given comparison function. The Scandir assignment is to write an object-oriented, C++ library that accomplishes the same thing.

This section is about some things to think about when designing and implementing your scandir library. Since there have been many versions of this assignment over the past few years, not all of the following suggestions will pertain directly to the current assignment. However, you should think carefully about each of them.

Common sense design issues

  • We already have the scandir function. Part of the point of this assignment is to replace it with an object-oriented solution to the same problem. This means that you should 1: Figure out exactly what problem scandir() solves, and 2: solve the same problem with classes and objects. If your solution is to have a Scandir class with one member function that takes two function pointers and returns a vector of file names, then you need to rethink your design. That and many things similar are most definitely NOT object-oriented. Why? What does it mean to be object-oriented? (Clearly, the answer is not just "use classes.")

    In particular, you should be making sensible use of 1: inheritance, 2: encapsulation, and 3: virtual functions. If you are not using all three of these, rethink your design. How can you tell if you are using all three of these?

  • You should not need global variables of any kind. The scandir() function certainly doesn't. This includes having an int somewhere that specifies which kind of sort to do:

    Example 5-1. DO NOT DO THIS OR ANYTHING LIKE THIS!!!

    #define ALPHASORT 1
    #define DATESORT 2
    #define SIZESORT 3
    
    int SortType;
    
    void SomeClass::Sort( vectord<files> & v )
    {
      switch( SortType )
        {
         case ALPHASORT:
    
        ...
    
        }
    }
    Likewise, do not use a global flag for things like: whether or not to leave out "." files, whether to print quietly or verbosely, etc. Why shouldn't you do it this way?

  • Continuing on that note, we can use any comparison function we can code with the scandir() function without having to touch the source code for scandir. If it is impossible or painfully difficult to add a new sort order to your solution to the scandir problem without editing your files, then you need to rethink your design.

  • The same goes with filtering. The scandir part will be in a library, and will be a black box as far as other people using your code are concerned. Your scandir library should be at least as useful and flexible as the scandir() function.

  • Part of the point of object-oriented programming is to avoid long, complicated if-else chains. How does it do this, and why?

  • If you filter in one place, don't filter somewhere else too. If you sort in one place, don't sort somewhere else too.

  • Don't needlessly invent special cases. You should do all your filtering "the same way." You should do all your sorting "the same way."

We've had similar assignments in 108 in the past and people have made a number of rather strange design decisions. The use of a global variable to change sort order actually turned up in several different groups. Many groups also did really bizarre things like this:

Example 5-2. A silly way to sort

enum SortTypes { ALPAH_SORT, DATE_SORT, SIZE_SORT, CUSTOM_SORT };

typedef int (*ComparisonFunction)( DirEntry &, DirEntry & );

int TheSortType = ALPHA_SORT;
ComparisonFunction CustomCompare = NULL;

void DoSort( vector<...> ... )
{
  switch( TheSortType )
    {
       case ALPHA_SORT:
	etc.

       case CUSTOM_SORT:
          if( (*CustomCompare)( list[i], list[j] ) )
             {
             ...
             }
      }
}
In other words, they could handle the general case of sorting (in an admittedly awkward way), but they chose to handle other cases in a completely different way. They could have handled alphabetical, date, and size sorting by just writing different "custom" comparison functions, which would have made their code much simpler. And, they could have avoided the global variable by passing in the comparison function. This is what I meant about do things "the same way" and don't invent special cases for yourself.

Now suppose I wanted sools to filter out all files older than some number of days, something like this:

sools --showOlderThan 5
You need to be able to pass that 5 to the filter somehow, so if your filtering mechanism uses function pointers, we're sunk. We could use a global variable, but that's ugly and non-o.o. We sort of need to pass in the 5 with the filter function, which we can't do with just a function pointer. The same thing if I want files that match some glob regular expression:
sools --showMatches '*.txt'
How do we pass the '*.txt' to the filter? Well, we could just provide lots of filter functions somewhere:

Example 5-3. A silly way to filter

DoFilter( StringFilterFunction, string )
DoFilter( IntFilterFunction, int )
that let you pass in something that will be sent to the filter function in addition to the DirEntry. Of course, this quickly gets out of hand, and we can't possibly anticipate all the different kinds of filters that people might want to use, so we might come up with something like:

Example 5-4. A slightly less silly way to filter

DoFilter( VoidPtrFilterFunction, void * )
which sort of works but is decidedly not type-safe and is ugly and not really o.o.

The trick is to realize that C++ gives us a very nice tool for solving this very problem: the virtual function. With that, I will leave you with one last hint. The following is just about the most useful class I have ever written (no joke):

Example 5-5. The Command class

class Command
{
  public:
    virtual void Execute() = 0;
    virtual ~Command() {}
};
Its cousin, template class ArgCommand, is also quite useful. I leave you to figure out what it looks like.

Things you might do without thinking, but don't have to do

  • You do not need to do sorting and filtering within the same function, or even within the same class.

  • You do not need to have a class named Scandir or DirScanner, etc. Why might that be a bad thing?

  • You will almost certainly need to sort a vector of some kind. (Or use a linked list, or a self-sorting list of some kind.) However, it does not have to be publicly visible as a raw data structure. You can hide it behind Iterators or Visitors. What does this get you?

  • There are ways to solve the scandir problem that do not involve passing around function pointers.

  • There are at least two ways to implement the reversed-order flag of sools without physically reversing the order of a vector of file objects and without passing a flag to a sort function to tell it increasing or deceasing order and without writing twice as many comparison functions.

Leave room to make it powerful

  • There are ways to solve the scandir problem that allow for very sophisticated filtering. Imagine someone who wants to find all non-executable files larger than 10Mb and older than one week.

Write it in C++ using C++ and STL idioms

  • Consider using Iterators. There are good ways to avoid returning vectors or passing them around. (Why should you avoid that? the section called Don't pass containers around by value in Chapter 4)

  • Consider using Command objects. So, what are command objects? Think about this: How is a virtual function call like using a function pointer, and why is it better?

    I don't just mean an object with an overloaded operator()(...) method. While that certainly can be used, there's no particular reason to use it. (Or is there?)

    You will perhaps want to look at hints on sorting with STL: the section called Sorting with STL in Chapter 4.

Gotchas

  • Be careful when searching directories recursively. Each time you open directory, it uses a file handle. In Solaris, a process can only open a maximum of 64 file handles at once. If you need to open more, you must first close an already open file handle, otherwise your program will crash. There are clever (object-oriented) ways around this problem, too. Where in DirStream is the file handle opened, and where is it closed?

Miscellaneous issues

Brainteasers

  • When is it good to have a class with a single member function? When is it bad? Are you sure?