Separation of Concerns in Data

A key principle of software architecture and design is Separation of Concerns. It is the principle of composing a software system of parts that
have as little overlap as possible.

Normally, separation of concern is applied to classes or components. However, it is also important in data. I don’t mean database schema design. I mean something more general than that.

I recently encountered a simple example of a violation of the principle of separation of concerns in data that resulted in some non-trivial consequences long after the time when the violation was implemented. To describe the problem, I’m going to obfuscate the actual details and instead concoct an imaginary software system that exhibits the same issue.

Imagine a data analysis application that runs as an desktop GUI program. The application organizes the analysis of data as projects, where a project is associated with one or more datasets, application settings, data filters, report templates, and so on. There is a project file that contains the structure of the project – containing a catalog of all the supporting files and also containing metadata about the project and the supporting files.

Now imagine that we take the core of the application and wrap it in a web service (the idea is to convert the single user application to a multi-user client-server system). For a variety of reasons the web service keeps a cache of certain information from the project file as well as some information in other files in the project. The web service watches for changes to files in the project and will update the cache whenever the files change.

Unfortunately, somewhere in the history of the application someone decided it would be a good idea to write a timestamp in the project file every time an analysis was performed. The project structure isn’t changed during the analysis and the input data and other files are most often unchanged as well. However, every time some analysis and report is generated, the project file gets updated to put the current time in one of the fields.

This results in the web service going through the cache update process every time¹ the analysis application is executed, even though cache only needs to get updated if something structural changes in the project.

It would have been far better to create a status file into which analysis execution timestamps could have been recorded. In other words, the concern of the project file is structure and meta data; the last run timestamp is not a concern for the project file. So separate that concern and store the information in a different file, a file that is concerned with tracking run history.

Architectural and design principles deserve to be considered when making the small choices, not just when making the big ones.

Yes, we can do some checks and avoid much unnecessary processing, but if the “last run” timestamp was simply stored in a different file (e.g. a run status file or run history file or run log file or some such thing) then we wouldn’t have had to build a bunch of krufty code to work around the issue. ↩