Agile Renaissance


On the Semantic Distance between Input & Output, Problem Partitioning & Complexity

Print the article

This entry was posted on 5/10/2007 4:56 PM and is filed under Development.

I first used the term Semantic Distance as a way of discussing the distance between the execution capabilities of a programming language and the execution requirements of the solution. Reg Braithwaite took this concept and expanded on it in by looking at "how well programs are written for human consumption" — this is yet another in a long list of excellent posts from Reg (just call me fanboy). His observation linking comprehensibility of the program to "the form of the code matches the data the code generates" is the same as how my colleagues and I at OmniMark Technologies used to judge how good an OmniMark program was, by how much it's structure matched the structure of the solution. There is, however, more to Semantic Distance and in particular its relationship to input and output, problem partitioning and complexity. In this post, I want to explore this, by recounting how I first learned about Semantic Distance and how to overcome it and then briefly explore its implications on dealing with complexity.

Back in the fall of 1990, my friend Sam Wilmott used OmniMark to convert a document, an Intergraph document (if I remember correctly), to one based on a specific SGML DTD. It took him about 2 weeks to write the conversion program. Afterwards, he was unhappy with his solution so he decided to re-write it. This time as two programs coupled together using the OmniMark co-routine mechanism context-translate. I had naively thought that it would take him another 2 weeks, but he was done in 3 1/2 days. Boy were he and I surprised. A little while latter, I observed again this non-linear decrease in the amount of development time when writing a solution for a second time using the co-routining model. Then it happened again and again and it was also observed by some of my other colleagues and some of our customers. This phenomenon delighted me. It also bothered me, because I could not find an explanation for why it was happening. So I started trying to solve it myself.

My first attempt was to attribute this non-linear decrease in development time to the reduced learning curve when you solve a problem for a second time. However, at best, this explanation would only account for some percentage of the time difference and in some cases the programmers were already knowledgeable about the input and output formats which meant that the learning curve time difference between the two solutions was essentially zero. My first clue towards a better explanation was the observation that this decrease in development time happened independent of the source and target languages but that the amount of the decrease in development time did vary.

Specifically, I noticed a difference in time decreases between different translations into the same CALS DTD. In other words, the target language was the same but the source languages were different and the time taken to develop their respective solutions was also different. To capture this observation, I started drawing two points in space and labeled the left dot L1, the source language, and the other dot L2, the target language. I then drew a straight line between them and said that this was the syntactic distance between the lexical units of the two languages. This model was too simplistic. It could not explain the non-linear decrease in time caused by introducing an intermediate language, L1i, between L1 and L2. Another issue with it, was that I had observed in one case, that it took more time to write a program for two lexically similar languages then it took for a program between two lexically very dissimilar languages. This was clue number two.

When I looked at the two sets of languages I observed that the two lexically dissimilar languages in fact had similar semantics and the two lexically similar languages had very different semantics. To capture this notion of semantic difference I drew a curve over top of the line connecting the two dots. I then reasoned that the area under the curve was where the semantics lay. Assume for now, that the curve is a 180 degree arc of a circle, the area under the curve is then

(PI * ((L2 - L1)/2)**2) / 2
When you introduce an intermediate language between L1 and L2 you end up with two smaller semi-circles whose combined area is less than the area of the initial semi-circle. This difference in area is due to the use of the power function, square, in the formula for the area of a circle. This model shows the non-linear decrease in time that was being observed. However, the magnitude of difference between the areas is not as large as what we were observing. So, on the suggestion of Mark Baker, I now use spheres when presenting this model today. The volume of sphere is calculated with the power function, cubed, which more approximates the observed decreases in time.

Implications

  1. Semantic Distance is actually more "volumetric" than scalar in nature.
  2. Breaking up the Semantic Distance between the Languages L1 and L2, into two smaller distances by introducing an intermediate language is problem partitioning. The effect of this partitioning is non-linear in nature.
  3. The amount of complexity associated with translating between input and output is proportional to the Semantic Distance between input and output.

Problems with the Model

Here are the major issues with this model that I have identified:

  1. This weakness is a variation of Zeno's Paradox about Achilles and a tortoise. The model predicts that if partitioning were recursively applied, then there should be no work to be done, which is equivalent to the argument that Achilles can never pass the tortoise. Clearly, this is not the case. In fact, there are aspects of the Semantic Distance that cannot be sub-divided, they are atomic in nature. These atomic "aspects" are sometimes completely orthogonal to any other aspects of the Semantic Distance between the input and output languages. This means that Semantic Distance is in fact n-dimensional in nature. And yes Reg, this means that there cannot be one language which has the shortest Semantic Distance between its capabilities and those needed to handle the n-dimensional nature of the Semantic Distance between any possible pair of input and output languages.
  2. As my friend Jacques Legare, pointed out, the model does not show or prove that there is a relationship between Semantic Distance, complexity and implementation effort. To me there has to be, it is the only ting that makes sense to me, but showing it, proving it, is an open question. I am hoping that some reader will be able to provide assistance in answering this question, one way or the other?
  3. As nice as I find the explanation, and while it seems correct to me, I wonder if there is a better or more correct and accurate explanation?

Two Bits of Advice on Writing Conversion Programs I Learned from Sam Wilmott

When translating from one input language to another, create the first intermediate language close to the input rather than the output language. Many times it is enough to just normalize the input. That is convert the raw and messy input into the format that a perfect author would created. This often times or with a little bit of design eliminates most of the corner cases and makes translating it into the output straight forward. The first output to create when using this technique should be a stylized human readable version, which captures as much information from the input as possible. This readable version provides you with three things: visibility or feedback, easily testable output and easier maintenance. Readability and testability helps to make sure that you are translating to the normalized form correctly and quickly and the same applies if maintenance is required. Once you have the conversion completed, you can convert the readable format into the required output format. If conversion to and from the readable output becomes a bottleneck, you just have to replace the output statement in the first program.

To eliminate problematic complexity between different Semantic Distance axes, introduce intermediate languages which only deal with changing one of the Semantic axis at a time.

Summary

To me, the neat things about this journey, was learning to use problem partitioning to reduce the complexity and time to overcome the Semantic Distance between input and output and coming to understanding just how powerful and wise the age old advice that you conqueror large problems/tasks by breaking them down into smaller ones.

 

What did you think of this article?




Trackbacks
Trackback specific URL for this entry
  • No trackbacks exist for this entry.
Comments

    • 5/23/2007 3:54 PM beza1e1 wrote:
      You should make some graphs to your descriptions that would make it much easier to understand.
      Reply to this
      1. 5/23/2007 4:45 PM Norbert Winklareth wrote:
        Dear Andreas Zwinkau,

        Good observation, I will do so. However, it will not happen until mid-June and I will let you know when I do.

        Thanks for stopping by, Norbert


        Reply to this
    Leave a comment

    Submitted comments will be subject to moderation before being displayed.

     Enter the above security code (required)

     Name

     Email (will not be published)

     Website

    Your comment is 0 characters limited to 3000 characters.