The inconsistency of strlen and strsub

I witnessed a weird bug in my GDL code that I could not explain at all. To save you from the same amount of head scratching I want to document this inconsistency here.
In a for loop I recently did some text operations on strings with new line characters \n. It would work great on the first paragraph, but then it would start to be, not a one off, but "two off" error on the next paragraph and then continue to get progressively worse.

Let's start with a very easy example:

test = "1\n2\n3"

So far there are no surprises here. If we’d show this on the plan with text2 0, 0, test we would get:

1
2
3

as our output.

Let's calculate the length of the string (and use "text2" again to show it in the floor plan – a poor mans "print()" if you want!): text2 0, 0, strlen(test).

Now, the result would be "5" since "\n" is counted as one character (EOL) here. This might come as a surprise for some already, but it is a reasonable result. Nearly every programming language (I couldn't find one that doesn't) counts "\n" as one character when being probed with their native string length functions.

But let's continue with the next operation.
For reasons we need to split the text into multiple parts. For that we can use the strsub function.
As a refresher let me provide you the three parameters strsub wants: strsub(str, n, m).

  • The first, str, is the string which will be hacked apart.
  • Secondly, n is an integer offset from where the extracted substring should start, where 0 < n ≤ strlen(str). Please mind that since GDL is not zero-based for its indices the first character in a string is in fact the "first" (and not 0th).
  • And lastly the second integer m is the length of the substring wanted from the offset n, where 0 < m < strlen(str)-n+1.

Let's try it out:

text2 0, -1, strsub(test, 1, 1)

This will show predictably just "1" in the plan.

text2 0, -1, strsub(test, 1, 3)

Now it gets spooky. This will still just show "1"!
"\n" is clearly – and surprisingly – counted as 2 characters in "strsub"!

To further proof this we can try

test = "1\n23\n4"
text2 0,-1, strsub(test, 1, 4)

If you predicted we would see

1
23

you’d be wrong. Instead it will just be

1
2


Now, what do you think will

text2 0,-1, strsub(test, 3, 1)

yield as a result?

If you said "n" you'd be right. Again, this might come as a surprise, but in a way it is at least very straight forward – you can literally look at the code without any knowledge of special characters – you'd get the "right" result. The third character, when seen alone, of the test string is indeed an "n", even tho we would never see it when used in the intended context.
But how come that

text2 0,-1, strsub(test, 2, 1)

will be empty? Shouldn’t we see the backslash appearing in plan?
This on the other hand however is not possible.

If we look into the GDL reference manual we see that "\" is a special control character and can not actually be used on its own, only with "n" and "t" (for newline and tabulator respectively).
If we want to show the actual backslash we need to escape it itself like this \\.

And from this perspective it starts to make sense (somewhat):
As long we do not actually "write" the text, those are still two separate characters – but one of them is not valid, so we cannot see it.

Still this leaves us in a bit of a pickle.
Now we know that "strlen()" can be deceiving – the actual string might be much longer, which might break e.g. loops for multiple reasons, be it because we cut a string at the wrong positions, or just because we plainly think that the string is exhausted while it actually is not.

Of course when we know that there are not newlines or tabs in the string we process it does not matter.
When the user provides strings we can not be sure that they did not sneak in some of these bad boys.

For this reason I conceived this little algorithm that will give us the correct™ number of characters in a string – disregarding escape sequences and treating every char as one no matter what.

input_string = "Hello\nWorld\n"

total_chars = strlen(input_string)

_str_rest = input_string
extra_newlines = 0
current_pos = 1

while current_pos <= 256 do
    newline_pos = strstr(_str_rest, "\n")

    if newline_pos > 0 then
        ! Found a newline, increment counter
        extra_newlines = extra_newlines + 1
        _str_rest = strsub(_str_rest, newline_pos+1, 256)
    else
        goto "EXITLOOP"
    endif
    ! infinity loops safeguard
    current_pos = current_pos + 1
endwhile
"EXITLOOP":
result = total_chars + extra_newlines

Ultimately the best way forward would be for Graphisoft to fix it, which could be easy: Just ignore what all other languages are doing and count all escape sequences as two characters everywhere, even with strlen(), and all would be good.

Addendum:

As Bernd Schwarzenbacher correctly pointed out a "fix" would most likely end in a new command strlen{2} to maintain backwards compatibility. A lot of scripts in existence would break in unexpected ways otherwise.