The inconsistency of strlen and strsub
I witnessed a weird bug in my GDL code that I could not explain at all. To save you from the same amount of head scratching I want to document this inconsistency here.
In a for loop I recently did some text operations on strings with new line characters \n. It would work great on the first paragraph, but then it would start to be, not a one off, but "two off" error on the next paragraph and then continue to get progressively worse.
Let's start with a very easy example:
test = "1\n2\n3"
So far there are no surprises here. If we’d show this on the plan with text2 0, 0, test we would get:
1
2
3
as our output.
Let's calculate the length of the string (and use "text2" again to show it in the floor plan – a poor mans "print()" if you want!): text2 0, 0, strlen(test).
Now, the result would be "5" since "\n" is counted as one character (EOL) here. This might come as a surprise for some already, but it is a reasonable result. Nearly every programming language (I couldn't find one that doesn't) counts "\n" as one character when being probed with their native string length functions.
But let's continue with the next operation.
For reasons we need to split the text into multiple parts. For that we can use the strsub function.
As a refresher let me provide you the three parameters strsub wants: strsub(str, n, m).
- The first,
str, is the string which will be hacked apart. - Secondly,
nis an integer offset from where the extracted substring should start, where0 < n ≤ strlen(str). Please mind that since GDL is not zero-based for its indices the first character in a string is in fact the "first" (and not0th). - And lastly the second integer
mis the length of the substring wanted from the offsetn, where0 < m < strlen(str)-n+1.
Let's try it out:
text2 0, -1, strsub(test, 1, 1)
This will show predictably just "1" in the plan.
text2 0, -1, strsub(test, 1, 3)
Now it gets spooky. This will still just show "1"!
"\n" is clearly – and surprisingly – counted as 2 characters in "strsub"!
To further proof this we can try
test = "1\n23\n4"
text2 0,-1, strsub(test, 1, 4)
If you predicted we would see
1
23
you’d be wrong. Instead it will just be
1
2
Now, what do you think will
text2 0,-1, strsub(test, 3, 1)
yield as a result?
If you said "n" you'd be right. Again, this might come as a surprise, but in a way it is at least very straight forward – you can literally look at the code without any knowledge of special characters – you'd get the "right" result. The third character, when seen alone, of the test string is indeed an "n", even tho we would never see it when used in the intended context.
But how come that
text2 0,-1, strsub(test, 2, 1)
will be empty? Shouldn’t we see the backslash appearing in plan?
This on the other hand however is not possible.
If we look into the GDL reference manual we see that "\" is a special control character and can not actually be used on its own, only with "n" and "t" (for newline and tabulator respectively).
If we want to show the actual backslash we need to escape it itself like this \\.
And from this perspective it starts to make sense (somewhat):
As long we do not actually "write" the text, those are still two separate characters – but one of them is not valid, so we cannot see it.
Still this leaves us in a bit of a pickle.
Now we know that "strlen()" can be deceiving – the actual string might be much longer, which might break e.g. loops for multiple reasons, be it because we cut a string at the wrong positions, or just because we plainly think that the string is exhausted while it actually is not.
Of course when we know that there are not newlines or tabs in the string we process it does not matter.
When the user provides strings we can not be sure that they did not sneak in some of these bad boys.
For this reason I conceived this little algorithm that will give us the correct™ number of characters in a string – disregarding escape sequences and treating every char as one no matter what.
input_string = "Hello\nWorld\n"
total_chars = strlen(input_string)
_str_rest = input_string
extra_newlines = 0
current_pos = 1
while current_pos <= 256 do
newline_pos = strstr(_str_rest, "\n")
if newline_pos > 0 then
! Found a newline, increment counter
extra_newlines = extra_newlines + 1
_str_rest = strsub(_str_rest, newline_pos+1, 256)
else
goto "EXITLOOP"
endif
! infinity loops safeguard
current_pos = current_pos + 1
endwhile
"EXITLOOP":
result = total_chars + extra_newlines
Ultimately the best way forward would be for Graphisoft to fix it, which could be easy: Just ignore what all other languages are doing and count all escape sequences as two characters everywhere, even with strlen(), and all would be good.
Addendum:
As Bernd Schwarzenbacher correctly pointed out a "fix" would most likely end in a new command strlen{2} to maintain backwards compatibility. A lot of scripts in existence would break in unexpected ways otherwise.
