I still double space after a period, because fuck you, it is easier to read. But as a bonus, it helped me prove that something I wrote wasn’t AI. You literally cannot get an AI to add double spaces after a period. It will say “Yeah, OK, I can do that” and then spit out a paragraph without it. Give it a try, it’s pretty funny.
This is because spaces typically are encoded by model tokenizers.
In many cases it would be redundant to show spaces, so tokenizers collapse them down to no spaces at all. Instead the model reads tokens as if the spaces never existed.
For example it might output: thequickbrownfoxjumpsoverthelazydog
Except it would actually be a list of numbers like: [1, 256, 6273, 7836, 1922, 2244, 3245, 256, 6734, 1176, 2]
Then the tokenizer decodes this and adds the spaces because they are assumed to be there. The tokenizer has no knowledge of your request, and the model output typically does not include spaces, hence your output sentence will not have double spaces.
I’d expect tokenizers to include spaces in tokens. You get words constructed from multiple tokens, so can’t really insert spaces based on them. And too much information doesn’t work well when spaces are stripped.
In my tests plenty of llms are also capable of seeing and using double spaces when accessed with the right interface.
The tokenizer is capable of decoding spaceless tokens into compound words following a set of rules referred to as a grammar in Natural Language Processing (NLP). I do LLM research and have spent an uncomfortable amount of time staring at the encoded outputs of most tokenizers when debugging. Normally spaces are not included.
There is of course a token for spaces in special circumstances, but I don’t know exactly how each tokenizer implements those spaces. So it does make sense that some models would be capable of the behavior you find in your tests, but that appears to be an emergent behavior, which is very interesting to see it work successfully.
I intended for my original comment to convey the idea that it’s not surprising that LLMs might fail at following the instructions to include spaces since it normally doesn’t see spaces except in special circumstances. Similar to how it’s unsurprising that LLMs are bad at numerical operations because of how the use Markov Chain probability to each next token, one at a time.
Yeah, I would expect it to be hard, similar to asking an llm to substitiute all letters e with an a. Which I’m sure they struggle with but manage to perform it too.
In this context though it’s a bit misleading explaining the observed behavior of op with that though, since it implies it is due to that fundamental nature of llms when in practice all models I have tested fundamentally had the ability.
It does seem that llms simply don’t use double spaces (or I have not noticed them doing it anywhere yet), but if you trained or just systemprompted them differently they could easily start to. So it isn’t a very stable method for non-ai identification.
Edit: And of course you’d have to make sure the interfaces also don’t strip double spaces, as was guessed elsewhere. I have not checked other interfaces but would not be surprised either way whether they did or did not. This too thought can’t be overly hard to fix with a few select character conversions even in the worst cases. And clearly at least my interface already managed to do it just fine.
So… Why don’t I see double spaces after your periods? Test. For. Double. Spaces.
EDIT: Yep, double spaces were removed from my test. So, that’s why. Although, they are still there as I’m editing this. So, not removed, just hidden, I guess?
I still double space after a period, because fuck you, it is easier to read. But as a bonus, it helped me prove that something I wrote wasn’t AI. You literally cannot get an AI to add double spaces after a period. It will say “Yeah, OK, I can do that” and then spit out a paragraph without it. Give it a try, it’s pretty funny.
Web browsers collapse whitespace by default which means that sans any trickery or deliberately using nonbreaking spaces, any amount of spaces between words to be reduced into one. Since apparently every single thing in the modern world is displayed via some kind of encapsulated little browser engine nowadays, the majority of double spaces left in the universe that are not already firmly nailed down into print now appear as singles. And thus the convention is almost totally lost.
This seems to match up with some quick tests I did just now, on the pseudonyminized chatbot interface of duckduckgo.
chatgpt, llama, and claude all managed to use double spaces themselves, and all but llama managed to tell I was using them too.
It might well depend on the platform, with the “native” applications for them stripping them on both ends.
tests
Mistral seems a bit confused and uses tripple-spaces.
Double spaces after periods can create “rivers.” This makes text more difficult to read for those with dyslexia. Whatever is used as a text editor is probably stripping them out for accessibility reasons. I suppose double spaces made sense with monospaced fonts.
HTML rendering collapses whitespace; it has nothing to do with accessibility. I would like to see the research on double-spacing causing rivers, because I’ve only ever noticed them in justified text where I would expect the renderer to be inserting extra space after a full stop compared between words within sentence anyway.
I’ve seen a lot of dubious legibility claims when it comes to typography including:
serif is more legible
sans-serif is more legible
comic sans is more legible for people with dyslexia
LLMs can’t count because they’re not brains. Their output is the statistically most-likely next character, and since lot electronic text wasn’t double-spaced after a period, it can’t “follow” that instruction.
I still double space after a period, because fuck you, it is easier to read. But as a bonus, it helped me prove that something I wrote wasn’t AI. You literally cannot get an AI to add double spaces after a period. It will say “Yeah, OK, I can do that” and then spit out a paragraph without it. Give it a try, it’s pretty funny.
This is because spaces typically are encoded by model tokenizers.
In many cases it would be redundant to show spaces, so tokenizers collapse them down to no spaces at all. Instead the model reads tokens as if the spaces never existed.
For example it might output: thequickbrownfoxjumpsoverthelazydog
Except it would actually be a list of numbers like: [1, 256, 6273, 7836, 1922, 2244, 3245, 256, 6734, 1176, 2]
Then the tokenizer decodes this and adds the spaces because they are assumed to be there. The tokenizer has no knowledge of your request, and the model output typically does not include spaces, hence your output sentence will not have double spaces.
I’d expect tokenizers to include spaces in tokens. You get words constructed from multiple tokens, so can’t really insert spaces based on them. And too much information doesn’t work well when spaces are stripped.
In my tests plenty of llms are also capable of seeing and using double spaces when accessed with the right interface.
The tokenizer is capable of decoding spaceless tokens into compound words following a set of rules referred to as a grammar in Natural Language Processing (NLP). I do LLM research and have spent an uncomfortable amount of time staring at the encoded outputs of most tokenizers when debugging. Normally spaces are not included.
There is of course a token for spaces in special circumstances, but I don’t know exactly how each tokenizer implements those spaces. So it does make sense that some models would be capable of the behavior you find in your tests, but that appears to be an emergent behavior, which is very interesting to see it work successfully.
I intended for my original comment to convey the idea that it’s not surprising that LLMs might fail at following the instructions to include spaces since it normally doesn’t see spaces except in special circumstances. Similar to how it’s unsurprising that LLMs are bad at numerical operations because of how the use Markov Chain probability to each next token, one at a time.
Yeah, I would expect it to be hard, similar to asking an llm to substitiute all letters e with an a. Which I’m sure they struggle with but manage to perform it too.
In this context though it’s a bit misleading explaining the observed behavior of op with that though, since it implies it is due to that fundamental nature of llms when in practice all models I have tested fundamentally had the ability.
It does seem that llms simply don’t use double spaces (or I have not noticed them doing it anywhere yet), but if you trained or just systemprompted them differently they could easily start to. So it isn’t a very stable method for non-ai identification.
Edit: And of course you’d have to make sure the interfaces also don’t strip double spaces, as was guessed elsewhere. I have not checked other interfaces but would not be surprised either way whether they did or did not. This too thought can’t be overly hard to fix with a few select character conversions even in the worst cases. And clearly at least my interface already managed to do it just fine.
So… Why don’t I see double spaces after your periods? Test. For. Double. Spaces.
EDIT: Yep, double spaces were removed from my test. So, that’s why. Although, they are still there as I’m editing this. So, not removed, just hidden, I guess?
Web browsers collapse whitespace by default which means that sans any trickery or deliberately using nonbreaking spaces, any amount of spaces between words to be reduced into one. Since apparently every single thing in the modern world is displayed via some kind of encapsulated little browser engine nowadays, the majority of double spaces left in the universe that are not already firmly nailed down into print now appear as singles. And thus the convention is almost totally lost.
This seems to match up with some quick tests I did just now, on the pseudonyminized chatbot interface of duckduckgo.
chatgpt, llama, and claude all managed to use double spaces themselves, and all but llama managed to tell I was using them too.
It might well depend on the platform, with the “native” applications for them stripping them on both ends.
tests
Mistral seems a bit confused and uses tripple-spaces.
Markdown usually collapses double spaces, yeah. But you can force the double spaces. Like this.
Double spaces after periods can create “rivers.” This makes text more difficult to read for those with dyslexia. Whatever is used as a text editor is probably stripping them out for accessibility reasons. I suppose double spaces made sense with monospaced fonts.
https://apastyle.apa.org/style-grammar-guidelines/paper-format/accessibility/typography#myth4
HTML rendering collapses whitespace; it has nothing to do with accessibility. I would like to see the research on double-spacing causing rivers, because I’ve only ever noticed them in justified text where I would expect the renderer to be inserting extra space after a full stop compared between words within sentence anyway.
I’ve seen a lot of dubious legibility claims when it comes to typography including:
and so on.
LLMs can’t count because they’re not brains. Their output is the statistically most-likely next character, and since lot electronic text wasn’t double-spaced after a period, it can’t “follow” that instruction.