I’ve actually messed with this a bit. The problem is more that it can’t count to begin with. If you ask it to spell out each letter individually (ie each letter will be its own token), it still gets the count wrong.
In my experience, when using reasoning models, it can count, but not very consistently. I’ve tried random assortments of letters and it can count them correctly sometimes. It seems to have much harder time when the same letter repeats many times, perhaps because those are tokenized irregularly.
I’ve actually messed with this a bit. The problem is more that it can’t count to begin with. If you ask it to spell out each letter individually (ie each letter will be its own token), it still gets the count wrong.
In my experience, when using reasoning models, it can count, but not very consistently. I’ve tried random assortments of letters and it can count them correctly sometimes. It seems to have much harder time when the same letter repeats many times, perhaps because those are tokenized irregularly.