Sunday, 22 February 2015

A compression for English sentences.

Just thought of this in my head...

There are around 1,025,000 words in the English language.
4 bytes can give almost 4300 times as much as this.
So using 8-bit characters, here is a way of compressing an English sentence.
Going trough every word in the English language (you could only use 2^^16 or 2^^24 of them), make every word have separate unique characters. You will be able to give way more than the amount of words in the English dictionary just 4 unique 8-bit characters each. You don't need to use the other empty bytes either, you can just use 1 character for the start (256 possibilities).
Each character has 256 possibilities but mixing the characters will give much more.
Now you can compress an English sentence by finding the word and replacing it with the unique characters.
Then just use a "#" character to seperate them, or a " " character to seperate and add a space to the sentences.
If you want most written sentences to only take up 2 characters (excluding the "#"/" ") which is over 65 thousand possibilities, just order the words and characters in the most used.
A lot of english words go over 4 letters so this should work fine.
And if you go through this entire post you will see that a lot of the words are over four letters long!

