Detecting English

To detect the English language in a string, we need a list of all English words. You can download such a list here. Then we must check the percentage of English words in the message. If it is greater than 20%, and the percentage of letters (not special characters) in the message is at least 85%, we can say the message is in English.


CHARACTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" # this string can contain other characters like !, @, etc.
LETTERS_AND_SPACE = CHARACTERS + CHARACTERS.lower() + " \t\n" # valid letters (uppercase and lowercase) and whitespace characters

def load_dictionary(file_path): # loading a dictionary from a file and returning a set of words
    with open(file_path, "r") as dictionary_file:
        return set(word.strip() for word in dictionary_file if word.strip()) # creating the set of words, stripping whitespace characters and ignoring empty lines

ENGLISH_WORDS = load_dictionary("dictionary.txt")

def remove_non_letters(message): # removing non-letter characters from the message
    return "".join(symbol for symbol in message if symbol in LETTERS_AND_SPACE)

def get_english_count(message): # returning the percentage of English words in the message
    message = message.upper()
    cleaned_message = remove_non_letters(message)
    possible_words = cleaned_message.split() # splitting the cleaned message into individual words

    if not possible_words: # checking whether the list is empty (there are no words at all)
        return 0.0

    matches = sum(1 for word in possible_words if word in ENGLISH_WORDS) # counting how many of the found words are valid English words
    return matches / len(possible_words) # return the ratio of valid words to total words

def is_english(message, word_percentage = 20.0, letter_percentage = 70.0):
    words_match = get_english_count(message) * 100 >= word_percentage # calculating if the percentage of English words meets the required threshold
    num_letters = len(remove_non_letters(message)) # counting letters in the cleaned message
    message_letters_percentage = (num_letters / len(message)) * 100 if message else 0 # calculating the percentage of letters in the original message
    letters_match = message_letters_percentage >= letter_percentage # calculating if the percentage of letters meets the required threshold
    return words_match and letters_match

print(is_english("This is my secret message."))

Detecting a language in a string can be useful, e.g., when using a brute-force attack. Then, we would not have to go through each possible answer and only select the most probable ones.