Alright, let me tell you about this thing I was messing around with – ‘eng vs ire’. Basically, I was trying to see if I could get some kind of language detection going, real simple stuff.

First off, I started gathering some data. I figured, gotta have stuff to test against, right? So, I went scavenging around for some English and Irish text. I wasn’t aiming for anything fancy, just a decent chunk of words from each language.
Then, the coding part. I decided to keep it super basic and used Python. I know, I know, real cutting-edge stuff. But hey, it’s quick and easy, and I just wanted to see if the idea had any legs.
Here’s what I did:
- I cleaned up the text – you know, lowercased everything, got rid of punctuation. Just the usual.
- I figured I’d count the frequency of letters in each language sample. So, how many ‘a’s, ‘b’s, ‘c’s, and so on.
- I built a simple “profile” for each language based on these letter frequencies.
Okay, so now I had these language profiles. Time to test it out. I’d feed it a snippet of text, do the same letter frequency counting thing, and compare it to the English and Irish profiles.
The comparison? I just used a simple distance metric – like, how far apart are the frequencies? The closer the distance, the more likely it was that language, or so I thought.

Did it work? Sort of. For really obvious cases – like, a whole paragraph in English versus a whole paragraph in Irish – yeah, it usually got it right. But when you threw in shorter sentences, or sentences with a lot of similar words, it got a little wonky.
The big problem was the data, I reckon. I didn’t have nearly enough to build solid profiles. Plus, Irish has some letters that are similar to English, so the simple frequency thing wasn’t cutting it.
What did I learn? Well, language detection is trickier than I thought! And I definitely need more data if I want to make this thing even remotely accurate. Might try using n-grams next time, see if that helps. It was a fun little experiment, though.