Programming/ Mathematics Challenge - Programming

Welcome, Guest: Register On Nairaland / LOGIN! / Trending / Recent / New
Stats: 3,152,989 members, 7,817,911 topics. Date: Saturday, 04 May 2024 at 10:42 PM

Programming/ Mathematics Challenge - Programming - Nairaland

Nairaland Forum / Science/Technology / Programming / Programming/ Mathematics Challenge (1346 Views)

Programming Challenge: Convert String to Json with a Loop / The Greatest Programmer On Nairaland / Simple Code Challenge: C#, Java, C, C++ (2) (3) (4)

(1) (Reply) (Go Down)

Programming/ Mathematics Challenge by Maximip(m): 1:10pm On Nov 18, 2011

I noticed that we hear a Nigerian name and easily tell where the person is from ie Emeka (South east) , Bayo (South west), etc. sometimes it's because we've heard the name before and learnt it's from a particular region and at other times, we can still tell just hearing the name for the first time.

Can you write a computer program to do same. take a name as input and outputs where the person is from? I think it'll be an interesting problem.

Re: Programming/ Mathematics Challenge by Chimanet(m): 4:49pm On Nov 18, 2011

@OP abeg no comment, weve got programmers in naija

Re: Programming/ Mathematics Challenge by Beaf: 7:34am On Nov 19, 2011

Maximip:

I noticed that we hear a Nigerian name and easily tell where the person is from ie Emeka (South east) , Bayo (South west), etc. sometimes it's because we've heard the name before and learnt it's from a particular region and at other times, we can still tell just hearing the name for the first time.

Can you write a computer program to do same. take a name as input and outputs where the person is from? I think it'll be an interesting problem.

It shouldn't be too difficult. At the most basic, we would need a few weighted rules about use of vowels and consonants e.g. a name like "Wumi" would be most likely Yoruba, because of the combination of w and u, while "Walid" would be most likely Arabic, because of the combination of w and a, as well as i and d.

If we could add an algorithm like Momel to read the pitch of the vowels, then we would have added intonation to our arsenal! We could weight the pith of vowels by language according to if the tend downwards or upwards. For example, if we are presented with a word with 3 straight dips in intonation eg "Ikebe," "Oghene," "Urhobo," "Isoko" we could guess the language is Urhobo and not be far wrong.

A little AI will sort it out.

Re: Programming/ Mathematics Challenge by ektbear: 3:18am On Nov 21, 2011

I don't think this is an easy problem, sounds like a legitimate research problem.

If someone provides you with a database of already labeled words, (e.g., Emeka=>SE, Bayo=>SW, etc), and the database is fairly large you can probably use some statistical classification techniques to get decent results (on classifying unlabeled words that you must categorize.)

But if you don't have labeled data, then it becomes hard. You have to generate the labels yourself. Which I guess would involve either:
1) Hiring someone to generate the labels for you
2) Try to figure out the labels yourself by extracting useful features and doing some sort of clustering. The hope is that the features you extract create a large amount of space between words from different categories. Perhaps intonation is a useful feature for separating words, as Beaf suggests.

But yeah, by no means trivial or something already in a textbook somewhere.

Re: Programming/ Mathematics Challenge by ektbear: 3:39am On Nov 21, 2011

Thinking about it further, I guess maybe it isn't as hard as I'm making it seem.

You could try approach #3. . .you could just take digitized language-specific media (dictionaries, newspapers in target languages), and use this to create a pretty huge database.

At that point classification might be fairly easy, I'm guessing.

So basically you have this huge corpus of documents, you know which language source you got it from. You label words by the source (e.g., Yoruba, Edo, etc). Words that appear in multiple sources, I guess you do some simple rule to classify it (Tunde might appear in Igbo language media and Yoruba language media. But presumably it will appear more frequently in the latter than the former. So correct to classify it as a Yoruba word.)

This will give you a huge database of words already labelled. Then you could probably come up with some pretty effective simple techniques that work well to classify words that aren't in your database.

And of course, if a word already exists in your database, then you'll be able to classify it. For example, if your corpus of Igbo language documents has the word "Emeka" in it, then you've probably classified this word previously as Igbo. So easy if someone asks you to classify it again (since you already have a label for this word.)

Re: Programming/ Mathematics Challenge by Maximip(m): 5:57am On Nov 21, 2011

Beaf:

we would need a few weighted rules about use of vowels and consonants e

Yeah, I think it'll be a complex set of rules because the combination of 2 or 3 letters could be common between both regions

ekt_bear:

2) Try to figure out the labels yourself by extracting useful features and doing some sort of clustering. The hope is that the features you extract create a large amount of space between words from different categories. Perhaps intonation is a useful feature for separating words, as Beaf suggests.

I think this will be the ideal solution although I don't think it'll be trivial. I guess the program should be able to take a bunch of names and create a set of rules to help it distinguish it from another set.

Re: Programming/ Mathematics Challenge by ektbear: 6:11am On Nov 21, 2011

Haha no joke.

Unsupervised learning is really, really hard.

This is why if it were it were me I'd try approach #3 first.

Re: Programming/ Mathematics Challenge by Beaf: 7:11am On Nov 21, 2011

Maximip:

Yeah, I think it'll be a complex set of rules because the combination of 2 or 3 letters could be common between both regions

A combination of 2 or 3 letters is much more uncommon than you think. Take a simple name like Emeka, if you had never heard it before, you would intuitively know that it sounds Igbo. Why? It is because of the combination of vowels and consonants in a way that is peculiar to Igbo language.

If you had to draw up a simple programme to classify the single, Emeka word into either Igbo or Yoruba; you would be best with a programme that applies similar logic to the following oversimplified example:

Alot the accompanying score if true.
Score = 0

Igbo words frequently start with the letter, e. Score += 1
Yoruba words infrequently start with the letter, e. Score -= 1

Igbo words frequently have the consonant-vowel combination, ka. Score += 1
Yoruba words infrequently have the consonant-vowel combination, ka. Score -= 1

Igbo words frequently have the consonant-vowel combination, *ka (where * is any letter). Score += 3
Yoruba words very rarely have the consonant-vowel combination, *ka (where * is any letter). Score -= 3

Igbo words frequently have the consonant-vowel combination, eme. Score += 5
Yoruba words do not have the consonant-vowel combination, eme. Score -= 5

The more positive the score, the more likely it is Igbo. The same could be done for all Nigerian languages.
About where to get the data? All languages have dictionaries, all we need do is run statistics for frequently used combinations through them and the job is 90% done.

Re: Programming/ Mathematics Challenge by Beaf: 7:27am On Nov 21, 2011

It just struck me that google and bing do language recognition, so I googled around and came up with this. The bolded part is basically the same solution I thought up:

Statistical Approaches

This can be done by comparing the compressibility of the text to the compressibility of texts in the known languages. This approach is known as mutual information based distance measure [1]. The same techniques can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods.

Another technique, as described by Dunning (1994) is to create a language n-gram model from a "training text" for each of the languages. Then, for any piece of text needing to be identified, a similar model is made, and that model is compared to each stored language model. The language model which is most similar to the model from the piece of text is the most likely language. This approach is problematic when the input text is in a language there is no model for. In this case, the method returns a random, "most similar" language as its result. Another problem are pieces of input text that are composed of several languages, as is common on the Web. For a more recent method, see Řehůřek and Kolkus (2009).

http://en.wikipedia.org/wiki/Language_identification

More detail about n-grams here: http://en.wikipedia.org/wiki/N-gram

I'm bloody good! cool

Re: Programming/ Mathematics Challenge by Maximip(m): 11:00am On Nov 21, 2011

ekt_bear:

This is why if it were it were me I'd try approach #3 first.

That approach will get stuck when it sees a new name.

Beaf:

A combination of 2 or 3 letters is much more uncommon than you think. Take a simple name like Emeka, if you had never heard it before, you would intuitively know that it sounds Igbo. Why? It is because of the combination of vowels and consonants in a way that is peculiar to Igbo language.

If you had to draw up a simple programme to classify the single, Emeka word into either Igbo or Yoruba; you would be best with a programme that applies similar logic to the following oversimplified example:

Alot the accompanying score if true.
Score = 0

Igbo words frequently start with the letter, e. Score += 1
Yoruba words infrequently start with the letter, e. Score -= 1

Igbo words frequently have the consonant-vowel combination, ka. Score += 1
Yoruba words infrequently have the consonant-vowel combination, ka. Score -= 1

Igbo words frequently have the consonant-vowel combination, *ka (where * is any letter). Score += 3
Yoruba words very rarely have the consonant-vowel combination, *ka (where * is any letter). Score -= 3

Igbo words frequently have the consonant-vowel combination, eme. Score += 5
Yoruba words do not have the consonant-vowel combination, eme. Score -= 5

The more positive the score, the more likely it is Igbo. The same could be done for all Nigerian languages.
About where to get the data? All languages have dictionaries, all we need do is run statistics for frequently used combinations through them and the job is 90% done.

I thought this was impressive even before seeing the n-gram post.

The bulk of the work would be in creating a system that'll generate those rules (model) on it's own after recieving a bunch of words (training text) from a particular language. Advanced AI!

Re: Programming/ Mathematics Challenge by Beaf: 11:12am On Nov 21, 2011

Maximip:

I thought this was impressive even before seeing the n-gram post.

Thanks, bruv. You know how to make a black brother blush. wink

Maximip:

The bulk of the work would be in creating a system that'll generate those rules (model) on it's own after recieving a bunch of words (training text) from a particular language. Advanced AI!

Yes, you are right indeed. The system could generate its own rules.

Re: Programming/ Mathematics Challenge by ektbear: 11:19am On Nov 21, 2011

Maximip:

That approach will get stuck when it sees a new name.

No, it does not. Building the database is only the first step, and isn't even really the hardest (at least conceptually.)

The point is to use your database of labeled words to build a classifier than can classify unseen words. That was the point of the following:

ekt_bear:

Then you could probably come up with some pretty effective simple techniques that work well to classify words that aren't in your database.

Basically when you do approach 3, there are usually three steps;
1) Build a labeled database of words.
2) turn words into features (so some function that maps words to say vectors. A simple one is just a histogram of letter frequencies. Of course there are tons and tons of stuff you can do.)
3) use some classification algorithm to build a classifier (e.g., stock techniques like nearest neighbors or naive bayes)

The output of step 3 produces something that can classify any word. Of course, there are some issues in step 2 in finding good features. So you could try a few things and see how well they work.

(1) (Reply)

Can Your Programming Language Do This? / Has Anyone Done A Blackberry App That Accessed The Net / Is Anybody In Nigeria Actually Learning Swift? (apples New Programming Language)

(Go Up)

Sections: politics (1) business autos (1) jobs (1) career education (1) romance computers phones travel sports fashion health
religion celebs tv-movies music-radio literature webmasters programming techmarket

Links: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

Nairaland - Copyright © 2005 - 2024 Oluwaseun Osewa. All rights reserved. See How To Advertise. 46
Disclaimer: Every Nairaland member is solely responsible for anything that he/she posts or uploads on Nairaland.