A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 - Programming

Welcome, Guest: Register On Nairaland / LOGIN! / Trending / Recent / New
Stats: 3,150,648 members, 7,809,446 topics. Date: Friday, 26 April 2024 at 09:46 AM

A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 - Programming - Nairaland

Nairaland Forum / Science/Technology / Programming / A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 (2835 Views)

For Beginners: Learn How To Create A Simple Android Native App / How To Make A Simple Calculator In Notepad Using .bat Format / Ludo Game Algorithm Wanted For AI Project (2) (3) (4)

(1) (Reply) (Go Down)

A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 2:46pm On Jan 03, 2013

Sometimes you come across a problem where you have to parse a sentence or some group of words and you want to have tolerance for user mistakes
A problem like this requires you to find how close a word is to another... e.g if "reinteprete" is similar to the correct spelling, "reinterpret" and by what percentage...

Does this problem exist?? if so, follow me lets figure this out
I assume you can implement many simple algorithm
The algorithm is in two phases... of which I developed the latter phase.
PHASE ONE: Word similarity
This phase is very simple and based on the simple concept of "bigram"(check your dictionary) ...originally by $$$$$$...
EXPLANATION:
SPLIT THE WORD INTO A LIST OF ITS BIGRAMS
bigrams carries a little bit of planar properties of Words. eg. "naira" and "niara", though same characters but different ordering... bigrams solves this. because the bigrams of NAIRA is NA, AI, IR, RA ... while NIARA is NI, IA, AR, RA.
.............the only bigram NAIRA has in NIARA is RA... and same as NIARA in NAIRA.
++++++Let A and B represent NAIRA and NIARA respectively... and the percentage similarity is now:
=(sb / t * 100)

where..
sb = (bigram of A found in B + bigram of B found in A) = (1 + 1);
t = (total count of bigrams in A and B) = 8;

similarity is now (2 / 8 * 100) = 25%..... Sweet!!!!!!

try this for "naira" and "nairra" ... the answer should be 89%....
...this means the user must have made mistake of adding extra "r"..(planar property!!!);

in conclusion, "naira, nairra" is similar than "naira, niara".
There is a consistency and planar property with this simple approach.

More benchmarks...
"internalization" is 63% similar to "interdenominationalism"

What I would like this community do is to implement this in "PHP, ASP, Python, Java, C++".
//Implement it in the Language you can.

[size=14pt]I Use This Opportunity to Humbly request Nairaland to integrate source code snippet Editor and Viewer into this section!!!(I believe there are many free ones out there)[/size]
*******************************************************************************************************
//Here is a simple bigram implementation I did in Python. works with 2.5 and 3.2.. due to Nairaland's input parsing, ignore asteriks(*) and replace the braces {} with []...
def returnbigrams(STR):
****"""returns a list of bigrams from str(STR)"""
****if(not isinstance(STR, str)):
********return {}
****return {STR{n:n+2} for n in range(len(STR) - 1)}

THe image attached is a snippet from my C++ implementation....
*************************************************************************************************************************************
I prototype my most of algorithm in Python... So I will post a Python implementation of PHASE ONE solution soon...(someone should recommend me code-pasting a site).
NOTE: I have implemented everything I need based on the post... am currently writing a small C++ library... so :-) .. I just wanted to share a little knowledge with those that care to learn this. its pretty basic, simple and and fun to learn... Lets go!
*************************************************************************************************************************************

All Python programmers should give a shout out!!! cause no other language can implement bigram under 4(achievable) lines of codes!! :-D.

We will deal with the second phase... later :: :-)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Timothy.

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by ciphoenix: 4:20pm On Jan 03, 2013

.Duplicate

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by ciphoenix: 4:20pm On Jan 03, 2013

Try IDEONE.com

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 4:58pm On Jan 03, 2013

ciphoenix: Try IDEONE.com

thanks a lot! Seems cool! Am on mobile now. will post it there as soon as I buy a new data bundle for my PC. It just got finished. :-(

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by ciphoenix: 6:46pm On Jan 03, 2013

WhiZTiM:
thanks a lot! Seems cool! Am on mobile now. will post it there as soon as I buy a new data bundle for my PC. It just got finished. :-(

you're welcome.

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 1:23pm On Jan 04, 2013

lol @ shoutout ...
interesting AI logic...am following this grin

btw i made an attempt


def returnBigrams(STR):
    return [STR[n:n+2] for n in range(len(STR) - 1)]


str1 =returnBigrams( "naira" )
str2 =returnBigrams( "niara" )
sb = 0
t = len(str1) + len(str2)

for n in str1:
    for m in str2:
        if n == m:
            sb+=2.0
            
s=(sb/t)*100
print s

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 10:13pm On Jan 04, 2013

turned it into a function.

₱®ÌИСΞ:
lol @ shoutout ...
interesting AI logic...am following this grin

btw i made an attempt


def returnBigrams(STR):
    return [STR[n:n+2] for n in range(len(STR) - 1)]

def similarity(str1, str2):
    str1 =returnBigrams(str1)
    str2 =returnBigrams(str2)
    sb = 0
    t = len(str1) + len(str2)

    for n in str1:
        for m in str2:
            if n == m:
                sb+=2.0
            
    s=(sb/t)*100
    return s

nice one there!! Exactly how I implemented mine except for one thing that makes yours break under some conditions... Have you tested this code??

Try it with, *naira* and *naira*.
Try it with, *church* and *church*. ...
A glitch right? Can you think of the glitch??.

Will post mine soon...... I still hvnt bought internet data on PC yet. Currently using my Phone's Operamini...

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 11:40pm On Jan 04, 2013

yea its giving 140% ... i think its because of the repeated "ch"
but isnt that a good thing? u know ... in matching stuffs ... the higher match is the likely chosen
say for example i matched "church" and "churce" nd it gave me 100% but with "church" it gave me 140% so d latter is chosen
maybe im wrong sha...would like to see ur code tho

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 12:47am On Jan 05, 2013

₱®ÌИСΞ:
yea its giving 140% ... i think its because of the repeated "ch"
but isnt that a good thing? u know ... in matching stuffs ... the higher match is the likely chosen
say for example i matched "church" and "churce" nd it gave me 100% but with "church" it gave me 140% so d latter is chosen
maybe im wrong sha...would like to see ur code tho

add a simple break statement. I mean, break the inner loop on first match. That was the glitch the orignal author didn't mention in his paper... http://www.cis.upenn.edu/~pereira/papers/sim-mlj.pdf

Anyway, you do not have to spend your time reading the paper, I will dish out the necessary thing here + some working codes in Python.

Back to the topic, I found out to keep the planer property. You gotta make assumption that 1st match is what you need. And remove it from the inner loop! I mean, add this code inside the sb += 2 block. ...i.e
...


    if n == m:
        sb += 2.0
        str2.remove(m)
        break

....
Works?? ... : - )

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 2:06am On Jan 05, 2013

it worked yea... cool

...

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 1:38am On Jan 06, 2013

aiite... Should get my PC on internet soon. Before then, how do you think we can apply this to short sentences... e.g "taye is happy" and "taiwo is happier"??
... Jst a little drill, however simple or complex your idea is may help everyone...

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 2:03am On Jan 06, 2013

well i think the sentences should be broken down to arrays of words, and the similarity of each index checked against the other, then number of words and number of letters in each sentence would be added to the mix grin

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 11:16am On Jan 06, 2013

What isp do u use?

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 2:57pm On Jan 06, 2013

₱®ÌИСΞ:
What isp do u use?

MTN.
I use both MTN-Hotspot WiFi located where I live and their mobile 3.5G network.

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 5:43pm On Jan 07, 2013

Ohhkayy... to continue this...
I have uploaded a sample Python code that covers the entirety of this first part....
check it out here...
http://ideone.com/RokLXI

....now the next thing is to explain the algorithm to newbies...
1. get the two words to compare and split the words into bigrams. e.g WhiZTiM turns to Wh, hi, iZ, ZT, Ti, iM

2. read the first post in this thread... :-)

Questions are ever welcomed... !

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 5:46pm On Jan 07, 2013

the delimiter function there is for use in the second part... that is when its now sentences...
--------------------Thinking what am thing....

...Natural Language Processing! yeah... thats right... oops, this is just a very very basic

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 11:51pm On Jan 07, 2013

true that..

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 7:07pm On Jan 09, 2013

@whiZTiM where u @ na

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by spikesC(m): 7:16pm On Jan 09, 2013

pls oooooooooooo, dnt be discouraged. Am following this o

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 12:40pm On Jan 10, 2013

Yes sirs!! ....I have been kinda busy lately...
Ohkkay... Spikes C...

... now that we have got this phase ready... we look at the next part that caters for sentences of few words...
this next phase is only reasonable for sentences of less than 5 words.
for example...
____-- we are writing a music library software.. and in the specs, we needed to be able to group Albums together... now, we know that some people may have tracks of the same album... but the "Album" ID3v2 tag may vary spelling wise...
secondly few statements like ... "Getting higher on rank" and "Getting on higher rank" are kinda similar... how do we measure its similarity>> ?

heres my idea...
1. Split up the string into two lists of individual Words, discarding away all kinds of delimiters like [,:;. -=] ...etc
2. have the two word_lists in a single nested loops.. i.e a for inside a for
3. if the list do not have the same length, maintain the word_list with the higher number words in the outer loop and the shorter in the inner. eg. list1:"xx yy zz", list2:"xx zz"
5. create an empty list, list1_ratios, outside the main loop and another, list2_ratios, inside the main loop.
4. iterate over... using getting the bigram similarity of every word in list1 with list2.... I mean... ->
5. if list1 = "Getting", "higher", "on", "rank", "oh" and list2 = "Getting", "on", "higher", "rank".
---then pick "Getting" in list1
store the percentage similarity of list1's "Getting" with list2's "Getting", "on", "higher" and "rank"... in list2_ratios
...you then have a list of 4 floating point numbers... before you exit the inner loop... get the largest item of list2_ratios and append to list1_ratios
6. do the same... until... all iteration is done...

now list1_ratios has 5 floating point numbers...
what do you think we can do next

?

PLS: I know my explanation is a bit crappy, pls ask for clarifications if you do not understand.

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by lordZOUGA(m): 8:16pm On Jan 10, 2013

nice algorithm

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 3:50am On Jan 15, 2013

@WhiZTiM by adding


sent1= "Getting higher on rank oh"
sent2= "Getting on higher rank"

list1 =  split_on_delimeters(sent1)
list2 =  split_on_delimeters(sent2)
list1_ratios=[]

if len(list1) > len(list2):
    for x in list1:
        list2_ratios=[]
        for y in list2:
            list2_ratios.append(word_similarity(x,y))
        list1_ratios.append(max(list2_ratios))
    print list1_ratios
else:
    for x in list2:
        list2_ratios=[]
        for y in list1:
            list2_ratios.append(word_similarity(x,y))
        list1_ratios.append(max(list2_ratios))
    print list1_ratios

according to your instructions...the output of list1_ratios is

[100.0, 100.0, 100.0, 100.0, 0.0]

as seen @ http://ideone.com/rn9J1H

so i guess we could do something like (2*(sum of values on list1_ratios)) / (len(list1) + len(list2))

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 1:02pm On Jan 15, 2013

lordZOUGA: nice algorithm

Thanks man,,,,,

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 1:29pm On Jan 15, 2013

@₱®ÌИСΞ

Yeah... That's great... Impressive! You are totally on track....

The next thing now is umm to use a basic statistical knowledge... That is...
After having gotten the similarity list we are looking for...
We can see that it has the same number of elements as its wordlist...

Ths last concept has to do with "weights".
Each word constitutes a weight in the original sentence.
And the sum of those weights gives us the total weight of the sentence(excluding all delimiters and symbols).

This weight is simply based on the number of characters of a word relative to its word list.
For example. The weight of 'school' in the sentence list "he goes to school everyday"
is ... len(school) / sum(word_list).

So what next.? Calculate the weights of the larger(if applicable) list.
..... The similarity list we have should be multiplied by its matching weight.
And.... Sum them up!!!!!!!'

Example:
If
list1 = [ab, aba, baba, babu]
List2 = [baba, ab, aba]
....then.
similarity_list = [100.0, 100.0, 100.0, 0.0]
weight_list = [0.154, 0.231, 0.308
, 0.0]
now weighted_similarity array is = [15.4, 23.1, 30.8, 0.0]...
Now similarity = sum(weighted_similarity). =
69.3

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 1:00pm On Jan 16, 2013

@WhiZTiM thanks boss
i made modification to the code to output the required values needed as seen here http://ideone.com/qZyAvx

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 12:45am On Jan 19, 2013

@WhiZTiM.....sup with d continuation

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 6:24pm On Jan 21, 2013

₱®ÌИСΞ:
@WhiZTiM.....sup with d continuation

I have been a bit ill... Secondly school work... Wi
ll try updating this this week.

This pretty much of the first part. And I appreciate you and everyone that followed. I will end this part with with the introduction of lexical parsing and hash map lookups.
Where we can determine if "he is intelligent" is same as "he is brilliant".

Currently, will lookup my C++ code, and try to write python implementation.

Part2 will be about, simple grammer systems. Can you train a system to be corrected so that it learns the best choice of word to use

. (mostly statistics, of which I never liked)

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 8:10pm On Jan 21, 2013

oh ok...sori bro
never liked stats too tho... undecided

all d same

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 1:33pm On Feb 01, 2013

I think... I should be getting ready with the next part... Probably before then, I would post a Quiz in the programming section... An interesting one on parsing.

Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 2:09pm On Feb 01, 2013

kul...

(1) (Reply)

I Need An Android App Developer. / Computer Science Algorithm Problem / Battling Google, Microsoft Changes How It Builds Software

(Go Up)

Sections: politics (1) business autos (1) jobs (1) career education (1) romance computers phones travel sports fashion health
religion celebs tv-movies music-radio literature webmasters programming techmarket

Links: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

Nairaland - Copyright © 2005 - 2024 Oluwaseun Osewa. All rights reserved. See How To Advertise. 47
Disclaimer: Every Nairaland member is solely responsible for anything that he/she posts or uploads on Nairaland.