Welcome, Guest: Register On Nairaland / LOGIN! / Trending / Recent / NewStats: 3,150,648 members, 7,809,446 topics. Date: Friday, 26 April 2024 at 09:46 AM |
Nairaland Forum / Science/Technology / Programming / A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 (2835 Views)
For Beginners: Learn How To Create A Simple Android Native App / How To Make A Simple Calculator In Notepad Using .bat Format / Ludo Game Algorithm Wanted For AI Project (2) (3) (4)
A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 2:46pm On Jan 03, 2013 |
Sometimes you come across a problem where you have to parse a sentence or some group of words and you want to have tolerance for user mistakes A problem like this requires you to find how close a word is to another... e.g if "reinteprete" is similar to the correct spelling, "reinterpret" and by what percentage... Does this problem exist?? if so, follow me lets figure this out I assume you can implement many simple algorithm The algorithm is in two phases... of which I developed the latter phase. PHASE ONE: Word similarity This phase is very simple and based on the simple concept of "bigram"(check your dictionary) ...originally by $$$$$$... EXPLANATION: SPLIT THE WORD INTO A LIST OF ITS BIGRAMS bigrams carries a little bit of planar properties of Words. eg. "naira" and "niara", though same characters but different ordering... bigrams solves this. because the bigrams of NAIRA is NA, AI, IR, RA ... while NIARA is NI, IA, AR, RA. .............the only bigram NAIRA has in NIARA is RA... and same as NIARA in NAIRA. ++++++Let A and B represent NAIRA and NIARA respectively... and the percentage similarity is now: =(sb / t * 100) where.. sb = (bigram of A found in B + bigram of B found in A) = (1 + 1); t = (total count of bigrams in A and B) = 8; similarity is now (2 / 8 * 100) = 25%..... Sweet!!!!!! try this for "naira" and "nairra" ... the answer should be 89%.... ...this means the user must have made mistake of adding extra "r"..(planar property!!!); in conclusion, "naira, nairra" is similar than "naira, niara". There is a consistency and planar property with this simple approach. More benchmarks... "internalization" is 63% similar to "interdenominationalism" What I would like this community do is to implement this in "PHP, ASP, Python, Java, C++". //Implement it in the Language you can. [size=14pt]I Use This Opportunity to Humbly request Nairaland to integrate source code snippet Editor and Viewer into this section!!!(I believe there are many free ones out there)[/size] ******************************************************************************************************* //Here is a simple bigram implementation I did in Python. works with 2.5 and 3.2.. due to Nairaland's input parsing, ignore asteriks(*) and replace the braces {} with []... def returnbigrams(STR): ****"""returns a list of bigrams from str(STR)""" ****if(not isinstance(STR, str)): ********return {} ****return {STR{n:n+2} for n in range(len(STR) - 1)} THe image attached is a snippet from my C++ implementation.... ************************************************************************************************************************************* I prototype my most of algorithm in Python... So I will post a Python implementation of PHASE ONE solution soon...(someone should recommend me code-pasting a site). NOTE: I have implemented everything I need based on the post... am currently writing a small C++ library... so :-) .. I just wanted to share a little knowledge with those that care to learn this. its pretty basic, simple and and fun to learn... Lets go! ************************************************************************************************************************************* All Python programmers should give a shout out!!! cause no other language can implement bigram under 4(achievable) lines of codes!! :-D. We will deal with the second phase... later :: :-) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Timothy.
|
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by ciphoenix: 4:20pm On Jan 03, 2013 |
.Duplicate |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by ciphoenix: 4:20pm On Jan 03, 2013 |
Try IDEONE.com |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 4:58pm On Jan 03, 2013 |
ciphoenix: Try IDEONE.comthanks a lot! Seems cool! Am on mobile now. will post it there as soon as I buy a new data bundle for my PC. It just got finished. :-( |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by ciphoenix: 6:46pm On Jan 03, 2013 |
WhiZTiM:you're welcome. |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 1:23pm On Jan 04, 2013 |
lol @ shoutout ... interesting AI logic...am following this btw i made an attempt
|
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 10:13pm On Jan 04, 2013 |
turned it into a function. ₱®ÌИСΞ:nice one there!! Exactly how I implemented mine except for one thing that makes yours break under some conditions... Have you tested this code?? Try it with, *naira* and *naira*. Try it with, *church* and *church*. ... A glitch right? Can you think of the glitch??. Will post mine soon...... I still hvnt bought internet data on PC yet. Currently using my Phone's Operamini... |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 11:40pm On Jan 04, 2013 |
yea its giving 140% ... i think its because of the repeated "ch" but isnt that a good thing? u know ... in matching stuffs ... the higher match is the likely chosen say for example i matched "church" and "churce" nd it gave me 100% but with "church" it gave me 140% so d latter is chosen maybe im wrong sha...would like to see ur code tho |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 12:47am On Jan 05, 2013 |
₱®ÌИСΞ: add a simple break statement. I mean, break the inner loop on first match. That was the glitch the orignal author didn't mention in his paper... http://www.cis.upenn.edu/~pereira/papers/sim-mlj.pdf Anyway, you do not have to spend your time reading the paper, I will dish out the necessary thing here + some working codes in Python. Back to the topic, I found out to keep the planer property. You gotta make assumption that 1st match is what you need. And remove it from the inner loop! I mean, add this code inside the sb += 2 block. ...i.e ...
.... Works?? ... : - ) |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 2:06am On Jan 05, 2013 |
it worked yea... ... |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 1:38am On Jan 06, 2013 |
aiite... Should get my PC on internet soon. Before then, how do you think we can apply this to short sentences... e.g "taye is happy" and "taiwo is happier"?? ... Jst a little drill, however simple or complex your idea is may help everyone... |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 2:03am On Jan 06, 2013 |
well i think the sentences should be broken down to arrays of words, and the similarity of each index checked against the other, then number of words and number of letters in each sentence would be added to the mix |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 11:16am On Jan 06, 2013 |
What isp do u use? |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 2:57pm On Jan 06, 2013 |
₱®ÌИСΞ:MTN. I use both MTN-Hotspot WiFi located where I live and their mobile 3.5G network. |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 5:43pm On Jan 07, 2013 |
Ohhkayy... to continue this... I have uploaded a sample Python code that covers the entirety of this first part.... check it out here... http://ideone.com/RokLXI ....now the next thing is to explain the algorithm to newbies... 1. get the two words to compare and split the words into bigrams. e.g WhiZTiM turns to Wh, hi, iZ, ZT, Ti, iM 2. read the first post in this thread... :-) Questions are ever welcomed... ! |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 5:46pm On Jan 07, 2013 |
the delimiter function there is for use in the second part... that is when its now sentences... --------------------Thinking what am thing.... ...Natural Language Processing! yeah... thats right... oops, this is just a very very basic |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 11:51pm On Jan 07, 2013 |
true that.. |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 7:07pm On Jan 09, 2013 |
@whiZTiM where u @ na |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by spikesC(m): 7:16pm On Jan 09, 2013 |
pls oooooooooooo, dnt be discouraged. Am following this o |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 12:40pm On Jan 10, 2013 |
Yes sirs!! ....I have been kinda busy lately... Ohkkay... Spikes C... ... now that we have got this phase ready... we look at the next part that caters for sentences of few words... this next phase is only reasonable for sentences of less than 5 words. for example... ____-- we are writing a music library software.. and in the specs, we needed to be able to group Albums together... now, we know that some people may have tracks of the same album... but the "Album" ID3v2 tag may vary spelling wise... secondly few statements like ... "Getting higher on rank" and "Getting on higher rank" are kinda similar... how do we measure its similarity>> ? heres my idea... 1. Split up the string into two lists of individual Words, discarding away all kinds of delimiters like [,:;. -=] ...etc 2. have the two word_lists in a single nested loops.. i.e a for inside a for 3. if the list do not have the same length, maintain the word_list with the higher number words in the outer loop and the shorter in the inner. eg. list1:"xx yy zz", list2:"xx zz" 5. create an empty list, list1_ratios, outside the main loop and another, list2_ratios, inside the main loop. 4. iterate over... using getting the bigram similarity of every word in list1 with list2.... I mean... -> 5. if list1 = "Getting", "higher", "on", "rank", "oh" and list2 = "Getting", "on", "higher", "rank". ---then pick "Getting" in list1 store the percentage similarity of list1's "Getting" with list2's "Getting", "on", "higher" and "rank"... in list2_ratios ...you then have a list of 4 floating point numbers... before you exit the inner loop... get the largest item of list2_ratios and append to list1_ratios 6. do the same... until... all iteration is done... now list1_ratios has 5 floating point numbers... what do you think we can do next? PLS: I know my explanation is a bit crappy, pls ask for clarifications if you do not understand. |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by lordZOUGA(m): 8:16pm On Jan 10, 2013 |
nice algorithm |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 3:50am On Jan 15, 2013 |
@WhiZTiM by adding
according to your instructions...the output of list1_ratios is [100.0, 100.0, 100.0, 100.0, 0.0] as seen @ http://ideone.com/rn9J1H so i guess we could do something like (2*(sum of values on list1_ratios)) / (len(list1) + len(list2)) |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 1:02pm On Jan 15, 2013 |
lordZOUGA: nice algorithmThanks man,,,,, |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 1:29pm On Jan 15, 2013 |
@₱®ÌИСΞ Yeah... That's great... Impressive! You are totally on track.... The next thing now is umm to use a basic statistical knowledge... That is... After having gotten the similarity list we are looking for... We can see that it has the same number of elements as its wordlist... Ths last concept has to do with "weights". Each word constitutes a weight in the original sentence. And the sum of those weights gives us the total weight of the sentence(excluding all delimiters and symbols). This weight is simply based on the number of characters of a word relative to its word list. For example. The weight of 'school' in the sentence list "he goes to school everyday" is ... len(school) / sum(word_list). So what next.? Calculate the weights of the larger(if applicable) list. ..... The similarity list we have should be multiplied by its matching weight. And.... Sum them up!!!!!!!' Example: If list1 = [ab, aba, baba, babu] List2 = [baba, ab, aba] ....then. similarity_list = [100.0, 100.0, 100.0, 0.0] weight_list = [0.154, 0.231, 0.308 , 0.0] now weighted_similarity array is = [15.4, 23.1, 30.8, 0.0]... Now similarity = sum(weighted_similarity). = 69.3 |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 1:00pm On Jan 16, 2013 |
@WhiZTiM thanks boss i made modification to the code to output the required values needed as seen here http://ideone.com/qZyAvx |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 12:45am On Jan 19, 2013 |
@WhiZTiM.....sup with d continuation |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 6:24pm On Jan 21, 2013 |
₱®ÌИСΞ:I have been a bit ill... Secondly school work... Wi ll try updating this this week. This pretty much of the first part. And I appreciate you and everyone that followed. I will end this part with with the introduction of lexical parsing and hash map lookups. Where we can determine if "he is intelligent" is same as "he is brilliant". Currently, will lookup my C++ code, and try to write python implementation. Part2 will be about, simple grammer systems. Can you train a system to be corrected so that it learns the best choice of word to use. (mostly statistics, of which I never liked) |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 8:10pm On Jan 21, 2013 |
oh ok...sori bro never liked stats too tho... all d same |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by WhiZTiM(m): 1:33pm On Feb 01, 2013 |
I think... I should be getting ready with the next part... Probably before then, I would post a Quiz in the programming section... An interesting one on parsing. |
Re: A Simple And Very Useful -WORD And SENTENCE- Similarity Algorithm ...part_1 by PrinceNN(m): 2:09pm On Feb 01, 2013 |
kul... |
(1) (Reply)
I Need An Android App Developer. / Computer Science Algorithm Problem / Battling Google, Microsoft Changes How It Builds Software
(Go Up)
Sections: politics (1) business autos (1) jobs (1) career education (1) romance computers phones travel sports fashion health religion celebs tv-movies music-radio literature webmasters programming techmarket Links: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Nairaland - Copyright © 2005 - 2024 Oluwaseun Osewa. All rights reserved. See How To Advertise. 47 |