PHP Classes
elePHPant
Icontem

PHP Similar Text Percentage: Compare two strings to compute a similarity score

Recommend this page to a friend!
  Info   View files Example   View files View files (4)   DownloadInstall with Composer Download .zip   Reputation   Support forum (1)   Blog    
Last Updated Ratings Unique User Downloads Download Rankings
2018-06-05 (16 days ago) RSS 2.0 feedNot enough user ratingsTotal: 255 This week: 7All time: 7,590 This week: 60Up
Version License PHP version Categories
similar-text 1.0.1Custom (specified...5Algorithms, PHP 5, Text processing
Description Author

This class can compare two strings to compute a similarity score.

It takes the text of two strings and analyze them using pure PHP code to evaluate how equal they are.

The class returns a number that represents a percentage of the two strings to tell the level of similarity.

It achieves that by sorting words, ignoring white space and punctuation, removing or adding word, strip URLs, replace words by acronyms or expanding acronyms into the original words, compare words with similar sounds using stems, checking parts of the strings, replace words by abbreviations or using anagrams.

Innovation Award
PHP Programming Innovation award nominee
April 2018
Number 6
PHP comes with built-in functions for comparing strings and determine how similar they are.

This package provides a pure PHP solution that works in a more sophisticated way by performing text comparison on a sentences basis, rather than on a word by word basis.

Manuel Lemos
  Performance   Level  
Name: zinsou A.A.E.Mo´se is available for providing paid consulting. Contact zinsou A.A.E.Mo´se .
Classes: 40 packages by
Country: Benin Benin
Age: 28
All time rank: 11221 in Benin Benin
Week rank: 1 Up
Innovation award
Innovation award
Nominee: 14x

Winner: 2x

Details
PHP Similar text Class is a very basic script which can be used to detect strings similarities
It actually can help to detect similarity even if these cases occurred:
WORD REORDER,WHITESPACE AND PUNCTUATION,REMOVE WORDS,ADD WORDS,URL STRIPPING,
FORM ACRONYM,EXPAND ACRONYM,STEMMING,SUBSTRING ,SUPERSTRING,ABBREVIATION ,ANAGRAM

The static similar_text method takes as parameters two strings and some optional
parameters as :
-$round(integer default 2) to round the results
-$insensitive(boolean default true) to force similarities detection apart from cases
 -&$stats(array default false) very useful to know if the shortest string is full contained in the longest (here 
 we are talking about characters) , if the the shortest is really contained in the longest(as a substring even exploded )
 but also what percentage of the shortest is really contained in the longest
-&$difference (array default false) to know differences in characters contained two members chars_notinstr0 and chars_notinstr1


and return a percentage of similarity which informs on how similar is the shortest to the longest
Note that string can have the same length...

example of usage:
first of all include the class source file:
require_once 'similar_text.class.php';
procedural style :
$d=array();
var_dump(similarText('icontem jsclasses.org','iCONtem PHPclasses.org',2,true,$s,$d),$s,$d);
/* return 
float(86.36)
array(3) {
  ["reallycontain"]=>
  bool(false)
  ["percentageRc"]=>
  float(50)
  ["contain"]=>
  bool(false)
}
array(2) {
  ["chars_notintstr1"]=>
  array(1) {
    [8]=>
    string(1) "j"
  }
  ["chars_notintstr0"]=>
  array(3) {
    [8]=>
    string(1) "p"
    [9]=>
    string(1) "h"
    [10]=>
    string(1) "p"
  }
 */
 
 another example :
 $d=array();
 var_dump(similarText('icontem class','iCONtem classes',2,true,$s,$d),$s,$d);
 /* return 
 float(86.67)
array(3) {
  ["reallycontain"]=>
  bool(true)
  ["percentageRc"]=>
  int(100)
  ["contain"]=>
  bool(true)
}
array(1) {
  ["chars_notintstr0"]=>
  array(2) {
    [13]=>
    string(1) "e"
    [14]=>
    string(1) "s"
  }
}
 */
 another example :
 $d=array();
var_dump(similarText('give me my pda','give me my personal digital assistant',2,false,$s,$d),$s);
/*return 
float(37.84)
array(3) {
  ["reallycontain"]=>
  bool(false)
  ["percentageRc"]=>
  float(75)
  ["contain"]=>
  bool(true)
}
*/
In this last example you can see that the most significant data are percentageRC(percentage really contained) and
contain...We can say here that similarity percentage interpretation is better when we get this two data   

And  another example :
 $d=array();
var_dump(similarText('how to cook some something ?','Something to cook',2,true,$s),$s);
/*
float(60.71)
array(3) {
  ["reallycontain"]=>
  bool(true)
  ["percentageRc"]=>
  float(100)
  ["contain"]=>
  bool(true)
}
*/

And finally  another example :
 $d=array();
 
 var_dump(similarText('tocooksomething','Somethingtocook',2,true,$s,$d),$s);
 /* return 
 float(100)
array(3) {
  ["reallycontain"]=>
  bool(false)
  ["percentageRc"]=>
  float(0)
  ["contain"]=>
  bool(true)
}
as you may see an full similarity has been detected because the two strings are ANAGRAMS 
but you can interpret the result  with the stats array and the similarity percentage
 that tell us that the two strings are similar because they have first the same length,
 then they have same characters ("similar" percentage==100% + "contain" value==true) but 
 they are not identical ("reallycontain" value==false + "percentage really contained" !=100)
 */
 
 
 
POO style :  similar_text::similarText($str_0,$str_1,$round,$insensitive,$stats,$difference);

For bug reporting and improvement contact me at leizmo@gmail.com or use the forum...

  Files folder image Files  
File Role Description
Accessible without login Plain text file license.txt Lic. license
Accessible without login Plain text file readme.txt Doc. readme
Plain text file similar_text.class.php Class class source
Accessible without login Plain text file testST.php Example example script

 Version Control Unique User Downloads Download Rankings  
 0%
Total:255
This week:7
All time:7,590
This week:60Up