PHP Classes

Seltz analyzer: Extract important words from HTML documents

Recommend this page to a friend!
  Info   View files View files (2)   DownloadInstall with Composer Download .zip   Reputation   Support forum   Blog    
Last Updated Ratings Unique User Downloads Download Rankings
2015-03-19 (1 year ago) RSS 2.0 feedStarStarStar 57%Total: 662 All time: 4,620 This week: 1,082Up
Version License PHP version Categories
seltz_analyzer 0.4Free for non-comm...5.2HTML, PHP 5, Text processing
Description Author

This class can be used to extract important words from HTML documents.

It can process a well-formed XHTML document and extract the words contained in the document.

The class gives scores to each word depending on conditions like, whether the first letter is upper case, whether the word is inside strong or bold tags, etc..

It returns an associative array of words sorted by importance score.

Innovation Award
PHP Programming Innovation award nominee
March 2009
Number 8

Prize: One book of choice by Manning
There are many solutions for determining the most important keywords of a text document.

However, when the text is part of an HTML document, the importance of each keyword may be affected by the emphasis given by the tags that enclose each keyword.

This class implements an approach for determining the most important keywords in an HTML document considering also the tags that format them.

Manuel Lemos
Picture of Seltzlab
Name: Seltzlab is available for providing paid consulting. Contact Seltzlab .
Classes: 1 package by
Country: Italy Italy
Age: 36
All time rank: 2715103 in Italy Italy
Week rank: 1373 Up57 in Italy Italy Up
Innovation award
Innovation award
Nominee: 1x

seltz_analyzer is a PHP class that try to find the most important words inside a well formed xhtml trunk.
Every word take a score based on the role on the xhtml structure. For example: a word between strong tag will take 5 points.
In addition, will look at some simple syntax rules. For example a word with first char uppercase will take 4 points.
Below the rules applied:
- weight_ucfirst = 4, first char uppercase and the first char before the word is not an interrpution char
- weight_ucfirst_multi = + weight_ucfirst, weight_ucfirst is satisfied and the previous words too
- weight_pspell = 3, if php pspell module is loaded and the syntax checker pspell does not recognize the word as a valid word
- weight_strong = 5, the word is inside a strong or b tag
- weight_em = 5, the word is inside a em or i tag
- weight_span = 4, the word is inside a span tag
- weight_p = 1, the word is inside a p tag
- weight_acronym = 2, the word is inside a acronym tag
- weight_cite = 1, the word is inside a cite tag

Header tags h1, h2, hn set a moltiplicator for the score of the following words which is in inverse proportion with the heading number (1/n)

Score is cumulative, so a word more is used more meaning will take.

This work is licensed under Creative Commons Attribution-Share Alike 3.0. or write to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

seltz_analyzer è una classe PHP che si occupa di parsare una documento
XHTML e trovare le parole semanticamente più importanti del testo.
Niente di trascendentale in realtà, niente intelligenza artificiale, nessuna magia.
Il parser analizza la struttura XHTML del documento e, in base all'utilizzo
di alcuni tag come strong o em dà un punteggio ad una parola o una frase.
Oltre a cercare l'importanza semantica dalla struttura del documento, assegna
anche punteggi sulla base di alcune regole legate alla sintassi del testo.
Di seguito i pesi applicati alle varie casistiche:
- weight_ucfirst = 4, la parola comincia con lettera maiuscola e non è preceduta da un carattere di interruzione
- weight_ucfirst_multi = + weight_ucfirst, se è soddisfatta weight_ucfirst ma anche la precendete parola soddisfaceva weight_ucfirst
- weight_pspell = 3, se il modulo php pspell è caricato e la parola non è riconosciuta dal syntax checker pspell
- weight_strong = 5, la parola è contenuta in un tag strong o b
- weight_em = 5, la parola è contenuta in un tag em o i
- weight_span = 4, la parola è contenuta in un tag span
- weight_p = 1, la parola è contenuta in un tag p
- weight_acronym = 4, la parola è contenuta in un tag acronym
- weight_cite = 1, la parola è contenuta in un tag cite

I tag header h1, h2, hn impostano un moltiplicatore inversamente proporzionale (1/n) per le parole che seguono

Il punteggio è comulativo, quindi più volte viene utilizzata una parola più significato prende.

Quest'opera è stata rilasciata sotto la licenza Creative Commons Attribuzione-Condividi allo stesso modo 3.0 Italia. Per leggere una copia della licenza visita il sito web o spedisci una lettera a Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.


// seltz_analyzer configuration params
$conf = array(
    'doclang' => 'en', // 2 chars document lang, default en
    'encoding' => 'utf-8' // document encoding, default iso-8859-1

// new seltz_analyzer object
// 1st param is a well formed XHTML trunk (must have a root element) or an URL to a web document
// 2nd param is the configuration array
$mydoc = new seltz_analyzer($html, $conf);
// the buildStruct method return an associative array with words as keys and scores as values
$words = $mydoc->buildStruct();

  Files folder image Files  
File Role Description
Plain text file seltz_analyzer.php Class Php class
Accessible without login Plain text file README Doc. Readme file

 Version Control Unique User Downloads Download Rankings  
This week:0
All time:4,620
This week:1,082Up
 User Ratings  
 All time