PHP Classes

This is a very fast class however it's not working effectivel...

Recommend this page to a friend!

      Tokenizer  >  All threads  >  This is a very fast class however...  >  (Un) Subscribe thread alerts  
Subject:This is a very fast class however...
Summary:Package rating comment
Messages:8
Author:Jase
Date:2010-06-14 10:16:08
Update:2010-07-01 18:13:14
 

Jase rated this package as follows:

Utility: Not sure
Consistency: Sufficient
Examples: Good

  1. This is a very fast class however...   Reply   Report abuse  
Picture of Jase Jase - 2010-06-14 10:16:08
This is a very fast class however it's not working effectively as a tokenizer. Generally, tokenizers will go character by character (very slow in php I agree), but I tested this on quite a lot of files and it's just not general enough. Symbols should not be catalogued into word or digit type tokens unless specified. I tried this on a php file which included tokenizing itself with the class appended. The class was tokenized well but the example at the top was a different story. It included the entire <?php down to the class keyword, including all the example code as one block. This is not the correct behaviour. Don't expect this to work as expected

  2. Re: This is a very fast class however...   Reply   Report abuse  
Picture of Domenico Pontari Domenico Pontari - 2010-06-22 09:43:40 - In reply to message 1 from Jase
Hi Jase,
can you send me the file you used, I'll try to improve this class.
Thank you for your interest.
Fair

  3. Re: This is a very fast class however...   Reply   Report abuse  
Picture of Jase Jase - 2010-06-23 17:24:18 - In reply to message 2 from Domenico Pontari
Hi Domenico

It was a standard PHP class I was using as I'm trying to write a minimizer (I'm writing a huge library framework and wanted to give a minimized version for those with shared hosting accounts and minimal space)

The original file I used has changed considerably so i don't have that on hand at the moment (i had no version control on it as it was just a test class)


May I suggest however doing a tokenizer using strcspn, strspn, and strpos? I've started a class like this but it's only halfway through and already proving much faster. I will try and re-create my class to show the issues (using PHP 5.2 with Ubuntu 10.04 as an apache2 module).


  4. Re: This is a very fast class however...   Reply   Report abuse  
Picture of Jase Jase - 2010-06-23 17:42:45 - In reply to message 2 from Domenico Pontari
Hi

I've just run a very simple example using class's own file for the parser

if you run the code (from a CLI would be simple enough)
cd Desktop
php name_of_file.php

you will see instantly that huge blocks are getting flooded into the var_dump
There are many examples where the last few lines of a method are included in the block as well a PHPDoc comment in the same block etc

The script below is just the class and one of your usage examples at the top but it includes itself

I know this code is more for simple strings but if done correctly, it should detect the php correctly (obviously it's then up to the user to determine what is meaningful in the tokenized array). for example, detecting a string, detecting a variable, detecting keywords like if, while etc

I highly suggest rewriting this class to incorporate the built in string functions. using strcspn(), strpos() etc, you can achieve the desired effect much quicker, less resources and less code. strcspn is one of the best for including strings as you if you put " in the strcspn (can detect multiple elements at once), you can detect one end of the string to another and also create a loop that detects an escape char to find the next "

i use strcspn to skip whitespace very simply with
Bare in mind Whitespace is seen as a space and a Tab. newlines are \r and \n. This also detects \r\n together because of strcspn's behaviour

/**
* Skips whitespace but also returns it just incase it is needed
*
* @param bool $stop_at_newlines
* @return string|false
*/
public function skipWhitespace($inlcude_newlines = true){
$whitespace = self::WHITESPACE;
if($inlcude_newlines)
$whitespace .= self::NEWLINE;

//find where the whitespace ends
$result = strspn($this->data, $whitespace, $this->position);

if(is_int($result) AND $result > 0){
//extract the whitespace
$space = substr($this->data, $this->position, $result);
//set the position ahead of the whitespace. Minus 1 as positions start from 0, not 1.
$this->skipChars(($result - 1));
//return the whitespace
return $space;
}
else{
return false;
}
}

Also, i highly suggest detecting symbols seperately from words. symbols like ! & ^ etc shouldn't be included along with keywords. This makes code parsing and HTML parsing much more difficult.

Hopefully i've not overloaded you here and i hope i've helped. Obviously my method will be different to your's but hopefully i've inspired ways for you to improve certain aspects or give an idea you hadn't thought of

<?php

$tokenizer = tokenizer::registerDefaultTokenizer();
var_dump ($tokenizer->tokenize (file_get_contents(__FILE__)));


/**
* TOKENIZER library
*
* This library has the following features:
* 1. you can define a list of different sequences to cut tokens, not only a single char.
* 2. you can define a list of markers to avoid the cutting of tokens.
* 3. for each marker you can define opening, closure and escaping sequence of chars: you are not obliged to use only one char.
*
* @package tokenizer
* @version 1.2
* @author Domenico Pontari <fairsayan@gmail.com>
* @copyright Copyright (c) 2010, Domenico Pontari
* @license http://opensource.org/licenses/bsd-license.php New and Simplified BSD licenses
* @link http://www.phpclasses.org/package/5969-PHP-Tokenizer-split-strings-into-tokens.html
*/


/**
* This class define sequences of chars to avoid the cutting of tokens
*
* Markers are sequences of chars to avoid the cutting of tokens. For each marker
* you can define opening, closure and escaping sequence of chars:
* you are not obliged to use only one char.
* @package tokenizer
*/
class marker {
protected $name;
protected $opening;
protected $closure;
protected $escaping;

function getName () { return $this->name; }
function setName ($name) {$this->name = $name;}

function getOpening () { return $this->opening; }
function getClosure () { return $this->closure; }
function getEscaping () { return $this->escaping; }

function setMarker ($opening, $closure = '', $escaping = '', $name = '') {
$this->opening = $opening;
$this->closure = ($closure == '')?$opening:$closure;
$this->escaping = $escaping;
$this->name = ($name == '')?$opening:$name;
}


/**
* delete all marks in a string: opening, closure and tranform escaping sequence in
* opening sequence
* @return string
*/
function unMark ($string) {
$tokenizer = new tokenizer ();
$limits = array ($this->opening);
$result = array ();
if ($this->escaping != '') array_push ($limits, $this->escaping);
if ($this->closure != $this->opening) array_push ($limits, $this->closure);
$tokenizer->setLimits ($limits);
$tokens = $tokenizer->tokenize ($string);
foreach ($tokens as $token) {
if (($token == $this->opening) || ($token == $this->closure)) continue;
if (($this->escaping != '') && ($token == $this->escaping)) array_push ($result, $this->opening);
else array_push ($result, $token);
}
return implode ('', $result);
}

function __construct ($opening, $closure = '', $escaping = '', $name = '') {
$this->setMarker ($opening, $closure, $escaping, $name);
}
}


/**
* This class realize the tokenization of a string into pieces
*
* @package tokenizer
*/
class tokenizer {
/**
* @var array an array of strings: a list of different sequences to cut tokens
*/
protected $limits = array();

/**
* @var array an array of markers to avoid the cutting of tokens
* @see marker
*/
protected $markers = array();


/**
* @var array an array of strings: the result of the tokenizer
*/
protected $tokens = false;

/**
* @var array an array of boolean
*/
protected $limitsMap = false;

/**
* @param array an array of markers the tokenizer
* @return void
*/
function setMarkers ($markers) { $this->markers = $markers; }

function getMarkers () {return $this->markers;}

function getMarker ($markerName) {
$result = false;
foreach ($this->markers as $marker)
if ($marker->getName() == $markerName) return $marker;
return $result;
}

function setMarker ($marker) {
$name = $marker->getName ();
$oldMarker =& $this->getMarker ($marker);
if ($oldMarker === false) array_push ($this->markers, $marker);
else $oldMarker =& $marker;
}

/**
* @param array array of chars
* @return void
*/
function setLimits ($limits) { $this->limits = $limits; }

/**
* @return array|false retrieve again the result of tokenize function
* @see tokenize
*/
function getTokens ($withLimits = false) {
if (($withLimits)||($this->tokens === false)) return $this->tokens;

$result = array();
foreach ($this->tokens as $num => $token)
if (!$this->limitsMap[$num]) array_push ($result, $token);
return $result;
}

function getLimitsMap () { return $this->limitsMap; }

/**
* Execute a tokenization
*
* @param string
* @return array|string if an error occurs return an error message otherwise return an array of tokens
*/
function tokenize ($string) {
if (empty($string)) return array();
$limits = array();
$escapingLimit = array (chr(92), '(', ')', '|'); // it's important to preserve this order
$escapedLimit = array (chr(92) . chr(92), '\(', '\)', '\|');
foreach ($this->limits as $limit) {
$limit = str_replace($escapingLimit, $escapedLimit, $limit);
array_push ($limits, "($limit)");
}

$this->tokens = preg_split("/" . implode('|', $limits) . "/", $string,-1,PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);

$numOfMarkers = count($this->markers);
if ($numOfMarkers > 0) $avoidLevel = array_fill(0, $numOfMarkers, 0);
else $avoidLevel = array();
$result = array();
$someActiveMarker = false;

foreach ($this->tokens as $num => $token) {
$tokenLen = strlen($token);
if ($someActiveMarker) $result[0] .= $token;
else array_unshift($result, $token);
$someActiveMarker = false;
for ($posMarker = 0; $posMarker < $numOfMarkers; ++$posMarker) {
$openingStr = $this->markers[$posMarker]->getOpening();
$closureStr = $this->markers[$posMarker]->getClosure();
$escapingStr = $this->markers[$posMarker]->getEscaping();
for ($i = 0; $i < $tokenLen; $i++) {
$comparingOpening = substr($token, $i, strlen($openingStr));
$comparingClosure = substr($token, $i, strlen($closureStr));
$comparingEscaping = substr($token, $i, strlen($escapingStr));

if ($avoidLevel[$posMarker] == 0) {// evaluate opening only
if ($comparingOpening == $openingStr) {
++$avoidLevel[$posMarker];
$i += strlen($openingStr) - 1;
}
} else {
if ((strlen($escapingStr) > 0) &&
($comparingEscaping == $escapingStr)) {
$i += strlen($escapingStr) - 1;
} elseif ($comparingClosure == $closureStr) {
--$avoidLevel[$posMarker];
$i += strlen($closureStr) - 1;
}
}
}
if ($avoidLevel[$posMarker] > 0) $someActiveMarker = true;
}
}
$this->tokens = array_reverse($result, false);
$this->limitsMap = array ();
foreach ($this->tokens as $token)
if (in_array($token, $this->limits, true)) array_push ($this->limitsMap, true);
else array_push ($this->limitsMap, false);

return $this->tokens;
}

/**
* Set default markers for the tokenizer:
* 1. ' with \' as escaping sequence
* 2. " with \" as escaping sequence
* 3. () without escaping chars
*
* @return void
*/
function setDefaultMarkers () {
$this->markers = array();
array_push ($this->markers, new marker ('(',')'));
array_push ($this->markers, new marker ("'", "'", "\'"));
array_push ($this->markers, new marker ('"', '"', '\"'));
}

/**
* Set default limit: the blank space
* @return void
*/
function setDefaultLimits () { $this->limits = array (' '); }

/**
* Return a new tokenizer with default limit and makers
* @return tokenizer
*/
static function registerDefaultTokenizer () {
$result = new tokenizer();
$result->setDefaultLimits();
$result->setDefaultMarkers();
return $result;
}
}

/**
* @package tokenizer
*/
class parser {
protected $expression;

protected $tokenizer;

protected $operators;

protected $errorMsg;

protected $elements;

protected $operatorsMap;

/**
* This version of the constructor automically create a deafault tokenizer
*/
function __construct ($expression, $operators, $tokenizer = false) {
if ($tokenizer === false) $tokenizer = tokenizer::registerDefaultTokenizer();
$this->setTokenizer ($tokenizer);
$this->setOperators ($operators);
$this->setExpression($expression);
}

/**
* @param string
* @return void
*/
function setExpression ($expression) { $this->expression = $expression; }

/**
* @param tokenizer
* @return void
* @see tokenizer
*/
function setTokenizer ($tokenizer) { $this->tokenizer = $tokenizer; }

/**
* @param array an array of strings in which each string is an operator in the expression
* @return void
*/
function setOperators ($operators) { $this->operators = $operators; }

/**
* Parse the expression to find operators and operand
* @return array
*/
function parse () {
$tokens = $this->tokenizer->tokenize ($this->expression);
$result = array();

foreach ($tokens as $pos => $token)
if (($pos == 0) || (in_array($token, $this->operators)) ||
(in_array($tokens[$pos - 1], $this->operators))) array_unshift($result, $token);
else $result[0] .= $token;
$this->elements = array_reverse($result, false);

$this->operatorsMap = array();

foreach ($this->elements as $pos => $el)
if (in_array($el, $this->operators, true)) array_push ($this->operatorsMap, $el);
else array_push ($this->operatorsMap, false);

return $this->elements;
}

/**
* Get the operator position in the element list
* @param int the n-th operator
* @param string|array|false it could be a single operator or a list of operators (array). If false all operators are valid
* @return int|false
*/
function getOperatorPositionInElementList ($nthOperator, $operatorType = false) {
$this->errorMsg = '';

$currOperator = 0;
$isArray = is_array($operatorType);
foreach ($this->operatorsMap as $pos => $operator) {
if ( (($operatorType === false)&&(in_array($operator, $this->operators, true))) ||
((!$isArray)&&($operatorType !== false)&&($operator === $operatorType)) ||
(($isArray)&&(in_array($operator, $operatorType, true)))
) {
if ($currOperator++ == $nthOperator) return $pos;
}
}
$this->errorMsg = "Don't exist $nthOperator operators in the expression";
return false;
}

/**
* Get the left operand of an operator in the expression
* @param int the n-th operator
* @param string|array|false it could be a single operator or a list of operators (array). If false all operators are valid
* @return string|false
*/
function getLeftOperand ($nthOperator, $operatorType = false) {
$pos = $this->getOperatorPositionInElementList ($nthOperator, $operatorType);
if ($pos === false) return false;
return $this->elements[$pos - 1];
}

/**
* Get the left operand of an operator in the expression
* @param int the n-th operator
* @param string|array|false it could be a single operator or a list of operators (array). If false all operators are valid
* @return string|false
*/
function getRightOperand ($nthOperator, $operatorType = false) {
$pos = $this->getOperatorPositionInElementList ($nthOperator, $operatorType);
if ($pos === false) return false;
return $this->elements[$pos + 1];
}
}

?>

  5. Re: This is a very fast class however...   Reply   Report abuse  
Picture of Domenico Pontari Domenico Pontari - 2010-07-01 16:35:16 - In reply to message 4 from Jase
Hi Jase,
I uploaded a new version of tokenizer.php that uses "explode" function instead of "preg_split". I visited some web site (e.g.):
mail-archive.com/php-general@lists. ...
many of them says that explode is the fastest function (I suppose because it doesn't support regular expression). Anyway this new algorithm can manage easier special chars as you said.
Let me know what do you think about.
Fair

  6. Re: This is a very fast class however...   Reply   Report abuse  
Picture of Jase Jase - 2010-07-01 17:34:24 - In reply to message 5 from Domenico Pontari
Hi Domenico

Yes, explode should perform much quicker as there's no overhead of a regex parser. Large parsing tasks should avoid regex like a plague lol.

I'll essentially run the same test as I did previously with the PHP file to see how it behaves. I shall post another comment when it's done

  7. Re: This is a very fast class however...   Reply   Report abuse  
Picture of Jase Jase - 2010-07-01 17:47:46 - In reply to message 6 from Jase
I've just re-ran the test in the same way I did previously.

It's working much better with the way it captures the data. However, after a while it begins to lose a bit of control and presents lumps of code in one var_dump section. Please look at the bottom of this post for the example

To be honest. I think you are going to be very limited with explode functions still but it may be the right sacrifice because this is amazingly fast but the accuracy is lacking.

The Tokenizer I'm working on is completely accurate but a fair bit slower as I do character by character scanning (which is why i use strcspn, strpos etc).

Getting the speed improvement and improved accuracy is a good start but for complete accuracy, you may need to switch to character by character scanning with some short cut functions to improve speed. Maybe that could be the next version? Don't really want to change this version too much as it may break other ppl's scripts that are using it successfully.

EXAMPLE:
See how [1498] states that it is 3148 characters long
[1495]=>
string(1) " "
[1496]=>
string(2) "1."
[1497]=>
string(1) " "
[1498]=>
string(3148) "' with \' as escaping sequence
* 2. " with \" as escaping sequence
* 3. () without escaping chars
*
* @return void
*/
function setDefaultMarkers () {
$this->markers = array();
array_push ($this->markers, new marker ('(',')'));
array_push ($this->markers, new marker ("'", "'", "\'"));
array_push ($this->markers, new marker ('"', '"', '\"'));
}

/**
* Set default limit: the blank space




  8. Re: This is a very fast class however...   Reply   Report abuse  
Picture of Domenico Pontari Domenico Pontari - 2010-07-01 18:13:14 - In reply to message 7 from Jase
It's depends on the default markers: () '' and ""
If you want only define whitespace limit use:

$tokenizer = tokenizer::registerDefaultTokenizer();
$tokenizer->setMarkers(array());
var_dump ($tokenizer->tokenize (file_get_contents(__FILE__)));

or

$tokenizer = new tokenizer();
$tokenizer->setLimits(array(' '));
var_dump ($tokenizer->tokenize (file_get_contents(__FILE__)));