Extract tokens from a query using regex defined tokens. The library is written in PHP 8.2.
The package is available on Packagist and can be installed using Composer:
composer req codeinc/query-token-extractor
use CodeInc\QueryTokenExtractor\QueryTokenExtractor;
use CodeInc\QueryTokensExtractor\Type\RegexType;
use CodeInc\QueryTokensExtractor\Type\WordType;
use CodeInc\QueryTokensExtractor\Type\FrenchPhoneNumberType;
use CodeInc\QueryTokensExtractor\Type\FrenchPostalCodeType;
use CodeInc\QueryTokensExtractor\Type\YearType;
use CodeInc\QueryTokensExtractor\Dto\QueryToken;
$tokensExtractor = new QueryTokensExtractor([
new FrenchPhoneNumberType(),
new FrenchPostalCodeType(),
new YearType(),
new RegexType('my_custom_token', '/^this a custom token/ui'),
new WordType(),
]);
$tokens = $tokensExtractor->extract('paris (75001) these are words 01.00.00.00.00 this a custom token 2023');
/** @var QueryToken $token */
foreach ($tokens as $token) {
echo "Position: " . $token->position . "\n"
."Class: " . get_class($token->type) . "\n"
."Name: " . $token->type->name . "\n"
."Value: " . $token->value . "\n";
}
The above exemple will generate the following output:
Position: 0
Class: CodeInc\QueryTokensExtractor\Type\WordType
Name: word
Value: paris
Position: 1
Class: CodeInc\QueryTokensExtractor\Type\FrenchPostalCodeType
Name: french_postal_code
Value: 75001
Position: 2
Class: CodeInc\QueryTokensExtractor\Type\WordType
Name: word
Value: these
Position: 3
Class: CodeInc\QueryTokensExtractor\Type\WordType
Name: word
Value: are
Position: 4
Class: CodeInc\QueryTokensExtractor\Type\WordType
Name: word
Value: words
Position: 5
Class: CodeInc\QueryTokensExtractor\Type\FrenchPhoneNumberType
Name: french_phone_number
Value: 01 00 00 00 00 (the original value without punctuation)
Position: 6
Class: CodeInc\QueryTokensExtractor\Type\CustomTokenType
Name: my_custom_token
Value: this a custom token
Position: 7
Class: CodeInc\QueryTokensExtractor\Type\YearType
Name: year
Value: 2023
WordType
: extract words from the queryYearType
: extract years from the queryFrenchPhoneNumberType
: extract French phone numbers from the queryFrenchPostalCodeType
: extract French postal codes from the queryHashtagType
: extract hashtags from the queryRegexTokenType
: extract tokens from the query using a regex
The token type priority is determined by the order in which the token types are passed to the QueryTokensExtractor
constructor.
The priority is used to determine the order in which the tokens are extracted. The higher the priority, the sooner the token will be extracted.
WordType
should always be used last as it will match any string.
use CodeInc\QueryTokenExtractor\QueryTokenExtractor;
use CodeInc\QueryTokensExtractor\Type\WordType;
use CodeInc\QueryTokensExtractor\Type\FrenchPhoneNumberType;
use CodeInc\QueryTokensExtractor\Type\FrenchPostalCodeType;
use CodeInc\QueryTokensExtractor\Type\YearType;
$tokensExtractor = new QueryTokensExtractor([
new FrenchPhoneNumberType(), // highest priority
new FrenchPostalCodeType(),
new YearType(),
new WordType(), // lowest priority
]);
Custom token types can be created by instantiating or extending RegexTokenType
. The constructor of RegexTokenType
takes four arguments:
string $name
: the name of the token typestring $regex
: the regex used to extract the token\Closure $valueFormatter
: a closure used to format the extracted value (optional)
The regexp value
capturing group is used as the extracted value (for instance the HashtagType
type uses the regex '/^#(?<value>.[a-z0-9_]+)/ui'
). If no group named value
is defined, the whole match is used as the token value.
The regexp should always start with ^
and do not constrain the end of the string with $
as the query is split into tokens using the preg_replace_callback()
function.
use CodeInc\QueryTokensExtractor\Type\RegexType;
class MyCustomTokenType extends RegexType
{
public function __construct()
{
parent::__construct(
name: 'my_custom_token',
regexp: '/^this a custom token/ui'
);
}
}
// alternatively tokens can be defined directly using the RegexType class
$myCustomToken2 = new RegexType(
name: 'my_custom_token',
regexp: '/^this a custom token/ui'
);
The extracted token value can be formatted using the valueFormatter
closure. The closure takes the extracted value as argument and must return the formatted value.
use CodeInc\QueryTokensExtractor\Type\RegexType;
$tokensExtractor = new QueryTokensExtractor([
new RegexType(
name: 'my_custom_token',
regexp: '/^this a custom token/ui',
// a simple closure called by QueryToken::getFormattedValue()
valueFormatter: fn($value) => strtoupper($value)
)
]);
$tokens = $tokensExtractor->extract('this a custom token');
$tokens->getByPosition(0)->getFormattedValue(); // THIS A CUSTOM TOKEN
This library is published under the MIT license (see the LICENSE file).