The \b consider only ASCCI as letter. So diacritics are considered as non word.
and a word like leçon is splitted in 2, "le" is empty word, and the search is done
on çon (which is not french [1], so has no result)
[1] con (without the cedilla) is a french word, but I won't tell you what it means...
anyway, there are probably no "con" in most catalogues ;-)
Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
for ( my $i = 0 ; $i <= @operands ; $i++ ) {
my $operand = $operands[$i];
# remove stopwords from operand : parse all stopwords & remove them (case insensitive)
+ # we use IsAlpha unicode definition, to deal correctly with diacritics.
+ # otherwise, a french word like "leçon" is splitted in "le" "çon", le is an empty word, we get "çon"
+ # and don't find anything...
foreach (keys %{C4::Context->stopwords}) {
- $operand=~ s/\b$_\b//i;
+ $operand=~ s/\P{IsAlpha}$_\P{IsAlpha}/ /i;
+ $operand=~ s/^$_\P{IsAlpha}/ /i;
+ $operand=~ s/\P{IsAlpha}$_$/ /i;
}
my $index = $indexes[$i];
my $stemmed_operand;