kind of major issue with the search block (lack of utf.8 support)

Permalink
just started looking into this, but the current search block seems to strip everyhting that isn't latin1… I develop sites in Swedish and it therefore strips all our precious little "åäö" characters from searches rendering it rather useless.
(the create aliases from page name thingy also strips all non latin characters - What about character folding rules there?)
For the search maybe use Sphinx as the search engine? Stemming and character folding rules are kind of neat having around :) not to mention support for misspelling using for example Aspell dictionaries / and or custom dicts…

I haven't the faintest what kind of pain that would be to implement but it's an idea…
As I would really like to se a site search become usable.
They always suck beyond crap.
What I love about google is the spell corrector, I have found myself not even bothering to check spelling when searching as I know that it will 9/10 times find what I meant.
A feature which should be available within a sitesearch… or you might as well just start using the "site:" rule on google to do searches :)
anyways, anyone have any experience with the search block stripping characters from the search term? how to fix?

philiph
 
katz515 replied on at Permalink Reply
katz515
I think search index itself is made of MySQL Full-text index search.

We, Japanese, have a different reason why we cannot use MySQL search index.

But I was able to index the search into MySQL.

There are a couple reasons why you cannot do this.

1. Collation of your MySQL Database (utf8-general-ci)
2. PHP internal encoding
3. Your server may not have mbstring installed


This is my PHP.INI setting

default_charset = UTF-8
mbstring.language = neutral
mbstring.internal_encoding = UTF-8


I checked that search block was able index UTF8 characters.
philiph replied on at Permalink Reply
philiph
right, seem to have those things in place. and yes it does indeed index utf8 characters.

but… when actually searching I get conflicting results.

for example searching for "Blocks vänster" I get results (including utf8 chars)
http://foretagsfokus.se/index.php/examples/search?search_paths%5B%5...

however when searching for only "vänster" I get nothing… why is this?
http://foretagsfokus.se/index.php/examples/search?search_paths%5B%5...
katz515 replied on at Permalink Reply
katz515
I see your point...

That's indeed weird.

If the search query is no more than 3 characters... and then you could trouble shoot that it is about MySQL setting...

http://www.concrete5.org/index.php?cID=9672...

But your string is long enough....


Oh I found out why~!

Hey check your HTML code.

It's "vänster" in HTML... This is the reason why it doesn't come up... I think.
philiph replied on at Permalink Reply
philiph
I believe concrete5 / tinymce escapes characters… damn, how would one go about solving this problem? but how does the search find vänster when coupled with the word next to it? I'm at a loss…
katz515 replied on at Permalink Reply
katz515
When I translated concrete5 to Japanese, I had to re-package TinyMCE to Japanese version which are under

/concrete/js/tiny_mce/

You may have to do the same with Swedish one.

But one simple solution would be to add Google Search field on your web site and have them index your site.