1
donuts
4y

How do case insensitive searches work?

Usually if I needed case insensitivity, I'd probably just store all the keys in lowercase and for searches convert whatever is typed to lowercase.

But what about Google. Can't convert everything to lower.

And if need to check the lower and upper case of each letter, like in a Tree, you would need to go down 2^N paths?

Comments
  • 3
    Downcase the data (and input) in your comparator?
  • 0
    Well, to be able to talk about these *cases*, they have to be letters. They don't have to be latin. Once you have a lexicon of all alphabets (with the unicode codes) it will be easy to implement.
  • 0
    Also the more accurate the casing is the more relevant the result is. Dunno how u gonna calculate that but a for loop could do it easily
  • 0
    @Root yes but is there an algorithm solution? How do DBs and well Google do it or even Word, Chrome, notepad? Even Regex allows case insensitive matching?
  • 0
    @melezorus34 Ok so say I enter all permutations of "aVeryLongString" into a "database" along with some other values.

    I will search the same, but it should return all matches in their original casing.
  • 0
    @billgates
    if you are searching for "aSd" case insensitive the regexp would be
    /[Aa][Ss][Dd]/
    You need the lexicon bro. Maybe get one with klington.
  • 0
    I'm just wondering how Google, databases, text editors do it. Chrome, you Find any text and it will pretty quickly highlight all matches
  • 0
    @billgates well, in js we can just do this:

    var regexr= new Regexr(input,"i"),
    arrayOfObjectsWithNames=[(...)];

    //We need the ones matching so
    return arrayOfObjectsWithNames
    .filter(object=>regexr.test(object.name))

    //Is it too complex? Heck yeah, I can't help it.
  • 0
    @melezorus34 ok but that's my question? The only way I can think of is it runs a finite state machine where the next match is either A or a ... But that doesn't work for Google or databases?

    It would take forever given how much data they hold
  • 0
    @billgates
    You can have outputs with regexr search bubbles.
    SomeHTMLText.replace(/(the thin)/gi ,
    matched=>`<mark>${matched}</mark>`);
    Will match any "the thin" in any text, case insensitive, and wrap them around highlighting tags.
  • 0
    @billgates Sir, you are having a conversation with a child now. I can't optimise my thinking speed to code and debug at the same time. I don't know what trick they have at their sleeves. Probs docker swarm/k8s.
  • 0
    Also, my second comment states that there could be title/text relevance to sort, e.g. 500, which would be picken after they
    SORT BY audit_score
    Or something I don't know. Its 3 am.
  • 1
    I wish you wisdom through this question, while asking nicely to ping me too when you find a proper solution. Mine is a hypotetical solution. **Do not take an advice from a child on coffee²**
  • 0
    At a guess, when Google builds their indexes of all of the Internet, they do something clever to store the search keys in lower case.
  • 2
    There's graduate programs on that subject.

    https://elastic.co/elasticon/conf/...
  • 5
    @billgates one thing Google does is split the index up over thousands of machines.

    The query is sent to each which returns the local matches to an aggregator that gets results from many index servers and that then returns to the searcher.

    Thats a simplified solution.

    Its quite possible that the index servers store a lower case version for first match and then only do case sensitive on the resulting matches but google does not always explain everything.

    They used to have some very good papers on it though but its several years since I took time reading them.
  • 0
    database use btree index
    big data use map reduce
    ngram - key value approach where you store each part of word in key value index ex from at least 2 characters for devrant it will be:
    de
    dev
    devr
    devra
    devran
    devrant

    you can combine all above techniques + add some rank and/or machine learning, geo awarness to improve results
  • 0
    At database level you can use a language coalition which is not case sensitive.
  • 0
    @NoToJavaScript you mean collation? But I'm not sure how that works.

    Like even if you tell it, A=a, B=b, etc. The only way I can think of would be to create a index where the keys are all lowercase and then map them to the original values?
  • 0
    @billgates Yep, sorry :)

    An example, we have this :

    [Name] NVARCHAR (MAX) COLLATE SQL_Latin1_General_CP1_CI_AI NOT NULL,

    That works for search : not case sensetive, also ignores frenchs acents.

    But I agree it is not "universal". It works for our case (french/english, not case sensetive)
Add Comment