C/C++ fuzzy string search library bounty (claimed)

This bounty has been claimed and is now unavailable.

I can grant $200 to a developer upon completion of a C or C++ library that roughly mirrors fuse.js’s functionality. Your API must roughly provide the features in this API example.

template <typename Key = std::string>
struct FuzzySearchDatabase {
	void addEntry(Key key, std::vector<std::string> strings);
	void setWeights(std::vector<float> weights);
	void setThreshold(float);
	std::vector<Key> search(std::string query);
};

Code must be 100% correct and not crash for all inputs. You may release the library under an MIT/BSD-like open-source license, or assign the copyright to VCV. Email contact@vcvrack.com if you’d like to develop this project.

5 Likes

I’m considering loosening the definition of “fuzzy” for this one. If you don’t want to deal with string distance algorithms (like Levenshtein), I’ll accept an algorithm that only matches substrings per token. This means that typing “stru di” should match “Audible Instruments”. Both queries and searched strings should be tokenized/separated into words. Like fuse.js, you should return a [0, 1] score for every match using a creative scoring system, so a string equal to “stru di” should obviously have a better score.

Even with this loosened requirement, I’m still offering $200 for a high-quality API and implementation.

1 Like

This is for VCV Library I assume? Interesting.

SQlite & FTS with clever queries, prefix etc.? https://www.sqlite.org/fts5.html

Using a heavy library with lots of unnecessary parts is a negative for me, but if you think that’s the best way, see the first post to apply.

I’m looking into it. SQLite seems like a sledgehammer to me. I would prefer to code it up myself. Have not formally applied for the job yet, but it’s an interesting task.

For anyone who’s interested in this, here’s an article on Sublime Text’s fuzzy search that’s well worth reading: https://www.forrestthewoods.com/blog/reverse_engineering_sublime_texts_fuzzy_match/

didn’t there used to be some annual prize in information retrieval?

Well @Vortico there is C++_ there. I have not had time to study it, but would that do?

Why not using what already exists - https://github.com/luceneplusplus/LucenePlusPlus ?

Afaik there is hardly any better search engine than Lucene.

This FuzzyQuery unit test should give a hint how to use this.

@andrew.bernard : that implementation would need both adaptation to the VCV Rack use case (multiple fields w/weights) and tuning to the domain (CamelCase is much less meaningful outside of source file names).

Also, Sublime Text 3 handles transpositions but that implementation doesn’t include that.

But it’s an interesting starting point.

Lucene in C++ seems to me to be the way to go.

Did you take a look how large and complicated that library seems? It looks like a huge code base. (Also has a dependency on Boost.)

I think you are missing the point. The requirements are formulated clearly in the opening post. Also, this request is about a database without permanent storage.

Just a reminder that this is a job offer, not a discussion thread. I’m closing this thread because I don’t think anything needs to be discussed beyond my initial two posts.

1 Like