Lucene on OpenCL sounds neat!
In Lucene's nightly indexing benchmarks (http://home.apache.org/~mikemccand/lucenebench/indexing.html)
I index an
export of Wikipedia's english content, including terms, docIDs, term
frequencies, positions, and also points, doc values, stored fields. The
full (messy!) source code is in this repository:https://github.com/mikemccand/luceneutil.
Both initial indexing and merging are CPU/IO intensive, but they are very
amenable to soaking up the hardware's concurrency.
On whether there's a market, that's beyond my pay grade ;) I just work on
the bits! Different users care about different things.
On Fri, Jun 17, 2016 at 6:52 PM, Steve Casselman wrote:
Hi Mike. I’m writing code for the Altera OpenCL SDK. I have a code base
that gives me a non-Lucene format index. I was wondering in your benchmark
what kind of data do you collect? Do you collect all the position and
frequency data? I’m also curious about what you see as the biggest
bottleneck in creating an index? Is it creating the index from the data or
merging the indexes? Or something else? Do you feel the algorithm is CPU,
memory or disk bound? And finally do you think there is a market for
accelerated indexing? Say I could quadruple the price performance yet still
make 100% Lucene compatible indexes, would people pay for that?