FAQ

[Lucene] Which one is better - Lucene OR Google Search Appliance

J. Delgado
Nov 30, 2008 at 5:11 pm

On Fri, Nov 28, 2008 at 1:28 AM, Mike_SearchGuru wrote:
Many thanks to your responses. yes you are right we shoudl not be
considering
license costs and i agree to you.

Let me further answer your questions:
1) each pdf file is about on avaerage 100 page long and 4MB in size.

Have you considered chopping the document into 100 separate pages and
indexing those while storing in a field the link to the complete doc? In
that way you get relevant hits at the page level and can navigate back to
the original doc.

If you need help on this I think I have a PDF "chopper" script (windows)
somewhere ... it uses pdfbox. Otherwise it should be relatively easy to do
it.

However, we are not indexing the whole lot. We will only be indexing very
few parts ie the headlines on the PDF files. So i woudl say some 5% of the
document will ever be indexed.
2) all files are in english
3) we dont need any control on how the pdf's are indexed.
4) every week we have an increase of 5000 pdfs that needs to be indexed
5) we need a facility whereby we can create multiple indexes so that we cna
keep teh size of these indexes as small as possible BUT when a query is
fired we want to be able to pull information form all these multiple
indexes.
6) no need for any access controls
7) on time factor - if it takes 1 sec to index a pdf file (assmuing that
the
content to index is 30KB), then we will be screwed up as we cant wait 93
days for everything to be indexed. So what we might do is split or docs
into
multiple parts and index them separately on separate servers ( may be 10
servers) and so that should cut the 93 days to 9 days. The question here is
can we then group all those indexes on one server later on when going live.
8) currently our pdf file size for all 8 million adds up to 40 terabyte
already.



awarnier wrote:
Mike_SearchGuru wrote:
OK basically we ahve 8 million pdf's to index and we have good technical
people in our company.

question is is lucene slower than GSA in terms of indexing pdf's?
are there any costs for licenses if used commercially. If yes then what
are
the costs?
what are teh downsides of Lucene as opposed to GSA. these are my
questions
and if you can answerr them then it will be great help.

Thanks
Ali



Ian Holsman wrote:
Mike_SearchGuru wrote:
We are evaluating Lucene at the moment and also considering Google
Search
Appliance. Is there anyone who can guide us on which one is better
apart
from Google being expensive as we have 8 million PDF's to index.

Can someoen help us by clearly identifying whcih one is better.
Hi Mike.

Firstly GSA is so much more than just a search library, which is what
lucene is. In your analysis you should be looking at things like Solr
(which will give you a web interface to the lucene library), and Tika
or
nutch to actually put your documents into the index itself.

as for which is better, we have no idea what your requirements are
(besides from wanting to avoid spending money) or what your
organization's technical capabilities are (are you willing to spend 1-3
getting up to speed with the open source tools for example) so it will
be hard for us to judge.
Hi.
I am not an expert on either GSA or Lucene, but reading your descrition
above, I would ask myself a couple of questions first of all.

You have 8 million PDFs which you want to index. That is, presumably,
to make their content searchable later by some users.
Let's say that you go though the entire collection of PDFs, and index
every single word in them, no matter with which tool (both GSA and
Lucene can do that).

Assuming that these 8 million PDFs are all in English, you have a good
chance that just about any word of the English language will occur
thousands of times. So, a user searching for something will find
thousands of hits, just like when you search in Google. Will that be
useful to them ?
In other words, the question is : do you want some control about how the
8 million PDFs are going to be indexed, or not ?

The second question is about access. When your documents are all
indexed, should then any user be able to access any item of the
collection ? or do you want some form of access-control, to determine
who gets access to what ?

The answer to the above will already provide some elements to make
choices.

A couple more notes :
- assume it takes just 1 second to read and index one PDF document. You
have 8,000,000 documents, and there are 86,400 seconds in a day.
Assuming no delays at all in passing these documents over any kind of
network, that means that it would take 93 days to index the collection.
- assume one PDF document contains on average 30 Kb of pure text. A
reasonable average for a full-text indexing, will result in an index
that is, in size, approximately 3 times as large as the original text.
You make the calculation.

You might thus want to analyse this seriously, and not make a decision
based purely on the cost of a license.
--
View this message in context:
http://www.nabble.com/Which-one-is-better---Lucene-OR-Google-Search-Appliance-tp20725398p20731258.html
Sent from the Lucene - General mailing list archive at Nabble.com.
reply

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions