On Fri, Nov 28, 2008 at 1:28 AM, Mike_SearchGuru wrote:
Many thanks to your responses. yes you are right we shoudl not be
license costs and i agree to you.

Let me further answer your questions:
1) each pdf file is about on avaerage 100 page long and 4MB in size.

Have you considered chopping the document into 100 separate pages and
indexing those while storing in a field the link to the complete doc? In
that way you get relevant hits at the page level and can navigate back to
the original doc.

If you need help on this I think I have a PDF "chopper" script (windows)
somewhere ... it uses pdfbox. Otherwise it should be relatively easy to do

However, we are not indexing the whole lot. We will only be indexing very
few parts ie the headlines on the PDF files. So i woudl say some 5% of the
document will ever be indexed.
2) all files are in english
3) we dont need any control on how the pdf's are indexed.
4) every week we have an increase of 5000 pdfs that needs to be indexed
5) we need a facility whereby we can create multiple indexes so that we cna
keep teh size of these indexes as small as possible BUT when a query is
fired we want to be able to pull information form all these multiple
6) no need for any access controls
7) on time factor - if it takes 1 sec to index a pdf file (assmuing that
content to index is 30KB), then we will be screwed up as we cant wait 93
days for everything to be indexed. So what we might do is split or docs
multiple parts and index them separately on separate servers ( may be 10
servers) and so that should cut the 93 days to 9 days. The question here is
can we then group all those indexes on one server later on when going live.
8) currently our pdf file size for all 8 million adds up to 40 terabyte

awarnier wrote:
Mike_SearchGuru wrote:
OK basically we ahve 8 million pdf's to index and we have good technical
people in our company.

question is is lucene slower than GSA in terms of indexing pdf's?
are there any costs for licenses if used commercially. If yes then what
the costs?
what are teh downsides of Lucene as opposed to GSA. these are my
and if you can answerr them then it will be great help.


Ian Holsman wrote:
Mike_SearchGuru wrote:
We are evaluating Lucene at the moment and also considering Google
Appliance. Is there anyone who can guide us on which one is better
from Google being expensive as we have 8 million PDF's to index.

Can someoen help us by clearly identifying whcih one is better.
Hi Mike.

Firstly GSA is so much more than just a search library, which is what
lucene is. In your analysis you should be looking at things like Solr
(which will give you a web interface to the lucene library), and Tika
nutch to actually put your documents into the index itself.

as for which is better, we have no idea what your requirements are
(besides from wanting to avoid spending money) or what your
organization's technical capabilities are (are you willing to spend 1-3
getting up to speed with the open source tools for example) so it will
be hard for us to judge.
I am not an expert on either GSA or Lucene, but reading your descrition
above, I would ask myself a couple of questions first of all.

You have 8 million PDFs which you want to index. That is, presumably,
to make their content searchable later by some users.
Let's say that you go though the entire collection of PDFs, and index
every single word in them, no matter with which tool (both GSA and
Lucene can do that).

Assuming that these 8 million PDFs are all in English, you have a good
chance that just about any word of the English language will occur
thousands of times. So, a user searching for something will find
thousands of hits, just like when you search in Google. Will that be
useful to them ?
In other words, the question is : do you want some control about how the
8 million PDFs are going to be indexed, or not ?

The second question is about access. When your documents are all
indexed, should then any user be able to access any item of the
collection ? or do you want some form of access-control, to determine
who gets access to what ?

The answer to the above will already provide some elements to make

A couple more notes :
- assume it takes just 1 second to read and index one PDF document. You
have 8,000,000 documents, and there are 86,400 seconds in a day.
Assuming no delays at all in passing these documents over any kind of
network, that means that it would take 93 days to index the collection.
- assume one PDF document contains on average 30 Kb of pure text. A
reasonable average for a full-text indexing, will result in an index
that is, in size, approximately 3 times as large as the original text.
You make the calculation.

You might thus want to analyse this seriously, and not make a decision
based purely on the cost of a license.
View this message in context:
Sent from the Lucene - General mailing list archive at Nabble.com.

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 14 of 15 | next ›
Discussion Overview
groupgeneral @
postedNov 27, '08 at 8:49p
activeDec 4, '08 at 5:15p



site design / logo © 2018 Grokbase