FAQ
I'm stepping tru a rdf file (the project gutenberg catalog) and sending data
to a lucene index to allow searches of titles authors and such. However the
gutenberg rdf is a little bit "special". It has two sections, one for title,
authors, collaborators and such, and (after all the books) starts the other
section that has the download links. The connection is a kind of foreign key
that exists on both tags (a unique number id). While i don't need to search
the download link, i do need to save it.

I'm memory limited and can't put in memory the 200 mb file that the catalog
is. I'm wondering if there is some way for me to use the number id to
connect both kinds of information without having to keep things in memory. A
first search for the things i want, and a second using the number id?
It seems very clumsy. I'm not actually using a database, and i don't want to
use very large libraries. Compass is 60 mbs (!). I tried lucenesail for a
while, but it has stopped working and the code is a mess (it is not adapted
to the filtering of the gutemberg rdf that i'm doing).

Search Discussions

  • Erick Erickson at Oct 31, 2010 at 8:46 pm
    Hmmmm. Are you too memory limited to do a first pass through the file and
    save
    the key/download links part in a map, then make another pass through the
    file
    indexing the data and grabbing the link from your map? I'm assuming that
    there's a lot less than 200M in just the key/link part.

    Alternatively (and this would probably be kinda slow, but...) still do a 2
    pass
    process, but instead of making a map, put the data in a Lucene index on
    disk.
    Then the second pass searches that index for the data to add to the docs
    in your "real" index.

    HTH
    Erick
    On Sun, Oct 31, 2010 at 12:17 PM, Paulo Levi wrote:

    I'm stepping tru a rdf file (the project gutenberg catalog) and sending
    data
    to a lucene index to allow searches of titles authors and such. However the
    gutenberg rdf is a little bit "special". It has two sections, one for
    title,
    authors, collaborators and such, and (after all the books) starts the other
    section that has the download links. The connection is a kind of foreign
    key
    that exists on both tags (a unique number id). While i don't need to search
    the download link, i do need to save it.

    I'm memory limited and can't put in memory the 200 mb file that the catalog
    is. I'm wondering if there is some way for me to use the number id to
    connect both kinds of information without having to keep things in memory.
    A
    first search for the things i want, and a second using the number id?
    It seems very clumsy. I'm not actually using a database, and i don't want
    to
    use very large libraries. Compass is 60 mbs (!). I tried lucenesail for a
    while, but it has stopped working and the code is a mess (it is not adapted
    to the filtering of the gutemberg rdf that i'm doing).
  • Paulo Levi at Oct 31, 2010 at 9:17 pm
    Yes that's what i ended up doing. I will probably "fork" a new java vm
    instead of doing it in the same. That way i can control the memory
    requirements, though it hasn't given me any problems (actually it even
    worked with -Xmx, though it probably doesn't if i do something else in the
    program at the same time - i'm not indexing the book subjects yet too, need
    to do some sort of string caching for that and authors.)
    On Sun, Oct 31, 2010 at 8:47 PM, Erick Erickson wrote:

    Hmmmm. Are you too memory limited to do a first pass through the file and
    save
    the key/download links part in a map, then make another pass through the
    file
    indexing the data and grabbing the link from your map? I'm assuming that
    there's a lot less than 200M in just the key/link part.

    Alternatively (and this would probably be kinda slow, but...) still do a 2
    pass
    process, but instead of making a map, put the data in a Lucene index on
    disk.
    Then the second pass searches that index for the data to add to the docs
    in your "real" index.

    HTH
    Erick
    On Sun, Oct 31, 2010 at 12:17 PM, Paulo Levi wrote:

    I'm stepping tru a rdf file (the project gutenberg catalog) and sending
    data
    to a lucene index to allow searches of titles authors and such. However the
    gutenberg rdf is a little bit "special". It has two sections, one for
    title,
    authors, collaborators and such, and (after all the books) starts the other
    section that has the download links. The connection is a kind of foreign
    key
    that exists on both tags (a unique number id). While i don't need to search
    the download link, i do need to save it.

    I'm memory limited and can't put in memory the 200 mb file that the catalog
    is. I'm wondering if there is some way for me to use the number id to
    connect both kinds of information without having to keep things in memory.
    A
    first search for the things i want, and a second using the number id?
    It seems very clumsy. I'm not actually using a database, and i don't want
    to
    use very large libraries. Compass is 60 mbs (!). I tried lucenesail for a
    while, but it has stopped working and the code is a mess (it is not adapted
    to the filtering of the gutemberg rdf that i'm doing).
  • Paulo Levi at Oct 31, 2010 at 9:18 pm
    I meant -Xmx32m
    On Sun, Oct 31, 2010 at 9:17 PM, Paulo Levi wrote:

    Yes that's what i ended up doing. I will probably "fork" a new java vm
    instead of doing it in the same. That way i can control the memory
    requirements, though it hasn't given me any problems (actually it even
    worked with -Xmx, though it probably doesn't if i do something else in the
    program at the same time - i'm not indexing the book subjects yet too, need
    to do some sort of string caching for that and authors.)

    On Sun, Oct 31, 2010 at 8:47 PM, Erick Erickson wrote:

    Hmmmm. Are you too memory limited to do a first pass through the file and
    save
    the key/download links part in a map, then make another pass through the
    file
    indexing the data and grabbing the link from your map? I'm assuming that
    there's a lot less than 200M in just the key/link part.

    Alternatively (and this would probably be kinda slow, but...) still do a 2
    pass
    process, but instead of making a map, put the data in a Lucene index on
    disk.
    Then the second pass searches that index for the data to add to the docs
    in your "real" index.

    HTH
    Erick
    On Sun, Oct 31, 2010 at 12:17 PM, Paulo Levi wrote:

    I'm stepping tru a rdf file (the project gutenberg catalog) and sending
    data
    to a lucene index to allow searches of titles authors and such. However the
    gutenberg rdf is a little bit "special". It has two sections, one for
    title,
    authors, collaborators and such, and (after all the books) starts the other
    section that has the download links. The connection is a kind of foreign
    key
    that exists on both tags (a unique number id). While i don't need to search
    the download link, i do need to save it.

    I'm memory limited and can't put in memory the 200 mb file that the catalog
    is. I'm wondering if there is some way for me to use the number id to
    connect both kinds of information without having to keep things in memory.
    A
    first search for the things i want, and a second using the number id?
    It seems very clumsy. I'm not actually using a database, and i don't want
    to
    use very large libraries. Compass is 60 mbs (!). I tried lucenesail for a
    while, but it has stopped working and the code is a mess (it is not adapted
    to the filtering of the gutemberg rdf that i'm doing).
  • Erick Erickson at Oct 31, 2010 at 11:16 pm
    32M is tiny. Is this a self-imposed memory constraint or do you really have
    hardware that's that limited? I ask because "just give the VM more memory"
    is the very first option I'd suggest......

    Best
    Erick
    On Sun, Oct 31, 2010 at 5:17 PM, Paulo Levi wrote:

    I meant -Xmx32m
    On Sun, Oct 31, 2010 at 9:17 PM, Paulo Levi wrote:

    Yes that's what i ended up doing. I will probably "fork" a new java vm
    instead of doing it in the same. That way i can control the memory
    requirements, though it hasn't given me any problems (actually it even
    worked with -Xmx, though it probably doesn't if i do something else in the
    program at the same time - i'm not indexing the book subjects yet too, need
    to do some sort of string caching for that and authors.)


    On Sun, Oct 31, 2010 at 8:47 PM, Erick Erickson <erickerickson@gmail.com
    wrote:
    Hmmmm. Are you too memory limited to do a first pass through the file
    and
    save
    the key/download links part in a map, then make another pass through the
    file
    indexing the data and grabbing the link from your map? I'm assuming that
    there's a lot less than 200M in just the key/link part.

    Alternatively (and this would probably be kinda slow, but...) still do a
    2
    pass
    process, but instead of making a map, put the data in a Lucene index on
    disk.
    Then the second pass searches that index for the data to add to the docs
    in your "real" index.

    HTH
    Erick
    On Sun, Oct 31, 2010 at 12:17 PM, Paulo Levi wrote:

    I'm stepping tru a rdf file (the project gutenberg catalog) and
    sending
    data
    to a lucene index to allow searches of titles authors and such.
    However
    the
    gutenberg rdf is a little bit "special". It has two sections, one for
    title,
    authors, collaborators and such, and (after all the books) starts the other
    section that has the download links. The connection is a kind of
    foreign
    key
    that exists on both tags (a unique number id). While i don't need to search
    the download link, i do need to save it.

    I'm memory limited and can't put in memory the 200 mb file that the catalog
    is. I'm wondering if there is some way for me to use the number id to
    connect both kinds of information without having to keep things in memory.
    A
    first search for the things i want, and a second using the number id?
    It seems very clumsy. I'm not actually using a database, and i don't want
    to
    use very large libraries. Compass is 60 mbs (!). I tried lucenesail
    for
    a
    while, but it has stopped working and the code is a mess (it is not adapted
    to the filtering of the gutemberg rdf that i'm doing).
  • Paulo Levi at Nov 1, 2010 at 8:54 am
    Just self imposed.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 31, '10 at 5:20p
activeNov 1, '10 at 8:54a
posts6
users2
websitelucene.apache.org

2 users in discussion

Paulo Levi: 4 posts Erick Erickson: 2 posts

People

Translate

site design / logo © 2022 Grokbase