Grokbase Groups Lucene dev April 2002
FAQ
Hi,

I just joined the list, and I'm not sure if it is correct place to ask, but
I do believe that my problem is development issue. I searched archives but
didn't find any reference that it was raised before.

I am creating directory implementation for JDataStore database from Borland.
And I have the following problem: Lucene tries to delete a file that is
still open by some of the InputStreams.

JDataStore has direct support of streams and files, so my stream
implementation does not do too much - it opens the underlying stream and
delegates calls to it. But JDataStore does not allow you to delete file when
there's at least one open stream.

I created a code that monitors all opened streams on the file and closes
them before deleting. And I get another problem: after deleting file there's
some activity on streams that were closed before deleting
object(readInternal(...)). This seems to be a bug.

Also, I modified RAMDirectory and implemented mechanism of registering
number of references on the RAMFile that exist ( +1 when stream is
opened, -1 when stream is closed). In RAMDirectory.deleteFile(String) I
check the number of references and if it >0 throw an java.lang.Error (when
you throw IOException in Directory.deleteFile(String), Lucene seems not to
notice it) and I do get them when running ThreadSafetyTest (I had to make
small modification in order to use one instance of directory and not create
them when needed).

Should I post my changes that show the problem here?

Thank you in advance.

Best regards,
Roman Rokytskyy


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


--
To unsubscribe, e-mail:
For additional commands, e-mail:

Search Discussions

  • Peter Carlson at Apr 22, 2002 at 2:37 pm
    Thanks for the work Roman,

    Please send a test case, that will be really helpful.

    --Peter
    On 4/22/02 7:00 AM, "Roman Rokytskyy" wrote:

    Hi,

    I just joined the list, and I'm not sure if it is correct place to ask, but
    I do believe that my problem is development issue. I searched archives but
    didn't find any reference that it was raised before.

    I am creating directory implementation for JDataStore database from Borland.
    And I have the following problem: Lucene tries to delete a file that is
    still open by some of the InputStreams.

    JDataStore has direct support of streams and files, so my stream
    implementation does not do too much - it opens the underlying stream and
    delegates calls to it. But JDataStore does not allow you to delete file when
    there's at least one open stream.

    I created a code that monitors all opened streams on the file and closes
    them before deleting. And I get another problem: after deleting file there's
    some activity on streams that were closed before deleting
    object(readInternal(...)). This seems to be a bug.

    Also, I modified RAMDirectory and implemented mechanism of registering
    number of references on the RAMFile that exist ( +1 when stream is
    opened, -1 when stream is closed). In RAMDirectory.deleteFile(String) I
    check the number of references and if it >0 throw an java.lang.Error (when
    you throw IOException in Directory.deleteFile(String), Lucene seems not to
    notice it) and I do get them when running ThreadSafetyTest (I had to make
    small modification in order to use one instance of directory and not create
    them when needed).

    Should I post my changes that show the problem here?

    Thank you in advance.

    Best regards,
    Roman Rokytskyy


    _________________________________________________________
    Do You Yahoo!?
    Get your free @yahoo.com address at http://mail.yahoo.com


    --
    To unsubscribe, e-mail: For additional commands, e-mail:

    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Roman Rokytskyy at Apr 22, 2002 at 2:49 pm
    Please send a test case, that will be really helpful.
    Thanks for quick reply.

    In attachment you will find two modified classes from latest CVS update.
    Please comment out/remove references to JDSDirectory if any (they are copies
    from my working env, and JDSDirectory depends on JDataStore classes,
    therefore its not included).

    Best regards,
    Roman Rokytskyy
  • Roman Rokytskyy at Apr 22, 2002 at 2:51 pm
    Sorry for double copies of each file. This was Outlook error. I will be more
    careful next time.
    -----Original Message-----
    From: Roman Rokytskyy
    Sent: Montag, 22. April 2002 16:56
    To: Lucene Developers List
    Subject: RE: InputStream handling problem

    Please send a test case, that will be really helpful.
    Thanks for quick reply.

    In attachment you will find two modified classes from latest CVS update.
    Please comment out/remove references to JDSDirectory if any (they
    are copies
    from my working env, and JDSDirectory depends on JDataStore classes,
    therefore its not included).

    Best regards,
    Roman Rokytskyy

    _________________________________________________________
    Do You Yahoo!?
    Get your free @yahoo.com address at http://mail.yahoo.com


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Otis Gospodnetic at Apr 25, 2002 at 2:43 am
    Hello,

    I just used your classes (I picked the ones that looked right), run
    ThreadSafetyTest, but I can't get it to throw any Errors/Exceptions.

    I can send you the 2 .java files I picked from the 4 that you sent,
    perhaps I picked the wrong ones...

    Otis


    --- Roman Rokytskyy wrote:
    Please send a test case, that will be really helpful.
    Thanks for quick reply.

    In attachment you will find two modified classes from latest CVS
    update.
    Please comment out/remove references to JDSDirectory if any (they are
    copies
    from my working env, and JDSDirectory depends on JDataStore classes,
    therefore its not included).

    Best regards,
    Roman Rokytskyy
    ATTACHMENT part 2 application/x-javascript name=ThreadSafetyTest.java
    ATTACHMENT part 3 application/x-javascript name=RAMDirectory.java
    ATTACHMENT part 4 application/x-javascript name=RAMDirectory.java
    ATTACHMENT part 5 application/x-javascript name=ThreadSafetyTest.java
    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:


    __________________________________________________
    Do You Yahoo!?
    Yahoo! Games - play chess, backgammon, pool and more
    http://games.yahoo.com/

    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Otis Gospodnetic at Apr 25, 2002 at 3:36 am
    Ah, sorry. After I hit sent I realized that I should have cleared my
    CLASSPATH, as it was old jar, with the old RAMDirectory, that was being
    used.

    I do get the error now:
    java.lang.Error: Cannot delete file while there's interest in it
    at
    org.apache.lucene.store.RAMDirectory.deleteFile(RAMDirectory.java:145)
    at
    org.apache.lucene.index.IndexWriter.deleteFiles(IndexWriter.java:364)
    at
    org.apache.lucene.index.IndexWriter.deleteSegments(IndexWriter.java:345)
    at org.apache.lucene.index.IndexWriter.access$200(IndexWriter.java:87)
    at org.apache.lucene.index.IndexWriter$2.doBody(IndexWriter.java:325)
    at org.apache.lucene.store.Lock$With.run(Lock.java:116)
    at
    org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:322)
    at
    org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:283)
    at
    org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:185)
    at
    org.apache.lucene.ThreadSafetyTest$IndexerThread.run(ThreadSafetyTest.java:112)


    Well, you may be onto something, although I don't see the culprit yet.
    At first I thought you forgot to add this in RAMDirectory:

    public final OutputStream createFile(String name) {
    RAMFile file = new RAMFile();
    files.put(name, file);
    // OG
    incInterest(name); // this
    return new RAMOutputStream(file, name);
    }

    However, that causes the error to happen right away, which I didn't
    expect, as I am increasing 'interest'. Oh, I see, you are increasing
    it in RAMOutputStream constructor...

    Hm, well, it looks like I am only able to confirm your observations:

    [otis@linux2 classes]$ java org.apache.lucene.ThreadSafetyTest &> o.log
    [otis@linux2 classes]$ grep Error o.log
    java.lang.Error: Cannot delete file while there's interest in it:
    _b.fdx
    [otis@linux2 classes]$
    [otis@linux2 classes]$ grep _b.fdx o.log
    Increased interest in _b.fdx to 1
    Decreased interest in _b.fdx to 0
    Increased interest in _b.fdx to 1
    Increased interest in _b.fdx to 2
    Increased interest in _b.fdx to 3
    Decreased interest in _b.fdx to 2
    Increased interest in _b.fdx to 3
    Increased interest in _b.fdx to 4
    Increased interest in _b.fdx to 5
    Increased interest in _b.fdx to 6
    Decreased interest in _b.fdx to 5
    Increased interest in _b.fdx to 6
    Increased interest in _b.fdx to 7
    Increased interest in _b.fdx to 8
    Decreased interest in _b.fdx to 7
    Increased interest in _b.fdx to 8
    java.lang.Error: Cannot delete file while there's interest in it:
    _b.fdx
    Decreased interest in _b.fdx to 7

    It looks like the number of places where you increase interest, and
    where you decrease or drop it, is balanced. The super class of
    RAMDirectory has all abstract methods, so nothing is happening there.

    I don't know, are you sure that what you are seeing really is a
    problem, that it is wrong to get rid of a file for which there is
    interest?
    It sounds logical, but maybe Doug wrote something that we can't find
    that makes this an okay thing to do.
    If this is a bug I wonder how come more people haven't complained about
    it...

    Sorry I couldn't help more, maybe somebody else can find the problem.

    Otis



    --- Otis Gospodnetic wrote:
    Hello,

    I just used your classes (I picked the ones that looked right), run
    ThreadSafetyTest, but I can't get it to throw any Errors/Exceptions.

    I can send you the 2 .java files I picked from the 4 that you sent,
    perhaps I picked the wrong ones...

    Otis


    --- Roman Rokytskyy wrote:
    Please send a test case, that will be really helpful.
    Thanks for quick reply.

    In attachment you will find two modified classes from latest CVS
    update.
    Please comment out/remove references to JDSDirectory if any (they are
    copies
    from my working env, and JDSDirectory depends on JDataStore classes,
    therefore its not included).

    Best regards,
    Roman Rokytskyy
    ATTACHMENT part 2 application/x-javascript
    name=ThreadSafetyTest.java

    ATTACHMENT part 3 application/x-javascript name=RAMDirectory.java
    ATTACHMENT part 4 application/x-javascript name=RAMDirectory.java
    ATTACHMENT part 5 application/x-javascript
    name=ThreadSafetyTest.java
    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:

    __________________________________________________
    Do You Yahoo!?
    Yahoo! Games - play chess, backgammon, pool and more
    http://games.yahoo.com/

    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:

    __________________________________________________
    Do You Yahoo!?
    Yahoo! Games - play chess, backgammon, pool and more
    http://games.yahoo.com/

    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Dmitry Serebrennikov at Apr 25, 2002 at 3:53 am

    Otis Gospodnetic wrote:
    I don't know, are you sure that what you are seeing really is a
    problem, that it is wrong to get rid of a file for which there is
    interest?
    It sounds logical, but maybe Doug wrote something that we can't find
    that makes this an okay thing to do.
    If this is a bug I wonder how come more people haven't complained about
    it...

    Sorry I couldn't help more, maybe somebody else can find the problem.

    Otis
    I think originally the code was written to work on Unix, where deleting
    an open file is ok - it simply removes the directory entry and so one
    else can open the file any more. OS keeps track of the file readers and
    cleans up after the last reader has closed the file (or died). Later
    one, when the code was moved to DOS, there was a problem with this
    because DOS will refuse to delete a file it if is open. So the delete
    code would fail (I think File.delete() just returns false, no exceptions
    are thrown). To work with this, there is the "deleted" file which lists
    the segments that could be deleted (but weren't due to the fact that
    they were still open). Periodically, this file is checked and deletes
    are attempted again.

    So, I think Otis is right - it's really not a "problem", besides being
    an interesting design problem that is. There is an issue of whether it
    is a good practice to make use of OS-specific behavior in this way.
    Obviously, the portability suffers. I'm not sure if there are
    performance arguments one way or another (Doug?).

    Dmitry.



    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Roman Rokytskyy at Apr 25, 2002 at 9:49 am

    So, I think Otis is right - it's really not a "problem", besides being
    an interesting design problem that is. There is an issue of whether it
    is a good practice to make use of OS-specific behavior in this way.
    Obviously, the portability suffers. I'm not sure if there are
    performance arguments one way or another (Doug?).
    My main concern here was that I get exceptions from JDataStore, and I had to
    put checks whether there's still open stream. If you say that the file will
    be deleted, I have no problem with it. :) I will hope that eventually all
    streams will be closed and file will be deleted.

    Another interesing design issue is cloneable input streams (this is where
    the problem comes). Since I never met it before, it would be great to put
    some Javadoc that explains in what state cloned input stream is supposed to
    be. Thanks to JDataStore, input stream there is random access and I can seek
    where I want, but what if some stream does not provide such functionality?

    Best regards,
    Roman Rokytskyy


    _________________________________________________________
    Do You Yahoo!?
    Get your free @yahoo.com address at http://mail.yahoo.com


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Dmitry Serebrennikov at Apr 25, 2002 at 6:39 pm

    Roman Rokytskyy wrote:

    So, I think Otis is right - it's really not a "problem", besides being
    an interesting design problem that is. There is an issue of whether it
    is a good practice to make use of OS-specific behavior in this way.
    Obviously, the portability suffers. I'm not sure if there are
    performance arguments one way or another (Doug?).
    My main concern here was that I get exceptions from JDataStore, and I had to
    put checks whether there's still open stream. If you say that the file will
    be deleted, I have no problem with it. :) I will hope that eventually all
    streams will be closed and file will be deleted.
    I'm pretty sure it's ok, but you may want to run a few test to make sure.
    Another interesing design issue is cloneable input streams (this is where
    the problem comes). Since I never met it before, it would be great to put
    some Javadoc that explains in what state cloned input stream is supposed to
    be. Thanks to JDataStore, input stream there is random access and I can seek
    where I want, but what if some stream does not provide such functionality?
    Yes, I forgot about that one. It's even more interesting than that! The
    stream objects that Doug coded are not java.io. streams. They are
    wrappers on top of those. Each clone maintains it's own seek offset.
    Essentially, they share the same OS file handle but present an
    abstraction of multiple independent streams into the same file.

    Dmitry.



    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Roman Rokytskyy at Apr 25, 2002 at 6:53 pm

    Yes, I forgot about that one. It's even more interesting than that! The
    stream objects that Doug coded are not java.io. streams. They are
    wrappers on top of those. Each clone maintains it's own seek offset.
    Essentially, they share the same OS file handle but present an
    abstraction of multiple independent streams into the same file.
    Sorry, but isn't file handle sharing something specific to FSInputStream?
    Why do we force that on our abstract class level?

    I would suggest a factory pattern, where input stream is created for a file,
    and how this is handled is up to the implementation. FSDirectory will share
    handles, RAMDirectory will have references to same RAMFile object, my
    JDataStoreDirectory will rely on JDataStore to manage it effectively.

    Should I try to rewrite it? (I also would appreciate your opinion if I
    should try to touch that code at all).

    Thanks,
    Roman Rokytskyy


    _________________________________________________________
    Do You Yahoo!?
    Get your free @yahoo.com address at http://mail.yahoo.com


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Dmitry Serebrennikov at Apr 25, 2002 at 10:01 pm

    Roman Rokytskyy wrote:

    Yes, I forgot about that one. It's even more interesting than that! The
    stream objects that Doug coded are not java.io. streams. They are
    wrappers on top of those. Each clone maintains it's own seek offset.
    Essentially, they share the same OS file handle but present an
    abstraction of multiple independent streams into the same file.
    Sorry, but isn't file handle sharing something specific to FSInputStream?
    Why do we force that on our abstract class level?
    I'm sorry, I should have been more specific. The file handle is only in
    the picture when FSInputStream is cloned. From what I can tell after a
    quick look, InputStream is responsible for buffering and it delegates to
    subclasses (via a call to readInternal) to refill the buffer from the
    underlying data store. When cloned, the InputStream clones the buffer
    (in the hope that the next read will still hit the buffered data I
    suppose), but after that it has its own seek position and its own
    buffer. In the case of FSInputStream, there is a Descriptor object that
    is shared between the clones. In the case of RAMInputStream - RAMFile is
    the shared object.

    I would suggest a factory pattern, where input stream is created for a file,
    and how this is handled is up to the implementation. FSDirectory will share
    handles, RAMDirectory will have references to same RAMFile object, my
    JDataStoreDirectory will rely on JDataStore to manage it effectively.
    Perhaps a factory patter would be more flexible, but it looks like the
    existing code does a pretty good job for the RAM and FS cases. Would the
    factory pattern allow a better database implementation?

    Should I try to rewrite it? (I also would appreciate your opinion if I
    should try to touch that code at all).
    I don't know, I have not heard many complaints about that code recently.
    There is activity in terms of creating a crawler / content handler
    framework. There is also a need to handle "update" better, I think. For
    example, I think it would be great to have deletes go through
    IndexWriter and get "cached" in the new segment, to be later applied to
    the prior segments during optimization. This would make deletes and adds
    transactional.

    Another thing on my wish / todo list is to reduce the number of OS files
    that must be open. Once you get a lot of indexes, with a number of
    stored fields, and keep re-indexing them, the number of open files grows
    rather quickly. And if Lucene is part of another program that already
    has other file IO needs, you end up quickly pushing into the max open
    files limit of the OS. The idea I have for this one is to implement a
    different kind of segment - one that is composed of a single file. Once
    a segment is created by IndexWriter, it never changes (besides the
    deletes), so it could easily be stored as a single file.

    These are just a few areas that are my favorites... But then again, if
    you see another problem that's in your way, chances are that there are
    other people out there with the same issue.

    In any case, good luck!
    Dmitry.

    Thanks,
    Roman Rokytskyy


    _________________________________________________________
    Do You Yahoo!?
    Get your free @yahoo.com address at http://mail.yahoo.com


    --
    To unsubscribe, e-mail: For additional commands, e-mail:
    .



    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Roman Rokytskyy at Apr 26, 2002 at 10:14 am

    I'm sorry, I should have been more specific. The file handle is only in
    the picture when FSInputStream is cloned. From what I can tell after a
    quick look, InputStream is responsible for buffering and it delegates to
    subclasses (via a call to readInternal) to refill the buffer from the
    underlying data store. When cloned, the InputStream clones the buffer
    (in the hope that the next read will still hit the buffered data I
    suppose), but after that it has its own seek position and its own
    buffer. In the case of FSInputStream, there is a Descriptor object that
    is shared between the clones. In the case of RAMInputStream - RAMFile is
    the shared object.
    What is the reason to have buffer with RAMInputStream? To have another copy
    of same data?
    Perhaps a factory patter would be more flexible, but it looks like the
    existing code does a pretty good job for the RAM and FS cases. Would the
    factory pattern allow a better database implementation?
    It might. If you use embedded database like JDataStore, you should not cache
    data internally, database does this. So, buffer and cache simply introduce
    addtional memory consumption.
    I don't know, I have not heard many complaints about that code recently.
    Ok, I will try it "as is" with JDataStore, and if it works - fine.
    There is activity in terms of creating a crawler / content handler
    framework. There is also a need to handle "update" better, I think. For
    example, I think it would be great to have deletes go through
    IndexWriter and get "cached" in the new segment, to be later applied to
    the prior segments during optimization. This would make deletes and adds
    transactional.
    Ok, I will have a look, but I have almost no experience with Lucene.
    Another thing on my wish / todo list is to reduce the number of OS files
    that must be open. Once you get a lot of indexes, with a number of
    stored fields, and keep re-indexing them, the number of open files grows
    rather quickly. And if Lucene is part of another program that already
    has other file IO needs, you end up quickly pushing into the max open
    files limit of the OS. The idea I have for this one is to implement a
    different kind of segment - one that is composed of a single file. Once
    a segment is created by IndexWriter, it never changes (besides the
    deletes), so it could easily be stored as a single file.
    I will check this thing with JDataStore. Maybe we could borrow couple of
    ideas from them (like built-in file system)... This would simplify life -
    one file for all indices, tx support?, backup, etc.

    Thanks!
    Roman Rokytskyy


    _________________________________________________________
    Do You Yahoo!?
    Get your free @yahoo.com address at http://mail.yahoo.com


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Dmitry Serebrennikov at Apr 26, 2002 at 4:02 pm

    Roman Rokytskyy wrote:

    I'm sorry, I should have been more specific. The file handle is only in
    the picture when FSInputStream is cloned. From what I can tell after a
    quick look, InputStream is responsible for buffering and it delegates to
    subclasses (via a call to readInternal) to refill the buffer from the
    underlying data store. When cloned, the InputStream clones the buffer
    (in the hope that the next read will still hit the buffered data I
    suppose), but after that it has its own seek position and its own
    buffer. In the case of FSInputStream, there is a Descriptor object that
    is shared between the clones. In the case of RAMInputStream - RAMFile is
    the shared object.
    What is the reason to have buffer with RAMInputStream? To have another copy
    of same data?
    Good point. Just goes to show that I shouldn't try to be an authority on
    the topic without a more detailed look at the whole picture.
    Perhaps a factory patter would be more flexible, but it looks like the
    existing code does a pretty good job for the RAM and FS cases. Would the
    factory pattern allow a better database implementation?
    It might. If you use embedded database like JDataStore, you should not cache
    data internally, database does this. So, buffer and cache simply introduce
    addtional memory consumption.
    I don't know, I have not heard many complaints about that code recently.
    Ok, I will try it "as is" with JDataStore, and if it works - fine.
    There is activity in terms of creating a crawler / content handler
    framework. There is also a need to handle "update" better, I think. For
    example, I think it would be great to have deletes go through
    IndexWriter and get "cached" in the new segment, to be later applied to
    the prior segments during optimization. This would make deletes and adds
    transactional.
    Ok, I will have a look, but I have almost no experience with Lucene.
    Another thing on my wish / todo list is to reduce the number of OS files
    that must be open. Once you get a lot of indexes, with a number of
    stored fields, and keep re-indexing them, the number of open files grows
    rather quickly. And if Lucene is part of another program that already
    has other file IO needs, you end up quickly pushing into the max open
    files limit of the OS. The idea I have for this one is to implement a
    different kind of segment - one that is composed of a single file. Once
    a segment is created by IndexWriter, it never changes (besides the
    deletes), so it could easily be stored as a single file.
    I will check this thing with JDataStore. Maybe we could borrow couple of
    ideas from them (like built-in file system)... This would simplify life -
    one file for all indices, tx support?, backup, etc.
    This JDataStore, I assume it is proprietary by Borland? The source isn't
    available is it? Probably many of the problems they address won't exist
    in Lucene if we only use this for finished segments, since they will be
    read-only. I think there are a lot of issues related to fragmentation
    and growth of files that a filesystem has to address if it supports writing.

    Dmitry.



    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedApr 22, '02 at 1:53p
activeApr 26, '02 at 4:02p
posts13
users4
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase