FAQ
Hello :-)

FYI, here is a list of apparent Tika 0.8 conversion failures when run
from Xapian's omindex on a Debian 6 Squeeze 64-bit system with 4 GB memory:

doc files: tried: 10268, failed: 345 3.35%
docx files: tried: 248, failed: 0
odp files: tried: 7, failed: 0
ods files: tried: 71, failed: 0
odt files: tried: 136, failed: 0
pdf files: tried: 3888, failed: 150 3.85%
pps files: tried: 29, failed: 3 10.34%
ppsx files: tried: 12, failed: 0
ppt files: tried: 331, failed: 0
pptx files: tried: 24, failed: 0
rtf files: tried: 698, failed: 1 .14%
xls files: tried: 3339, failed: 2 .05%
xlsx files: tried: 63, failed: 0

The statistics were generated by searching omindex output for
.$ext" failed
where $ext was each of the listed extensions in turn.

More information can be supplied on request.

Best

Charles

Search Discussions

  • Olly Betts at Sep 1, 2011 at 1:21 pm

    On Tue, Aug 09, 2011 at 09:14:20PM +0530, Charles wrote:
    FYI, here is a list of apparent Tika 0.8 conversion failures when run
    from Xapian's omindex on a Debian 6 Squeeze 64-bit system with 4 GB memory:

    doc files: tried: 10268, failed: 345 3.35%
    docx files: tried: 248, failed: 0
    odp files: tried: 7, failed: 0
    ods files: tried: 71, failed: 0
    odt files: tried: 136, failed: 0
    pdf files: tried: 3888, failed: 150 3.85%
    pps files: tried: 29, failed: 3 10.34%
    ppsx files: tried: 12, failed: 0
    ppt files: tried: 331, failed: 0
    pptx files: tried: 24, failed: 0
    rtf files: tried: 698, failed: 1 .14%
    xls files: tried: 3339, failed: 2 .05%
    xlsx files: tried: 63, failed: 0

    The statistics were generated by searching omindex output for
    .$ext" failed
    where $ext was each of the listed extensions in turn.

    More information can be supplied on request.
    It would be interesting to know how these compare with the failure rates
    for other filter programs on the same set of documents.

    Without anything to compare these to, it's hard to know if they're good
    or bad. For example, perhaps all those 345 failed .doc files are
    "readme.doc" and actually plain text. Or perhaps they are all valid and
    would be read fine by a different filter.

    Cheers,
    Olly
  • Charles at Sep 29, 2011 at 11:01 am

    On 01/09/11 18:51, Olly Betts wrote:
    On Tue, Aug 09, 2011 at 09:14:20PM +0530, Charles wrote:
    FYI, here is a list of apparent Tika 0.8 conversion failures when run
    from Xapian's omindex on a Debian 6 Squeeze 64-bit system with 4 GB memory:

    doc files: tried: 10268, failed: 345 3.35%
    docx files: tried: 248, failed: 0
    odp files: tried: 7, failed: 0
    ods files: tried: 71, failed: 0
    odt files: tried: 136, failed: 0
    pdf files: tried: 3888, failed: 150 3.85%
    pps files: tried: 29, failed: 3 10.34%
    ppsx files: tried: 12, failed: 0
    ppt files: tried: 331, failed: 0
    pptx files: tried: 24, failed: 0
    rtf files: tried: 698, failed: 1 .14%
    xls files: tried: 3339, failed: 2 .05%
    xlsx files: tried: 63, failed: 0

    The statistics were generated by searching omindex output for
    .$ext" failed
    where $ext was each of the listed extensions in turn.

    More information can be supplied on request.
    It would be interesting to know how these compare with the failure rates
    for other filter programs on the same set of documents.

    Without anything to compare these to, it's hard to know if they're good
    or bad. For example, perhaps all those 345 failed .doc files are
    "readme.doc" and actually plain text. Or perhaps they are all valid and
    would be read fine by a different filter.

    Cheers,
    Olly
    Hello Olly :-)

    Sorry for delay; pressure of higher priorities including having typhoid
    fever.

    The omindex command is run with MIME types. For .doc files the option is:
    --filter "application/msword:java -jar $tika_jar --text
    so it seems unlikely that the failed .doc files are anything other than
    Word files.

    I plan to enhance the bash script that runs omindex, making the filters
    configurable. When that is done I plan to try a variety of filters and
    compare results. It may be that various filters fail on a different
    selection of files so this may give better coverage in the index.

    Would you like a few of the failing .doc files to investigate?

    Best

    Charles
  • Olly Betts at Oct 5, 2011 at 2:38 pm

    On Thu, Sep 29, 2011 at 04:31:43PM +0530, Charles wrote:
    On 01/09/11 18:51, Olly Betts wrote:
    It would be interesting to know how these compare with the failure rates
    for other filter programs on the same set of documents.

    Without anything to compare these to, it's hard to know if they're good
    or bad. For example, perhaps all those 345 failed .doc files are
    "readme.doc" and actually plain text. Or perhaps they are all valid and
    would be read fine by a different filter.
    Sorry for delay; pressure of higher priorities including having typhoid
    fever. My...
    The omindex command is run with MIME types. For .doc files the option is:
    --filter "application/msword:java -jar $tika_jar --text
    so it seems unlikely that the failed .doc files are anything other than
    Word files.
    By default, omindex currently uses a list of extension->MIME
    content-type mappings, and only consults the magic library for
    extensions it doesn't know. So any file with a .doc extension will be
    considered as application/msword (unless you run omindex with
    '--mime-type=doc:').

    This is a bit dubious as it's pretty common to find files with a .doc
    extension which are actually RTF - that mechanism comes from before we
    had libmagic support. I think it is worth keeping as libmagic doesn't
    correctly identify every filetype, but we should probably trim the
    default list a bit.
    I plan to enhance the bash script that runs omindex, making the filters
    configurable. When that is done I plan to try a variety of filters and
    compare results. It may be that various filters fail on a different
    selection of files so this may give better coverage in the index.

    Would you like a few of the failing .doc files to investigate?
    Feel free to send a few, though I'm crazily busy at the moment so might
    not manage to investigate for a while. If they're OK to make public
    feel free to attach them to a ticket in trac which will allow others to
    investigate too.

    Cheers,
    Olly
  • James Aylett at Oct 5, 2011 at 3:23 pm

    On 5 Oct 2011, at 15:38, Olly Betts wrote:

    By default, omindex currently uses a list of extension->MIME
    content-type mappings, and only consults the magic library for
    extensions it doesn't know. So any file with a .doc extension will be
    considered as application/msword (unless you run omindex with
    '--mime-type=doc:').

    This is a bit dubious as it's pretty common to find files with a .doc
    extension which are actually RTF - that mechanism comes from before we
    had libmagic support. I think it is worth keeping as libmagic doesn't
    correctly identify every filetype, but we should probably trim the
    default list a bit.

    Would it make sense to have a mode where libmagic is tried first, and if it fails to provide anything we can use we fall back to the internal table? We could configure it with something illegal at the start of a MIME type, such as '+'.

    J

    --
    James Aylett
    talktorex.co.uk - xapian.org - devfort.com
  • Olly Betts at Oct 6, 2011 at 5:30 am

    On Wed, Oct 05, 2011 at 04:23:23PM +0100, James Aylett wrote:
    On 5 Oct 2011, at 15:38, Olly Betts wrote:

    By default, omindex currently uses a list of extension->MIME
    content-type mappings, and only consults the magic library for
    extensions it doesn't know. So any file with a .doc extension will be
    considered as application/msword (unless you run omindex with
    '--mime-type=doc:').

    This is a bit dubious as it's pretty common to find files with a .doc
    extension which are actually RTF - that mechanism comes from before we
    had libmagic support. I think it is worth keeping as libmagic doesn't
    correctly identify every filetype, but we should probably trim the
    default list a bit.
    Would it make sense to have a mode where libmagic is tried first, and
    if it fails to provide anything we can use we fall back to the
    internal table? We could configure it with something illegal at the
    start of a MIME type, such as '+'.
    I suppose the question is if there are situations where libmagic says
    it doesn't know, and the extension tells us the type, but not reliably
    enough that we would want to just trust the extension. If there aren't
    situations where it would really help, it's just complicating the code
    and the mental model the user needs to build for no reason.

    (If the extension is reliable, then we can just use it as we do now,
    and it libmagic gives a wrong answer we wouldn't get to such a
    fallback.)

    It's definitely useful to have a list of "trustworthy" extensions which
    is checked first, as there are cases where libmagic thinks it knows the
    answer but is just wrong. The problem is usually due to a rule which is
    considered before the correct one not being specific enough, for
    example:

    http://bugs.debian.org/cgi-bin/bugreport.cgi?bugV9147

    This sort of bug seems to be depressingly common, mostly because a
    lot of file formats lack reliable magic sequences.

    For .doc at least, I think the best approach is to just ask libmagic.

    Cheers,
    Olly
  • Charles at Oct 8, 2011 at 7:45 am

    On 05/10/11 20:08, Olly Betts wrote:
    Would you like a few of the failing .doc files to investigate?
    Feel free to send a few, though I'm crazily busy at the moment so might
    not manage to investigate for a while. If they're OK to make public
    feel free to attach them to a ticket in trac which will allow others to
    investigate too.
    Would you prefer sample files that failed with the 0.10 version or the
    development 1.0 snapshot?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupxapian-discuss @
categoriesxapian
postedAug 9, '11 at 3:44p
activeOct 8, '11 at 7:45a
posts7
users3
websitexapian.org
irc#xapian

People

Translate

site design / logo © 2022 Grokbase