FAQ
Hi all,

I have tried to detect several types of formats and currently only the
Microsoft Office ones are
those that cannot be detected accurately.

If Tika's detect(File file) method is used ms files are detected as follows
and
I guess the result from detection is the expected one.
doc - "application/msword"
docx -
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"

But If Tika's detect(InputStream is) method is used the picture is not the
same.
The results are:
doc - "application/x-tika-msoffice"
docx - "application/x-tika-ooxml"

Files for the test are created from MS Office 2007.
I couldn't find out why I get different results on same files.
Please let me know If I do something wrong or if there is some adequate
reason for this behaviour.

Best Regards,
Nasko

Search Discussions

  • Nick Burch at Sep 23, 2012 at 5:37 pm

    On Sun, 23 Sep 2012, naskoo wrote:
    But If Tika's detect(InputStream is) method is used the picture is not the
    same.
    The results are:
    doc - "application/x-tika-msoffice"
    docx - "application/x-tika-ooxml"
    Try wrapping your InputStream as a TikaInputStream - for full container
    detection Tika needs to be able to read the whole file, but still have it
    available for the parser

    Nick
  • Naskoo at Sep 23, 2012 at 6:07 pm
    Thanks for the suggestion. That way the problem is solved at some point.
    I run some more tests, but this time I removed the ms file extensions.
    I get the same not consistent results as before, even if I use
    TikaInputStream as a wrapper.
    Probably TikaInputStream just adds some metadata to include the file
    extension in the
    detection.
    On Sun, Sep 23, 2012 at 8:37 PM, Nick Burch wrote:
    On Sun, 23 Sep 2012, naskoo wrote:

    But If Tika's detect(InputStream is) method is used the picture is not the
    same.
    The results are:
    doc - "application/x-tika-msoffice"
    docx - "application/x-tika-ooxml"
    Try wrapping your InputStream as a TikaInputStream - for full container
    detection Tika needs to be able to read the whole file, but still have it
    available for the parser

    Nick
  • Jukka Zitting at Sep 23, 2012 at 7:34 pm
    Hi,
    On Sun, Sep 23, 2012 at 8:07 PM, naskoo wrote:
    Thanks for the suggestion. That way the problem is solved at some point.
    I run some more tests, but this time I removed the ms file extensions.
    I get the same not consistent results as before, even if I use
    TikaInputStream as a wrapper.
    Probably TikaInputStream just adds some metadata to include the file
    extension in the detection.
    It doesn't add extra metadata (unless explicitly requested). Instead
    the TikaInputStream class allows Tika parsers and detectors to use
    random access for reading the underlying file.

    The MS Office detectors (and a few other features in Tika) rely on
    that functionality, and thus won't give as accurate results when given
    just a plain InputStream instance.

    BR,

    Jukka Zitting
  • Nick Burch at Sep 23, 2012 at 9:39 pm

    On Sun, 23 Sep 2012, naskoo wrote:
    Probably TikaInputStream just adds some metadata to include the file
    extension in the detection.
    Nope, as I said:
    Try wrapping your InputStream as a TikaInputStream - for full container
    detection Tika needs to be able to read the whole file, but still have
    it available for the parser
    TikaInputStream provides this buffering, which allows a detector to read
    the whole file to identify what it contains (which container formats
    need), whilst still allowing a parser to get at the whole contents to
    process it

    Nick

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriestika, lucene
postedSep 23, '12 at 4:34p
activeSep 23, '12 at 9:39p
posts5
users3
websitetika.apache.org

People

Translate

site design / logo © 2021 Grokbase