FAQ
Hello all,

I know, this is not the right group to ask this question, thought some of you guys might have experienced.

I newbie with Tika. I am using latest version 0.8 version. I extracted text from PDF document but found spaces and new line missing. Indexing the data gives wrong result. Could any one in this group could help me? I am using tika directly to extract the contents, which later gets indexed.

Regards
Ganesh
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Lance Norskog at Dec 3, 2010 at 7:00 am
    The text should come out as a stream of words with space, but without
    any of the formatting in the PDF. Extraction is only good enough to
    tell you that a word is somewhere inside a PDF file. Can you post a
    short bit of the text that it extracted?

    Also, you should try this test on different PDF files that were made
    with different software.
    On Thu, Dec 2, 2010 at 9:35 PM, Ganesh wrote:
    Hello all,

    I know, this is not the right group to ask this question, thought some of you guys might have experienced.

    I newbie with Tika. I am using latest version 0.8 version. I extracted text from PDF document but found spaces and new line missing. Indexing the data gives wrong result. Could any one in this group could help me? I am using tika directly to extract the contents, which later gets indexed.

    Regards
    Ganesh
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Alexander Aristov at Dec 3, 2010 at 9:10 am
    anyway even if you get correct whitespaces and new lines this won't affect
    indexing.

    Best Regards
    Alexander Aristov

    On 3 December 2010 10:00, Lance Norskog wrote:

    The text should come out as a stream of words with space, but without
    any of the formatting in the PDF. Extraction is only good enough to
    tell you that a word is somewhere inside a PDF file. Can you post a
    short bit of the text that it extracted?

    Also, you should try this test on different PDF files that were made
    with different software.
    On Thu, Dec 2, 2010 at 9:35 PM, Ganesh wrote:
    Hello all,

    I know, this is not the right group to ask this question, thought some of
    you guys might have experienced.
    I newbie with Tika. I am using latest version 0.8 version. I extracted
    text from PDF document but found spaces and new line missing. Indexing the
    data gives wrong result. Could any one in this group could help me? I am
    using tika directly to extract the contents, which later gets indexed.
    Regards
    Ganesh
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ganesh at Dec 3, 2010 at 10:44 am
    The main problem is i am not getting whitespace and newline char. This is happening only for PDF documents.

    Sample outoput: Someofthedifferencesare but it should be Some of the differences are

    Regards
    Ganesh

    ----- Original Message -----
    From: "Alexander Aristov" <alexander.aristov@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 2:39 PM
    Subject: Re: PDF text extracted without spaces

    anyway even if you get correct whitespaces and new lines this won't affect
    indexing.

    Best Regards
    Alexander Aristov

    On 3 December 2010 10:00, Lance Norskog wrote:

    The text should come out as a stream of words with space, but without
    any of the formatting in the PDF. Extraction is only good enough to
    tell you that a word is somewhere inside a PDF file. Can you post a
    short bit of the text that it extracted?

    Also, you should try this test on different PDF files that were made
    with different software.
    On Thu, Dec 2, 2010 at 9:35 PM, Ganesh wrote:
    Hello all,

    I know, this is not the right group to ask this question, thought some of
    you guys might have experienced.
    I newbie with Tika. I am using latest version 0.8 version. I extracted
    text from PDF document but found spaces and new line missing. Indexing the
    data gives wrong result. Could any one in this group could help me? I am
    using tika directly to extract the contents, which later gets indexed.
    Regards
    Ganesh
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • McGibbney, Lewis John at Dec 3, 2010 at 11:10 am
    Hi Ganesh

    I encountered this same problem last week. I was thinking if it was possible to include at minimum a WhitespaceAnalyzer somewhere within Tika which would solve the problem. I am not sure of how this would be done as I am not familiar with Tika codebase.

    Unfortunately I don't think that the solution to the first part of this problem lies within the java-user mailing list.

    When were you sending extracted contents to Lucene... at what later stage?

    Thank you

    Lewis

    -----Original Message-----
    From: Ganesh
    Sent: 03 December 2010 10:44
    To: java-user@lucene.apache.org
    Subject: Re: PDF text extracted without spaces

    The main problem is i am not getting whitespace and newline char. This is happening only for PDF documents.

    Sample outoput: Someofthedifferencesare but it should be Some of the differences are

    Regards
    Ganesh

    ----- Original Message -----
    From: "Alexander Aristov" <alexander.aristov@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 2:39 PM
    Subject: Re: PDF text extracted without spaces

    anyway even if you get correct whitespaces and new lines this won't affect
    indexing.

    Best Regards
    Alexander Aristov

    On 3 December 2010 10:00, Lance Norskog wrote:

    The text should come out as a stream of words with space, but without
    any of the formatting in the PDF. Extraction is only good enough to
    tell you that a word is somewhere inside a PDF file. Can you post a
    short bit of the text that it extracted?

    Also, you should try this test on different PDF files that were made
    with different software.
    On Thu, Dec 2, 2010 at 9:35 PM, Ganesh wrote:
    Hello all,

    I know, this is not the right group to ask this question, thought some of
    you guys might have experienced.
    I newbie with Tika. I am using latest version 0.8 version. I extracted
    text from PDF document but found spaces and new line missing. Indexing the
    data gives wrong result. Could any one in this group could help me? I am
    using tika directly to extract the contents, which later gets indexed.
    Regards
    Ganesh
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems

    Glasgow Caledonian University is a registered Scottish charity, number SC021474

    Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009
    http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
  • Ganesh at Dec 3, 2010 at 11:56 am
    I first extract the contents from documents using tika and latter index it with Lucene. The problem is the extracted text from PDF using tika has no whitespaces.

    Regards
    Ganesh


    ----- Original Message -----
    From: "McGibbney, Lewis John" <Lewis.McGibbney@gcu.ac.uk>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 4:40 PM
    Subject: RE: PDF text extracted without spaces

    Hi Ganesh

    I encountered this same problem last week. I was thinking if it was possible to include at minimum a WhitespaceAnalyzer somewhere within Tika which would solve the problem. I am not sure of how this would be done as I am not familiar with Tika codebase.

    Unfortunately I don't think that the solution to the first part of this problem lies within the java-user mailing list.

    When were you sending extracted contents to Lucene... at what later stage?

    Thank you

    Lewis

    -----Original Message-----
    From: Ganesh
    Sent: 03 December 2010 10:44
    To: java-user@lucene.apache.org
    Subject: Re: PDF text extracted without spaces

    The main problem is i am not getting whitespace and newline char. This is happening only for PDF documents.

    Sample outoput: Someofthedifferencesare but it should be Some of the differences are

    Regards
    Ganesh

    ----- Original Message -----
    From: "Alexander Aristov" <alexander.aristov@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 2:39 PM
    Subject: Re: PDF text extracted without spaces

    anyway even if you get correct whitespaces and new lines this won't affect
    indexing.

    Best Regards
    Alexander Aristov

    On 3 December 2010 10:00, Lance Norskog wrote:

    The text should come out as a stream of words with space, but without
    any of the formatting in the PDF. Extraction is only good enough to
    tell you that a word is somewhere inside a PDF file. Can you post a
    short bit of the text that it extracted?

    Also, you should try this test on different PDF files that were made
    with different software.
    On Thu, Dec 2, 2010 at 9:35 PM, Ganesh wrote:
    Hello all,

    I know, this is not the right group to ask this question, thought some of
    you guys might have experienced.
    I newbie with Tika. I am using latest version 0.8 version. I extracted
    text from PDF document but found spaces and new line missing. Indexing the
    data gives wrong result. Could any one in this group could help me? I am
    using tika directly to extract the contents, which later gets indexed.
    Regards
    Ganesh
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems

    Glasgow Caledonian University is a registered Scottish charity, number SC021474

    Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009
    http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ian Lea at Dec 3, 2010 at 12:09 pm
    Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
    Have you tried asking on the tika mailing list?
    http://tika.apache.org/mail-lists.html.


    --
    Ian.

    On Fri, Dec 3, 2010 at 11:55 AM, Ganesh wrote:
    I first extract the contents from documents using tika and latter index it with Lucene. The problem is the extracted text from PDF using tika has no whitespaces.

    Regards
    Ganesh


    ----- Original Message -----
    From: "McGibbney, Lewis John" <Lewis.McGibbney@gcu.ac.uk>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 4:40 PM
    Subject: RE: PDF text extracted without spaces

    Hi Ganesh

    I encountered this same problem last week. I was thinking if it was possible to include at minimum a WhitespaceAnalyzer somewhere within Tika which would solve the problem. I am not sure of how this would be done as I am not familiar with Tika codebase.

    Unfortunately I don't think that the solution to the first part of this problem lies within the java-user mailing list.

    When were you sending extracted contents to Lucene... at what later stage?

    Thank you

    Lewis

    -----Original Message-----
    From: Ganesh
    Sent: 03 December 2010 10:44
    To: java-user@lucene.apache.org
    Subject: Re: PDF text extracted without spaces

    The main problem is i am not getting whitespace and newline char. This is happening only for PDF documents.

    Sample outoput: Someofthedifferencesare but it should be Some of the differences are

    Regards
    Ganesh

    ----- Original Message -----
    From: "Alexander Aristov" <alexander.aristov@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 2:39 PM
    Subject: Re: PDF text extracted without spaces

    anyway even if you get correct whitespaces and new lines this won't affect
    indexing.

    Best Regards
    Alexander Aristov

    On 3 December 2010 10:00, Lance Norskog wrote:

    The text should come out as a stream of words with space, but without
    any of the formatting in the PDF. Extraction is only good enough to
    tell you that a word is somewhere inside a PDF file.  Can you post a
    short bit of the text that it extracted?

    Also, you should try this test on different PDF files that were made
    with different software.
    On Thu, Dec 2, 2010 at 9:35 PM, Ganesh wrote:
    Hello all,

    I know, this is not the right group to ask this question, thought some of
    you guys might have experienced.
    I newbie with Tika. I am using latest version 0.8 version. I extracted
    text from PDF document but found spaces and new line missing. Indexing the
    data gives wrong result. Could any one in this group could help me? I am
    using tika directly to extract the contents, which later gets indexed.
    Regards
    Ganesh
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems

    Glasgow Caledonian University is a registered Scottish charity, number SC021474

    Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009
    http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Fabiano Nunes at Dec 3, 2010 at 1:52 pm
    Have you ever tried other extractor tool than PDFBox? I used to extract
    contents with pdfbox: its capability of extract contents wasn't a problem,
    but its lack of structure information was.
    You can try poppler-utils (pdftotext) to extract contents with
    layout structure.

    Fabiano Nunes




    On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea wrote:

    Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
    Have you tried asking on the tika mailing list?
    http://tika.apache.org/mail-lists.html.


    --
    Ian.

    On Fri, Dec 3, 2010 at 11:55 AM, Ganesh wrote:
    I first extract the contents from documents using tika and latter index
    it with Lucene. The problem is the extracted text from PDF using tika has no
    whitespaces.
    Regards
    Ganesh


    ----- Original Message -----
    From: "McGibbney, Lewis John" <Lewis.McGibbney@gcu.ac.uk>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 4:40 PM
    Subject: RE: PDF text extracted without spaces

    Hi Ganesh

    I encountered this same problem last week. I was thinking if it was
    possible to include at minimum a WhitespaceAnalyzer somewhere within Tika
    which would solve the problem. I am not sure of how this would be done as I
    am not familiar with Tika codebase.
    Unfortunately I don't think that the solution to the first part of this
    problem lies within the java-user mailing list.
    When were you sending extracted contents to Lucene... at what later
    stage?
    Thank you

    Lewis

    -----Original Message-----
    From: Ganesh
    Sent: 03 December 2010 10:44
    To: java-user@lucene.apache.org
    Subject: Re: PDF text extracted without spaces

    The main problem is i am not getting whitespace and newline char. This
    is happening only for PDF documents.
    Sample outoput: Someofthedifferencesare but it should be Some of the
    differences are
    Regards
    Ganesh

    ----- Original Message -----
    From: "Alexander Aristov" <alexander.aristov@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 2:39 PM
    Subject: Re: PDF text extracted without spaces

    anyway even if you get correct whitespaces and new lines this won't
    affect
    indexing.

    Best Regards
    Alexander Aristov

    On 3 December 2010 10:00, Lance Norskog wrote:

    The text should come out as a stream of words with space, but without
    any of the formatting in the PDF. Extraction is only good enough to
    tell you that a word is somewhere inside a PDF file. Can you post a
    short bit of the text that it extracted?

    Also, you should try this test on different PDF files that were made
    with different software.
    On Thu, Dec 2, 2010 at 9:35 PM, Ganesh wrote:
    Hello all,

    I know, this is not the right group to ask this question, thought
    some of
    you guys might have experienced.
    I newbie with Tika. I am using latest version 0.8 version. I
    extracted
    text from PDF document but found spaces and new line missing. Indexing
    the
    data gives wrong result. Could any one in this group could help me? I
    am
    using tika directly to extract the contents, which later gets indexed.
    Regards
    Ganesh
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    Email has been scanned for viruses by Altman Technologies' email
    management service - www.altman.co.uk/emailsystems
    Glasgow Caledonian University is a registered Scottish charity, number
    SC021474
    Winner: Times Higher Education’s Widening Participation Initiative of
    the Year 2009 and Herald Society’s Education Initiative of the Year 2009
    http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Hans Merkl at Dec 3, 2010 at 2:36 pm
    pdftotext is much better and faster from my experience.

    On Fri, Dec 3, 2010 at 08:52, Fabiano Nunes wrote:

    Have you ever tried other extractor tool than PDFBox? I used to extract
    contents with pdfbox: its capability of extract contents wasn't a problem,
    but its lack of structure information was.
    You can try poppler-utils (pdftotext) to extract contents with
    layout structure.

    Fabiano Nunes




    On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea wrote:

    Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
    Have you tried asking on the tika mailing list?
    http://tika.apache.org/mail-lists.html.


    --
    Ian.

    On Fri, Dec 3, 2010 at 11:55 AM, Ganesh wrote:
    I first extract the contents from documents using tika and latter index
    it with Lucene. The problem is the extracted text from PDF using tika has no
    whitespaces.
    Regards
    Ganesh


    ----- Original Message -----
    From: "McGibbney, Lewis John" <Lewis.McGibbney@gcu.ac.uk>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 4:40 PM
    Subject: RE: PDF text extracted without spaces

    Hi Ganesh

    I encountered this same problem last week. I was thinking if it was
    possible to include at minimum a WhitespaceAnalyzer somewhere within Tika
    which would solve the problem. I am not sure of how this would be done as I
    am not familiar with Tika codebase.
    Unfortunately I don't think that the solution to the first part of
    this
    problem lies within the java-user mailing list.
    When were you sending extracted contents to Lucene... at what later
    stage?
    Thank you

    Lewis

    -----Original Message-----
    From: Ganesh
    Sent: 03 December 2010 10:44
    To: java-user@lucene.apache.org
    Subject: Re: PDF text extracted without spaces

    The main problem is i am not getting whitespace and newline char. This
    is happening only for PDF documents.
    Sample outoput: Someofthedifferencesare but it should be Some of the
    differences are
    Regards
    Ganesh

    ----- Original Message -----
    From: "Alexander Aristov" <alexander.aristov@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 2:39 PM
    Subject: Re: PDF text extracted without spaces

    anyway even if you get correct whitespaces and new lines this won't
    affect
    indexing.

    Best Regards
    Alexander Aristov

    On 3 December 2010 10:00, Lance Norskog wrote:

    The text should come out as a stream of words with space, but
    without
    any of the formatting in the PDF. Extraction is only good enough to
    tell you that a word is somewhere inside a PDF file. Can you post a
    short bit of the text that it extracted?

    Also, you should try this test on different PDF files that were made
    with different software.
    On Thu, Dec 2, 2010 at 9:35 PM, Ganesh wrote:
    Hello all,

    I know, this is not the right group to ask this question, thought
    some of
    you guys might have experienced.
    I newbie with Tika. I am using latest version 0.8 version. I
    extracted
    text from PDF document but found spaces and new line missing.
    Indexing
    the
    data gives wrong result. Could any one in this group could help me?
    I
    am
    using tika directly to extract the contents, which later gets
    indexed.
    Regards
    Ganesh
    Send free SMS to your Friends on Mobile from your Yahoo!
    Messenger.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    Email has been scanned for viruses by Altman Technologies' email
    management service - www.altman.co.uk/emailsystems
    Glasgow Caledonian University is a registered Scottish charity, number
    SC021474
    Winner: Times Higher Education’s Widening Participation Initiative of
    the Year 2009 and Herald Society’s Education Initiative of the Year 2009
    http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --

    Hans Merkl
    Right On Point, LLC
    215 Victor Parkway, Suite E
    Annapolis, MD 21403

    Phone: (443) 951-4324
    E-mail: hmerkl@rightonpoint.us
  • Ralph Seward at Dec 3, 2010 at 2:52 pm
    pdftotext has usually worked quite well for my purposes. More info at
    http://www.foolabs.com/xpdf/about.html .

    "Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X
    components (pdftops, pdftotext, etc.) also run on Win32 systems and should
    run on pretty much any system with a decent C++ compiler."

    Ralph
    On Fri, Dec 3, 2010 at 9:35 AM, Hans Merkl wrote:

    pdftotext is much better and faster from my experience.

    On Fri, Dec 3, 2010 at 08:52, Fabiano Nunes wrote:

    Have you ever tried other extractor tool than PDFBox? I used to extract
    contents with pdfbox: its capability of extract contents wasn't a
    problem,
    but its lack of structure information was.
    You can try poppler-utils (pdftotext) to extract contents with
    layout structure.

    Fabiano Nunes




    On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea wrote:

    Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
    Have you tried asking on the tika mailing list?
    http://tika.apache.org/mail-lists.html.


    --
    Ian.

    On Fri, Dec 3, 2010 at 11:55 AM, Ganesh wrote:
    I first extract the contents from documents using tika and latter
    index
    it with Lucene. The problem is the extracted text from PDF using tika
    has
    no
    whitespaces.
    Regards
    Ganesh


    ----- Original Message -----
    From: "McGibbney, Lewis John" <Lewis.McGibbney@gcu.ac.uk>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 4:40 PM
    Subject: RE: PDF text extracted without spaces

    Hi Ganesh

    I encountered this same problem last week. I was thinking if it was
    possible to include at minimum a WhitespaceAnalyzer somewhere within
    Tika
    which would solve the problem. I am not sure of how this would be done
    as
    I
    am not familiar with Tika codebase.
    Unfortunately I don't think that the solution to the first part of
    this
    problem lies within the java-user mailing list.
    When were you sending extracted contents to Lucene... at what later
    stage?
    Thank you

    Lewis

    -----Original Message-----
    From: Ganesh
    Sent: 03 December 2010 10:44
    To: java-user@lucene.apache.org
    Subject: Re: PDF text extracted without spaces

    The main problem is i am not getting whitespace and newline char.
    This
    is happening only for PDF documents.
    Sample outoput: Someofthedifferencesare but it should be Some of
    the
    differences are
    Regards
    Ganesh

    ----- Original Message -----
    From: "Alexander Aristov" <alexander.aristov@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 2:39 PM
    Subject: Re: PDF text extracted without spaces

    anyway even if you get correct whitespaces and new lines this
    won't
    affect
    indexing.

    Best Regards
    Alexander Aristov

    On 3 December 2010 10:00, Lance Norskog wrote:

    The text should come out as a stream of words with space, but
    without
    any of the formatting in the PDF. Extraction is only good enough
    to
    tell you that a word is somewhere inside a PDF file. Can you
    post a
    short bit of the text that it extracted?

    Also, you should try this test on different PDF files that were
    made
    with different software.

    On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailgane@yahoo.co.in>
    wrote:
    Hello all,

    I know, this is not the right group to ask this question,
    thought
    some of
    you guys might have experienced.
    I newbie with Tika. I am using latest version 0.8 version. I
    extracted
    text from PDF document but found spaces and new line missing.
    Indexing
    the
    data gives wrong result. Could any one in this group could help
    me?
    I
    am
    using tika directly to extract the contents, which later gets
    indexed.
    Regards
    Ganesh
    Send free SMS to your Friends on Mobile from your Yahoo!
    Messenger.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail:
    java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    Email has been scanned for viruses by Altman Technologies' email
    management service - www.altman.co.uk/emailsystems
    Glasgow Caledonian University is a registered Scottish charity,
    number
    SC021474
    Winner: Times Higher Education’s Widening Participation Initiative
    of
    the Year 2009 and Herald Society’s Education Initiative of the Year
    2009
    http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --

    Hans Merkl
    Right On Point, LLC
    215 Victor Parkway, Suite E
    Annapolis, MD 21403

    Phone: (443) 951-4324
    E-mail: hmerkl@rightonpoint.us
  • Ganesh at Dec 6, 2010 at 9:47 am
    Thanks for the link. I downloaded the tool and pdftotext extracts correctly. I also dropped a mail to the tika user group. It is a regression in latest release https://issues.apache.org/jira/browse/TIKA-548. Hopefully coming release will have a fix.

    Regards
    Ganesh

    ----- Original Message -----
    From: "Ralph Seward" <rj.seward@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 8:21 PM
    Subject: Re: PDF text extracted without spaces


    pdftotext has usually worked quite well for my purposes. More info at
    http://www.foolabs.com/xpdf/about.html .

    "Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X
    components (pdftops, pdftotext, etc.) also run on Win32 systems and should
    run on pretty much any system with a decent C++ compiler."

    Ralph
    On Fri, Dec 3, 2010 at 9:35 AM, Hans Merkl wrote:

    pdftotext is much better and faster from my experience.

    On Fri, Dec 3, 2010 at 08:52, Fabiano Nunes wrote:

    Have you ever tried other extractor tool than PDFBox? I used to extract
    contents with pdfbox: its capability of extract contents wasn't a
    problem,
    but its lack of structure information was.
    You can try poppler-utils (pdftotext) to extract contents with
    layout structure.

    Fabiano Nunes




    On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea wrote:

    Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
    Have you tried asking on the tika mailing list?
    http://tika.apache.org/mail-lists.html.


    --
    Ian.

    On Fri, Dec 3, 2010 at 11:55 AM, Ganesh wrote:
    I first extract the contents from documents using tika and latter
    index
    it with Lucene. The problem is the extracted text from PDF using tika
    has
    no
    whitespaces.
    Regards
    Ganesh


    ----- Original Message -----
    From: "McGibbney, Lewis John" <Lewis.McGibbney@gcu.ac.uk>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 4:40 PM
    Subject: RE: PDF text extracted without spaces

    Hi Ganesh

    I encountered this same problem last week. I was thinking if it was
    possible to include at minimum a WhitespaceAnalyzer somewhere within
    Tika
    which would solve the problem. I am not sure of how this would be done
    as
    I
    am not familiar with Tika codebase.
    Unfortunately I don't think that the solution to the first part of
    this
    problem lies within the java-user mailing list.
    When were you sending extracted contents to Lucene... at what later
    stage?
    Thank you

    Lewis

    -----Original Message-----
    From: Ganesh
    Sent: 03 December 2010 10:44
    To: java-user@lucene.apache.org
    Subject: Re: PDF text extracted without spaces

    The main problem is i am not getting whitespace and newline char.
    This
    is happening only for PDF documents.
    Sample outoput: Someofthedifferencesare but it should be Some of
    the
    differences are
    Regards
    Ganesh

    ----- Original Message -----
    From: "Alexander Aristov" <alexander.aristov@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, December 03, 2010 2:39 PM
    Subject: Re: PDF text extracted without spaces

    anyway even if you get correct whitespaces and new lines this
    won't
    affect
    indexing.

    Best Regards
    Alexander Aristov

    On 3 December 2010 10:00, Lance Norskog wrote:

    The text should come out as a stream of words with space, but
    without
    any of the formatting in the PDF. Extraction is only good enough
    to
    tell you that a word is somewhere inside a PDF file. Can you
    post a
    short bit of the text that it extracted?

    Also, you should try this test on different PDF files that were
    made
    with different software.

    On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailgane@yahoo.co.in>
    wrote:
    Hello all,

    I know, this is not the right group to ask this question,
    thought
    some of
    you guys might have experienced.
    I newbie with Tika. I am using latest version 0.8 version. I
    extracted
    text from PDF document but found spaces and new line missing.
    Indexing
    the
    data gives wrong result. Could any one in this group could help
    me?
    I
    am
    using tika directly to extract the contents, which later gets
    indexed.
    Regards
    Ganesh
    Send free SMS to your Friends on Mobile from your Yahoo!
    Messenger.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail:
    java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    Email has been scanned for viruses by Altman Technologies' email
    management service - www.altman.co.uk/emailsystems
    Glasgow Caledonian University is a registered Scottish charity,
    number
    SC021474
    Winner: Times Higher Education’s Widening Participation Initiative
    of
    the Year 2009 and Herald Society’s Education Initiative of the Year
    2009
    http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
    Download Now! http://messenger.yahoo.com/download.php
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --

    Hans Merkl
    Right On Point, LLC
    215 Victor Parkway, Suite E
    Annapolis, MD 21403

    Phone: (443) 951-4324
    E-mail: hmerkl@rightonpoint.us
    Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 3, '10 at 5:35a
activeDec 6, '10 at 9:47a
posts11
users8
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase