FAQ

Search Discussions

85 discussions - 266 posts

  • Hello, is there an option to define the content types that should be parsed in an archive file? for example i have a zip archive that contains jar and pdf files, tika should only parse the pdf files ...
    Daniel KnappDaniel Knapp
    Dec 4, 2009 at 2:58 pm
    Dec 4, 2009 at 2:58 pm
  • Hello all after 0.5 release, project's site missed the api documentation, like http://lucene.apache.org/tika/apidocs/org/apache/tika/parser/Parser.html, although some pages ...
    Alex OttAlex Ott
    Dec 4, 2009 at 1:28 pm
    Dec 4, 2009 at 1:28 pm
  • Hi all, I converted the package to "tika-app-0.4.*exe*" using ikvmc.exe, so that I could use Tika in .Net environment. Everything worked fine so far except with .rtf file. The output(only contains ...
    Li LeonLi Leon
    Dec 4, 2009 at 8:09 am
    Dec 4, 2009 at 8:09 am
  • Hi all, I'm using the following command to filter out the attached doc which is in Chinese. The doc was filtered fine but only with gibberish output. Any ideas? "type "chinese char.doc" | java -jar ...
    Li LeonLi Leon
    Dec 4, 2009 at 3:05 am
    Dec 4, 2009 at 7:58 am
  • Hi all, I got an exception when filtering the attached Excel file using "type bugs.xls | java -jar tika-app -0.4.jar -". Any ideas? The embedded object seemed to cause the problem. Thanks,
    Li LeonLi Leon
    Dec 4, 2009 at 2:38 am
    Dec 4, 2009 at 10:01 am
  • David StuartDavid Stuart
    Dec 3, 2009 at 7:01 pm
    Dec 4, 2009 at 10:34 am
  • Hello list, This question is about how to get the content of <div id="article" ..interesting content...</div Is the <div element skipped on purpose or is there a way to tell the parser what to pass ...
    Anne BlankertAnne Blankert
    Dec 3, 2009 at 5:43 pm
    Dec 3, 2009 at 5:43 pm
  • Hello, i want to set the content-type of a file before i parse it. I've seen on the Tika Website that this is possible and an advantage during the parsing process. What is the right name and value of ...
    Daniel KnappDaniel Knapp
    Dec 2, 2009 at 8:53 pm
    Dec 3, 2009 at 1:12 am
  • Hello all I want to implement a XML handler, that will behave differently, depending on mime-type of processed document - for example, output CSV for spreadsheets, etc. May i assume, that when XML ...
    Alex OttAlex Ott
    Nov 27, 2009 at 12:47 pm
    Nov 27, 2009 at 12:51 pm
  • Hi, I'm trying to build Tika 0.5 and I can't get past the tests. I get stuck at the Running org.apache.tika.mime.MimeDetectionTest step. I looked at the sources and saw that MimeDetectionTest.java ...
    Georger AraujoGeorger Araujo
    Nov 24, 2009 at 8:43 pm
    Nov 24, 2009 at 9:44 pm
  • (...apologies for the cross posting...) The Apache Lucene project is pleased to announce the release of Apache Tika 0.5. The release contents have been pushed out to the main Apache release site and ...
    Mattmann, Chris A (388J)Mattmann, Chris A (388J)
    Nov 22, 2009 at 3:51 pm
    Nov 22, 2009 at 3:51 pm
  • Hello all I have one question - is it possible to extract text not only from single document, but also text from documents, embedded into archive? When i send archive (.zip) to tika, i get only list ...
    Alex OttAlex Ott
    Nov 21, 2009 at 6:22 pm
    Nov 26, 2009 at 1:05 pm
  • Hi, is there a separate list for Nutch? Thank you, Makr
    Mark KerznerMark Kerzner
    Nov 16, 2009 at 7:39 pm
    Nov 16, 2009 at 7:52 pm
  • Hi, I'm using the UIMA Tika Annotator which calls Tika in the following way: Parser.parse(originalStream, handler, md); originalStream is a BufferedInputStream. I've upgraded the Tika dependencies ...
    Wermter, JoachimWermter, Joachim
    Nov 13, 2009 at 2:53 pm
    Nov 13, 2009 at 2:53 pm
  • Hi, I tried to extract text from an Office 2207 Word and Excel, and Tika thinks they are XML files. "file" command in Linux thinks they are "zip' files. Where should I look for the current format ...
    Mark KerznerMark Kerzner
    Nov 12, 2009 at 2:43 am
    Nov 12, 2009 at 1:01 pm
  • Team, For those Lucene fanatics not in Oakland this week for ApacheCon US, don't miss the FREE live video streaming, starting today: http://streaming.linux-magazin.de/en/program-apachecon-us-2009.htm ...
    Michael McCandlessMichael McCandless
    Nov 4, 2009 at 1:26 pm
    Nov 4, 2009 at 1:26 pm
  • Hello List, According to the changelog, Tika now inserts whitespace when parsing html documents, but I do not understand how to get it to work. My HTML document has a fragment: <td valign="top" ...
    Anne BlankertAnne Blankert
    Oct 16, 2009 at 5:21 pm
    Oct 24, 2009 at 6:54 pm
  • Hello, How come Ken's recently added MboxParser is not in 0.5-SNAPSHOT? Doesn't the snapshot get rebuilt and pushed to maven repo on the nightly basis? If I get the 0.5-SNAPSHOT jar from the repo: $ ...
    Otis GospodneticOtis Gospodnetic
    Oct 15, 2009 at 3:07 am
    Oct 15, 2009 at 9:25 am
  • Hello, I just exported the tika from subversion, built the jars, and tried to use tika-app to extract data from a copy-protected pdf. This is the error I got: $java -jar ...
    Daniel HigginbothamDaniel Higginbotham
    Oct 14, 2009 at 9:50 pm
    Oct 15, 2009 at 12:06 am
  • Dear tika experts, I just tried to get tika 0.4 running, but there seems to a mismatch with the "getting started" page (at least on my computer :-)). "mvn install" passes without errors, but, ...
    Marc BechlerMarc Bechler
    Oct 10, 2009 at 8:27 pm
    Oct 11, 2009 at 12:42 pm
  • tika-parsers:0.4 has an dependency on Xerces. Once Xerces is on classpath, some simple operations with TraX fail with a DOMException NOT_FOUND_ERR. Putting Xalan onto the path makes everything ...
    Benson MarguliesBenson Margulies
    Oct 9, 2009 at 2:09 am
    Oct 16, 2009 at 6:17 am
  • I spent this afternoon wiring tika 0.4 into my code. For various reasons, what I want to get from Tika is an XHTML DOM tree, not just plain text. I made a few discoveries. To begin with, the HTML ...
    Benson MarguliesBenson Margulies
    Oct 9, 2009 at 2:07 am
    Oct 9, 2009 at 9:22 am
  • Using tika 0.4. I parse an XHTML document, and it has some div nodes. The xhtml that comes back the other way through the Parser content handler has lost all the div nodes. Is this mandatory? Is this ...
    Benson MarguliesBenson Margulies
    Oct 8, 2009 at 8:14 pm
    Oct 8, 2009 at 8:14 pm
  • Hi, Tika is a really great tool. Thanks for the great work. As an ardent user, I was wondering whether and when it's possible to upgrade Tika's Maven dependencies (mainly its parser libs, e.g. ...
    Wermter, JoachimWermter, Joachim
    Oct 7, 2009 at 2:15 pm
    Oct 24, 2009 at 6:44 pm
  • I have a java program that takes a list of URLs to tika to fetch a set of html pages. Most of the time, I can process a list of URL with Tika. However, some time I get the following error: Error: ...
    LhpanglerLhpangler
    Sep 30, 2009 at 4:41 pm
    Sep 30, 2009 at 5:24 pm
  • Hello list, i'm trying to index rich text format documents (pdf, docs etc) with solr+tika. At the moment the output of tika is xhtml <str ... </str which isn't exactly what i'm looking for. I'm ...
    Claudio MartellaClaudio Martella
    Sep 30, 2009 at 9:40 am
    Sep 30, 2009 at 9:40 am
  • Hi all, IIUC, the character encoding (as detected by CharsetDetector, for instance) is only relevant for text formats. It does not make sense, for example, to talk about the character encoding of a ...
    Kaspar FischerKaspar Fischer
    Sep 27, 2009 at 11:14 am
    Sep 30, 2009 at 9:20 am
  • Hello, has a problem with some encodings in rtf files. I've seen it happen with some rtf files generated by Microsoft Word and having Czech characters. ( one example being "ř", unicode code point ...
    Cristian VatCristian Vat
    Sep 11, 2009 at 9:58 pm
    Sep 17, 2009 at 11:03 am
  • Hi, I'm try to extract the text content of some text document formats. Unfortunately, I get no content at all. I use the tika-core-0.4.jar. This is my test class: public class ContentConverter { ...
    Fabian LazarskiFabian Lazarski
    Aug 26, 2009 at 9:00 am
    Aug 26, 2009 at 10:06 am
  • FYI Begin forwarded message:
    Grant IngersollGrant Ingersoll
    Aug 25, 2009 at 3:20 pm
    Aug 25, 2009 at 3:20 pm
  • I am using the Solr nightly build 8/11/09 which uses the tika-core-0.4.jar. I have set the text field in the solrconfig.xml file to be stored. I index an MS Word document and when I search for a word ...
    Kevin MillerKevin Miller
    Aug 19, 2009 at 3:15 pm
    Aug 24, 2009 at 10:52 pm
  • New to tika, early user of Lucene. Particular interest in indexing and searching XML instances. I currently have about 800+ instances, with about 20 different schemas (XML based user documentation ...
    Dave PawsonDave Pawson
    Aug 17, 2009 at 5:29 am
    Nov 11, 2009 at 3:10 pm
  • Hi, I am trying to use AutoDetectParser for parsing files dynamically, but i am getting following exception. I tried to google it but didnt get proper solution to this.. please help.. Exception in ...
    Chaitali PatelChaitali Patel
    Aug 17, 2009 at 4:47 am
    Aug 17, 2009 at 9:46 am
  • Hi, I am using PDFParser to extract PDF content. But it is taking a lot of time for extracting. 250KB file in my case took around 5 minutes to get extracted. a 600KB file took around 10 minutes. ...
    Chaitali PatelChaitali Patel
    Aug 13, 2009 at 5:33 am
    Nov 23, 2009 at 3:24 pm
  • Forwarding the ApacheCon announcement. Also note we have a lot of Lucene ecosystem talks and a meetup scheduled, as well as training on both Lucene and Solr, so I hope you will join us. Cheers, Grant ...
    Grant IngersollGrant Ingersoll
    Aug 12, 2009 at 1:59 pm
    Aug 12, 2009 at 1:59 pm
  • Hi, what is jackrabbit extractor and how does it compare to Tika? I am looking for a an extractor from MS Outlook. Thank you, Mark
    Mark KerznerMark Kerzner
    Aug 6, 2009 at 4:46 am
    Aug 6, 2009 at 8:27 am
  • Hi, Tika is supposed to process *.msg files (email extraced out of Outlook). At first attempt this did not work, what should I look at? Incidentally, how do I get *.msg out of Outlook? Thank you, Mark
    Mark KerznerMark Kerzner
    Aug 4, 2009 at 9:18 pm
    Aug 5, 2009 at 2:44 pm
  • Hello, I try to build tika-0.4 with 'mvn install' and get the following build error: ... [WARNING] Warning building bundle org.apache.tika:tika-app:bundle:0.4 : Split package org/apache/tika/parser ...
    Florian ScholzFlorian Scholz
    Aug 3, 2009 at 5:45 pm
    Aug 6, 2009 at 2:36 pm
  • (...apologies for the cross posting...) The Apache Lucene project is pleased to announce the release of Apache Tika 0.4. The release contents have been pushed out to the main Apache release site and ...
    Mattmann, Chris A (388J)Mattmann, Chris A (388J)
    Jul 27, 2009 at 5:46 pm
    Jul 27, 2009 at 5:46 pm
  • Hi all, I'm just starting with tika and try to extract the text content of some html. Unfortunately, I get no content at all. This is my test method (in scala): def testHtml() { val html = "<html ...
    Martin GrotzkeMartin Grotzke
    Jul 24, 2009 at 10:48 pm
    Jul 29, 2009 at 10:29 pm
  • Hi, Tika extracts text from PDF in a grand way, but now I need a little more. I have a PDF that was created from a PowerPoint, and it has a few words on the bottom, kind of footer, but not real PDF ...
    Mark KerznerMark Kerzner
    Jul 22, 2009 at 6:00 pm
    Jul 23, 2009 at 12:11 am
  • The Travel Assistance Committee is taking in applications for those wanting to attend ApacheCon US 2009 (Oakland) which takes place between the 2nd and 6th November 2009. The Travel Assistance ...
    Grant IngersollGrant Ingersoll
    Jul 22, 2009 at 10:49 am
    Jul 22, 2009 at 10:49 am
  • Hi, let's say I have a PDF that has both the images and the text. I can get all text with Tika, and I can do OCR with Ocropus. But the two results don't yet play together. What I would really like to ...
    Mark KerznerMark Kerzner
    Jul 21, 2009 at 10:01 pm
    Jul 30, 2009 at 7:05 pm
  • Hi, can i post .pst directly to solr tika or only the .msg files are posted -- Regards, Brindha.KR
    Brindha karuppiahBrindha karuppiah
    Jul 21, 2009 at 6:37 am
    Jul 30, 2009 at 10:59 am
  • Hi, I have to post outlook .pst files to solr apachetika.I dont know how to post it to solr. -- Regards, Brindha.KR
    Brindha karuppiahBrindha karuppiah
    Jul 21, 2009 at 5:43 am
    Jul 21, 2009 at 5:56 am
  • For those in NYC, there will be a Lucene ecosystem (Lucene/Solr/Mahout/ Nutch/Tika/Droids/Lucene ports) Meetup on July 22, hosted by MTV Networks and co-sponsored with Lucid Imagination. For more ...
    Grant IngersollGrant Ingersoll
    Jul 15, 2009 at 3:32 pm
    Jul 15, 2009 at 3:32 pm
  • I am trying to get Tika to work with Solr. My problem is that when I attempt to run the "mvn install" the build fails. Is there something that I am missing to get the files to build? Kevin Miller ...
    Kevin MillerKevin Miller
    Jul 13, 2009 at 8:06 pm
    Jul 14, 2009 at 5:23 pm
  • Hi All, (sorry for the cross-post) For those in NYC, there will be a Lucene ecosystem (Lucene/Solr/Mahout/ Nutch/Tika/Droids/Lucene ports) Meetup on July 22, hosted by MTV Networks and co-sponsored ...
    Grant IngersollGrant Ingersoll
    Jul 3, 2009 at 12:11 pm
    Jul 3, 2009 at 12:11 pm
  • I'm parsing a package file, let's say foo.tar.gz. AutoDetectParser does the right thing in the sense that returns an XHTML file that contains entries for each file in the tar file which is in the ...
    Jonathan KorenJonathan Koren
    Jun 24, 2009 at 6:39 pm
    Jun 24, 2009 at 6:39 pm
  • Hi, I have a PowerPoint presentation with a background. The background contains text, such as "confidential" at the bottom, and company name, and Google knows these words, but Tika only gets text. Is ...
    Mark KerznerMark Kerzner
    Jun 3, 2009 at 3:15 pm
    Jun 3, 2009 at 3:15 pm
Group Navigation
period‹ prev | Latest | first ›
Group Overview
grouptika-user @
categorieslucene
discussions85
posts266
users50
websitelucene.apache.org

Top users

Jukka Zitting: 60 posts Mark Kerzner: 32 posts Daniel Knapp: 13 posts Grant Ingersoll: 13 posts Benson Margulies: 11 posts Jonathan Koren: 8 posts Alex Ott: 8 posts Dave Pawson: 8 posts Mattmann, Chris A: 8 posts Li Leon: 6 posts Anne Blankert: 6 posts Georger Araujo: 6 posts Uwe Schindler: 5 posts Brindha karuppiah: 5 posts Michael McCandless: 5 posts Wermter, Joachim: 4 posts Kevin Miller: 4 posts Gargate, Siddharth: 4 posts Martin Grotzke: 4 posts Aaron Fulton: 4 posts
show more