Grokbase Groups Tika user

Search Discussions

750 discussions - 2,520 posts

  • I am running Tika rest API server locally, calling it via ajax, and getting the cross domain request issue, so I am trying to set --cors option but the server generates errors and stops. It's a very ...
    Allison AhnAllison Ahn
    Jun 16, 2016 at 1:13 am
    Jun 16, 2016 at 1:13 am
  • Oh, wow. Y, that's probably more than we'd want to support (unless any other Tika devs have an interest?)...very, very cool! -----Original Message----- From: Justin Lee Sent: Monday, June 13, 2016 ...
    Allison, Timothy B.Allison, Timothy B.
    Jun 14, 2016 at 12:53 pm
    Jun 14, 2016 at 7:55 pm
  • Hi all, I am using TIKA java library to read the content of some PDFs and it seems like it inserts some weird (hyphen-like) spacing. For example: The es tab lish ment of an in te grated Part ner Re ...
    Augusto Ribeiro SilvaAugusto Ribeiro Silva
    May 31, 2016 at 11:38 am
    May 31, 2016 at 1:17 pm
  • CVE-2016-4434: Apache Tika XML External Entity vulnerability Severity: Important Vendor: The Apache Software Foundation Versions Affected: Apache Tika 0.10 to 1.12 Description: Apache Tika parses XML ...
    Tim AllisonTim Allison
    May 26, 2016 at 4:00 pm
    May 26, 2016 at 4:00 pm
  • What is the complexity of text extraction, say from pdf, pptx or docx on Tika? To be precise, Is it logarithmic, quadratic or linear? Also is it possible for you to give me specifics regarding the ...
    Kavya Sree BhagavatulaKavya Sree Bhagavatula
    May 26, 2016 at 5:29 am
    May 26, 2016 at 5:29 am
  • I tried downloading the three files from the tika download page (for 1.13): - - tika-app-1.13.jar - tika-server-1.13.jar First, the download page lists the filename for the ...
    Matt Work CoarrMatt Work Coarr
    May 24, 2016 at 1:52 pm
    May 25, 2016 at 12:55 pm
  • Dear list, I am not sure this the right place to ask, but since I don’t know a better place and some of you might use the tika-python package, I might as well give it a shot. If you know a better ...
    Philipp SteinkrügerPhilipp Steinkrüger
    May 22, 2016 at 7:12 pm
    May 23, 2016 at 9:20 am
  • Thank you Tim and everyone else involved for fixing this issue so quickly! All best, Philipp
    Philipp SteinkrügerPhilipp Steinkrüger
    May 18, 2016 at 11:27 am
    May 18, 2016 at 11:27 am
  • The Apache Tika project is pleased to announce the release of Apache Tika 1.13. The release contents have been pushed out to the main Apache release site and to the Central sync, so the releases ...
    David MeikleDavid Meikle
    May 16, 2016 at 3:57 pm
    May 16, 2016 at 3:57 pm
  • Dear list, I am running Tika server 1.14 on a Debian jessie. I start the server with this command: java -jar tika-server-1.14-SNAPSHOT.jar If I send a file for metadata extraction like this curl -T ...
    Philipp SteinkrügerPhilipp Steinkrüger
    May 15, 2016 at 2:11 pm
    May 16, 2016 at 5:37 pm
  • Dear list, I started to explore the possibilities of TIKA and I have a couple of questions that I will send to the list in separate emails, to keep things tidy. To begin with, I noticed the following ...
    Philipp SteinkrügerPhilipp Steinkrüger
    May 15, 2016 at 1:58 pm
    May 16, 2016 at 9:02 am
  • Hi All For those who couldn't make it to Vancouver this week, the slides from my "What's new with Apache Tika 2.0" talk are now available online ...
    Nick BurchNick Burch
    May 11, 2016 at 9:09 pm
    May 12, 2016 at 12:04 am
  • (X-posted from StackOverflow) up vote down votefavorite< I'm trying to write a Java application that ...
    Betsey BenaghBetsey Benagh
    May 11, 2016 at 3:21 pm
    May 20, 2016 at 1:03 am
  • Hi, I'm using Apache Tika to detect if some files uploaded by my team are in the mime type scope that I've defined. But now I'm facing of detecting some XML files but only some specifics, I can't ...
    May 10, 2016 at 5:39 pm
    May 11, 2016 at 8:02 pm
  • A candidate for the Tika 1.13 release is available at: The release candidate is a zip archive of the sources in ...
    David MeikleDavid Meikle
    May 9, 2016 at 7:34 pm
    May 17, 2016 at 4:33 am
  • Thank you, Tilman! :) -----Original Message----- From: Tilman Hausherr Sent: Saturday, April 30, 2016 5:24 PM To: <span class="m_body_email_addr" title="ba5bdb86566b77399993b7f786fa03c3" ...
    Allison, Timothy B.Allison, Timothy B.
    May 2, 2016 at 12:19 pm
    May 2, 2016 at 12:23 pm
  • Hello everyone, we are using TikaOCR to access tesseract OCR via Tika Server's web API, which is working perfectly satisfying. However, as we process documents in different languages, I was wondering ...
    Mirko HeringMirko Hering
    Apr 27, 2016 at 1:57 pm
    Apr 27, 2016 at 1:57 pm
  • Hi I recently upgraded to tika 1.12 from 1.7 and read the notes about Jempbox being no longer used. My pom now pulls in 1.12 versions of tika-core, tika-parsers, tika-xmp and tika-bundle. The app is ...
    Chris BamfordChris Bamford
    Apr 22, 2016 at 4:14 pm
    Apr 22, 2016 at 5:21 pm
  • Ha. I'm in the process of comparing mimetype detection results from DROID, Tika and 'file' on our TIKA-1302 corpus. After that, I was going to compare our different encoding detectors on the ...
    Allison, Timothy B.Allison, Timothy B.
    Apr 18, 2016 at 2:13 pm
    May 11, 2016 at 2:15 pm
  • Greetings to the Community! a simple question: is there a way to white-/black-list certain mime- or file-types for OCR? E.g. I'd like to extract and OCR embedded images from PDFs only (which is ...
    Apr 17, 2016 at 8:13 pm
    Apr 17, 2016 at 8:13 pm
  • Hi All, I made a Wikipedia page for Apache Tika: Please update and edit. Thank you. Cheers, Chris ...
    Mattmann, Chris A (3980)Mattmann, Chris A (3980)
    Apr 15, 2016 at 11:22 pm
    Apr 16, 2016 at 4:31 am
  • Hi, I've just happily discovered Tika and am sorting out how well it fits our needs. I'm trying to create a searchable index for PDF files that contain typed pages and pages with scanned text ...
    Apr 13, 2016 at 10:52 am
    Apr 13, 2016 at 1:58 pm
  • Hi all, I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses Tika and its LinkContentHandler. I'm interested in collecting *all* links on a web page, but I'm surprised the ...
    Joseph NaegeleJoseph Naegele
    Apr 5, 2016 at 7:27 pm
    Apr 6, 2016 at 10:10 pm
  • Hello, I'm having an issue where I'm getting back two or three metadata properties that are related to a temp file that tika is apparently creating under the hood: File Modified Date (the current ...
    Brian YoungBrian Young
    Mar 25, 2016 at 9:07 pm
    Mar 28, 2016 at 8:14 pm
  • Hi Tika experts, Question : How to enable multiple parsers for specific mimetypes? I am using tika to parse html pages. My requirement is that both *NamedEntityParser* and *HtmlParser* has to be ...
    Thamme Gowda N.Thamme Gowda N.
    Mar 23, 2016 at 9:53 pm
    Mar 23, 2016 at 10:29 pm
  • Tika appears to use two logging frameworks, Commons Logging and SLF4J. Is that correct? Commons Logging is used by; tika-app tika-parsers tika-server SLF4J is used by; tika-batch tika-core ...
    John PatrickJohn Patrick
    Mar 2, 2016 at 11:13 pm
    Mar 3, 2016 at 9:19 am
  • Hi All, I am using Tika server REST api to extract the content from large files. I was able to extract the content up to 100 MB files. when i try to send files more than 100 MB giving me "Zip bomb ...
    Raghu vittalRaghu vittal
    Mar 1, 2016 at 6:41 am
    Mar 22, 2016 at 12:21 pm
  • Hi, I am trying to use the ForkParser, but am getting an exception: org.apache.tika.exception.TikaException: Unable to serialize ParseContext to pass to the Forked Parser Caused by: Caused by ...
    Luke Noel-StorrLuke Noel-Storr
    Feb 25, 2016 at 3:35 pm
    Feb 25, 2016 at 3:35 pm
  • hiya, I'm working with an existing code base that is using Jackson 2.6.3. Now adding tika but because the tika-server jar containers Jackson 2.4.0 having lots of compile issues. 1) Was it intentional ...
    John PatrickJohn Patrick
    Feb 23, 2016 at 5:51 pm
    Feb 25, 2016 at 5:27 pm
  • Hi all, I'm extracting some text from pdf. As result, some important words end with spaces between characters. For example, I could have the word "Subtitle" that I want to detect, written like "S u b ...
    Francisco Andrés FernándezFrancisco Andrés Fernández
    Feb 23, 2016 at 3:00 pm
    Feb 23, 2016 at 3:00 pm
  • Hi, Is there a wiki or instructions on how to remove cryptographic software from Tika? Is it enough to simply remove the bouncy castle libraries from Tika's JAR and yet be able to use Tika to its ...
    Steven WhiteSteven White
    Feb 19, 2016 at 5:05 pm
    Feb 19, 2016 at 9:45 pm
  • Hi All we have very large PDF,.docx,.xlsx. We are using Tika to extract content and dump data in Elastic Search for full-text search. sending very large files to Tika will cause out of memory ...
    Raghu vittalRaghu vittal
    Feb 19, 2016 at 9:38 am
    Feb 29, 2016 at 3:13 pm
  • Hi , I am currently indexing individual outlook messages and searching is working fine. I have created solr core using following command. ./solr create -c sreenimsg1 -d data_driven_schema_configs I ...
    Sreenivasa KalluSreenivasa Kallu
    Feb 16, 2016 at 11:39 pm
    Feb 17, 2016 at 12:42 am
  • The Apache Tika project is pleased to announce the release of Apache Tika 1.12. The release contents have been pushed out to the main Apache release site and to the Central sync, so the releases ...
    Chris MattmannChris Mattmann
    Feb 15, 2016 at 6:45 pm
    Feb 15, 2016 at 7:43 pm
  • Team, Sorry for the long delay. This VOTE has PASSED with the following tallies: +1 Chris Mattmann* Markus Jelsma Oleg Tikhonov* Ken Krugler* Tim Allison* Konstantin Gribov* David Meikle* Lewis John ...
    Mattmann, Chris A (3980)Mattmann, Chris A (3980)
    Feb 15, 2016 at 5:03 pm
    Feb 15, 2016 at 5:03 pm
  • x-post to Tika user's Y and n. If you run tika app as: java -jar tika-app.jar <input_dir <output_dir It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302). This creates a parent and ...
    Allison, Timothy B.Allison, Timothy B.
    Feb 11, 2016 at 7:45 pm
    Feb 12, 2016 at 1:01 am
  • Hello everyone! I hope this email finds you well. I hope everyone is as excited about ApacheCon as I am! I'd like to remind you all of a couple of important dates, as well as ask for your assistance ...
    Melissa WarnkinMelissa Warnkin
    Feb 11, 2016 at 6:25 pm
    Feb 11, 2016 at 6:25 pm
  • Hi everyone, I'm including tika-app-1.11.jar with my application and see that Tika includes "slf4j". This is conflicting with my own "slf4j". If I remove it from Tika's JAR will that cause any ...
    Steven WhiteSteven White
    Feb 10, 2016 at 11:04 pm
    Feb 11, 2016 at 4:36 pm
  • Hi everyone, I'm integrating Tika with my application and need your help to figure out if the OOM I'm getting is due to the way I'm using Tika or if it is an issue with parsing XML files. The ...
    Steven WhiteSteven White
    Feb 8, 2016 at 6:37 pm
    Feb 10, 2016 at 1:26 am
  • Hello all, This was not an issue before but now it is. I had tried to check the manual and online to see what has changed so I can update my code but no success, hence decided to email the users list ...
    Carlos ACarlos A
    Feb 6, 2016 at 12:24 am
    Feb 6, 2016 at 12:29 am
  • Hi everyone, How do I detect if a file type is supported or not? Also, how do I detect if a file type is supported but it cannot be processed because the parser for it is missing (the required JARs ...
    Steven WhiteSteven White
    Feb 5, 2016 at 8:40 pm
    Feb 6, 2016 at 2:14 am
  • Hi, I'm having an exception when converting a RTF document with the standard new Tika().parseToString(). org.apache.tika.exception.TikaException: Unexpected RuntimeException from ...
    Andrea AstaAndrea Asta
    Feb 3, 2016 at 2:35 pm
    Feb 3, 2016 at 2:42 pm
  • Hi everyone, I have written a standalone application that works with Solr 5.2. I'm using the existing JARs that come with Solr to index data off a file system. My applications scans the file system, ...
    Steven WhiteSteven White
    Feb 3, 2016 at 12:01 am
    Feb 5, 2016 at 7:32 pm
  • Hello, I would gladly welcome the reply of the community on the following subject: We are using Tika embedded in Solr server. I would like to know if it is possible to give in input to TesseractOCR, ...
    Giovanni UsaiGiovanni Usai
    Feb 2, 2016 at 2:28 pm
    Feb 2, 2016 at 2:28 pm
  • Hello Tika People, I am trying to add a custom content-type to Tika and am finding it difficult. Not sure if the tutorial I am following is out of date but it could be the case. I am using Tika 1.11, ...
    James BrookingJames Brooking
    Feb 2, 2016 at 1:16 pm
    Feb 9, 2016 at 10:43 am
  • Hi Folks, A first candidate for the Tika 1.12 release is available at: The release candidate is a zip archive of the sources in ...
    Mattmann, Chris A (3980)Mattmann, Chris A (3980)
    Jan 25, 2016 at 7:58 pm
    Jan 30, 2016 at 9:44 am
  • Fine by me. I can cut a 1.12-rc1 this weekend. If I don’t hear objections from the other devs, I’ll go for it on Friday. Also this will be the first Git release, so should be fun! :) Cheers, Chris ...
    Mattmann, Chris A (3980)Mattmann, Chris A (3980)
    Jan 21, 2016 at 8:30 pm
    Jan 25, 2016 at 3:07 pm
  • Hello PMC, With TIKA-1835 committed Apache Nutch can finally fully support text and link extraction via Boilerpipe, something many Nutch users (myself not included) have been looking forward too for ...
    Markus JelsmaMarkus Jelsma
    Jan 21, 2016 at 8:27 pm
    Jan 21, 2016 at 8:27 pm
  • Anyone else have a workaround for reusing an input stream that has been given to Tika Detect? According to inline comments in the Tika code, it gives the impression that developers understand about ...
    John PatrickJohn Patrick
    Jan 18, 2016 at 4:23 pm
    Jan 18, 2016 at 4:23 pm
  • Happy New Year everyone, I have a small program for simple text and metadata extraction. It is really not more than this (in Scala): val fileParser : AutoDetectParser = new AutoDetectParser() val ...
    Jan 5, 2016 at 8:32 am
    Jan 8, 2016 at 1:57 pm
Group Navigation
period‹ prev | Latest | first ›
Group Overview
groupuser @
categoriestika, lucene

Top users

Jukka Zitting: 191 posts Mattmann, Chris A: 172 posts Nick Burch: 171 posts Nick Burch: 170 posts Allison, Timothy B.: 101 posts Mark Kerzner: 75 posts Markus Jelsma: 56 posts Michael McCandless: 42 posts Grant Ingersoll: 40 posts Dave Meikle: 36 posts Ken Krugler: 33 posts Zabrane: 31 posts Sergey Beryozkin: 28 posts Benson Margulies: 27 posts Uwe Schindler: 25 posts Jan Høydahl: 24 posts Chris Mattmann: 21 posts Alex Ott: 21 posts Public Network Services: 20 posts Steven White: 18 posts
show more