FAQ
Hi all.

I'm looking for a way to be able to load a generic file from the
system and understand if he is plain text.
The mimetype module has some nice methods, but for example it's not
working for file without extension.

Any suggestion?

--
-- luca

Search Discussions

  • Philip Semanchuk at Nov 14, 2009 at 5:51 pm

    On Nov 14, 2009, at 11:02 AM, Luca Fabbri wrote:

    Hi all.

    I'm looking for a way to be able to load a generic file from the
    system and understand if he is plain text.
    The mimetype module has some nice methods, but for example it's not
    working for file without extension.
    Hi Luca,
    You have to define what you mean by "text" file. It might seem
    obvious, but it's not.

    Do you mean just ASCII text? Or will you accept Unicode too? Unicode
    text can be more difficult to detect because you have to guess the
    file's encoding (unless it has a BOM; most don't).

    And do you need to verify that every single byte in the file is
    "text"? What if the file is 1GB, do you still want to examine every
    single byte?

    If you give us your own (specific!) definition of what "text" means,
    or perhaps a description of the problem you're trying to solve, then
    maybe we can help you better.

    Cheers
    Philip
  • Luca at Nov 15, 2009 at 12:49 pm

    On Sat, Nov 14, 2009 at 6:51 PM, Philip Semanchuk wrote:
    Hi Luca,
    You have to define what you mean by "text" file. It might seem obvious, but
    it's not.

    Do you mean just ASCII text? Or will you accept Unicode too? Unicode text
    can be more difficult to detect because you have to guess the file's
    encoding (unless it has a BOM; most don't).

    And do you need to verify that every single byte in the file is "text"? What
    if the file is 1GB, do you still want to examine every single byte?

    If you give us your own (specific!) definition of what "text" means, or
    perhaps a description of the problem you're trying to solve, then maybe we
    can help you better.
    Thanks all.

    I was quite sure that this is not a very simple task. Right now search
    only inside ASCII encode is not enough for me (my native language is
    outside this encode :-)
    Checking every single byte can be a good solution...

    I can start using the mimetype module and, if the file has no
    extension, check byte one by one (commonly) as "file" command does.
    Better: I can check use the "file" command if available.

    Again: thanks all!

    --
    -- luca
  • Nobody at Nov 15, 2009 at 6:56 pm

    On Sun, 15 Nov 2009 13:49:54 +0100, Luca wrote:

    I was quite sure that this is not a very simple task. Right now search
    only inside ASCII encode is not enough for me (my native language is
    outside this encode :-)
    Checking every single byte can be a good solution...

    I can start using the mimetype module and, if the file has no
    extension, check byte one by one (commonly) as "file" command does.
    Better: I can check use the "file" command if available.
    Another possible solution:

    Universal Encoding Detector
    Character encoding auto-detection in Python 2 and 3

    http://chardet.feedparser.org/
  • Nobody at Nov 15, 2009 at 12:06 pm

    On Sat, 14 Nov 2009 17:02:29 +0100, Luca Fabbri wrote:

    I'm looking for a way to be able to load a generic file from the
    system and understand if he is plain text.
    The mimetype module has some nice methods, but for example it's not
    working for file without extension.

    Any suggestion?
    You could use the "file" command. It's normally installed by default on
    Unix systems, but you can get a Windows version from:

    http://gnuwin32.sourceforge.net/packages/file.htm
  • Chris Rebert at Nov 15, 2009 at 12:34 pm

    On Sun, Nov 15, 2009 at 4:06 AM, Nobody wrote:
    On Sat, 14 Nov 2009 17:02:29 +0100, Luca Fabbri wrote:

    I'm looking for a way to be able to load a generic file from the
    system and understand if he is plain text.
    The mimetype module has some nice methods, but for example it's not
    working for file without extension.

    Any suggestion?
    You could use the "file" command. It's normally installed by default on
    Unix systems, but you can get a Windows version from:
    FWIW, IIRC the heuristic `file` uses to check whether a file is text
    or not is whether it contains any null bytes; if it does, it
    classifies it as binary (i.e. not text).

    Cheers,
    Chris
  • Nobody at Nov 15, 2009 at 6:50 pm

    On Sun, 15 Nov 2009 04:34:10 -0800, Chris Rebert wrote:

    I'm looking for a way to be able to load a generic file from the
    system and understand if he is plain text.
    The mimetype module has some nice methods, but for example it's not
    working for file without extension.

    Any suggestion?
    You could use the "file" command. It's normally installed by default on
    Unix systems, but you can get a Windows version from:
    FWIW, IIRC the heuristic `file` uses to check whether a file is text
    or not is whether it contains any null bytes; if it does, it
    classifies it as binary (i.e. not text).
    "file" provides more granularity than that, recognising many specific
    formats, both text and binary.

    First, it uses "magic number" checks, checking for known signature bytes
    (e.g. "#!" or "JFIF") at the beginning of the file. If those checks fail
    it checks for common text encodings. If those also fail, it reports "data".

    Also, UTF-16-encoded text is recognised as text, even though it may
    contain a high proportion of NUL bytes.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedNov 14, '09 at 4:02p
activeNov 15, '09 at 6:56p
posts7
users5
websitepython.org

People

Translate

site design / logo © 2022 Grokbase