FAQ
Hi all,
I just tried to find some information about the unicodedata database
and the possibilities of updating it to the latest version of the
unicode standards (currently 5.2, while python supports 5.1 in the
latest versions).
An option to update this database individually might be useful as the
unicode standard updates seem to be more frequent than the official
python releases (and not every release is updated to the latest
available unicode db version either).
Am I right, that this is not possible without recompiling python from source?
I eventually found the promissing file
...Python-src--2.6.5\Python-2.6.5\Tools\unicode\makeunicodedata.py
which required the following files from the unicode database to be in
the same folder:
EastAsianWidth-3.2.0.txt
UnicodeData-3.2.0.txt
CompositionExclusions-3.2.0.txt
UnicodeData.txt
EastAsianWidth.txt
CompositionExclusions.txt

and also
Modules/unicodedata_db.h
Modules/unicodename_db.h,
Objects/unicodetype_db.h

After a minor correction - addig the missing "import re" - the script
was able to run and recreate the above h files.
I guess, I am stuck here, as I use the precompiled version supplied in
the windows installer and can't compile python from source to obtain
the needed unicodedata.pyd.
Or are there any possibilities I missed to individually upgrade the
unicodedata databese? (Using Python 2.6.5, Win XPh SP3)

Thanks in advance for any hints,
vbr

Search Discussions

  • MRAB at Mar 23, 2010 at 1:27 am

    Vlastimil Brom wrote:
    Hi all,
    I just tried to find some information about the unicodedata database
    and the possibilities of updating it to the latest version of the
    unicode standards (currently 5.2, while python supports 5.1 in the
    latest versions).
    An option to update this database individually might be useful as the
    unicode standard updates seem to be more frequent than the official
    python releases (and not every release is updated to the latest
    available unicode db version either).
    Am I right, that this is not possible without recompiling python from source?
    I eventually found the promissing file
    ...Python-src--2.6.5\Python-2.6.5\Tools\unicode\makeunicodedata.py
    which required the following files from the unicode database to be in
    the same folder:
    EastAsianWidth-3.2.0.txt
    UnicodeData-3.2.0.txt
    CompositionExclusions-3.2.0.txt
    UnicodeData.txt
    EastAsianWidth.txt
    CompositionExclusions.txt

    and also
    Modules/unicodedata_db.h
    Modules/unicodename_db.h,
    Objects/unicodetype_db.h

    After a minor correction - addig the missing "import re" - the script
    was able to run and recreate the above h files.
    I guess, I am stuck here, as I use the precompiled version supplied in
    the windows installer and can't compile python from source to obtain
    the needed unicodedata.pyd.
    Or are there any possibilities I missed to individually upgrade the
    unicodedata databese? (Using Python 2.6.5, Win XPh SP3)

    Thanks in advance for any hints,
    vbr
    From the look of it the Unicode data is compiled into the DLL, but I
    don't see any reason, other than speed, why preprocessed data couldn't
    be read from a file at startup by the DLL, provided that the format
    hasn't changed, eg new fields added, without affecting the DLL's
    interface to the rest of Python.
  • Gabriel Genellina at Mar 23, 2010 at 7:22 am
    En Mon, 22 Mar 2010 21:19:04 -0300, Vlastimil Brom
    <vlastimil.brom at gmail.com> escribi?:
    I guess, I am stuck here, as I use the precompiled version supplied in
    the windows installer and can't compile python from source to obtain
    the needed unicodedata.pyd.
    You can recompile Python from source, on Windows, using the free
    Microsoft? Visual C++? 2008 Express Edition.
    http://www.microsoft.com/express/Windows/

    Fetch the required dependencies using Tools\buildbot\external.bat, and
    then execute PCbuild\env.bat and build.bat. See readme.txt in that
    directory for details. It should build cleanly.

    --
    Gabriel Genellina
  • Vlastimil Brom at Mar 23, 2010 at 2:18 pm

    2010/3/23 Gabriel Genellina <gagsl-py2 at yahoo.com.ar>:
    En Mon, 22 Mar 2010 21:19:04 -0300, Vlastimil Brom
    <vlastimil.brom at gmail.com> escribi?:
    I guess, I am stuck here, as I use the precompiled version supplied in
    the windows installer and can't compile python from source to obtain
    the needed unicodedata.pyd.
    You can recompile Python from source, on Windows, using the free Microsoft?
    Visual C++? 2008 Express Edition.
    http://www.microsoft.com/express/Windows/

    Fetch the required dependencies using Tools\buildbot\external.bat, and then
    execute PCbuild\env.bat and build.bat. See readme.txt in that directory for
    details. It should build cleanly.

    --
    Gabriel Genellina

    --
    http://mail.python.org/mailman/listinfo/python-list
    Thanks for the hints; i probably screwed some steps up in some way,
    but the result seem to be working for the most part; I'll try to
    summarise it just for the record (also hoping to get further
    suggestions):
    I used the official source tarball for python 2.6.5 from:
    http://www.python.org/download/

    In the unpacked sources, I edited the file:
    ...\Python-2.6.5-src\Tools\unicode\makeunicodedata.py

    import re # added
    ...
    # UNIDATA_VERSION = "5.1.0" # changed to:
    UNIDATA_VERSION = "5.2.0"

    Furthermore the following text files were copied to the same directory
    like makeunicodedata.py

    CompositionExclusions-3.2.0.txt
    EastAsianWidth-3.2.0.txt
    UnicodeData-3.2.0.txt
    UnicodeData.txt
    EastAsianWidth.txt
    CompositionExclusions.txt

    from
    http://unicode.org/Public/3.2-Update/
    and
    http://unicode.org/Public/5.2.0/ucd/

    furthermore there are some files in the subdirectories needed:
    ...\Python-2.6.5-src\Tools\unicode\Objects\unicodetype_db.h
    ...\Python-2.6.5-src\Tools\unicode\Modules\unicodedata_db.h
    ...\Python-2.6.5-src\Tools\unicode\Modules\unicodename_db.h

    After running makeunicodedata.py, the above headers are recreated from
    the new unicode database and can be copied to the original locations
    in the source

    ...\Python-2.6.5-src\Objects\unicodetype_db.h
    ...\Python-2.6.5-src\Modules\unicodedata_db.h
    ...\Python-2.6.5-src\Modules\unicodename_db.h

    (while keeping the backups)

    Trying to run
    ...\Python-2.6.5-src\Tools\buildbot\external.bat and other bat files,
    I got quite a few path mismatches resulting in file ... not found
    errors;

    However, I was able to just open the solution file in Visual C++ 2008 Express:
    C:\install\Python-2.6.5-src\PCbuild\pcbuild.sln
    set the build configuration to "release" and try to build the sources.

    There were some errors in particular modules (which might be due to my
    mistakes or ommissions, as this maybe shouldn't happen normally), but
    the wanted
    ...\Python-2.6.5-src\PCbuild\unicodedata.pyd
    was generated and can be used in the original python installation:
    C:\Python26\DLLs\unicodedata.pyd

    the newly added characters, cf.:
    http://www.unicode.org/Public/UNIDATA/DerivedAge.txt
    seem to be available

    ? (dec.: 8528) (hex.: 0x2150) # ? VULGAR FRACTION ONE SEVENTH (Number, Other)
    ? (dec.: 68352) (hex.: 0x10b00) # ? AVESTAN LETTER A (Letter, Other)

    but some are not present; I noticed this for
    the new CJK block - CJK Unified Ideographs Extension C (U+2A700..U+2B73F).
    Probably this new range isn't taken into account for some reason.

    All in all, I am happy to have the current version of the unicode
    database available; I somehow expected this to be more complicated,
    but on the other hand I can't believe this is the standard way of
    preparing the built versions (with all the copying,checking and and
    replacing the files); it might be possible, that the actual
    distribution is built using some different tools (the trivial missing
    import in makeunicodedata.py would be found immediately, I guess).

    I also wanted to ask, whether the missing characters might be a result
    of my error in updating the unicode database, or could it be a problem
    with the makeunicodedata.py itself?

    Thanks in advance and sorry for this long post.
    vbr

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedMar 23, '10 at 12:19a
activeMar 23, '10 at 2:18p
posts4
users3
websitepython.org

People

Translate

site design / logo © 2022 Grokbase