FAQ
Hi,
the strings I am writing in my reducer have characters that may present a
problem, such as char represented by decimal 254, which is hex FE. It seems
that instead I see hex C3, or something else is messed up. Or my
understanding is messed up :)

Any advice?

Thank you,
Mark

Search Discussions

  • Todd Lipcon at Oct 9, 2009 at 11:51 pm
    Hi Mark,

    If you're using TextOutputFormat, it assumes you're dealing in UTF8. Decimal
    254 wouldn't be valid as a standalone character in UTF8 encoding.

    If you're dealing with binary (ie non-textual) data, you shouldn't use
    TextOutputFormat.

    -Todd
    On Fri, Oct 9, 2009 at 3:09 PM, Mark Kerzner wrote:

    Hi,
    the strings I am writing in my reducer have characters that may present a
    problem, such as char represented by decimal 254, which is hex FE. It seems
    that instead I see hex C3, or something else is messed up. Or my
    understanding is messed up :)

    Any advice?

    Thank you,
    Mark
  • Mark Kerzner at Oct 13, 2009 at 2:57 am
    Thanks, that is a great answer.
    My problem is that the application that reads my output accepts a
    comma-separated file with extended ASCII delimiters. Following your answer,
    however, I will try to use low-value ASCII, like 9 or 11, unless someone has
    a better suggestion.

    Thank you,
    Mark
    On Fri, Oct 9, 2009 at 6:49 PM, Todd Lipcon wrote:

    Hi Mark,

    If you're using TextOutputFormat, it assumes you're dealing in UTF8.
    Decimal
    254 wouldn't be valid as a standalone character in UTF8 encoding.

    If you're dealing with binary (ie non-textual) data, you shouldn't use
    TextOutputFormat.

    -Todd
    On Fri, Oct 9, 2009 at 3:09 PM, Mark Kerzner wrote:

    Hi,
    the strings I am writing in my reducer have characters that may present a
    problem, such as char represented by decimal 254, which is hex FE. It seems
    that instead I see hex C3, or something else is messed up. Or my
    understanding is messed up :)

    Any advice?

    Thank you,
    Mark
  • Todd Lipcon at Oct 13, 2009 at 3:16 am
    Hey Mark,

    The most commonly used delimiter for cases like this is ^A (character 1)

    -Todd
    On Mon, Oct 12, 2009 at 7:56 PM, Mark Kerzner wrote:

    Thanks, that is a great answer.
    My problem is that the application that reads my output accepts a
    comma-separated file with extended ASCII delimiters. Following your answer,
    however, I will try to use low-value ASCII, like 9 or 11, unless someone
    has
    a better suggestion.

    Thank you,
    Mark
    On Fri, Oct 9, 2009 at 6:49 PM, Todd Lipcon wrote:

    Hi Mark,

    If you're using TextOutputFormat, it assumes you're dealing in UTF8.
    Decimal
    254 wouldn't be valid as a standalone character in UTF8 encoding.

    If you're dealing with binary (ie non-textual) data, you shouldn't use
    TextOutputFormat.

    -Todd

    On Fri, Oct 9, 2009 at 3:09 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,
    the strings I am writing in my reducer have characters that may present
    a
    problem, such as char represented by decimal 254, which is hex FE. It seems
    that instead I see hex C3, or something else is messed up. Or my
    understanding is messed up :)

    Any advice?

    Thank you,
    Mark
  • Mark Kerzner at Oct 13, 2009 at 3:20 am
    Thanks again, Todd. I need two delimiters, one for comma and one for quote.
    But I guess I can use ^A for quote, and keep the comma as is, and I will be
    good.
    Sincerely,
    Mark
    On Mon, Oct 12, 2009 at 10:15 PM, Todd Lipcon wrote:

    Hey Mark,

    The most commonly used delimiter for cases like this is ^A (character 1)

    -Todd
    On Mon, Oct 12, 2009 at 7:56 PM, Mark Kerzner wrote:

    Thanks, that is a great answer.
    My problem is that the application that reads my output accepts a
    comma-separated file with extended ASCII delimiters. Following your answer,
    however, I will try to use low-value ASCII, like 9 or 11, unless someone
    has
    a better suggestion.

    Thank you,
    Mark
    On Fri, Oct 9, 2009 at 6:49 PM, Todd Lipcon wrote:

    Hi Mark,

    If you're using TextOutputFormat, it assumes you're dealing in UTF8.
    Decimal
    254 wouldn't be valid as a standalone character in UTF8 encoding.

    If you're dealing with binary (ie non-textual) data, you shouldn't use
    TextOutputFormat.

    -Todd

    On Fri, Oct 9, 2009 at 3:09 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,
    the strings I am writing in my reducer have characters that may
    present
    a
    problem, such as char represented by decimal 254, which is hex FE. It seems
    that instead I see hex C3, or something else is messed up. Or my
    understanding is messed up :)

    Any advice?

    Thank you,
    Mark
  • Amr Awadallah at Oct 13, 2009 at 3:36 am
    ^A for quote, ^B for comma .. and so on.

    -- amr

    Mark Kerzner wrote:
    Thanks again, Todd. I need two delimiters, one for comma and one for quote.
    But I guess I can use ^A for quote, and keep the comma as is, and I will be
    good.
    Sincerely,
    Mark

    On Mon, Oct 12, 2009 at 10:15 PM, Todd Lipcon wrote:

    Hey Mark,

    The most commonly used delimiter for cases like this is ^A (character 1)

    -Todd

    On Mon, Oct 12, 2009 at 7:56 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:

    Thanks, that is a great answer.
    My problem is that the application that reads my output accepts a
    comma-separated file with extended ASCII delimiters. Following your answer,
    however, I will try to use low-value ASCII, like 9 or 11, unless someone
    has
    a better suggestion.

    Thank you,
    Mark

    On Fri, Oct 9, 2009 at 6:49 PM, Todd Lipcon wrote:

    Hi Mark,

    If you're using TextOutputFormat, it assumes you're dealing in UTF8.
    Decimal
    254 wouldn't be valid as a standalone character in UTF8 encoding.

    If you're dealing with binary (ie non-textual) data, you shouldn't use
    TextOutputFormat.

    -Todd

    On Fri, Oct 9, 2009 at 3:09 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:

    Hi,
    the strings I am writing in my reducer have characters that may
    present
    a
    problem, such as char represented by decimal 254, which is hex FE. It seems
    that instead I see hex C3, or something else is messed up. Or my
    understanding is messed up :)

    Any advice?

    Thank you,
    Mark
  • Amandeep Khurana at Oct 13, 2009 at 3:36 am
    ^A is ascii 1.. You can use ascii 2 for the comma...
    On 10/12/09, Mark Kerzner wrote:
    Thanks again, Todd. I need two delimiters, one for comma and one for quote.
    But I guess I can use ^A for quote, and keep the comma as is, and I will be
    good.
    Sincerely,
    Mark
    On Mon, Oct 12, 2009 at 10:15 PM, Todd Lipcon wrote:

    Hey Mark,

    The most commonly used delimiter for cases like this is ^A (character 1)

    -Todd

    On Mon, Oct 12, 2009 at 7:56 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Thanks, that is a great answer.
    My problem is that the application that reads my output accepts a
    comma-separated file with extended ASCII delimiters. Following your answer,
    however, I will try to use low-value ASCII, like 9 or 11, unless someone
    has
    a better suggestion.

    Thank you,
    Mark
    On Fri, Oct 9, 2009 at 6:49 PM, Todd Lipcon wrote:

    Hi Mark,

    If you're using TextOutputFormat, it assumes you're dealing in UTF8.
    Decimal
    254 wouldn't be valid as a standalone character in UTF8 encoding.

    If you're dealing with binary (ie non-textual) data, you shouldn't use
    TextOutputFormat.

    -Todd

    On Fri, Oct 9, 2009 at 3:09 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,
    the strings I am writing in my reducer have characters that may
    present
    a
    problem, such as char represented by decimal 254, which is hex FE.
    It seems
    that instead I see hex C3, or something else is messed up. Or my
    understanding is messed up :)

    Any advice?

    Thank you,
    Mark

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 9, '09 at 10:11p
activeOct 13, '09 at 3:36a
posts7
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase