Grokbase Groups Hive user April 2010
FAQ
I have some sequence files in which all our data is in the key.

http://osdir.com/ml/hive-user-hadoop-apache/2009-10/msg00027.html

Has anyone tackled the above issue?

Search Discussions

  • Zheng Shao at Apr 3, 2010 at 1:34 am
    The easiest way is to write a SequenceFileInputFormat that returns a
    RecordReader that has key in the value and value in the key.

    Zheng
    On Fri, Apr 2, 2010 at 2:16 PM, Edward Capriolo wrote:
    I have some sequence files in which all our data is in the key.

    http://osdir.com/ml/hive-user-hadoop-apache/2009-10/msg00027.html

    Has anyone tackled the above issue?


    --
    Yours,
    Zheng
  • Edward Capriolo at Apr 13, 2010 at 9:43 pm

    On Fri, Apr 2, 2010 at 9:34 PM, Zheng Shao wrote:

    The easiest way is to write a SequenceFileInputFormat that returns a
    RecordReader that has key in the value and value in the key.

    Zheng
    On Fri, Apr 2, 2010 at 2:16 PM, Edward Capriolo wrote:
    I have some sequence files in which all our data is in the key.

    http://osdir.com/ml/hive-user-hadoop-apache/2009-10/msg00027.html

    Has anyone tackled the above issue?


    --
    Yours,
    Zheng

    I am attempting to do this for sequence files. Unfortunately I have to copy
    much of the SequenceFile format since the reader (in) has private access.
    ----------------------------------------
    public class SequenceKeyOnlyInputFormat<K extends WritableComparable, V
    extends Writable> extends SequenceFileInputFormat<K, V> {

    public RecordReader<K, V> getRecordReader(InputSplit split, JobConf job,
    Reporter reporter) throws IOException {
    reporter.setStatus(split.toString());
    return new SequenceKeyOnlyRecordReader<K, V>(job, (FileSplit)
    split);
    }

    }
    --------------------------------------------
    @SuppressWarnings({ "unchecked", "deprecation" })
    public class SequenceKeyOnlyRecordReader<K extends WritableComparable , V
    extends Writable>
    implements RecordReader<K, V>{

    private SequenceFile.Reader in;
    private long start;
    private long end;
    private boolean more = true;
    protected Configuration conf;


    public SequenceKeyOnlyRecordReader(Configuration conf, FileSplit split)
    throws IOException {
    Path path = split.getPath();
    FileSystem fs = path.getFileSystem(conf);
    this.in = new SequenceFile.Reader(fs, path, conf);
    this.end = split.getStart() + split.getLength();
    this.conf = conf;

    if (split.getStart() > in.getPosition()) in.sync(split.getStart());
    // sync to start

    this.start = in.getPosition();
    more = start < end;
    }

    /**
    * The class of key that must be passed to {@link #next(Object,
    Object)}..
    */
    public Class getKeyClass() {
    return in.getKeyClass();
    }

    /**
    * The class of value that must be passed to {@link #next(Object,
    Object)}..
    */
    public Class getValueClass() {
    return in.getKeyClass();
    }

    public K createKey() {
    return (K) ReflectionUtils.newInstance(getKeyClass(), conf);
    }

    public V createValue() {
    return (V) ReflectionUtils.newInstance(getKeyClass(), conf);
    }

    public synchronized boolean next(K key, V value) throws IOException {
    if (!more) return false;
    long pos = in.getPosition();

    boolean remaining = in.next(key);
    if (remaining) {
    getCurrentValue(value);
    }
    if (pos >= end && in.syncSeen()) {
    more = false;
    } else {
    more = remaining;
    }
    return more;
    }

    protected synchronized boolean next(K key) throws IOException {
    if (!more) return false;
    long pos = in.getPosition();
    boolean remaining = in.next(key);
    if (pos >= end && in.syncSeen()) {
    more = false;
    } else {
    more = remaining;
    }
    return more;
    }

    protected synchronized void getCurrentValue(V value) throws IOException
    {
    in.getCurrentValue(value);
    //in.next(value);
    }

    /**
    * Return the progress within the input split
    *
    * @return 0.0 to 1.0 of the input byte range
    */
    public float getProgress() throws IOException {
    if (end == start) {
    return 0.0f;
    } else {
    return Math.min(1.0f, (in.getPosition() - start) / (float) (end
    - start));
    }
    }

    public synchronized long getPos() throws IOException {
    return in.getPosition();
    }

    protected synchronized void seek(long pos) throws IOException {
    in.seek(pos);
    }

    public synchronized void close() throws IOException {
    in.close();
    }

    }

    seems like:

    protected synchronized void getCurrentValue(V value) throws IOException
    {
    in.getCurrentValue(value);
    }

    ^ Returns nulls

    protected synchronized void getCurrentValue(V value) throws IOException
    {
    in.next(value);
    }

    ^ returns every other row.

    Do you have any idea what I am doing wrong? Will contrib it hopefully If i
    can get this going correctly.

    Thanks,
    Edward
  • Edward Capriolo at Apr 14, 2010 at 12:16 am
    I was looking at the code and it looks like hive uses
    ignorekeyOUTPUTformat so rather the trying to swap values in the
    inputformat just write an ignore value output format.
    On Tuesday, April 13, 2010, Edward Capriolo wrote:


    On Fri, Apr 2, 2010 at 9:34 PM, Zheng Shao wrote:

    The easiest way is to write a SequenceFileInputFormat that returns a
    RecordReader that has key in the value and value in the key.

    Zheng
    On Fri, Apr 2, 2010 at 2:16 PM, Edward Capriolo wrote:
    I have some sequence files in which all our data is in the key.

    http://osdir.com/ml/hive-user-hadoop-apache/2009-10/msg00027.html

    Has anyone tackled the above issue?


    --
    Yours,
    Zheng


    I am attempting to do this for sequence files. Unfortunately I have to copy much of the SequenceFile format since the reader (in) has private access.
    ----------------------------------------
    public class SequenceKeyOnlyInputFormat<K extends WritableComparable, V extends Writable> extends SequenceFileInputFormat<K, V> {

    public RecordReader<K, V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException {
    reporter.setStatus(split.toString());
    return new SequenceKeyOnlyRecordReader<K, V>(job, (FileSplit) split);
    }

    }
    --------------------------------------------
    @SuppressWarnings({ "unchecked", "deprecation" })
    public class SequenceKeyOnlyRecordReader<K extends WritableComparable , V extends Writable>
    implements RecordReader<K, V>{

    private SequenceFile.Reader in;
    private long start;
    private long end;
    private boolean more = true;
    protected Configuration conf;


    public SequenceKeyOnlyRecordReader(Configuration conf, FileSplit split) throws IOException {
    Path path = split.getPath();
    FileSystem fs = path.getFileSystem(conf);
    this.in = new SequenceFile.Reader(fs, path, conf);
    this.end = split.getStart() + split.getLength();
    this.conf = conf;

    if (split.getStart() > in.getPosition()) in.sync(split.getStart()); // sync to start

    this.start = in.getPosition();
    more = start < end;
    }

    /**
    * The class of key that must be passed to {@link #next(Object, Object)}..
    */
    public Class getKeyClass() {
    return in.getKeyClass();
    }

    /**
    * The class of value that must be passed to {@link #next(Object, Object)}..
    */
    public Class getValueClass() {
    return in.getKeyClass();
    }

    public K createKey() {
    return (K) ReflectionUtils.newInstance(getKeyClass(), conf);
    }

    public V createValue() {
    return (V) ReflectionUtils.newInstance(getKeyClass(), conf);
    }

    public synchronized boolean next(K key, V value) throws IOException {
    if (!more) return false;
    long pos = in.getPosition();

    boolean remaining = in.next(key);
    if (remaining) {
    getCurrentValue(value);
    }
    if (pos >= end && in.syncSeen()) {
    more = false;
    } else {
    more = remaining;
    }
    return more;
    }

    protected synchronized boolean next(K key) throws IOException {
    if (!more) return false;
    long pos = in.getPosition();
    boolean remaining = in.next(key);
    if (pos >= end && in.syncSeen()) {
    more = false;
    } else {
    more = remaining;
    }
    return more;
    }

    protected synchronized void getCurrentValue(V value) throws IOException {
    in.getCurrentValue(value);
    //in.next(value);
    }

    /**
    * Return the progress within the input split
    *
    * @return 0.0 to 1.0 of the input byte range
    */
    public float getProgress() throws IOException {
    if (end == start) {
    return 0.0f;
    } else {
    return Math.min(1.0f, (in.getPosition() - start) / (float) (end - start));
    }
    }

    public synchronized long getPos() throws IOException {
    return in.getPosition();
    }

    protected synchronized void seek(long pos) throws IOException {
    in.seek(pos);
    }

    public synchronized void close() throws IOException {
    in.close();
    }

    }

    seems like:

    protected synchronized void getCurrentValue(V value) throws IOException {
    in.getCurrentValue(value);
    }

    ^ Returns nulls

    protected synchronized void getCurrentValue(V value) throws IOException {
    in.next(value);
    }

    ^ returns every other row.

    Do you have any idea what I am doing wrong? Will contrib it hopefully If i can get this going correctly.

    Thanks,
    Edward
  • John Sichi at Apr 14, 2010 at 12:46 am
    Coincidentally, you can find a HiveNullValueSequenceFileOutputFormat in my HIVE-1295.1.patch:

    https://issues.apache.org/jira/browse/HIVE-1295

    (I needed this because that's what TotalOrderPartitioner wanted...)

    JVS
    On Apr 13, 2010, at 5:15 PM, Edward Capriolo wrote:

    I was looking at the code and it looks like hive uses
    ignorekeyOUTPUTformat so rather the trying to swap values in the
    inputformat just write an ignore value output format.
    On Tuesday, April 13, 2010, Edward Capriolo wrote:


    On Fri, Apr 2, 2010 at 9:34 PM, Zheng Shao wrote:

    The easiest way is to write a SequenceFileInputFormat that returns a
    RecordReader that has key in the value and value in the key.

    Zheng
    On Fri, Apr 2, 2010 at 2:16 PM, Edward Capriolo wrote:
    I have some sequence files in which all our data is in the key.

    http://osdir.com/ml/hive-user-hadoop-apache/2009-10/msg00027.html

    Has anyone tackled the above issue?


    --
    Yours,
    Zheng


    I am attempting to do this for sequence files. Unfortunately I have to copy much of the SequenceFile format since the reader (in) has private access.
    ----------------------------------------
    public class SequenceKeyOnlyInputFormat<K extends WritableComparable, V extends Writable> extends SequenceFileInputFormat<K, V> {

    public RecordReader<K, V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException {
    reporter.setStatus(split.toString());
    return new SequenceKeyOnlyRecordReader<K, V>(job, (FileSplit) split);
    }

    }
    --------------------------------------------
    @SuppressWarnings({ "unchecked", "deprecation" })
    public class SequenceKeyOnlyRecordReader<K extends WritableComparable , V extends Writable>
    implements RecordReader<K, V>{

    private SequenceFile.Reader in;
    private long start;
    private long end;
    private boolean more = true;
    protected Configuration conf;


    public SequenceKeyOnlyRecordReader(Configuration conf, FileSplit split) throws IOException {
    Path path = split.getPath();
    FileSystem fs = path.getFileSystem(conf);
    this.in = new SequenceFile.Reader(fs, path, conf);
    this.end = split.getStart() + split.getLength();
    this.conf = conf;

    if (split.getStart() > in.getPosition()) in.sync(split.getStart()); // sync to start

    this.start = in.getPosition();
    more = start < end;
    }

    /**
    * The class of key that must be passed to {@link #next(Object, Object)}..
    */
    public Class getKeyClass() {
    return in.getKeyClass();
    }

    /**
    * The class of value that must be passed to {@link #next(Object, Object)}..
    */
    public Class getValueClass() {
    return in.getKeyClass();
    }

    public K createKey() {
    return (K) ReflectionUtils.newInstance(getKeyClass(), conf);
    }

    public V createValue() {
    return (V) ReflectionUtils.newInstance(getKeyClass(), conf);
    }

    public synchronized boolean next(K key, V value) throws IOException {
    if (!more) return false;
    long pos = in.getPosition();

    boolean remaining = in.next(key);
    if (remaining) {
    getCurrentValue(value);
    }
    if (pos >= end && in.syncSeen()) {
    more = false;
    } else {
    more = remaining;
    }
    return more;
    }

    protected synchronized boolean next(K key) throws IOException {
    if (!more) return false;
    long pos = in.getPosition();
    boolean remaining = in.next(key);
    if (pos >= end && in.syncSeen()) {
    more = false;
    } else {
    more = remaining;
    }
    return more;
    }

    protected synchronized void getCurrentValue(V value) throws IOException {
    in.getCurrentValue(value);
    //in.next(value);
    }

    /**
    * Return the progress within the input split
    *
    * @return 0.0 to 1.0 of the input byte range
    */
    public float getProgress() throws IOException {
    if (end == start) {
    return 0.0f;
    } else {
    return Math.min(1.0f, (in.getPosition() - start) / (float) (end - start));
    }
    }

    public synchronized long getPos() throws IOException {
    return in.getPosition();
    }

    protected synchronized void seek(long pos) throws IOException {
    in.seek(pos);
    }

    public synchronized void close() throws IOException {
    in.close();
    }

    }

    seems like:

    protected synchronized void getCurrentValue(V value) throws IOException {
    in.getCurrentValue(value);
    }

    ^ Returns nulls

    protected synchronized void getCurrentValue(V value) throws IOException {
    in.next(value);
    }

    ^ returns every other row.

    Do you have any idea what I am doing wrong? Will contrib it hopefully If i can get this going correctly.

    Thanks,
    Edward
  • Zheng Shao at Apr 14, 2010 at 4:21 am
    Try this:

    protected synchronized void getCurrentValue(V value) throws IOException
    {
    in.getCurrentKey);
    }

    On Tue, Apr 13, 2010 at 4:42 PM, Edward Capriolo wrote:

    On Fri, Apr 2, 2010 at 9:34 PM, Zheng Shao wrote:

    The easiest way is to write a SequenceFileInputFormat that returns a
    RecordReader that has key in the value and value in the key.

    Zheng

    On Fri, Apr 2, 2010 at 2:16 PM, Edward Capriolo <edlinuxguru@gmail.com>
    wrote:
    I have some sequence files in which all our data is in the key.

    http://osdir.com/ml/hive-user-hadoop-apache/2009-10/msg00027.html

    Has anyone tackled the above issue?


    --
    Yours,
    Zheng

    I am attempting to do this for sequence files. Unfortunately I have to copy
    much of the SequenceFile format since the reader (in) has private access.
    ----------------------------------------
    public class SequenceKeyOnlyInputFormat<K extends WritableComparable, V
    extends Writable> extends SequenceFileInputFormat<K, V> {

    public RecordReader<K, V> getRecordReader(InputSplit split, JobConf job,
    Reporter reporter) throws IOException {
    reporter.setStatus(split.toString());
    return new SequenceKeyOnlyRecordReader<K, V>(job, (FileSplit)
    split);
    }

    }
    --------------------------------------------
    @SuppressWarnings({ "unchecked", "deprecation" })
    public class SequenceKeyOnlyRecordReader<K extends WritableComparable , V
    extends Writable>
    implements RecordReader<K, V>{

    private SequenceFile.Reader in;
    private long start;
    private long end;
    private boolean more = true;
    protected Configuration conf;


    public SequenceKeyOnlyRecordReader(Configuration conf, FileSplit split)
    throws IOException {
    Path path = split.getPath();
    FileSystem fs = path.getFileSystem(conf);
    this.in = new SequenceFile.Reader(fs, path, conf);
    this.end = split.getStart() + split.getLength();
    this.conf = conf;

    if (split.getStart() > in.getPosition()) in.sync(split.getStart());
    // sync to start

    this.start = in.getPosition();
    more = start < end;
    }

    /**
    * The class of key that must be passed to {@link #next(Object,
    Object)}..
    */
    public Class getKeyClass() {
    return in.getKeyClass();
    }

    /**
    * The class of value that must be passed to {@link #next(Object,
    Object)}..
    */
    public Class getValueClass() {
    return in.getKeyClass();
    }

    public K createKey() {
    return (K) ReflectionUtils.newInstance(getKeyClass(), conf);
    }

    public V createValue() {
    return (V) ReflectionUtils.newInstance(getKeyClass(), conf);
    }

    public synchronized boolean next(K key, V value) throws IOException {
    if (!more) return false;
    long pos = in.getPosition();

    boolean remaining = in.next(key);
    if (remaining) {
    getCurrentValue(value);
    }
    if (pos >= end && in.syncSeen()) {
    more = false;
    } else {
    more = remaining;
    }
    return more;
    }

    protected synchronized boolean next(K key) throws IOException {
    if (!more) return false;
    long pos = in.getPosition();
    boolean remaining = in.next(key);
    if (pos >= end && in.syncSeen()) {
    more = false;
    } else {
    more = remaining;
    }
    return more;
    }

    protected synchronized void getCurrentValue(V value) throws IOException
    {
    in.getCurrentValue(value);
    //in.next(value);
    }

    /**
    * Return the progress within the input split
    *
    * @return 0.0 to 1.0 of the input byte range
    */
    public float getProgress() throws IOException {
    if (end == start) {
    return 0.0f;
    } else {
    return Math.min(1.0f, (in.getPosition() - start) / (float) (end
    - start));
    }
    }

    public synchronized long getPos() throws IOException {
    return in.getPosition();
    }

    protected synchronized void seek(long pos) throws IOException {
    in.seek(pos);
    }

    public synchronized void close() throws IOException {
    in.close();
    }

    }

    seems like:

    protected synchronized void getCurrentValue(V value) throws IOException
    {
    in.getCurrentValue(value);
    }

    ^ Returns nulls

    protected synchronized void getCurrentValue(V value) throws IOException
    {
    in.next(value);
    }

    ^ returns every other row.

    Do you have any idea what I am doing wrong? Will contrib it hopefully If i
    can get this going correctly.

    Thanks,
    Edward


    --
    Yours,
    Zheng
    http://www.linkedin.com/in/zshao

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedApr 2, '10 at 9:16p
activeApr 14, '10 at 4:21a
posts6
users3
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase