Hello everybody,

currently I've got a task of converting ~9500 CSV files to RDF (corpus
extracted from publicdata.eu portal) and I use python csv module to
extract headers from csv file.
I tried to use sniff method as in the next example:
with open(self.resource_dir + self.filename, 'rU') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
reader = csv.reader(csvfile, dialect)
for row in reader:
return row
except BaseException as e:
print str(e)
return []
But it fails to determine comma ',' as a delimiter in some cases (for
instance, it can take 'i' as a delimiter, which is nonsense in
real-world applications). This is really bad, because comma delimiter is
the most frequently used one and should be determined without mistake.

If I know which delimiters are possible in my corpus is there a way to
tell sniffer to choose between them?

Kind regards,
Ivan Ermilov.

Search Discussions

  • Tony Wallace at Dec 22, 2012 at 4:22 am
    If I were importing 9500 CSV files generated as output from a single
    database I would not even try to use dialect detection. Better to
    determine what the correct dialect is and parse it with a statically
    assigned dialect. This dialect could be stored in your application
    metadata or assigned in code.

    The reason is that in handling production data quantities there are
    always a few records that trip up code or detection algorithms. Better
    to find out what the gotcha's are and deal with them once and for all.


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcsv @
postedDec 5, '12 at 11:40p
activeDec 22, '12 at 4:22a

2 users in discussion

Tony Wallace: 1 post Ivan Ermilov: 1 post



site design / logo © 2019 Grokbase