FAQ
Hello everybody,


currently I've got a task of converting ~9500 CSV files to RDF (corpus
extracted from publicdata.eu portal) and I use python csv module to
extract headers from csv file.
I tried to use sniff method as in the next example:
with open(self.resource_dir + self.filename, 'rU') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
try:
for row in reader:
return row
except BaseException as e:
print str(e)
return []
But it fails to determine comma ',' as a delimiter in some cases (for
instance, it can take 'i' as a delimiter, which is nonsense in
real-world applications). This is really bad, because comma delimiter is
the most frequently used one and should be determined without mistake.


If I know which delimiters are possible in my corpus is there a way to
tell sniffer to choose between them?


Kind regards,
Ivan Ermilov.

Search Discussions

  • Tony Wallace at Dec 22, 2012 at 4:22 am
    If I were importing 9500 CSV files generated as output from a single
    database I would not even try to use dialect detection. Better to
    determine what the correct dialect is and parse it with a statically
    assigned dialect. This dialect could be stored in your application
    metadata or assigned in code.


    The reason is that in handling production data quantities there are
    always a few records that trip up code or detection algorithms. Better
    to find out what the gotcha's are and deal with them once and for all.


    Tony

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcsv @
categoriespython
postedDec 5, '12 at 11:40p
activeDec 22, '12 at 4:22a
posts2
users2
websitepython.org

2 users in discussion

Tony Wallace: 1 post Ivan Ermilov: 1 post

People

Translate

site design / logo © 2019 Grokbase