I'm doing some simple file manipulation work and the process gets
"Killed" everytime I run it. No traceback, no segfault... just the
word "Killed" in the bash shell and the process ends. The first few
batch runs would only succeed with one or two files being processed
(out of 60) before the process was "Killed". Now it makes no
successful progress at all. Just a little processing then "Killed".
That isn't a Python thing. Run "sleep 60" in one shell, then "kill -9"
the process in another shell, and you'll get the same message.
I know my shared web host has a daemon that does that to processes that
consume too many resources.
Wait a minute. If you ran this multiple times, won't it have removed the
first two lines from the first files multiple times, deleting some data
you actually care about? I hope you have backups...
Any Ideas? Is there a buffer limitation? Do you think it could be the
Any suggestions appreciated.... Thanks.
The code I'm running:
from glob import glob
filePathList = glob('/data/ascii/*.dat')
If that dir is very large, that could be slow. Both because glob will
run a regexp over every filename, and because it will return a list of
every file that matches.
If you have Python 2.5, you could use glob.iglob() instead of
glob.glob(), which returns an iterator instead of a list.
for filePath in filePathList:
f = open(filePath, 'r')
lines = f.readlines()[2:]
This reads the entire file into memory. Even better, I bet slicing
copies the list object temporarily, before the first one is destroyed.
f = open(filePath, 'w')
This is unrelated, but "print file" will just say "<type 'file'>",
because it's the name of a built-in object, and you didn't assign to it
(which you shouldn't anyway).
Actually, if you *only* ran that exact code, it should exit almost
instantly, since it does one import, defines a function, but doesn't
actually call anything. ;-)
Sample lines in File:
# time, ap, bp, as, bs, price, vol, size, seq, isUpLast, isUpVol,
1062993789 0 0 0 0 1022.75 1 1 0 1 0 0
1073883668 1120 1119.75 28 33 0 0 0 0 0 0 0
- The file sizes range from 76 Kb to 146 Mb
- I'm running on a Gentoo Linux OS
- The filesystem is partitioned and using: XFS for the data
repository, Reiser3 for all else.
How about this version? (note: untested)
# If you don't have Python 2.5, use "glob.glob" instead.
filePaths = glob.iglob('/data/ascii/*.dat')
for filePath in filePaths:
fin = open(filePath, 'rb')
fout = open(filePath + '.out', 'wb')
# Discard two lines
os.rename(filePath + '.out', filePath)
I don't know how light it will be on CPU, but it should use very little
memory (unless you have some extremely long lines, I guess). You could
write a version that just used .read() and .write() in chunks
Also, it temporarily duplicates "whatever.dat" to "whatever.dat.out",
and if "whatever.dat.out" already exists, it will blindly overwrite it.
Also, if this is anything but a one-shot script, you should use
"try...finally" statements to make sure the file objects get closed (or,
in Python 2.5, the "with" statement).