This is a very interesting discussion. My comments below.
Salvatore Sanfilippo skrev 2012-06-13 11:47:
# 1) Why the RDB is not used as base type in AOF rewrites?
So it's still one file, with identical semantics, but just the initial
base type is binary.
There are only two disadvantages with this approach (and many advantages):
1) You lost the ability to easily process the AOF with a script or alike.
2) When RDB and AOF are enabled you lost the advantage of having data
encoded in two formats by two different code paths (resistance to bugs
corrupting the DB).
Agree. In particular 2 is a potential disadvantage. It's also an
advantage, though. By using the same code path for AOF rewrites and RDB
saves, we can focus more on optimizing and testing that single
implementation. There will be less code that can go wrong.
The only reason this was not done before is other priorities in the
development, and that this requires very careful design and very
Indeed. This needs to be done really carefully before we can trust it.
# 2) About 2.6 AOF.
The good news is that in 2.6 the AOF rewrite uses variadic commands to
reduce the space used to generate the AOF *a lot*.
It's not going to be as small as the RDB but to output a 10 element
list a single RPUSH command will be used instead of 10 RPUSH commands.
Not the final solution as we can improve this with RDB format indeed.
This is really good =)
# 3) Two files, one database.
In this thread it was proposed that we can have RDB dumps *and* the
AOF file expressing what changed since the RDB file, so not just a
single file. So you could have:
That combined together would form the final dataset. I think that
while this allows some optimization it is a somewhat dangerous path
practically speaking and for "operations" guys. Either we need to make
things more complex and reference each file inside each other in some
way, like taking checksums, or if for an error a server is restarted
with a mismatching set of files something really bad can happen.
Sorry for the long comment here, but I hope it adds something to the
Having separate files is not necessarily more complicated IMHO. With
this system, the RDB snapshots would work (essentially) the same
regardless of whether AOF is used or not and as long as you have your
RDB-file you will always be able to restore a snapshot of your data (so
this is an excellent candidate for backups). If you always shutdown
Redis cleanly, then you actually *never* need anything else.
If the AOF files happen to be around, then they will be replayed
automatically after loading the RDB (the RDB references the AOF, which I
think can be made robust enough easily). The AOF will only add crash
recovery on top of the RDB-snapshots, though, which is a very clean
separation IMO. The RDB snapshots would be the primary persistence
method and AOF would only do some optional logging in addition to that
(if enabled). The sysadmin would usually only need to deal with the RDBs
(and there will be no difference between taking a snapshot by doing a
BGSAVE or a BGREWRITEAOF, which simplifies the concepts a little). Redis
can then take care of the AOF more or less automatically, similar to how
redo logs and WALs are dealt with in other databases. The AOF is
completely optional and will be much smaller when they are separate
file(s) that only contain the differences from the last snapshot, so
they will be less of a "problem". In fact, after a clean shutdown (with
a save) the AOF could be safely deleted.
When it comes to implementation, I think it's actually very easy to
implement this using separate AOF files (see my experimental branch in
another e-mail). Especially with a segmented AOF, it's possible to
remove a lot of code. Most of the RDB-saving code is reused exactly as
is and all the AOF rewrite code can be simply deleted. No changes need
to be buffered in memory or similar either and the new code (mostly for
loading and creating new segments) is quite short and simple.
One small problem with this approach would definitely be to make sure
that Redis doesn't try to load a set of mismatching files together. One
way to add some extra safety here would be to give each AOF a random id
(stored inside the file) and mention it in each reference (so we can
check that it's not another AOF with the same filename). Mismatches
could happen if someone restores a backup of an RDB and keeps some old
AOFs around, but something like this should catch even that.
IMHO a better approach is to have just a single file as a first step,
probably an extension of the current RDB file that can optionally get
commands in the protocol format (like the AOF itself) logged at the
end, with a single utility to check if the file is same, with a single
file extension for all the kind of Redis dumps (.rdb), and so forth.
So the rewrite would work like that:
1) Start writing the RDB file in background.
2) Append changes in memory (or inside a file, see next sections).
3) Flush AOF-format changes at the end of the RDB file.
4) Rename into the final place.
Of course the command line tools should be able to tell us easily how
big is the RDB section, how big is the appended part, and so forth.
I think it's very useful to have the pure RDB-snapshot as a separate
file, so that it's easy to make backups of that (without having to
include the AOF changes). If we store both kinds, then we get some data
duplication, which means more disk space and disk I/O required. Having
to write more to the disk could affect performance a lot in some cases,
I think. It would still be an improvement, though.
# 4) About AOF and replication link: format is different.
It's important to remember that the AOF format and the replication
link is not a compatible format. We rewrite certain commands for the
replication link, so I think that mixing the two is not ok, maybe in
the future, but given the sensible nature of this changes is better to
move forward step by step :)
Agree. Lets try to unify replication and persistance later, if that
would be useful.
# 5) Accumulate AOF differences while rewriting: Memory or Disk?
Currently we accumulate writes during the rewrite using an in-memory
buffer. This uses memory indeed, but especially if the base format
will be the RDB, it is generated in a decent amount of time so not too
much memory is used.
It is still possible to write the difference into a file that is later
appended in the RDB file by the parent, but my feeling is that the
current approach using a single write(2) call is a lot less time
consuming to perform in the main thread, leading to less latency.
Btw the important thing here is: we can do that in the future, but
just start trying to make RDB persistence as fast as possible. Maybe
we'll realize that the in memory buffer is good because for sure it
takes a memory that is proportional to the memory already used for
data (since RDB write time is proportional to data size), and this may
often be a small percentage of the memory currently used.
If we instead find that it's cool to provide an option to use a file
instead of memory, we can do it later.
I don't think the memory use is the main problem. We need source code to
manage the buffer (more code is always bad) and the buffered data needs
to be written twice to disk here (disk I/O is limited and especially
when writing a lot at once we could saturate the disk for a short
while). Overall, it's probably not a huge problem, though ;)
# 6) On segmented AOF.
Segmented AOF is a mess compared to a single file from an operational
point of view, however it offers many advantages like the ones in this
context, but also the ability to offline compact pieces of the AOF
file. If the benefits are huge then it's worth it, but at this stage
it's not going to be a good idea probably, just to save the in-memory
AOF rewrite buffer.
Agree, it's probably not worth it just to save the memory and
double-write. IMHO, it also simplifies the implementation and
operations, though. It does require a shift of view, however. The AOF
will no longer be used on its own, but will simply be a complement to
I think we can make this kind of segmented AOF work in a very automatic
and invisible way (so that nobody really needs to care much about it
being enabled or not, except for performance and durability) and without
complicating the source code needlessly. Overall, I think this would
benefit both users and the implementation.
That's just my opinion, though. I like this approach, as you probably
have noticed by now =)
So... this is how I see this issue:
1) Move forward slowly starting with just the replacement of RDB in
the AOF rewrite.
2) Keep the single-file persistence.
3) Use only a single file format, the RDB one, that will start always
with an RDB dump plus an optional AOF section.
4) Still make the new Redis version able to read old AOF files, we'll
no longer have 'appendonly.aof' files in the future but just RDB so
this will be easy.
5) Optimize the RDB generation in order to make it as fast as possible.
Personally, I think it would be easier and better to directly solve this
with my approach (which may need some elaboration). This is definitely a
viable alternative, however.
6) Switch the AOF section to a binary format, so that it will be much
better to both store and transfer it to the replication link.
Conceptually it will be still command arg arg arg, but instead of
*3\r\n...$2342.... will a binary thing with like 16-bit command
opcode, number of arguments and lengths as 32 bit numbers, followed by
data composing the arguments.
That's a good idea!
# User interface
How to expose this to the user? With just a persistence format we want
to also tell a single persistence story to our users inside
I think something like that will work:
1) We can retain the "save" thing, that will simply force an AOF
rewrite when the save point is triggered, generating the new RDB file.
2) a new option like 'rdb-append-changes yes/no' will be added to
select if the user just want snapshots or snapshots with logs of
3) kill the command BGREWRITEAOF that will just be BGSAVE. If
rdb-append-changes is set to yes it will work as a rewrite.
4) Add a new redis.conf option so that on rewrites the RDB file will
also be copied into dump.rdb.base that is just the base file without
any appending performed on it, so that people still can have a
single-file point-in-time stuff that you can copy around.
Feedbacks? Makes sense?
I think that in this simple form we can have this stuff into 2.8 with
little efforts and little bugs hopefully.
We can then iterate again to improve it for 3.0.
It does make sense. We clearly have two candidate solutions here. I feel
that the main question here is whether to continue saving all related
data in a single file or not. I think it might be time for Redis to take
the step and move to multi-file persistence (where classic RDB snapshots
and AOF logs cooperate more tightly).
I personally think it would be best to solve this with a "segmented AOF"
and separate RDB-snapshots that have references to the AOF, but either
solution would be a step forward. Maybe it would be useful to try out
these solutions outside the main branch for a while before making a
final decision? I will probably maintain my git branch for a while and
I'm interested in seeing how well that can be made to work. Feel free to
try it out, everyone (although, it's highly experimental right now!).
It's completely ok if it's never included in the main Redis branch, of
That's my comments. It would be cool to hear what more people think. Let
me know if there's anything more I could do here that would be helpful.
I think improvements related to this are very interesting...
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to firstname.lastname@example.org.
To unsubscribe from this group, send email to email@example.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.