I think there is a problem in my "dependency graph". As an example, here is the index ikiwiki generated for my site (note that the site changed since this index was generated).

Some HUGE dependencies appear, clearly non optimal, like

depends = A| B | A | C | A | D | A | E | A | F | A | G | ....

or

depends= A | B | C | D | A | B | C | D | A | B | C | D | ....

Couldn't isolate the cause, but some sources for this problem may be:

  • related to the img module
  • easily observable in my sire because one of my pages includes 80 resized images

Other special things in my templates and site:

  • a sidebar with [[!include pages="notes/*" template=foo]] while notes.mdwn has a [[!include pages="notes/*"]] and uses the sidebar; removed it, doesn't change
  • a template (biblio.tmpl) calling the "img" plugin with a template parameter as the image filename; removed it, doesn't change
  • some strange games with tags whose page calls a "map" directive to show other tags shile tags are also used in tagclouds (in the sidebar and in the main pages)
  • ...

I observed these problems (same kind, I didn't check in details) on

  • ikiwiki 2.00gpa1 + v5.8.4 + Debian 3.1
  • ikiwiki 2.3 + v5.8.8 + Ubuntu 7.04

I can think about reducung the size of my wiki source and making it available online for analysis.

-- NicolasLimare

As long as these dependencies don't grow over time (ie, when a page is edited and nothing changed that should add a dependency), I wouldn't worry about them. There are many things that can cause non-optimal dependencies to be recorded. For one thing, if you inline something, ikiwiki creates a dependency like:

(PageSpec) or (file1 or file2 or file3 ...)

Where fileN are all the files that the PageSpec currently matches. (This is ncessary to detect when a currently inlined file is deleted, and know the inlining page needs an update.) Now consider what it does if you have a single page with two inline statements, that inline the same set of stuff twice:

((PageSpec) or (file1 or file2 or file3 ...) or (PageSpec) or (file1 or file2 or file3 ...)

Clearly non-optimal, indeed.

Ikiwiki doesn't bother to simplify complex PageSpecs because it's difficult to do, and because all they use is some disk space. Consider what ikiwiki uses these dependencies for. All it wants to know is: does the PageSpec for this page it's considering rebuilding match any of the pages that have changed? Determining this is a simple operation -- the PageSpec is converted to perl code. The perl code is run.

So the total impact of an ugly dependency like this is:

  1. Some extra data read/written to disk.
  2. Some extra space in memory.
  3. A bit more data for the PageSpec translation code to handle. But that code is quite fast.
  4. Typically one extra function call when the generated perl code is run. Ie, when the expression on the left-hand side fails, which typically happens after one (inexpensive) function call, it has to check the identical expression on the right hand side.

So this is at best a wishlist todo item, not a bug. A PageSpec simplifier (or improved pagespec_merge() function) could be written and improve ikiwiki's memory and disk usage, but would it actually speed it up any? We'd have to see the code to the simplifier to know.

--Joey

I've been looking at optimizing ikiwiki for a site using album (which produces a lot of pages) and it seems that checking which pages depend on which pages does take a significant amount of time. The optimize-depends branch in my git repository avoids using pagespec_merge() for this (indeed it's no longer used at all), and instead represents dependencies as a list of pagespecs rather than a single pagespec. This does turn out to be faster, although not as much as I'd like. --smcv

Merged --smcv

I just wanted to note that there is a whole long discussion of dependencies and pagespecs on the tracking bugs with dependencies page. -- Will

Yeah, I had a look at that (as the only other mention of pagespec_merge). I think I might have solved some of the problems mentioned there, actually - pagespec_merge no longer needs to exist in my branch (although I haven't actually deleted it), because the "or" operation is now done in the Perl code, rather than by merging pagespecs and translating. --smcv

I've now added a patch to the end of that branch that deletes pagespec_merge almost entirely (we do need to keep a copy around, in ikiwiki-transition, but that copy doesn't have to be optimal or support future features like tracking bugs with dependencies). --smcv


Some questions on your optimize-depends branch. --Joey

In saveindex it still or'd together the depends list, but the {depends} field seems only useful for backwards compatability (ie, ikiwiki-transition uses it still), and otherwise just bloats the index.

If it's acceptable to declare that downgrading IkiWiki requires a complete rebuild, I'm happy with that. I'd prefer to keep the (simple form of the) transition done automatically during a load/save cycle, rather than requiring ikiwiki-transition to be run; we should probably say in NEWS that the performance increase won't fully apply until the next rebuild. --smcv

It is acceptable not to support downgrades. I don't think we need a NEWS file update since any sort of refresh, not just a full rebuild, will cause the indexdb to be loaded and saved, enabling the optimisation. --Joey

A refresh will load the current dependencies from {depends} and save them as-is as a one-element {dependslist}; only a rebuild will replace the single complex pagespec with a long list of simpler pagespecs. --smcv

Is an array the right data structure? add_depends has to loop through the array to avoid dups, it would be better if a hash were used there. Since inline (and other plugins) explicitly add all linked pages, each as a separate item, the list can get rather long, and that single add_depends loop has suddenly become O(N2) to the number of pages, which is something to avoid..

I was also thinking about this (I've been playing with some stuff based on the remove-pagespec-merge branch). A hash, by itself, is not optimal because the dependency list holds two things: page names and page specs. The hash would work well for the page names, but you'll still need to iterate through the page specs. I was thinking of keeping a list and a hash. You use the list for pagespecs and the hash for individual page names. To make this work you need to adjust the API so it knows which you're adding. -- Will

I wasn't thinking about a lookup hash, just a dedup hash, FWIW. --Joey

I was under the impression from previous code review that you preferred to represent unordered sets as lists, rather than hashes with dummy values. If I was wrong, great, I'll fix that and it'll probably go a bit faster. --smcv

It depends, really. And it'd certianly make sense to benchmark such a change. --Joey

Benchmarked, below. --smcv

Also, since a lot of places are calling add_depends in a loop, it probably makes sense to just make it accept a list of dependencies to add. It'll be marginally faster, probably, and should allow for better optimisation when adding a lot of depends at once.

That'd be an API change; perhaps marginally faster, but I don't see how it would allow better optimisation if we're de-duplicating anyway? --smcv

Well, I was thinking that it might be sufficient to build a %seen hash of dependencies inside add_depends, if the places that call it lots were changed to just call it once. Of course the only way to tell is benchmarking. --Joey

It doesn't seem that it significantly affects performance either way. --smcv

In Render.pm, we now have a triply nested loop, which is a bit scary for efficiency. It seems there should be a way to rework this code so it can use the optimised pagespec_match_list, and/or hoist some of the inner loop calculations (like the pagename) out.

I don't think the complexity is any greater than it was: I've just moved one level of "loop" out of the generated Perl, to be in visible code. I'll see whether some of it can be hoisted, though. --smcv

The call to pagename is the only part I can see that's clearly run more often than before. That function is pretty inexpensive, but.. --Joey

I don't see anything that can be hoisted without significant refactoring, actually. Beware that there are two pagename calls in the loop: one for $f (which is the page we might want to rebuild), and one for $file (which is the changed page that it might depend on). Note that I didn't choose those names!

The three loops are over source files, their lists of dependency pagespecs, and files that might have changed. I see the following things we might be doing redundantly:

  • If $file is considered as a potential dependency for more than one $f, we evaluate pagename($file) more than once. Potential fix: cache them (this turns out to save about half a second on the docwiki, see below).
  • If several pages depend on the same pagespec, we evaluate whether each changed page matches that pagespec more than once: however, we do so with a different location parameter every time, so repeated calls are, in the general case, the only correct thing to do. Potential fix: perhaps special-case "page x depends on page y and nothing else" (i.e. globs that have no wildcards) into a separate hash? I haven't done anything in this direction.
  • Any preparatory work done by pagespec_match (converting the pagespec into Perl, mostly?) is done in the inner loop; switching to pagespec_match_list (significant refactoring) saves more than half a second on the docwiki.

--smcv

Very good catch on img/meta using the wrong dependency; verified in the wild! (I've cherry-picked those bug fixes.)


Benchmarking results: I benchmarked by altering docwiki.setup to switch off verbose, running "make clean && ./Makefile.PL && make", and timing one rebuild of the docwiki followed by three refreshes. Before each refresh I used touch plugins/*.mdwn to have something significant to refresh.

I'm assuming that "user" CPU time is the important thing here (system time was relatively small in all cases, up to 0.35 seconds per run).

master at the time of rebasing: 14.20s to rebuild, 10.04/12.07/14.01s to refresh. I think you can see the bug clearly here - the pagespecs are getting more complicated every time!

I can totally see a bug here, and it's one I didn't think existed. Ie, I thought that after the first refresh, the pagespec should stabalize, and what it stabalized to was probably unnecessarily long, but not growing w/o bounds!

a) Explains why ikiwiki.info has been so slow lately. Well that and some other things that overloaded the system. b) Suggests to me we will probably want to force a rebuild on upgrade when fixing this (via the mechanism in the postinst).

I've investigated why the pagespecs keep growing: When page A changes, its old depends are cleared. Then page B that inlines A gets rebuilt, and its old depends are also cleared. But page B also inlines page C; which means C gets re-rendered. And this happens w/o its old depends being cleared, so C's depends are doubled. --Joey

After the initial optimization: 14.27s to rebuild, 8.26/8.33/8.26 to refresh. Success!

Not pre-joining dependencies actually took about ~0.2s more; I don't know why. I'm worried that duplicates will just build up (again) in less simple cases, though, so 0.2s is probably a small price to pay for that not happening (it might well be experimental error, for that matter).

It's weird that the suggested optimisations to add_depends had no effect. So, the commit message to b6fcb1cb0ef27e5a63184440675d465fad652acf is actually wrong.. ? --Joey

I'll try benchmarking again on the non-public wiki where I had the 4% speedup. The docwiki is so small that 4% is hard to measure... --smcv

Not saving {depends} to the index, using a hash instead of a list to de-duplicate, and allowing add_depends to take an arrayref instead of a single pagespec had no noticable positive or negative effect on this test.

I see e4cd168ebedd95585290c97ff42234344bfed46c is still in your branch though. I don't like using an arrayref, it could just take ($page, @depends). and I don't see the need to keep it if it doesn't currently help.

I'll drop it. --smcv

Is there any reason to keep 7227c2debfeef94b35f7d81f42900aa01820caa3 if it doesn't improve speed? --Joey

I'll try benchmarking on a more complex wiki and see whether it has a positive or negative effect. It does avoid being O(n**2) in number of dependencies. --smcv

Memoizing the results of pagename brought the rebuild time down to 14.06s and the refresh time down to 7.96/7.92/7.92, a significant win.

Ok, that seems safe to memoize. (It's a real function and it isn't called with a great many inputs.) Why did you chose to memoize it explicitly rather than adding it to the memoize list at the top?

It does depend on global variables, so using Memoize seemed like asking for trouble. I suppose what I did is equivalent to Memoize though... --smcv

Refactoring to use pagespec_match_list looks more risky from a code churn point of view; rebuild now takes 14.35s, but refresh is only 7.30/7.29/7.28, another significant win.

--smcv

I had mostly convinced myself that pagespec_match_list would not lead to a speed gain here. My reasoning was that you want to stop after finding one match, while pagespec_match_list checks all pages for matches. So what we're seeing is that on a rebuild, @changed is all pages, and not short-circuiting leads to unnecessary work. OTOH, on refresh, @changed is small and I suppose pagespec_match_list's other slight efficiencies win out somehow.

Welcome to the "I made ikiwiki twice as fast and all I got was this lousy git sha1sum" club BTW :-) --Joey